On the importance of memory buffer...

I have been working since a long time on trying to solve random crashes on the OCaml Forge. I have first suspected failing hardware to be the cause of it, but found no evidence about that.

In despair of a solution, I have installed cacti to investigate the issue. The good news is that it has allowed me to catch a pretty nice graph of a crash.

Load average - Crash

Just before the crash the average load was 200...

My conclusion was that too many processes were running at the same time. So I start to hunt processes that load the server for nothing. One of them was darcsweb. This process doesn't really create a load, but it calls darcs for various operations and most of them are quite expensive. The most expensive one is darcs diff. I first turn on the caching of darcsweb, which already reduce the number of invocation of darcs. But it doesn't make a real difference (except that the website was faster to load).

I continued to investigate. In order to reduce globally the load of the server, I installed various robots.txt to prevent crawler to call diff for all VCS on the forge. Crawler traffic is 10 times the normal traffic and indexing diff in Google is not very interesting. It doesn't make a real difference.

Load average - Peaks

I installed a script to analyze all the peak load above 2. The peaks happen every hour and are more or less important (from 4 to 10). I discovered that this peak were related to the hourly cronjob of FusionForge that fixed repository permissions. I was pretty surprise because updating permissions should no generate this load.

This morning, I had a simple idea. There are a lot of process on the server and one of them (the bzr daemon, loggerhead) eats a big chunk of memory (at least never frees it). I just had a quick at the "buffers" memory... Only 4MB !

I just restarted loggerhead. The improvement made on robots.txt makes it a lot more stable with regard to memory consumption, since we don't run bzr diff anymore. The "buffers" memory is now at 100MB and guess what ! No more peaks...

Here is the result:

Load average - Restart

Conclusion: sometimes the cause of a high load lies in a single number...

