Thank you everyone for your kind words
I do not have much more details about the experiments I’ve made, I can tell you what I have tried that did not work though:
- Increasing the number of generations from
3resulted into a bigger heap so I quickly reverted to the default
-cparameter that enables a compacting algorithm for collecting the oldest generation, this parameter resulted in a much lower memory footprint but resulted in much longer GC pauses which made the nodes less responsive and, on a producer node would make it randomly miss a few leader checks per hour. This might be useable on machines with fast CPU though but these machines generally also have enough RAM…
-Mparameter to limit the maximum heap and raise an internal exception if the limit is reached so that the program can try to stop cleanly (you can set a leeway post exception to allow the program to stop itself). I tried this but this parameter also enables
-conce a certain threshold is reached causing the same performances issue.
-xnthis parameter enable the new non-moving GC algorithm and gives overall better performances but the problem is that it’s unstable (random segmentation faults)
- I also tried various values for the
-Fparameter: the lower the value the more often the GC is triggered by heap overflow and frequent memory allocation are necessary which makes everything very slow. so
1.5seems a good compromise. Something to try maybe would be a lower value with a bigger
-F1 -H5Gwhich might allow for a lower heap size (>=5G) but this would need to be carefully tuned to the actual memory requirement of the node which is difficult to determine…
Apart from this, another thing I noticed is that the memory footprint depends vastly on the garbage collections happening during the node startup which is mostly random and dependent, it seems, on IO efficiency, if one is lucky and GCs happened at the right time (before the live data becomes big and right after a lot of garbage was created) then the memory footprint will stay low.
I’m not sure if there is code in the node to call
performMajorGC but this could also be a solution because, in the end, my solution just consists in making sure a major GC is performed at an average of every 10 minutes, (I’m using
-Iw600 depending on my relays but values lower than 600 don’t seem to make any difference in heap size).
The best solution would be for the node to
performMajorGC regularly on its own when there is time instead of waiting for the RTS to do it on heap overflow (this is what makes the heap to grow). It could also be performed after certain operations that are known to create a lot of garbage. When the node is acting as a producer, we could imagine that it checks for a minute in advance for slot leader and when it knows it has enough time it triggers a major GC…
I think that GHC 9.2 with the new
-I0.3 (the default) and
-Iw600 will solve this issue completely as it will allow releasing memory to the OS which never happen with GHC 8.10.x.