Oh yeah - if people are still strugging with this, try the following:
+RTS -N(num of PHYSICAL CPUs) -xn -G2 -T -S -A128m -n1m -C0 -I0.3 -Iw3600 -O4000m -RTS
I think the magic is in -C0 - this tells GHC to do as MANY context switches as it can and not wait the default 20ms before trying to context switch.
-O4000m means don’t do old GC before it reaches 4000 MB. -n1m means split up chunks in heap to 1 MB