Improving GHC's code generation - Summer of Code - Complete Report
Disclaimer: Initially this was a gist - So the actual date is estimated.
Improvements to GHCs Code Generator
This is a writeup of the work done by me (Andreas Klebinger) during my SoC 2018 Project.
Improved code layout for GHC
The vast majority of my time was spent working on this part of the Project. While the patch was not yet merged there is not much missing.
The state of the patch can be viewed on GHCs Phab. Some discussion is also on the Bug Tracker
A explaination is also available on the GHC wiki
Preliminary results:
Numbers are speedups, higher is better.
Library | Sandy Bridge (Linux) | Haswell (Linux) | Skylake (Win) |
---|---|---|---|
aeson | +2.6% | +2.3% | +1.2% |
containers | +1.4% | +1.1% | +1.7% |
megaparsec | +3.2% | +13.6% 1 ) | +8.0% |
perf-xml 2 ) | +0.2% | NA | +1.1% |
text | +3.0% | NA | NA |
Vector *2 | +2.5% | +2.5% | +1.3% |
- 1 ) Possibly exaggerated because of background load.
- 2 ) https://github.com/haskell-perf/xml
Layout project review
Motivation
The idea was born out of neccesity. Some time ago I worked on adding static analysis to GHC to mark branches as likely or unlikely to be taken. Ultimately however this showed almost no benefit on runtime since code layout did not take advantage of this information. Even producing worse code at times.
Initial work
It became quickly clear that in order to improve on this GHC would need more information about control flow available when doing code layout.
There were two ways to approach this: Attach metadata to assembly instructions or have this information available out of band. I chose the later, building up a CFG from the CMM intermediate Language before generating assembly from it. Then doing static analysis on that to assign edge weights which approximate control flow patterns and using these to find a good code layout.
I started with a greedy algorithm for placing blocks based on edge weights. This went surprisingly well and I did not run into major issues during the implementation. It was my favourite part of the project.
First Roadblock
Next I wrote the code required to construct the CFG. However soon when testing the code I hit linker errors. After a good deal of debugging I realized that GHC adds (or removes!) basic blocks at non obvious places in the code. This meant blocks where invisible in the CFG, hence not placed and leading to issues at link time.
Finding all the places that modify the CFG was the first unplanned part which took a larger amount of time. But for the x64 backend the code now updates the CFG as we modify the code.
Benchmarks
Finally I was able to benchmark the actual performance. Which was worse initially. However after experimentation it became clear that this was mostly a matter of adjusting how we assign weights. With results improving ever since.
I initially only used nofib for benchmarks which has it’s own drawbacks.
Over time I incorporated various library benchmark suits into my approach. Which in hindsight was important as these showcased both benefits and drawbacks which where not obvious looking only at nofib result.
However benchmarking in general took a lot of work and time. Unexpected issues where:
- Benchmarks straight up being broken - some of the issues are listed below:
- Nofib - Some benchmarks did not use the given compiler flags - Fixed during SoC
- Nofib - Most of the benchmarks did not actually contribute to the reported runtime difference - Fixed during SoC
- Vector - Various issues - Submitted a PR
- Aeson - Broken dependency - Local workaround
- text - benchmark not building on windows
- Issues with tooling. I’ve run into a few issues with cabal new-build in particular. These invalidated some results so a good bit of time was spent reporting these issues and finding workarounds.
- Computing time required - Getting a meaningful performance comparison took a lot of compute time. Especially since it was unclear which approach to assigning edge weights gives the best results. The longest ones are listed below.
- nofib - >10 Hours.
- containers - 4 Hours
- text - 7 hours
- Benchmarks being broken when using a unreleased GHC version.
I ended up writing a lot of scripts (bash/R) in order to compare the results of benchmarks. They are available here but are better used as a source of inspiration. As they rely a lot on hardcoded paths/flags and the like.
CMOV Support - Unfinished
Convert certain code patterns to branchless instructions. There is a working prototype on Phab.
However there are still bugs with certain edge cases. Ultimately the work on this was posponed in favour of the layout work which promised bigger benefits in more cases.
Improvements to nofib
Updated nofib default runtimes/settings
Driven by the need to get reliable benchmarks I reworked the default settings for the nofib benchmark suite in order to harmonize runtimes between the different benchmarks.
This has not yet been merged but has been accepted to be merged. Tracker, Patch
QOL Patch - Ignore warnings about tabs vs spaces for nofib.
This has been merged upstream.
Nofib pretty consistently uses tabs especially in old benchmarks. Given the nature of the code it doesn’t make sense to emit warnings about this. Patch
Bugfix - Don’t search for perl binary in a hotcoded path.
Merged upstream. Patch
Bugfix - Some benchmarks ignored given compiler options if these were overriden by O2
Merged upstream. Patch
Other work
Small documentation fixes about GHCs dump flags
Merged upstream. Patch 1, Patch 2
Allow users to hide most uniques when comparing Cmm code
Merged upstream. Patch
This makes it easier to compare Cmm dumps in cases where the uniques are different but the code is the same.
Small performance improvement for GHCs OrdList/Bag
Merged upstream: Patch
Use strict left folds in GHC.
Has been accepted but not yet merged. Patch
Alignment of symbols.
I wrote a patch to specify the alignment of generated functions at compile time to help rule out performance differences caused by alignment.
This work has been merged upstream. Tracker, Phab
Eliminate conditional branches which jump to the same target on true and false.
This has been merged upstream. Tracker,
Invert float comparisons to eliminate explicit parity checks.
Improve generated code by taking advantage of the fact that certain checks always fail on unordered floating point numbers.
The patch is complete but review and acceptance from maintainers is still required.
Contributions to the wider eco system
- BugReport: Undocumented lib numa dependency by GHC
- Bugreport: cabal ignoring store-dir option
- Bugreport: cabal is broken with relative store-dir path
- Bugreport: cabal caches invalid store-dir path
- Pull request: update vector benchmarks
- Pull request: Update metadata for container benchmarks
- Bug report: Cabal has a race condition with custom store directory and new-build on windows
- Bug report: Undocumented/Wrong behaviour for cabal new-configure
- Bug report: Cabal fails with <<loop>> under certain conditions.
- Bug report: hashtables depends on outdated packages
- Bug report: Benchmarks fails with out of memory exception
- Code review on other patches. Here and here.