The critical loop in type merging is (unsurprisingly) the one that iterates over every type record and remaps indices.
The patch here mostly focuses on improving inlining behavior and saving unnecessary memcpys. The way the algorithm works is that for every record, it tries to insert it into a hash table, and if it succeeded (because it was new), it then calls into a callback to serialize the record and save it off. There were multiple levels of outlined functions in this tight loop. This brings my test case down from 40 seconds to ~35 seconds when built with clang with optimizations.