The critical loop in type merging is (unsurprisingly) the one that iterates over every type record and remaps indices.
The patch here mostly focuses on improving inlining behavior and saving unnecessary memcpys. The way the algorithm works is that for every record, it tries to insert it into a hash table, and if it succeeded (because it was new), it then calls into a callback to serialize the record and save it off. There were multiple levels of outlined functions in this tight loop. This brings my test case down from 40 seconds to ~35 seconds when built with clang with optimizations.
Maybe the next step is to templatize this so we don't have to build a vector of offsets to iterate? Maybe it doesn't matter, though. It'd save the branch on the TiRefKind inside the loop, though.
I think the ideal code for type index remapping would basically be a giant switch on record kind followed by inlined code that implements the remappings inline in each case block.