My latest fix to avoid the TLS linker optimization bug was not quite sufficient. At -O3 we ran into another bootstrap failure. We ran into a situation where the introduced register copy was not coalesced away, and as a result the target of the add-immediate was still not GPR3.
It seems the only way to truly constrain this is to hide the add-immediate and the call together in a separate pseudo-op flagged to define GPR3, so that the add-immediate cannot float away from the call. This still permits the call sequence to be fully commoned by MachineCSE, but avoids the register assignment problem.
At one point I had thought to glue the output register copy (COPY %vregout = %X3) onto this pseudo-op at the SelectionDAG level, but a little thought reminded me that this would break CSE. So I am maintaining the creation of that copy in the PPCTLSDynamicCall pass prior to RA. This pass now also expands the combined pseudo into its two constituent pseudos with proper defs and uses of GPR3.
With this change, the test suites still pass, and if I disable the workaround that shuts off the linker optimizations, I can still bootstrap clang at both -O2 and -O3. So I'm feeling fairly comfortable that this solution will hold up.