Page MenuHomePhabricator

[PowerPC] Fix __tls_get_addr sequence to avoid register assignment issues

Authored by wschmidt on Feb 8 2015, 11:40 AM.



My latest fix to avoid the TLS linker optimization bug was not quite sufficient. At -O3 we ran into another bootstrap failure. We ran into a situation where the introduced register copy was not coalesced away, and as a result the target of the add-immediate was still not GPR3.

It seems the only way to truly constrain this is to hide the add-immediate and the call together in a separate pseudo-op flagged to define GPR3, so that the add-immediate cannot float away from the call. This still permits the call sequence to be fully commoned by MachineCSE, but avoids the register assignment problem.

At one point I had thought to glue the output register copy (COPY %vregout = %X3) onto this pseudo-op at the SelectionDAG level, but a little thought reminded me that this would break CSE. So I am maintaining the creation of that copy in the PPCTLSDynamicCall pass prior to RA. This pass now also expands the combined pseudo into its two constituent pseudos with proper defs and uses of GPR3.

With this change, the test suites still pass, and if I disable the workaround that shuts off the linker optimizations, I can still bootstrap clang at both -O2 and -O3. So I'm feeling fairly comfortable that this solution will hold up.

Diff Detail

Event Timeline

wschmidt updated this revision to Diff 19550.Feb 8 2015, 11:40 AM
wschmidt retitled this revision from to [PowerPC] Fix __tls_get_addr sequence to avoid register assignment issues.
wschmidt updated this object.
wschmidt edited the test plan for this revision. (Show Details)
wschmidt added reviewers: hfinkel, kbarton, nemanjai, seurer.
wschmidt added a subscriber: Unknown Object (MLST).
hfinkel edited edge metadata.Feb 8 2015, 12:28 PM

Can you please rebase this? It does not apply cleanly against trunk (I had to revert the previous attempted fix in its entirety because, unfortunately, independent of the linker flag, it was causing bootstrapping failures).

OK, I was not aware you had reverted this. I am going through the pain of re-merging now. I recommend you have a look at the original diff to see what the changes are, though, as the new diff is going to include all that older code that was reverted and will make it difficult to see what I've changed.

wschmidt updated this revision to Diff 19555.EditedFeb 8 2015, 2:15 PM
wschmidt edited edge metadata.

Rebased patch provided. I've bootstrapped with -O2, -O3, -O3 -DNDEBUG, and -O3 -DNDEBUG -mcpu=native. However, those tests were on a POWER8. Hal, I'd appreciate it if you could apply to your POWER7 and see if you run into any issues this time.


wschmidt updated this revision to Diff 19558.Feb 8 2015, 3:38 PM

Re-posting the last one because PPCTLSDynamicCall.cpp came out as a diff instead of an add of a new file. Sorry for any problems.

Re-posting the last one because PPCTLSDynamicCall.cpp came out as a diff instead of an add of a new file. Sorry for any problems.

Unfortunately, I see the same problem here that caused me to revert the previous version. Even leaving Clang untouched (so we still get the linker flag to disable the tls optimizations), self-hosting fails on the P7:

Building arm_neon.h...
0  clang-tblgen 0x000000001009fdb4
1  clang-tblgen 0x00000000100a0cb4
2  clang-tblgen 0x00000000100a0f64
3               0x00000fff8a020448 __kernel_sigtramp_rt64 + 0
4  clang-tblgen 0x000000001008d29c
5  clang-tblgen 0x000000001007d6ec
6    0x00000080b782c088
7    0x00000080b782c280 __libc_start_main + 4293375408
/bin/sh: line 1: 36646 Segmentation fault      (core dumped) ../../../../bin/clang-tblgen -gen-arm-neon -I/home/hfinkel/src/llvm/tools/clang/lib/Headers -I /home/hfinkel/src/llvm/lib/Target -I /home/hfinkel/src/llvm/include /home/hfinkel/src/llvm/tools/clang/include/clang/Basic/ -o /home/hfinkel/build/ppc64/llvm-stage1/tools/clang/lib/Headers/arm_neon.h.tmp
make[2]: *** [tools/clang/lib/Headers/arm_neon.h.tmp] Error 139

This is a cmake self-hosted build, gcc 4.8.2 providing the libstdc++, -O3 -mcpu=native -DNDEBUG.

When we had talked on IRC, we had discussed expanding the pseudo-instruction late (after register allocation), but before post-RA scheduling. You're still expanding it here before RA. Did you change your mind?

Also, as is, you're doing things in between MI scheduling and RA without preserving live intervals, etc. I don't think that's a good idea.

Alright, here's what happening on the P7:

Starting program: /home/hfinkel/build/ppc64/llvm-stage1/bin/clang-tblgen -gen-arm-neon -I /home/hfinkel/src/llvm/tools/clang/lib/Headers -I /home/hfinkel/src/llvm/lib/Target -I /home/hfinkel/src/llvm/include /home/hfinkel/src/llvm/tools/clang/include/clang/Basic/ -o /home/hfinkel/build/ppc64/llvm-stage1/tools/clang/lib/Headers/arm_neon.h.tmp
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/".

Program received signal SIGSEGV, Segmentation fault.
0x000000001008d2a4 in .llvm::PrettyStackTraceEntry::PrettyStackTraceEntry() ()
(gdb) disassemble
Dump of assembler code for function ._ZN4llvm21PrettyStackTraceEntryC2Ev:
   0x000000001008d270 <+0>:   mflr    r0
   0x000000001008d274 <+4>:   std     r31,-8(r1)
   0x000000001008d278 <+8>:   std     r0,16(r1)
   0x000000001008d27c <+12>:  stdu    r1,-64(r1)
   0x000000001008d280 <+16>:  addis   r12,r2,-1
   0x000000001008d284 <+20>:  nop
   0x000000001008d288 <+24>:  mr      r4,r3
   0x000000001008d28c <+28>:  mr      r31,r1
   0x000000001008d290 <+32>:  addi    r3,r2,-32392
   0x000000001008d294 <+36>:  addi    r6,r12,-2248
   0x000000001008d298 <+40>:  bl      0x10005078 <00000011.plt_call.__tls_get_addr@@GLIBC_2.3+0>
   0x000000001008d29c <+44>:  ld      r2,40(r1)
   0x000000001008d2a0 <+48>:  addi    r5,r6,16
=> 0x000000001008d2a4 <+52>:  std     r5,0(r4)
   0x000000001008d2a8 <+56>:  addis   r3,r3,0
   0x000000001008d2ac <+60>:  ori     r2,r2,0
   0x000000001008d2b0 <+64>:  ld      r5,-32768(r3)
   0x000000001008d2b4 <+68>:  std     r5,8(r4)
   0x000000001008d2b8 <+72>:  std     r4,-32768(r3)
   0x000000001008d2bc <+76>:  addi    r1,r1,64
   0x000000001008d2c0 <+80>:  ld      r0,16(r1)
   0x000000001008d2c4 <+84>:  ld      r31,-8(r1)
   0x000000001008d2c8 <+88>:  mtlr    r0
   0x000000001008d2cc <+92>:  blr
   0x000000001008d2d0 <+96>:  .long 0x0
   0x000000001008d2d4 <+100>: .long 0x0
   0x000000001008d2d8 <+104>: .long 0x0
End of assembler dump.

Looks to me like the instruction that becomes the function call needs to do more than clobber r3. Maybe it needs the full clobber mask associated with a regular function call? Based on this crash, it looks like it at least also clobbers r4.

Maybe it needs the full clobber mask associated with a regular function call? Based on this crash, it looks like it at least also clobbers r4.

To answer my own question, Google seems to say that glibc/sysdeps/powerpc/tls-macros.h has this:

#define __TLS_CALL_CLOBBERS						      \
	"0", "4", "5", "6", "7", "8", "9", "10", "11", "12",		      \
	"lr", "ctr", "cr0", "cr1", "cr5", "cr6", "cr7"

(and that's what is used by the TLS_LD/TLS_GD macros (etc.) that contain calls to __tls_get_addr.

Let's define a minimal CC reg mask that only has these registers, and apply it to the pseudo that becomes the function call (or something like that).

OK, that makes sense. That was formerly handled by the call logic under the old method. Thanks for looking into this! I'll work tomorrow (well, today) on all the changes that are needed.

wschmidt updated this revision to Diff 19606.Feb 9 2015, 1:21 PM

Once more, with feeling...

Changes from the previous patch:

  • Call clobbers are handled by annotating both the GETtlsADDR* and ADDItls*LADDR* instructions with implied defs. This still permits MachineCSE to do its work while preventing the behavior that was breaking the POWER7 bootstrap.
  • PPCTLSDynamicCall.cpp now properly repairs live intervals.
  • The new PPCTLSDynamicCall pass is now only called if the relocation model is PIC_.

Once again, I've tested various bootstrap scenarios, but would appreciate a sniff test on your POWER7 build. Thanks!

wschmidt updated this revision to Diff 19674.Feb 10 2015, 6:51 AM

Slight change to the previous patch. I observed on inspection that I was missing the INITIALIZE_PASS_DEPENDENCY macro for LiveIntervals. While I was at it, I added the SlotIndexes analysis as required and preserved to be on the safe side.

Sorry for the omission!

hfinkel accepted this revision.Feb 10 2015, 9:40 AM
hfinkel edited edge metadata.

LGTM, thanks! (I checked self-hosting on my P7 box, and it was clean).

This revision is now accepted and ready to land.Feb 10 2015, 9:40 AM

(also, I check self-hosting on my P7 box with the tls linker flag removed, and that was also clean)

Excellent! Thanks very much for all your help with debug and test!

Committed as r228725. Thanks again for all the help and patience!

wschmidt closed this revision.Feb 10 2015, 1:03 PM

It looks like the buildbots are happy at this point, so closing this.