Page MenuHomePhabricator

[LLD][PowerPC] Fix bug in PC-Relative initial exec
ClosedPublic

Authored by stefanp on Jan 22 2021, 12:56 PM.

Details

Summary

There is a bug when initial exec is relaxed to local exec.
In the following situation:

InitExec.c

extern __thread unsigned TGlobal;
unsigned getConst(unsigned*);
unsigned addVal(unsigned, unsigned*);

unsigned GetAddrT() {
  return addVal(getConst(&TGlobal), &TGlobal);
}

Def.c

__thread unsigned TGlobal;

unsigned getConst(unsigned* A) {
  return *A + 3;
}

unsigned addVal(unsigned A, unsigned* B) {
  return A + *B;
}

The problem is in InitExec.c but Def.c is required if you want to link the example and see the problem.
To compile everything:

clang -O3 -mcpu=pwr10 -c InitExec.c
clang -O3 -mcpu=pwr10 -c Def.c
ld.lld InitExec.o Def.o -o IeToLe

If you objdump the problem object file:

$ llvm-objdump -dr --mcpu=pwr10 InitExec.o

you will get the following assembly:

0000000000000000 <GetAddrT>:
       0: a6 02 08 7c  	mflr 0
       4: f0 ff c1 fb  	std 30, -16(1)
       8: 10 00 01 f8  	std 0, 16(1)
       c: d1 ff 21 f8  	stdu 1, -48(1)
      10: 00 00 10 04 00 00 60 e4      	pld 3, 0(0), 1
		0000000000000010:  R_PPC64_GOT_TPREL_PCREL34	TGlobal
      18: 14 6a c3 7f  	add 30, 3, 13
		0000000000000019:  R_PPC64_TLS	TGlobal
      1c: 78 f3 c3 7f  	mr	3, 30
      20: 01 00 00 48  	bl 0x20
		0000000000000020:  R_PPC64_REL24_NOTOC	getConst
      24: 78 f3 c4 7f  	mr	4, 30
      28: 30 00 21 38  	addi 1, 1, 48
      2c: 10 00 01 e8  	ld 0, 16(1)
      30: f0 ff c1 eb  	ld 30, -16(1)
      34: a6 03 08 7c  	mtlr 0
      38: 00 00 00 48  	b 0x38
		0000000000000038:  R_PPC64_REL24_NOTOC	addVal

The lines of interest are:

      10: 00 00 10 04 00 00 60 e4      	pld 3, 0(0), 1
		0000000000000010:  R_PPC64_GOT_TPREL_PCREL34	TGlobal
      18: 14 6a c3 7f  	add 30, 3, 13
		0000000000000019:  R_PPC64_TLS	TGlobal
      1c: 78 f3 c3 7f  	mr	3, 30

Which once linked gets turned into:

10010210: ff ff 03 06 00 90 6d 38      	paddi 3, 13, -28672, 0
10010218: 00 00 00 60  	nop
1001021c: 78 f3 c3 7f  	mr	3, 30

The problem is that register 30 is never set after the optimization.

Therefore it is not correct to relax the above instructions by replacing
the add instruction with a nop.
Instead the add instruction should be replaced with a copy (mr) instruction.
If the add uses the same resgiter as input and as ouput then it is safe to
continue to replace the add with a nop.

Diff Detail

Event Timeline

stefanp created this revision.Jan 22 2021, 12:56 PM
stefanp requested review of this revision.Jan 22 2021, 12:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 22 2021, 12:56 PM
MaskRay added inline comments.Jan 24 2021, 10:09 AM
lld/test/ELF/ppc64-tls-pcrel-ie.s
82

In this case, does the codegen assume that r3 is killed and cannot be reused?

stefanp added inline comments.Jan 25 2021, 4:44 AM
lld/test/ELF/ppc64-tls-pcrel-ie.s
82

This is an artificial test case. There is no real reason why r4 is used instead of r3 in the test above. Generating a real test would create more code and it would just clutter the test file. This is a snippet from a real test:

10: 00 00 10 04 00 00 60 e4      	pld 3, 0(0), 1
		0000000000000010:  R_PPC64_GOT_TPREL_PCREL34	TGlobal
18: 14 6a c3 7f  	add 30, 3, 13
		0000000000000019:  R_PPC64_TLS	TGlobal
1c: 78 f3 c3 7f  	mr	3, 30
20: 01 00 00 48  	bl 0x20
		0000000000000020:  R_PPC64_REL24_NOTOC	getConst
24: 78 f3 c4 7f  	mr	4, 30

the idea is that the bl clobbers r3 and so we save to r30 first so that we can use the result of the add both before and after the call.
The only thing I want to test is what happens if the result register of the add is different from the result register of the pld.

MaskRay added inline comments.Jan 25 2021, 10:30 AM
lld/test/ELF/ppc64-tls-pcrel-ie.s
82
pld 3, 12488(0), 1
add 4, 3, 13

r3 != r4. If the compiler generated code makes use of r3, no matter how the linker relaxes the sequence, it will be broken, right?

nemanjai added inline comments.Jan 25 2021, 4:53 PM
lld/test/ELF/ppc64-tls-pcrel-ie.s
82

Is it possible for r3 to be used for anything other than materializing the address of x? Basically, the result in r3 before the relaxation is "The offset of x from the thread pointer" and the result in r4 is "The address of x". After the relaxation, both r3 and r4 contain "the address of x".

stefanp added inline comments.Feb 1 2021, 12:13 PM
lld/test/ELF/ppc64-tls-pcrel-ie.s
82

Technically one could create assembly by hand that would break this relaxation. As Nemanja mentioned we do change what r3 stores when we relax it.

However, I do not believe that our compiler can generate code like that. My impression based on the ELFv2 ABI is that R_PPC64_GOT_TPREL_PCREL34 is only used in code sequences for Initial Exec. If the linker receives this relocation in other situations and r3 is used in other ways I would consider that bad code gen from the compiler.

I am basically assuming that all uses of r3 will be marked with R_PPC64_TLS and can be relaxed safely. If that assumption cannot be made we either have to cancel this relaxation completely or we have to add more look-ahead code to check that the add in the TLS pattern always uses the same register as input and output.

Gentle ping.

MaskRay added inline comments.Feb 16 2021, 10:10 AM
lld/test/ELF/ppc64-tls-pcrel-ie.s
82

I thought there was unactioned items in this comment thread. Can you briefly summarize the compiler behavior and roll that into the description?

I'd like a proof that

pld 3, x@got@tprel@pcrel(0), 1
add 4, 3, x@tls@pcrel

is legitimate compiler emitted code sequence, so relaxing it is a bug. There are other compiler emitted code sequences which are technically really difficult to detect, but compilers do not emit them.

I'm sorry, I should have added the full test case.

InitExec.c

extern __thread unsigned TGlobal;
unsigned getConst(unsigned*);
unsigned addVal(unsigned, unsigned*);

unsigned GetAddrT() {
  return addVal(getConst(&TGlobal), &TGlobal);
}

Def.c

__thread unsigned TGlobal;

unsigned getConst(unsigned* A) {
  return *A + 3;
}

unsigned addVal(unsigned A, unsigned* B) {
  return A + *B;
}

The problem is in InitExec.c but Def.c is required if you want to link the example and see the problem.
To compile everything:

clang -O3 -mcpu=pwr10 -c InitExec.c
clang -O3 -mcpu=pwr10 -c Def.c
ld.lld InitExec.o Def.o -o IeToLe

If you objdump the problem object file:

$ llvm-objdump -dr --mcpu=pwr10 InitExec.o

you will get the following assembly:

0000000000000000 <GetAddrT>:
       0: a6 02 08 7c  	mflr 0
       4: f0 ff c1 fb  	std 30, -16(1)
       8: 10 00 01 f8  	std 0, 16(1)
       c: d1 ff 21 f8  	stdu 1, -48(1)
      10: 00 00 10 04 00 00 60 e4      	pld 3, 0(0), 1
		0000000000000010:  R_PPC64_GOT_TPREL_PCREL34	TGlobal
      18: 14 6a c3 7f  	add 30, 3, 13
		0000000000000019:  R_PPC64_TLS	TGlobal
      1c: 78 f3 c3 7f  	mr	3, 30
      20: 01 00 00 48  	bl 0x20
		0000000000000020:  R_PPC64_REL24_NOTOC	getConst
      24: 78 f3 c4 7f  	mr	4, 30
      28: 30 00 21 38  	addi 1, 1, 48
      2c: 10 00 01 e8  	ld 0, 16(1)
      30: f0 ff c1 eb  	ld 30, -16(1)
      34: a6 03 08 7c  	mtlr 0
      38: 00 00 00 48  	b 0x38
		0000000000000038:  R_PPC64_REL24_NOTOC	addVal

The lines of interest are:

      10: 00 00 10 04 00 00 60 e4      	pld 3, 0(0), 1
		0000000000000010:  R_PPC64_GOT_TPREL_PCREL34	TGlobal
      18: 14 6a c3 7f  	add 30, 3, 13
		0000000000000019:  R_PPC64_TLS	TGlobal
      1c: 78 f3 c3 7f  	mr	3, 30

Which once linked get turned into:

10010210: ff ff 03 06 00 90 6d 38      	paddi 3, 13, -28672, 0
10010218: 00 00 00 60  	nop
1001021c: 78 f3 c3 7f  	mr	3, 30

The problem is that register 30 is never set after the optimization.

I'll add it to the description as well. Is this what you were looking for @MaskRay ?

stefanp edited the summary of this revision. (Show Details)Feb 22 2021, 4:57 AM
stefanp edited the summary of this revision. (Show Details)Feb 22 2021, 5:00 AM
stefanp updated this revision to Diff 331052.Mar 16 2021, 11:51 AM

Added full test case that was compiled from C source to show the issue.

I think I may have misunderstood your initial comment.
I don't think that the ABI has an issue with the code pattern:

pld 3, <Immediate Value>(0), 1
add 4, 3, 13

The ABI does not explicitly state that those registers must be equal. It also does not say explicitly that they can differ so I don't think there is a clear direction from that perspective. Logically it would make sense for those registers to be able to differ if they are used in multiple places or if they must be used across function calls (as in the example I added).
Linking the object files generated by LLVM with ld produces the correct result with the mr and not the nop. Therefore it seems that ld expects to have to handle this kind of code. While this situation will not arise very often I believe that this is something LLD should be able to handle.

I hope that this answers your questions @MaskRay .

MaskRay accepted this revision.Mar 17 2021, 9:52 AM

Sorry for my slow response. Mostly looks good.

lld/ELF/Arch/PPC64.cpp
924

RT -> rt

926
lld/test/ELF/ppc64-tls-le-relax.s
83

Delete trailing empty lines

This revision is now accepted and ready to land.Mar 17 2021, 9:52 AM
stefanp updated this revision to Diff 332367.Mar 22 2021, 11:07 AM

Thank you for the review!

Fixed naming of some local variables.
Fixed braces on if statement.
Removed empty lines at the bottom of the test case.

This revision was landed with ongoing or failed builds.Mar 22 2021, 11:16 AM
This revision was automatically updated to reflect the committed changes.