This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Implement native TLS for Windows
ClosedPublic

Authored by mstorsjo on Mar 1 2018, 1:58 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

mstorsjo created this revision.Mar 1 2018, 1:58 PM
rnk resigned from this revision.Mar 7 2018, 5:26 PM

I don't feel like I know enough AArch64 asm to review this.

I don't know much about COFF TLS to tell why you need the special treatment, but the AArch64 asm looks correct (though very inefficient, but you know that:).

I don't know much about COFF TLS to tell why you need the special treatment, but the AArch64 asm looks correct (though very inefficient, but you know that:).

Thanks for taking a look in any case!

Let's see if I can get some more progress on this if I turn it into a few more concrete questions:

First: Currently there's the pseudoinstruction LOADgot which lowers into a "adrp + ldr" instruction pair, but the ldr always loads a full X-register. Here I needed to do that, but load a W-register instead, to only read 32 bits. I tried to do this by adding a separate LOADgot32 pseudoinstruction, but I wasn't able to make it return a 32 bit value on the SelectionDAG level.

Secondly: When I do a series of SHL + ZERO_EXTEND + Load, I would want it lowered into a ldr x0, [x0, w1 uxtw #3], but it ends up as lsl w1, w1, #3; ldr x0, [x0, w1 uxtw] which feels pointless. I remember doing similar things elsewhere, where it was folded properly into the ldr as a shift - what's missing here?

Who's familiar with the tablegen part of the AArch64 backend and/or SelectionDAG and can help out with this?

First: Currently there's the pseudoinstruction LOADgot which lowers into a "adrp + ldr" instruction pair, but the ldr always loads a full X-register. Here I needed to do that, but load a W-register instead, to only read 32 bits. I tried to do this by adding a separate LOADgot32 pseudoinstruction, but I wasn't able to make it return a 32 bit value on the SelectionDAG level.

Right, I see. This seems to be a new problem. When trying it on current trunk, I do get an unreachable, but when I try on older versions (ex. 3.8) I see a much simpler code coming out.

@tlsVar16 = thread_local global i16 0
@tlsVar32 = thread_local global i32 0
@tlsVar64 = thread_local global i64 0

define i64 @getVar() {
  %1 = load i16, i16* @tlsVar16
  %2 = load i32, i32* @tlsVar32
  %3 = load i64, i64* @tlsVar64

  %4 = zext i16 %1 to i32
  %5 = add i32 %2, %4

  %6 = zext i32 %5 to i64
  %7 = add i64 %3, %6
  ret i64 %7
}

then:

$ llc -mtriple aarch64-windows tls-32b.ll -o -

I get:

mrs  x8, TPIDR_EL0
add  x9, x8, :tprel_hi12:tlsVar16
add  x10, x8, :tprel_hi12:tlsVar32
add  x8, x8, :tprel_hi12:tlsVar64
add  x9, x9, :tprel_lo12_nc:tlsVar16
add  x10, x10, :tprel_lo12_nc:tlsVar32
add  x8, x8, :tprel_lo12_nc:tlsVar64
ldr  w10, [x10]
ldrh w9, [x9]
ldr  x8, [x8]
add  w9, w10, w9
add  x0, x8, w9, uxtw
ret

Which does the correct loads for half, single and double sizes.

Maybe the functionality was lost for not being tested properly on Windows?

Secondly: When I do a series of SHL + ZERO_EXTEND + Load, I would want it lowered into a ldr x0, [x0, w1 uxtw #3], but it ends up as lsl w1, w1, #3; ldr x0, [x0, w1 uxtw] which feels pointless. I remember doing similar things elsewhere, where it was folded properly into the ldr as a shift - what's missing here?

Who's familiar with the tablegen part of the AArch64 backend and/or SelectionDAG and can help out with this?

AFAICR, this is supposed to be done by the DAGCombine. However, it depends on when the Pseudos get expanded (I forget). If it's too late, then they won't get merged.

First: Currently there's the pseudoinstruction LOADgot which lowers into a "adrp + ldr" instruction pair, but the ldr always loads a full X-register. Here I needed to do that, but load a W-register instead, to only read 32 bits. I tried to do this by adding a separate LOADgot32 pseudoinstruction, but I wasn't able to make it return a 32 bit value on the SelectionDAG level.

Maybe the functionality was lost for not being tested properly on Windows?

No, that's probably because it aarch64-windows didn't exist at all back then (there's even no publicly available hardware yet), so this just did something else (looks like ELF relocations). And the issue isn't with reading tls vars of different sizes, but just in one small piece of implementing it properly - when reading the global _tls_index variable, which is a 32 bit value.

To reproduce the issue with this, that I'm facing, try adding the following to AArch64InstrInfo.td (next to the existing similar pattern for LOADgot):

def LOADgot32 : Pseudo<(outs GPR32:$dst), (ins i64imm:$addr),
                     [(set GPR32:$dst, (AArch64LOADgot tglobaladdr:$addr))]>,
              Sched<[WriteLDAdr]>;

When building with this pattern in place, tablegen fails like this:

FAILED: cd /home/martin/code/llvm/build/lib/Target/AArch64 && /home/martin/code/llvm/build/bin/llvm-tblgen -gen-instr-info -I /home/martin/code/llvm/lib/Target/AArch64 -I /home/martin/code/llvm/include -I /home/martin/code/llvm/lib/Target /home/martin/code/llvm/lib/Target/AArch64/AArch64.td -o /home/martin/code/llvm/build/lib/Target/AArch64/AArch64GenInstrInfo.inc.tmp
Type set is empty for each HW mode:
possible type contradiction in the pattern below (use -print-records with llvm-tblgen to see all expanded records).
LOADgot32:      (LOADgot32:{ *:[i32] } (tglobaladdr:{ *:[] }):$addr)
UNREACHABLE executed at ../utils/TableGen/CodeGenDAGPatterns.cpp:817!

I guess I could try to see what dag nodes this produces if I do such a load of a 32 bit global variable from normal IR.

No, that's probably because it aarch64-windows didn't exist at all back then (there's even no publicly available hardware yet), so this just did something else (looks like ELF relocations).

D'oh, of course! :)

I guess I could try to see what dag nodes this produces if I do such a load of a 32 bit global variable from normal IR.

That's what I'd recommend, yes.

mstorsjo updated this revision to Diff 137840.Mar 9 2018, 2:19 PM
mstorsjo edited the summary of this revision. (Show Details)

Managed to implement it without adding any new pseudo instructions. No significant effect on the generated code (the two missing combines are still as before), but this isn't nearly as messy in as many places as before.

While still suboptimal, I wouldn't be quite as embarrased to commit this as such now, with the rest of the instruction selection mess cleaned up now.

Right, this makes a lot more sense! :)

I'm assuming only 32-bit values are supported? If not, would it be too hard to add different sizes now?

mstorsjo updated this revision to Diff 137895.Mar 9 2018, 8:17 PM

Managed to merge the lsl into the ldr addressing by swapping the order of ZERO_EXTEND and SHL.

mstorsjo added a comment.EditedMar 9 2018, 8:19 PM

Right, this makes a lot more sense! :)

I'm assuming only 32-bit values are supported? If not, would it be too hard to add different sizes now?

No, it supports any size. This added code just calculates the address of the tls variable. (The fiddling with a 32 bit load was only for the _tls_index ABI variable, not the user defined tls variable itself.)

mstorsjo edited the summary of this revision. (Show Details)Mar 10 2018, 1:55 AM
mstorsjo updated this revision to Diff 137910.Mar 10 2018, 4:53 AM

Added testcases for getting a plain pointer, for storing to the tls var, and for other sizes of tls vars.

rengolin accepted this revision.Mar 10 2018, 5:50 AM

Right, now it's clear what it's doing, thanks! I'm happy with it, LGTM.

This revision is now accepted and ready to land.Mar 10 2018, 5:50 AM
This revision was automatically updated to reflect the committed changes.