This is an archive of the discontinued LLVM Phabricator instance.

Use 64bit jump table with large code model on 64bit
Needs ReviewPublic

Authored by yuyichao on Jun 20 2017, 9:45 AM.

Details

Summary

The data and code segments can be more than 32bit apart so the offset needs to be 64bit in size.

Diff Detail

Event Timeline

yuyichao created this revision.Jun 20 2017, 9:45 AM
t.p.northover accepted this revision.Jun 20 2017, 12:24 PM
t.p.northover added a subscriber: t.p.northover.

This looks reasonable to me.

This revision is now accepted and ready to land.Jun 20 2017, 12:24 PM
joerg requested changes to this revision.Jun 20 2017, 1:42 PM

At least the PPC change is definitely wrong. AArch64 should be wrong as well from what I discussed with Tim.

Blindly moving to 64bit differences is *not* the right approach. Moving to function relative offsets or not using a separate section is.

The entirely sensible assumption of the PPC backend is that a single function is no longer than 2GB/4GB. The position of the sections is irrelevant. The jump offsets inside the function can all be expressed as 32bit offset. The only problematic part is hooking up the address computation of this base in the backend.

This revision now requires changes to proceed.Jun 20 2017, 1:42 PM

The entirely sensible assumption of the PPC backend is that a single function is no longer than 2GB/4GB.

What's being produced is an offset from the basic blocks (.text) to the Jump table (.rodata). That's not necessarily 32-bits, and I think there are entirely legitimate reasons for putting jump tables in .rodata.

I expect something better could be done for PPC, but this is entirely in line with the existing 32-bit code and correctness comes before performance. I pretty strongly object to characterising the patch as "wrong".

joerg added a comment.Jun 20 2017, 2:08 PM

The entirely sensible assumption of the PPC backend is that a single function is no longer than 2GB/4GB.

What's being produced is an offset from the basic blocks (.text) to the Jump table (.rodata). That's not necessarily 32-bits, and I think there are entirely legitimate reasons for putting jump tables in .rodata.

That's not what PPC64 creates. It puts offsets between the BB and a picbase into the jumptable.

I expect something better could be done for PPC, but this is entirely in line with the existing 32-bit code and correctness comes before performance. I pretty strongly object to characterising the patch as "wrong".

Please read what I said. The PPC change is wrong: the code works correctly for the large code model. For X86_64, two option exists: non-PIC should work fine when using absolute pointers. That's the primary use case of large code model right now, i.e. for JIT. The better approach is to follow what PPC does and this is not done by this patch.

yuyichao added a comment.EditedJun 20 2017, 2:20 PM

I expect something better could be done for PPC, but this is entirely in line with the existing 32-bit code and correctness comes before performance. I pretty strongly object to characterising the patch as "wrong".

Exactly, the code generated for aarch64, ppc64 and x86-64 in this case was obviously wrong in correctness sense (feels wierd to say it out...). The new code generated for x86-64 and aarch64 looks correct to me. I'm not very familiar with ppc64 assembly to tell but given the code generating the assembly is pretty generic I expect it to be correct at least. I have no idea what exactly would be a better code to generate for either of these cases. I do agree a function can be assumed to be smaller than 2G so if a different but existing code path can be used I'll be happy to make that change for a specific arch. Otherwise, I'd like to keep the "generic" way currently used since that matches what's done in 32bit mode and leave the performance optimization part to people more familiar with the performance model and trade offs on different archs.

FWIW, the PPC backend does seems to be doing some transformation so that a function local offset is end up being used. The original function never returns EK_GPRel32BlockAddress and I can't really verify the correctlyness of the resulting assembly so I went with the version that shares the code path with two other archs that I can verify. If EK_GPRel32BlockAddress should work on ppc64 and generates an pc offset table, I'll be happy to make that change.

It puts offsets between the BB and a picbase into the jumptable.

Yes, I assumed it's some other valid transformation done by the backend since I checked in the debugger that an offset between the BB and a data section was asked for. Would EK_GPRel32BlockAddress be the right solution on ppc64 then?

For AArch64 there was a previous discussion here: https://reviews.llvm.org/D32564 (some replies don't seem to be here, the thread is at http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20170424/448545.html). That patch would supersede this one I believe, but until I get around to resurrecting it this patch at least makes things work.

FWIW, the PPC backend does seems to be doing some transformation so that a function local offset is end up being used.

Ah yes, I've run the code myself and see what Joerg meant now (apologies, and thanks for overriding me). It seems like disabling this for PPC is probably a good idea, the existing 32-bit offsets are probably fine.

Which leaves x86. It looks like x86 is much like PPC and correct right now: jump tables immediately follow each function.

So I think I should probably actually do something about that older patch.

joerg added a comment.Jun 20 2017, 3:03 PM

The x86_64 code is mixed: it is correct for non-PIC but large, it doesn't work correctly for PIC. That covers the cases most people have been interested in, especially since large model and PIC runs into various other issues too from what I remember.

It would be really nice to use the proper 32bit difference for PIC in general. It would be a stepping stone to using 8bit or 16bit labels for smaller functions. But I run out of time banging my head against the lowering when I last looked at it.

the existing 32-bit offsets are probably fine.

Ah, I just realize that the large code model is indeed doing some magic here. I'll leave it as is then and update the test. (Though it feels a little strange since the wrong format is asked for....)

Which leaves x86. It looks like x86 is much like PPC and correct right now:

I believe it's actually wrong. The jump table is emitted into .rodata and can be placed far away. It's where I actually saw segfault/assertion.

That covers the cases most people have been interested in, especially since large model and PIC runs into various other issues too from what I remember.

For some reason we are using large code model with PIC in the JIT and this is the only issue I see on x86-64 so far.

yuyichao updated this revision to Diff 103283.Jun 20 2017, 4:03 PM
yuyichao edited edge metadata.

PPC changes reverted and tests updated. I believe this addresses all the comment so far.

So to summarize,

On PPC64, there's some magic that fixes the large offset so the final code is correct already.
On AArch64, there might be something fancy we can do (use the pointer that we use to find the data section?) but this should make it work correctly before that.
On X86-64, this is needed to make it behave correctly. Not sure if there's better trick.

joerg added a comment.Jun 20 2017, 4:12 PM

For some reason we are using large code model with PIC in the JIT and this is the only issue I see on x86-64 so far.

Who is we? I'm moderately sure that is not the case in general, since it would create less efficient code.

joerg added a comment.Jun 20 2017, 4:15 PM

If you want to create a stop-gap solution that bad, I would make it emit the jump table as writable for non-PIC && largeish code model. That's a much more contained change. It would be really better to work on the real fix and not add more hacks...

Who is we? I'm moderately sure that is not the case in general, since it would create less efficient code.

No that's not the default. "We" are julia on x86-64...... Ref https://github.com/JuliaLang/julia/pull/22451 . In short I think we turned it on to support threading. Comments there are welcome =)

Why would a writable jump table help?

joerg added a comment.Jun 21 2017, 2:16 AM

If the jump table is writable, you can just use the absolute pointers, PIC or no PIC. That would be the "block address" encoding.

If the jump table is writable

FWIW, that also sounds like a hack and a (minor?) security issue.

It would be really better to work on the real fix and not add more hacks...

And what exactly do you mean by "the real fix". I believe this is a reasonable generic fallback before an optimization is implemented for a particular arch and given that this is exactly what GCC does on x86-64 I think this is (one of) the correct solution on x86-64 too.
On AArch64, GCC throws an error, which I think is much better than silently generating the wrong code....

joerg added a comment.Jun 21 2017, 7:20 AM

If the jump table is writable

FWIW, that also sounds like a hack and a (minor?) security issue.

It's no better or worse than the GOT.

It would be really better to work on the real fix and not add more hacks...

And what exactly do you mean by "the real fix". I believe this is a reasonable generic fallback before an optimization is implemented for a particular arch and given that this is exactly what GCC does on x86-64 I think this is (one of) the correct solution on x86-64 too.
On AArch64, GCC throws an error, which I think is much better than silently generating the wrong code....

There are three options given here so far:
(1) Use the plain block address. Requires replacing the default getSectionForJumpTable and getJumpTableEncoding. Change is localized to the affected architectures.
(2) Introduce a whole new generic 64bit label difference. Non-localized infrastructure change.
(3) Properly switch to function-relative 32bit labels. Change is localized the affected architecture or at least support glue.

In terms of code overhead for the access, (1) is strictly the shortest, plain indirect branch to a indexed memory location. (2) needs a pointer load + offset computation, (3) needs a pointer load + PC-relative offset computation. As such, (2) and (3) are often somewhat equal. (1) and (2) require the same amount of memory for the jump table, making (2) not very attractive when relocations themselve are ephemeral. (3) saves significant amounts of space for any non-trivial jump table.

Note that GCC is quite different as it often will not create a separate jump table section. That's also an option [(4)] supported by LLVM with some overrides and it will work for large code model at the expensive of making more static data executable.

Based on all that, I do not consider the complexity of (2) justified at all for a short term workaround of target-specific limitations. (1) and (4) are easier and create faster code. (3) is the preferred implementation for 64bit platforms as it minimizes size of executable code and total binary size at the expensive of a slightly more complex access vector. The only reason why it isn't implemented for AArch64 and X86_64 yet is the necessary function-specific base address.

I see the non-arch specific property as the good part since currently everything other than PPC claims to support large model with PIC and just generate wrong code. If a fallback that's always implementable and correct is defined, the only arch-specific changes needed will be for efficiency and not for correctness.

Note that GCC is quite different as it often will not create a separate jump table section.

AFAICT it is using rodata as the section. Same as LLVM here.

The only reason why it isn't implemented for AArch64 and X86_64 yet is the necessary function-specific base address.

Looking at the assembly, I think AArch64 should always has a function base address to use (accessing the jump table in PIC way already requires adrp) but that does not seem to be the case for x86_64 where the function local address isn't stored in any register and it seems that doing that will in general require one more instructions.