Page MenuHomePhabricator

[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.

Authored by chandlerc on Apr 1 2018, 2:18 AM.



The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.

However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.

There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.

The patch isn't complete yet. I at least want to add an optimization
which looks for opposite conditions and when present simply inverts
whatever test is being done.

I also have a MIR test file I've been using to test this stuff. I need
to add CHECK lines to it and add it to the patch before it can go in.

However, I wanted to get some early feedback on the approach and how I'm
doing this, so I'm sending the patch on out.

Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.

Diff Detail


Event Timeline

chandlerc created this revision.Apr 1 2018, 2:18 AM

I may go ahead and pull the changes out of this. The SchedRW is incomplete for ADCX, it needs a ReadAfterLd.

Shoudl we pull out the X86InstrInfo.cpp changes too? Is this the same as the change in the spectre load hardening patch too?

461 ↗(On Diff #140575)


144 ↗(On Diff #140575)


chandlerc updated this revision to Diff 140598.Apr 1 2018, 3:04 PM

Rebase after splitting out patches and after fixing some minor code review comments.

chandlerc marked 2 inline comments as done.Apr 1 2018, 3:05 PM

I may go ahead and pull the changes out of this. The SchedRW is incomplete for ADCX, it needs a ReadAfterLd.

After talking on IRC, I've just landed this part of the patch. It's less broken and should make it easy for you to improve this stuff.

Shoudl we pull out the X86InstrInfo.cpp changes too? Is this the same as the change in the spectre load hardening patch too?


Also, addressed the minor comments inline.

rnk added a comment.Apr 2 2018, 11:08 AM


73 ↗(On Diff #140598)

nit: They're really pseudo instrutions. Now that we're not in the DAG, calling them nodes doesn't seem accurate.

11–14 ↗(On Diff #140598)

I feel like the first one sentence summary of why we need this pass shouldn't include "so we don't need pushf/popf". That presumes the reader already thinks that "popf" is a solution the problem, when it was never actually a viable solution to begin with for all the reasons you outline. The first sentence should state directly that there is no correct or efficient way to copy comparison flags on x86. Then you can go on and explain that the reader might consider popf, lahf, etc, but these are innefficient and incorrect.

101 ↗(On Diff #140598)

This looks dead, I can't find it in the rest of the pass.

357 ↗(On Diff #140598)

Any reason not to delete dead copies so we have the invariant that after this pass, regardless of future pass ordering, no flags copies will be emitted?

371 ↗(On Diff #140598)

Maybe call it FlagsKilled.

399–401 ↗(On Diff #140598)

This would be clearer as:
Once we encounter a branch, the rest of the instructions must also be branches. We can't rewrite them in place here, so we handle them below.

403 ↗(On Diff #140598)

This might need to be addressed before committing.

424 ↗(On Diff #140598)

Would it be better to assert this and set IsKilled before the rewrite? This is similar to how you set the IsKilled bool above before rewrites.

449–454 ↗(On Diff #140598)

Do you want to commit this? If so, I'd add braces (DEBUG({ ... });) and re-format it.

Given that we report_fatal_error here, it doesn't seem like we need to mention popf in the logs. The plan is to remove that physreg copy logic so LLVM never emits it.

462–464 ↗(On Diff #140598)

splitBlock has lots of side effects. This would be clearer as:

MachineBasicBlock *JmpMBB = &MBB;
if (JmpI != JmpIs.front())
  JmpMBB = &splitBlock(...);

It's used as a pointer in the assert anyway.

470–471 ↗(On Diff #140598)

I see. Yeah, that is annoying to implement. Don't later passes try to infer this logic? Maybe it's OK to rely on that?

479–480 ↗(On Diff #140598)

This seems redundant with -print-after=

561 ↗(On Diff #140598)

This seems like the only use of getMnemonicFromOpcode. It seems like you could simplify all that stuff down to just three states:

  1. non-flags-using arithmetic
  2. arithmetic using CF
  3. arithmetic using OF (ADDOX)

I'm guessing this more detailed classification of opcodes is useful if we want to do sophisticated ADDOX optimizations for multi-precision arithmetic. Do you think it's worth keeping around?

584–585 ↗(On Diff #140598)

Hah, I was scratching my head at this, but I guess at this point during codegen we haven't done two-operand conversions yet. :)

620 ↗(On Diff #140598)

Pass name is wrong. Blech, boiler-plate. =P

chandlerc marked 12 inline comments as done.Apr 2 2018, 8:38 PM
chandlerc added inline comments.
11–14 ↗(On Diff #140598)

Yeah, totally rewrote this. Lemme know what you think.

101 ↗(On Diff #140598)

Wow, sorry. Copy-pasted from my other pass.

403 ↗(On Diff #140598)

Amazingly, I have found no test case that hits this. But yes, we need to handle this.

I'm going to work on crafting a MIR test case to hit this and then will fix it.

424 ↗(On Diff #140598)

Yes, thanks.

462–464 ↗(On Diff #140598)

Ended up rewriting this anyways, and yeah, much clearer.

470–471 ↗(On Diff #140598)

It's OK, just suboptimal. IT will make some things less efficient.

479–480 ↗(On Diff #140598)

I find it crazy useful. But sure.

561 ↗(On Diff #140598)

I can simplify it if you really want.

Honestly, having done the simplification, it's a mess to read. The issue is that the macros to do all the silly matching of the crazy number of instruction op codes ends up making it really hard to read the business logic here. So I think just for readability, I'd rather keep the enum and static mapping function.

The use of a switch here is just to force us to update it when/if we add a new mnemonic.

584–585 ↗(On Diff #140598)

Yeah, hilariously.

chandlerc updated this revision to Diff 140732.Apr 2 2018, 8:40 PM
chandlerc marked 6 inline comments as done.

Major update. Addresses all but one code review comment. Also updates all tests
and adds my dedicated MIR test for this functionality.

The test updates are particularly great. Tons of removed code. Everything
consolidates on a common (good) codegen pattern.

I've also implemented various optimizations here such as scavenging existing
setCC instructions and looking for inverted conditions when its straightforward
to do so.

chandlerc updated this revision to Diff 140751.Apr 3 2018, 3:50 AM

Another major update.

We now have test coverage of tail calls which was the last major functionality
missing from this patch.

I've also now removed all of the old lowering for EFLAGS copies. I've also
removed the forced dynamic stack adjustment due to the presence of EFLAGS. This
even further improved the generated assembly for all the test cases in
question, many of which no longer require a frame pointer.

This also has passed the LLVM test suite with asserts enabled. I also have the
separation of DF from EFLAGS patch working on top of this one, passing all of
its tests and the LLVM test suite.

chandlerc marked 2 inline comments as done.Apr 3 2018, 3:53 AM

Last major issue I'm aware of with this patch addressed (see below). I think this is ready for another round of review.

403 ↗(On Diff #140598)

Turns out to not be amazing at all.

We don't form conditional tail calls until post-RA and we form them out of conditional branches. So by fixing conditional branches here we get conditional tail calls to use this lowering. I've added a test to copy-eflags.ll which actually forms a conditional tail call in the end and it passes and in fact uses very nice lowering. =D

I've updated this comment to reflect what is going on here as it is a surprising detail to realize here.

niravd added a subscriber: niravd.Apr 3 2018, 11:18 AM
rnk added inline comments.Apr 3 2018, 1:43 PM
11–14 ↗(On Diff #140598)

Looks good

561 ↗(On Diff #140598)

Makes sense.

545–546 ↗(On Diff #140751)

Do you think we need to check if this is a virtual register? It won't work if it's a physreg, but I also can't imagine how we'd end up with a SETcc into a physical register...

548 ↗(On Diff #140751)

It's a correctness issue to keep scanning, since if you find other SETcc's, they will be from the wrong flag state. This might be clearer as: Stop scanning when we find the instruction that defined the flags.

6804 ↗(On Diff #140751)


11–12 ↗(On Diff #140751)

Can we do this now?

11–29 ↗(On Diff #140751)

These comments are super stale. Maybe update them to reflect the new checks?

10–11 ↗(On Diff #140751)

I guess all this IR up here is MIR boilerplate. =/

1 ↗(On Diff #140751)

This seems like a questionable use of auto-generated llc checks. :(

5–6 ↗(On Diff #140751)

Can we do this?

chandlerc marked 22 inline comments as done.Apr 3 2018, 4:40 PM

All done. Everything just passed the machine verifier. =]

11–29 ↗(On Diff #140751)

Updated a bit. We still do get the long change of copies, but now they fit into the register allocator so it generates much better code.

10–11 ↗(On Diff #140751)


1 ↗(On Diff #140751)

Yeah, but not gonna fix here. Already tried (hard) to make it better and it is still bleh.

chandlerc updated this revision to Diff 140884.Apr 3 2018, 4:40 PM
chandlerc marked 3 inline comments as done.

Update based on review comments.

craig.topper added inline comments.Apr 3 2018, 10:57 PM
615 ↗(On Diff #140884)

I think these Addend values are backwards. 255 would work for carry, and 127 would work for overflow

rnk added inline comments.Apr 4 2018, 11:22 AM
615 ↗(On Diff #140884)

I confirmed this with assembly. You might want to do some multi-precision ADDC execution tests locally to validate the change after fixing this.

chandlerc added inline comments.Apr 4 2018, 10:08 PM
615 ↗(On Diff #140884)

I tried sooooo many tests, included a bunch of high-precision arithmetic library tests... Nothing failed. =[ The only thing I've seen actually hit this code path is the mul-i1024.ll regression test ironically.

Maybe I can kill two birds with one stone by moving mul-i1024.ll to instead be a bitcode test of basic arithmetic. Anyways, any reason to not commit with this fixed?

craig.topper added inline comments.Apr 4 2018, 10:52 PM
378 ↗(On Diff #140884)

Is this longer than 80 columns?

500 ↗(On Diff #140884)

Nit - extra space before the exclamation mark.

589 ↗(On Diff #140884)

Are we clever enough to make that conversion to the shorter encoding?

chandlerc added inline comments.Apr 5 2018, 4:24 AM
615 ↗(On Diff #140884)

Ok, I've now built a framework to let me add execution tests to the test-suite with arbitrary hand-written IR and written some high-precision integre multiply code that we know can hit this because we have a file-check test for it.

First problem, i have to do an i4096 x i4096 multiply to see EFLAGS get saved and restored on x86-64.

Second problem, I can't find an input to the multiply that produces wrong results. I know such an input exists, but only a relatively small number of carries in the multiply actually get hit with the bug. Finding the right input seems ... challenging.

Anyways, I'll clean up this execution test and land it. At the least it will let us get rid of a FileCheck based test for huge integer multiply lowering. SHould also show a good path to adding *very* reduced execution test cases which seems valuable.

I can add an execution test that *specifically* forces EFLAGS to be copied feeding adds with carries if you want, but it feels .... very narrow. Won't catch any future bugs.

chandlerc updated this revision to Diff 141589.Apr 9 2018, 12:37 AM
chandlerc marked 6 inline comments as done.

Rebase and updates. I think this is probably ready to go in now?

I'm sending my execution tests in a separate patch to the test suite, but
hopefully they don't block this. Even with testing designed to maximize
carrying, I still have never managed to produce a real-world test that would
fail. The synthetic tests at least cover the correct CF/OF behavior.

This revision is now accepted and ready to land.Apr 9 2018, 9:45 AM
This revision was automatically updated to reflect the committed changes.