Page MenuHomePhabricator

SelectionDAGBuilder: Improve canonicalization by not swapping branch targets
Needs ReviewPublic

Authored by MatzeB on Sep 23 2021, 8:53 AM.



Introduces a TargetLowerig::optimizeFallthroughsEarly() switch to
stop SelectionDAG flipping jump targets to create fallthroughs. Optimizing
for fallthroughs can negatively affect canonicalization and we can usually
leave the decision to MachineBlockPlacement.


In most targets comparison instructions produce results for multiple
relations, so a program like if (x == C) { ... } else if (x > C) { ... }
can use a single compare instructions for 2 conditional branches. This pattern
is widespread in one of our systems (a decref operation with support for
immortal objects).

Machine-CSE would not consistently trigger, because:
SelectionDAG would often optimizes for fallthroughs,
flip true/false and negate the condition:

x > C
--> !(x > C) with true/false branches flipped
--> X <= C
--> X < (C + 1)  canonicalization

So we end up with a different constant inhibiting CSE!

However with MachineBlockPlacement optimzing control flow
and fallthroughs anyway there is no real benefit from
SelectionDAG optimizing early when it negatively affects

Target Specifics

I did not apply the change to most targets:

  • ARM: IfConverter pass currently relies on fallthroughs existing
  • AArch64: Has a custom AArch64ConditionOptimizer pass to CSE comparison. Strangely the unit tests do not show consistent behavior with and without these changes. Glancing at the code I wonder if the algorithm only checks control flow among the "true" direction of a conditional branch and misses opportunities in the "false" direction.
  • PowerPC+Mips: It seems in some tests not instructions are inserted when optimizing for fallthrough instead of flipping true/false.
  • BPF+WebAssembly: The TargetInstrInfo::analyzeBranch implementation appears to be incomplete so MachineBlockPlacement cannot optimize for fallthroughs.
  • Other targets: Test regressions that I did not understand.

The changes are enabled for: X86, RISCV, Lanai, XCore and by default
for new targets.

Test Changes

There is a lot of test churn. Most is benign because of different
canonicalization (which in fact makes it more likely to see the same
constant in LLVM-IR and assembly now).

There are a number of tests containing unoptimized conditional branches
jumping on true/false, getting better or worse. But those are just
artifacts of bugpoint reduction or manual construction and
won't happen in practice after InstCombine:


These tests show improved codegen because duplicated CMPs get

test/CodeGen/X86/cse-cmp.ll (new test)

These appear to improve because of different normalization:


Diff Detail

Unit TestsFailed

150 msx64 debian > ORC-x86_64-linux.TestCases/Linux/x86-64::trivial-cxa-atexit.S
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -m64 -c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/orc/X86_64LinuxConfig/TestCases/Linux/x86-64/Output/trivial-cxa-atexit.S.tmp /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/orc/TestCases/Linux/x86-64/trivial-cxa-atexit.S
160 msx64 debian > ORC-x86_64-linux.TestCases/Linux/x86-64::trivial-static-initializer.S
Script: -- : 'RUN: at line 7'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -m64 -c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/orc/X86_64LinuxConfig/TestCases/Linux/x86-64/Output/trivial-static-initializer.S.tmp /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/orc/TestCases/Linux/x86-64/trivial-static-initializer.S
160 msx64 debian > ORC-x86_64-linux.TestCases/Linux/x86-64::trivial-tls.S
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -m64 -c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/orc/X86_64LinuxConfig/TestCases/Linux/x86-64/Output/trivial-tls.S.tmp /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/orc/TestCases/Linux/x86-64/trivial-tls.S

Event Timeline

MatzeB created this revision.Sep 23 2021, 8:53 AM
MatzeB requested review of this revision.Sep 23 2021, 8:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 23 2021, 8:53 AM
MatzeB edited the summary of this revision. (Show Details)Sep 23 2021, 8:55 AM

I always found this annoying. Can we just unconditionally remove it?

MatzeB added a subscriber: ast.Sep 23 2021, 8:57 AM

I always found this annoying. Can we just unconditionally remove it?

I would love to, but I am not in a position to deal with all the test churn (and I tried for the last couple days). Theres little details that need fixing (see "Target Specifics" section in the summary) and general test churn that needs review.

RKSimon added inline comments.Sep 23 2021, 9:17 AM

This kind of thing is going to increase EFLAGS usage - I addressed this for CMOVs + SETs with SGT/UGT in D101074 but haven't had time to address JMP/BRANCH, or add the SLE/ULE equivalent yet.

MatzeB added inline comments.Sep 23 2021, 9:28 AM

This kind of thing is going to increase EFLAGS usage

Wow, I was not aware that affects performance. Does that mean MachineCodePlacement / analyzeBranch can pessimize code when it flips the true/false branches and negates the condition of a jump?

craig.topper added inline comments.Sep 23 2021, 9:38 AM

At least on Intel CPUs, I'm not aware of a performance issue for jumps. There is an extra uop for CMOV and SETCC conditions that use the C flag and Z flag. That would cmovbe, cmova, setbe, and seta. There is no extra uop for branches on the same conditions.

I can't speak to AMD CPUs.

RKSimon added inline comments.Sep 23 2021, 9:44 AM

Its often tricky to see the effect of this in real world code, but yes we should always be trying to minimize the number of EFLAGS bits we depend on. Some x86 cpus split the dependencies into individual/groups of bits, others keep them together, its quite a mess and isn't something we've properly tried to model......

I'll see if I can resurrect the followups to D101074 - iirc I encountered a similar mammoth test churn to you in this patch.

Without having a good way to progress here, I went for a new approach of making X86InstrInfo::optimizeCompareInstr more powerful in D110862, D110863, D110865 and D110867...