This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86RegisterInfo.h
-
test/
-
CodeGen/X86/
-
X86/
-
GlobalISel/
-
add-scalar.ll
-
and-scalar.ll
-
ashr-scalar.ll
-
binop.ll
-
callingconv.ll
-
ext-x86-64.ll
-
ext.ll
-
lshr-scalar.ll
-
memop-scalar.ll
-
mul-scalar.ll
-
or-scalar.ll
-
phi.ll
-
ptrtoint.ll
-
shl-scalar-widening.ll
-
shl-scalar.ll
-
sub-scalar.ll
-
trunc.ll
-
undef.ll
-
xor-scalar.ll
-
add.ll
-
addcarry.ll
-
and-encoding.ll
-
andimm8.ll
-
anyext.ll
-
apm.ll
-
atomic-eflags-reuse.ll
-
atomic128.ll
-
avg.ll
-
avoid-sfb.ll
-
avx-intel-ocl.ll
-
avx-vinsertf128.ll
-
avx512-arith.ll
-
avx512-calling-conv.ll
-
avx512-insert-extract.ll
-
avx512-intrinsics-upgrade.ll
-
avx512-mask-op.ll
-
avx512-regcall-NoMask.ll
-
avx512-schedule.ll
-
avx512-select.ll
-
avx512bw-mask-op.ll
-
avx512dq-mask-op.ll
-
avx512vl-arith.ll
-
bigstructret.ll
-
bitcast-i256.ll
-
bitcast-int-to-vector-bool.ll
-
bitreverse.ll
-
bmi-intrinsics-fast-isel-x86_64.ll
-
bmi-intrinsics-fast-isel.ll
-
bmi.ll
-
bmi2.ll
-
bool-math.ll
-
bool-simplify.ll
-
bswap-rotate.ll
-
bswap-wide-int.ll
-
bswap.ll
-
bswap_tree.ll
-
bswap_tree2.ll
-
bt.ll
-
btc_bts_btr.ll
-
bypass-slow-division-64.ll
-
clear-highbits.ll
-
clear-lowbits.ll
-
cmov-into-branch.ll
-
cmov.ll
-
cmovcmov.ll
-
cmp.ll
-
cmpxchg-clobber-flags.ll
-
cmpxchg-i128-i1.ll
-
combine-add.ll
-
combine-rotates.ll
-
combine-sdiv.ll
-
combine-udiv.ll
-
combine-urem.ll
-
conditional-indecrement.ll
-
dagcombine-select.ll
-
divide-by-constant.ll
-
divrem.ll
-
divrem8_ext.ll
-
extract-lowbits.ll
-
fast-isel-fold-mem.ll
-
fast-isel-select-cmov.ll
-
fast-isel-select-cmov2.ll
-
fast-isel-select-pseudo-cmov.ll
-
fast-isel-sext-zext.ll
-
fast-isel-shift.ll
-
fast-isel-store.ll
-
fixup-bw-copy.ll
-
fma.ll
-
fold-vector-sext-crash2.ll
-
funnel-shift-rot.ll
-
funnel-shift.ll
-
ghc-cc64.ll
-
hipe-cc64.ll
-
i128-mul.ll
-
iabs.ll
-
imul.ll
-
lea-opt.ll
-
legalize-shift-64.ll
-
legalize-shl-vec.ll
-
machine-combiner-int.ll
-
machine-cp.ll
-
machine-cse.ll
-
madd.ll
-
mask-negated-bool.ll
-
misched-matmul.ll
-
mul-constant-i16.ll
-
mul-constant-i32.ll
-
mul-constant-i64.ll
-
mul-i1024.ll
-
mul-i256.ll
-
mul-i512.ll
-
mul128.ll
-
mul64.ll
-
mwaitx-schedule.ll
-
mwaitx.ll
-
negate-i1.ll
-
negate-shift.ll
-
negate.ll
-
no-sse2-avg.ll
-
not-and-simplify.ll
-
palignr.ll
-
peep-setb.ll
-
pku.ll
-
pmaddubsw.ll
-
pmulh.ll
-
pr12360.ll
-
pr15705.ll
-
pr15981.ll
-
pr23664.ll
-
pr28173.ll
-
pr34653.ll
-
pr34657.ll
-
promote-i16.ll
-
ptest.ll
-
rot16.ll
-
rot64.ll
-
rotate.ll
-
rotate2.ll
-
rotate4.ll
-
sar_fold64.ll
-
sat-add.ll
-
scalar_widen_div.ll
-
schedule-x86-64-shld.ll
-
schedule-x86_64.ll
-
select.ll
-
select_const.ll
-
selectcc-to-shiftand.ll
-
setcc-logic.ll
-
sext-i1.ll
-
shift-and.ll
-
shift-bmi2.ll
-
shift-double-x86_64.ll
-
shift-double.ll
-
shift-pair.ll
-
shuffle-of-insert.ll
-
signbit-shift.ll
-
sret-implicit.ll
-
sse1.ll
-
sse3-intrinsics-x86.ll
-
sse3-schedule.ll
-
sse42-intrinsics-fast-isel-x86_64.ll
-
sse42-intrinsics-fast-isel.ll
-
sse42-intrinsics-x86.ll
-
sse42-intrinsics-x86_64.ll
-
sse42-schedule.ll
-
sttni.ll
-
subcarry.ll
-
swift-return.ll
-
swifterror.ll
-
system-intrinsics-xsetbv.ll
-
tbm-intrinsics-fast-isel-x86_64.ll
-
tbm-intrinsics-fast-isel.ll
-
tbm_patterns.ll
-
trunc-subvector.ll
-
twoaddr-lea.ll
-
umul-with-overflow.ll
-
umulo-128-legalisation-lowering.ll
-
unfold-masked-merge-scalar-constmask-innerouter.ll
-
unfold-masked-merge-scalar-constmask-interleavedbits.ll
-
unfold-masked-merge-scalar-constmask-interleavedbytehalves.ll
-
unfold-masked-merge-scalar-constmask-lowhigh.ll
-
unfold-masked-merge-scalar-variablemask.ll
-
unfold-masked-merge-vector-variablemask-const.ll
-
unfold-masked-merge-vector-variablemask.ll
-
urem-power-of-two.ll
-
urem-seteq-optsize.ll
-
use-add-flags.ll
-
vec_cast.ll
-
vector-bitreverse.ll
-
vector-blend.ll
-
vector-compare-results.ll
-
vector-interleave.ll
-
vector-pcmp.ll
-
vector-reduce-fadd-fast.ll
-
vector-shift-ashr-128.ll
-
vector-shift-lshr-128.ll
-
vector-shuffle-128-v2.ll
-
vector-shuffle-combining-sse4a.ll
-
vector-shuffle-combining-ssse3.ll
-
vector-shuffle-combining.ll
-
vector-zext.ll
-
vectorcall.ll
-
vselect-minmax.ll
-
vselect.ll
-
widen_bitops-0.ll
-
widen_bitops-1.ll
-
widen_load-2.ll
-
widen_load-3.ll
-
win64_vararg.ll
-
x64-cet-intrinsics.ll
-
x86-64-bittest-logic.ll
-
x86-64-double-shifts-Oz-Os-O2.ll
-
x86-cmov-converter.ll
-
x86-shrink-wrapping.ll
-
xaluo.ll
-
xchg-nofold.ll
-
xmulo.ll
-
xor.ll
-
DebugInfo/
-
COFF/
-
pieces.ll
-
X86/
-
live-debug-values.ll
-
live-debug-variables.ll
-
pieces-3.ll

Differential D38128

Handle COPYs of physregs better (regalloc hints)
ClosedPublic

Authored by RKSimon on Sep 21 2017, 4:29 AM.

Download Raw Diff

Details

Reviewers

uweigand
qcolombet
MatzeB
t.p.northover
niravd
arsenm
rampitec
rovka
yanyh
aemerson
craig.topper
jonpa
efriedma
spatel
nemanjai

Commits

rG2d0f20cc0434: [X86] Handle COPYs of physregs better (regalloc hints)
rL342578: [X86] Handle COPYs of physregs better (regalloc hints)

Summary

While enabling the mischeduler for SystemZ, it was discovered that fore some reason a test needed one extra seemingly needless COPY (test/CodeGen/SystemZ/call-03.ll). The handling for that is resulted in this patch, which improves the register coalescing by providing not just one copy hint, but a sorted list of copy hints. On SystemZ, this gives ~12500 less register moves on SPEC, as well as marginally less spilling.

Instead of improving just the SystemZ backend, the improvement has been implemented in common-code (calculateSpillWeightAndHint(). This gives a lot of test failures, but since this should be a general improvement I hope that the involved targets will help and review the test updates.

Diff Detail

Repository: rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In D38128#904781, @jonpa wrote:

In D38128#904581, @efriedma wrote:

Not sure if it is clear that coalescing with %R8 is generally better than %R0.

What do you mean, it isn't clear? Is the performance problem not clear? Or do you mean you're not sure how to detect this situation when you're sorting the hints?

I guess given your initial wording ("not great"), I was not sure if this is general and serious enough so that we really want to add an additional heuristic like "COPY to compare" on top of the sorting by weight. I suppose then there should be a flag like HasCompareUser which is then the tie-breaker when the weight is the same, or?

I don't believe there's anything special about compare here, this applies the same to any other use. In general, if you have something like "COPY dest, src" followed by "use dest", it is nearly always better to replace the second instruction by "use src" if this is possible (i.e. if src is still live and unchanged at that point). This is simply because on most processor architectures "COPY dest, src / use src" can execute in parallel, while "COPY dest, src / use dest" must execute sequentially.

In D38128#904861, @uweigand wrote:

In D38128#904781, @jonpa wrote:

In D38128#904581, @efriedma wrote:

Not sure if it is clear that coalescing with %R8 is generally better than %R0.

What do you mean, it isn't clear? Is the performance problem not clear? Or do you mean you're not sure how to detect this situation when you're sorting the hints?

I guess given your initial wording ("not great"), I was not sure if this is general and serious enough so that we really want to add an additional heuristic like "COPY to compare" on top of the sorting by weight. I suppose then there should be a flag like HasCompareUser which is then the tie-breaker when the weight is the same, or?

I don't believe there's anything special about compare here, this applies the same to any other use. In general, if you have something like "COPY dest, src" followed by "use dest", it is nearly always better to replace the second instruction by "use src" if this is possible (i.e. if src is still live and unchanged at that point). This is simply because on most processor architectures "COPY dest, src / use src" can execute in parallel, while "COPY dest, src / use dest" must execute sequentially.

That makes sense.

Would any of these be an improvement worth trying, then?

(1) If the source phys-reg is contained in the regclass of an immediately following instruction using the vreg, then increase the priority of the hint of that phys-reg.
(2) A simpler alternative would be to simply prefer phys-reg sources more than phys-reg dest-regs (if weight is equal). That would catch all the cases of (1).

In D38128#904899, @jonpa wrote:

Would any of these be an improvement worth trying, then?

(1) If the source phys-reg is contained in the regclass of an immediately following instruction using the vreg, then increase the priority of the hint of that phys-reg.
(2) A simpler alternative would be to simply prefer phys-reg sources more than phys-reg dest-regs (if weight is equal). That would catch all the cases of (1).

I believe (2) makes sense ... certainly worth a try.

In D38128#905052, @uweigand wrote:

In D38128#904899, @jonpa wrote:

Would any of these be an improvement worth trying, then?

(1) If the source phys-reg is contained in the regclass of an immediately following instruction using the vreg, then increase the priority of the hint of that phys-reg.
(2) A simpler alternative would be to simply prefer phys-reg sources more than phys-reg dest-regs (if weight is equal). That would catch all the cases of (1).

I believe (2) makes sense ... certainly worth a try.

I see that on SPEC/z13, I now get a few more copys left. Compared to master:

lgr            :               341784               328956   -12828
lr             :                26208                25803     -405
...
Spill|Reload   :               165135               164761     -374

-> with prefer phys reg source COPYs:

lgr            :               341784               329060   -12724
lr             :                26208                25817     -391
...
Spill|Reload   :               165135               164934     -201

I also see looking at just the loops that while 120 innner loops have changed size, 103 of them has gotten
bigger, which seems to indicate that at least currently (Wei Mi has an upcoming regsplit patch that may
change this), this idea is not quite promising.

It did however handle the test case under discussion (ARM/swifterror.ll)

Reversing this heuristic so that phys-reg dest copys are prioritized:

lgr            :               341784               328907   -12877
lr             :                26208                25762     -446
...
Spill|Reload   :               165135               164807     -328

For the innermost loops, there are only 34 different (15 smaller).

I am not sure it is right to tamper with this in this way given the above...

Right. Always coalescing with the copy source of course extends the live range of the source, and thus increases overall register pressure. So it's not a good idea to do this unconditionally.

On the other hand, in the example that triggered this discussion, there was no register pressure problem and source and destination register just happened to be both still live at the point of use, anyway. In this specific case, it would be preferable to use the source instead of the dest register at the point of use. On the other hand, if keeping the source alive would cause extra spilling, we don't want that, but want to use the dest after all.

Not sure how exactly to handle this in LLVM. It is probably not simply a matter of preference in the register coalescer.

Eli, do you have any comment on this?

What specifically do you want me to comment on? I don't really understand the register allocator well enough to comment on your code changes.

In D38128#908337, @efriedma wrote:

What specifically do you want me to comment on? I don't really understand the register allocator well enough to comment on your code changes.

Given that this is something that Quentin has been meaning to fix before I did, and that this is *generally* good, the question is now if we can accept minor issues like this. On SystemZ, I think this is quite clear since we get 12500 less COPYs and also less spilling (on SPEC). Could you perhaps try it on your target and see if you likewise would agree this is good enough despite the swifterror test issue?

Also, it could be that other tests with a similar code actually now improve, since this was a random result. So just one regression alone should not have to stop the patch, unless it is a sign of a more general issue.

It's a minor issue; I'm okay with committing this as-is and trying to address the issue later.

It's a minor issue; I'm okay with committing this as-is and trying to address the issue later.

All right - I take this to mean that you approve of the ARM test changes, then. Thanks.

PING!

Patch updated with a few X86 tests updated. Again, someone from each target with the right experience should review the test changes I have made:

TODO:
CodeGen : AArch64 AMDGPU BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

DONE:
CodeGen : ARM
DebugInfo :

ping @tstellar: CodeGen/AMDGPU/multilevel-break.ll

In D38128#910425, @jonpa wrote:

TODO:
CodeGen : AArch64 AMDGPU BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore

AArch64/win64_vararg.ll is still ok.
arm64-aapcs.ll and func-argpassing.ll also seem fine.

swifterror.ll seems to get a very minor regression, not sure who to get an authoritative approval from though.

PING!

@Quentin, I implemented this in common code because you had already plans for this... It's been weeks now since I updated all the tests manually, which was not a small effort. Still, there is very little progress on the review of the tests :-/ Do you know who I should add as reviewers? (BTW patch slightly changed since you approved it - still ok, I hope?)

@Tom, I hope you can find the time to look into the AMDGPU tests as I thought you meant to do...

Review of test changes:

TODO:
CodeGen : AMDGPU BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

PARTIALLY DONE (still more tests in need of review):
CodeGen :

AArch64: win64_vararg.ll, arm64-aapcs.ll, func-argpassing.ll (Martin Storsjo) swifterror.ll small regression / needs approval (Martin Storsjo)

DONE:
CodeGen :

ARM (Eli Friedman)

Just looking at the number and placement of reg-reg moves for PowerPC seems fine. It doesn't seem like there are regressions but no huge reduction in the number of copies. However, since the diff is without context, it's hard to tell exactly what is going on in the test cases. I'll apply this patch to ToT and have a look at how things change before signing off on the PPC changes.

The AMDGPU tests look OK, but the CHECK lines for test/CodeGen/AMDGPU/sgpr-control-flow.ll should be updated like this:
https://reviews.llvm.org/P8044

@Nemanja:

Just looking at the number and placement of reg-reg moves for PowerPC seems fine. It doesn't seem like there are regressions but no huge reduction in the number of copies. However, since the diff is without context, it's hard to tell exactly what is going on in the test cases. I'll apply this patch to ToT and have a look at how things change before signing off on the PPC changes.

Sorry about forgetting the context - fixed.

@Tom:

The AMDGPU tests look OK, but the CHECK lines for test/CodeGen/AMDGPU/sgpr-control-flow.ll should be updated like this: https://reviews.llvm.org/P8044

Thanks! sgpr-control-flow.ll updated, but there is also still the multilevel-break.ll failure. This is the only test that is failing even with this patch, and the reason for this is that I was not sure how to update it. We discussed this on Oct 11 here on Phabricator. Could you please take a look?

Test updates in need of review:

TODO:
CodeGen : BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

PARTIALLY DONE (still more tests in need of review):
CodeGen :

AArch64 : win64_vararg.ll, arm64-aapcs.ll, func-argpassing.ll (Martin Storsjo) swifterror.ll: small regression, needs approval (Martin Storsjo)
AMDGPU: (Tom Stellard) All reviewed, but multilevel-break.ll needs updating (fails with patch)

DONE:
CodeGen :

ARM (Eli Friedman)

I think the changes to the multilevel-break.ll test are OK. The register usage is increased by 2, but this won't have any impact on performance for this test. Here is a patch to make this test pass with your changes: https://reviews.llvm.org/P8046

The PowerPC CodeGen test changes are fine. I've also confirmed that this patch reduces the total number of register move instructions (which implement the copies). So that LGTM.

This revision is now accepted and ready to land.Nov 8 2017, 10:55 AM

@Tom:

I think the changes to the multilevel-break.ll test are OK. The register usage is increased by 2, but this won't have any impact on performance for this test. Here is a patch to make this test pass with your changes: https://reviews.llv\

m.org/P8046
Applied. Thanks!

@Nemanja:

The PowerPC CodeGen test changes are fine. I've also confirmed that this patch reduces the total number of register move instructions (which implement the copies). So that LGTM.

Thanks, nice to hear :-)

Regression? : test/DebugInfo/X86/live-debug-variables.ll

Test updates in need of review:

TODO:
CodeGen   : BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

PARTIALLY DONE :
 CodeGen :
  - AArch64: All done but swifterror.ll (small regression and needs approval according to Martin Storsjo)

DONE:
 CodeGen :
  - ARM (Eli Friedman)
  - AMDGPU (Tom Stellard)
  - PPC (nemanjai)

Ping!

Three backends now done (with confirmed reductions in number of COPYs after regalloc). ninja check passes with patch (all failing tests updated).

Please, could someone from each backend listed below take a look at the test changes (already updated).

TODO:
CodeGen   : BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

PARTIALLY DONE :
 CodeGen :
  - AArch64: All done but swifterror.ll (small regression and needs approval according to Martin Storsjo)

APPROVED:
 CodeGen :
  - ARM (Eli Friedman)
  - AMDGPU (Tom Stellard)
  - PPC (Nemanja Ivanovic)

(Regression? : test/DebugInfo/X86/live-debug-variables.ll)

@Quentin: I think the patch is starting to be somewhat convincing, as it is now also confirmed and approved on PowerPC, as well as has gotten approved test changes for AMDGPU and CodeGen/ARM. Given that it's now been two months of review, and that I updated all the tests already a month ago, I wonder what you would say about activating this for now under a target preference hook? I think that SystemZ, PowerPC and AMDGPU could benefit from this now, while other targets could do so after reviewing test changes. Would this work?

Status table corrected:

TODO:
CodeGen   : BPF Hexagon Mips SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

PARTIALLY DONE :
 CodeGen :
  - AArch64: All done but swifterror.ll (small regression and needs approval according to Martin Storsjo)

APPROVED:
 CodeGen :
  - CodeGen/ARM (Eli Friedman)
  - AMDGPU (Tom Stellard)
  - PowerPC (Nemanja Ivanovic)

(Regression? : test/DebugInfo/X86/live-debug-variables.ll)

Still LGTM :).
It makes sense to have it behind a hook for now, but could you follow-up with the targets owners so that it gets to be the default and we can get rid of the hook.
I believe this is general goodness and that we should use it for all targets going forward.

Patch updated to include a TargetRegisterInfo hook 'enableMultipleCopyHints()' that returns false per default, which gives NFC to trunk. OnlySystemZ returns true for now. It is up to all target maintainers to enable this and review test changes, which hopefully will be done
soon so that the hook can be removed again.

Some minor changes compared to last version in order to get the NFC to trunk when returning false in enableMultipleCopyHints():

reinstate the mri.isAllocatable() check for the hint, which was dropped for some reason (makes a small difference on the resulting weight in some rare cases).
If target has created a simple hint, it must be cleared before the best hint from CopyHints is inserted, so a new clearSimpleHint() method in MachineRegisterInfo was needed. Not sure that all those different methods for manipulating hints are really needed in the end once all targets have been converted. In particular, I am curious why AMDGPU sets those hints (of generic type) when they will be recomputed always here...
A new temporary variable CopyHint::HintOrder, in order to properly mimic the current behavior on trunk for NFC. This is because on trunk, the best hint is kept in the order of discovery, which is not what the CopyHints set was doing.

These hacks aren't very pretty, so PLEASE EVERYONE: HELP REMOVE THEM BY ENABLING THE MULTIPLE COPY HINTS!

@nemanjai: You approved this for Power PC, but one more test has changed now (licm-tocReg.ll), so please get the previous test changes from here on Phabricator (which apply fine) and review the new one, and then return true in enableMultipleCopyHints() once this is approved.

@tstellard, @efriedman: Thanks for helping reviewing the tests on your targets. I have not enabled those targets as I have not checked on those tests on latest trunk. I hope you can download those tests from here on Phabricator and make sure eveything's green as you enable this.

Please let me know if this is ok to commit.

jonpa requested review of this revision.Nov 28 2017, 6:16 AM

jonpa edited edge metadata.

ping!

The SystemZ changes LGTM. Given that Quentin approved the common part, this should be fine.

This revision is now accepted and ready to land.Dec 5 2017, 2:27 AM

Commited as r319754.

I will not close this review since the next step is to remove the temporary enableMultipleCopyHints() hook, after everyone is returning true.

Tom, Nemanja, Martin, Eli... I hope you will try this soon...

Targets that implement getRegAllocationHints() (ARM), might want to review that implementation in regards to the new common-code...

This has been committed, but is still only enabled for SystemZ.

I thought we might try to make progress on enabling this on all targets, so I decided to now do one target at a time.

First out is PowerPC:

CodeGen/PowerPC/licm-tocReg.ll: Updated. Seems to use one instruction (mr) less.
load-two-flts.ll: Updated: Smaller immediate offsets should be better, I hope.
ppc64-byval-align.ll: Updated: One move less, it seems.
select-i1-vs-i1.ll: Updated: a few functions a bit different. If-conversion?

Nemanja?

Ping!

Could anyone on PowerPC confirm these test changes and perhaps also that number of COPYs are decreasing overall, please.

jonpa requested review of this revision.Jan 18 2018, 9:03 AM

In D38128#980403, @jonpa wrote:

Ping!

Could anyone on PowerPC confirm these test changes and perhaps also that number of COPYs are decreasing overall, please.

Sorry about the delay. The test case changes look neutral or like improvements (slect-i1-vs-i1.ll looks like an improvement). I'm running it now to see the overall impact on the number of register copies. I'll report back soon.

This also reduces the overall number of register moves slightly. LGTM

This revision is now accepted and ready to land.Jan 29 2018, 10:36 AM

Thanks Nemanja, this is now enabled (r323858) also for PowerPC :-)

A 'ping' to the other targets that I updated tests for a while back ago - has anyone made any progress?

Thank you Jonas. As this no longer needs PPC input, I'm resigning from the review.

Let's try to enable this for AArch64, next.

I have reapplied the test changes from earlier. Martin, does this look OK?

Yes, this looks like mostly no-op reorderings. I don't remember exactly what part I thought was a potential performance regression in swifterror.ll before - it moves one "mov" instruction (in two places) to before a branch, so potentially executing one instruction more than before. I would say it's most probably ok (and the gains in some of the other register shuffling tests would make it sound like a net gain in any case).

I'm in no way authoritative for this though, so perhaps @aemerson or @qcolombet would like to give such a stamp of approval?

So with the absolutely minimal performance regression in swifterror.ll and lack of response otherwise, I think I'm ok with giving it the LGTM for aarch64.

This revision is now accepted and ready to land.Feb 9 2018, 12:44 AM

In D38128#1002953, @mstorsjo wrote:

So with the absolutely minimal performance regression in swifterror.ll and lack of response otherwise, I think I'm ok with giving it the LGTM for aarch64.

Thank you, Martin. Committed as r324720.

mstorsjo resigned from this revision.Feb 9 2018, 1:38 AM

This revision now requires review to proceed.Feb 9 2018, 1:38 AM

Sorry I didn't see this, I need to fix my email filters.

This revision is now accepted and ready to land.Feb 9 2018, 2:39 AM

aemerson closed this revision.Feb 9 2018, 2:39 AM

Enabled for ARM. Mostly same test updates as previously (still passing), plus a few more.

Somebody take a look, please. Eli?

jonpa reopened this revision.Feb 9 2018, 2:40 AM

This revision is now accepted and ready to land.Feb 9 2018, 2:40 AM

jonpa requested review of this revision.Feb 9 2018, 2:41 AM

Ping!

ARM test updates are waiting for approval.

ARM test changes LGTM

This revision is now accepted and ready to land.Feb 15 2018, 6:05 PM

In D38128#1009606, @efriedma wrote:

ARM test changes LGTM

Thanks, Eli.

r325327.

Enabled for AMDGPU, with just a few minor test updates.

@tstellar : This looks to be a subset of the test changes you previously approved, so I am hoping they still look ok?

[AMDGPU] Just a few test updates in need of approval.

Changes look non-essential to me.

This revision is now accepted and ready to land.Feb 16 2018, 8:13 AM

In D38128#1010271, @rampitec wrote:

Changes look non-essential to me.

Thank you, Stanislav. I committed this even though a last-minute change slipped in (diff from approved changes):

--- a/test/CodeGen/AMDGPU/ret.ll
+++ b/test/CodeGen/AMDGPU/ret.ll
@@ -178,8 +178,8 @@ bb:
 }

 ; GCN-LABEL: {{^}}sgpr:
-; GCN-DAG: s_add_i32 s0, s3, 2
-; GCN-DAG: s_mov_b32 s2, s3
+; GCN: s_mov_b32 s2, s3
+; GCN: s_add_i32 s0, s2, 2
 ; GCN-NOT: s_endpgm
 define amdgpu_vs { i32, i32, i32 } @sgpr([9 x <16 x i8>] addrspace(4)* byval %arg, i32 inreg %arg1, i32 inreg %arg2, float %arg3) #0 {
 bb:

This looks harmless to me - the add is using s2 instead of s3, directly after the move of s3 to s2, right? Please take a look.

Committed as r325425.

[BPF] multiple copy hints enabled. Tests updated and passing, please review.

[BPF] Please review updated tests.

In D38128#1011287, @jonpa wrote:
In D38128#1010271, @rampitec wrote:

Changes look non-essential to me.

Thank you, Stanislav. I committed this even though a last-minute change slipped in (diff from approved changes):
--- a/test/CodeGen/AMDGPU/ret.ll
+++ b/test/CodeGen/AMDGPU/ret.ll
@@ -178,8 +178,8 @@ bb:
 }

 ; GCN-LABEL: {{^}}sgpr:
-; GCN-DAG: s_add_i32 s0, s3, 2
-; GCN-DAG: s_mov_b32 s2, s3
+; GCN: s_mov_b32 s2, s3
+; GCN: s_add_i32 s0, s2, 2
 ; GCN-NOT: s_endpgm
 define amdgpu_vs { i32, i32, i32 } @sgpr([9 x <16 x i8>] addrspace(4)* byval %arg, i32 inreg %arg1, i32 inreg %arg2, float %arg3) #0 {
 bb:
This looks harmless to me - the add is using s2 instead of s3, directly after the move of s3 to s2, right? Please take a look.

Committed as r325425.

This change in mov/add does not affect anything, thanks.

(Yonghong Song)
I checked BFP, yes. The patch looks good to me and you can add my ACK and push in.

BFP enabled by r325457.

[Hexagon] Enabled with updated tests.

Please take a look at the test updates.

The Hexagon changes look ok.

This revision is now accepted and ready to land.Feb 21 2018, 7:52 AM

In D38128#1014574, @kparzysz wrote:

The Hexagon changes look ok.

Thank you.
r325697

[Mips] Multiple regalloc hints enabled with updated tests.

This revision now requires review to proceed.Feb 22 2018, 1:25 AM

sdardis mentioned this in rL325770: [mips] Regenerate tests for D38128 (NFC).Feb 22 2018, 3:56 AM

Hi Jonas, I've regenerated the tests in rL325770 and the changes seem ok. The main difference I'm seeing is some moves are being eliminated, but in a few cases in test/CodeGen/Mips/analyzebranch.ll the earlier move prevents a delay slot being filled but otherwise looks ok.

https://reviews.llvm.org/differential/diff/135383 is my local copy of this patch against the regenerated tests.

This revision is now accepted and ready to land.Feb 22 2018, 4:57 AM

sdardis patch applied.

In D38128#1015672, @sdardis wrote:

Hi Jonas, I've regenerated the tests in rL325770 and the changes seem ok. The main difference I'm seeing is some moves are being eliminated, but in a few cases in test/CodeGen/Mips/analyzebranch.ll the earlier move prevents a delay slot being filled but otherwise looks ok.

https://reviews.llvm.org/differential/diff/135383 is my local copy of this patch against the regenerated tests.

Thank you, Simon. I applied your patch with regenerated tests and committed it as r325870.

[SPARC] Patch enabled - please review the updated tests.

This revision now requires review to proceed.Feb 23 2018, 1:00 AM

Sparc changes look reasonable.

This revision is now accepted and ready to land.Feb 23 2018, 8:33 AM

In D38128#1017405, @jyknight wrote:

Sparc changes look reasonable.

Thank you, James. r326028.

[XCore] patch enabled - please review updated test and approve patch.

This revision now requires review to proceed.Feb 24 2018, 3:57 AM

please see in-line comments

test/CodeGen/XCore/byVal.ll
45–46 ↗	(On Diff #135790)	Hi jonpa, The output in test/CodeGen/XCore/byVal.ll is incorrect. viz the value in r1 is not copied into r0 before being overwritten by the value in sp[2] (r0's indirected value). These two lines should be the other way around. sorry robert

This revision now requires changes to proceed.Feb 25 2018, 9:51 AM

Test updated again.

jonpa marked an inline comment as done.Feb 25 2018, 11:58 PM

jonpa added inline comments.

test/CodeGen/XCore/byVal.ll
45–46 ↗	(On Diff #135790)	Ah, I think 'r1' slipped through in the place of 'r11'... Does it look ok now?

LGTM
thank you
robert

This revision is now accepted and ready to land.Feb 26 2018, 12:01 AM

In D38128#1018974, @robertlytton wrote:

LGTM
thank you
robert

Thank you, Robert. r326069

X86 is now the final backend to enable :-)

This patch enables X86 with regenerated tests in all (155) tests that previously had the "NOTE: : Assertions have been autogenerated..." comment. I have manually updated the remaining (28) ones.

Please review these test changes.

After this, we can finally get rid of the enableMultipleCopyHints() hook.

This revision now requires review to proceed.Feb 26 2018, 5:19 AM

thegameg added a subscriber: thegameg.Feb 26 2018, 7:29 AM

This looks like a nice improvement modulo a few issues:

In a number of places (notably register arguments to shift instructions) we generate movq instead of movl that's shorter and equivalent. I believe there's no real performance difference. Certainly we should for when compiling for size.
There's some unnecessary shuffling of register names in SSE4 tests.

Someone who's a more familiar with Debug should double check those tests though they seem fine to me.

test/CodeGen/X86/fast-isel-shift.ll
22 ↗	(On Diff #135887)	Why are we getting two kill comments about cx here?
test/CodeGen/X86/schedule-x86-64-shld.ll
129 ↗	(On Diff #135887)	This should be movl given optsize.
test/CodeGen/X86/sret-implicit.ll
13 ↗	(On Diff #135887)	This shouldn't be DAG matches anymore the 2nd line should always come first
27 ↗	(On Diff #135887)	Same as above
test/CodeGen/X86/vector-shift-ashr-128.ll
270 ↗	(On Diff #135887)	All of the SSE4 changes regarding shifts seem to generate unnecessary register shuffling.
test/CodeGen/X86/vectorcall.ll
26 ↗	(On Diff #135887)	another case of movq vs movl
152 ↗	(On Diff #135887)	This is probably fine, but confusing. Can you decompose this up into the X86 and X64? In fact, this file should probably be autogenerated with utils/update_llc_test_checks.py

RKSimon commandeered this revision.Apr 9 2018, 1:19 AM

RKSimon edited reviewers, added: jonpa; removed: RKSimon.

ping! Please don't forget to do this for X86, we still want to get rid of that temporary hook... :-)

RKSimon mentioned this in rL338262: [X86] Regenerate fast-isel tests..Jul 30 2018, 9:14 AM

RKSimon mentioned this in rL338264: [X86] Regenerate PKU test to merge 32/64-bit rdpkru checks.

RKSimon mentioned this in rL338265: [X86] Regenerate NOBMI/BMI combine-select tests..Jul 30 2018, 9:20 AM

Updated all x86 tests to trunk latest

Has anyone looked at @niravd's comments?

In D38128#1180526, @craig.topper wrote:

Has anyone looked at @niravd's comments?

I haven't - just updated all the tests

RKSimon added inline comments.Jul 30 2018, 11:50 AM

test/CodeGen/X86/vector-shift-ashr-128.ll
270 ↗	(On Diff #135887)	If I had to guess - this is probably due to SSE41's PBLENDV instructions being hardwired to use xmm0

RKSimon mentioned this in D52121: [X86] Fold (movmsk (setne (and X, (1 << C)), 0)) -> (movmsk (X << C)).Sep 14 2018, 3:10 PM

RKSimon mentioned this in D52109: [TwoAddressInstructionPass] Don't update SrcRegMap for copies inserted for tied register constraint when the src isn't killed.Sep 15 2018, 3:25 AM

Is this ready or any blockers?

rebased to trunk

In D38128#1235889, @xbolva00 wrote:

Is this ready or any blockers?

Other than the poor codegen issues (but not regressions) commented on by @niravd I don't think there are any blockers.

OK to commit?

The codesize issues are minor and shouldn't hold this patch up. The only blocker I see is the unnecessary data shuffling for SSE41 codegen which someone else should decide on.

In D38128#1238605, @RKSimon wrote:

OK to commit?

In D38128#1238797, @niravd wrote:

The codesize issues are minor and shouldn't hold this patch up. The only blocker I see is the unnecessary data shuffling for SSE41 codegen which someone else should decide on.

Is this just about the extra 'movdqa' in vector-shift-ashr-128.ll, or are there other diffs to look at?

I've gone through and marked all the places.

Is this just about the extra 'movdqa' in vector-shift-ashr-128.ll, or are there other diffs to look at?

test/CodeGen/X86/vector-shift-lshr-128.ll
232 ↗	(On Diff #165646)	Extra instruction here
test/CodeGen/X86/vselect-minmax.ll
4539 ↗	(On Diff #165646)	Extra instruction here.
4660 ↗	(On Diff #165646)	Extra instruction here.
4781 ↗	(On Diff #165646)	Extra Instruction here.
4901 ↗	(On Diff #165646)	Extra instruction here.
5019 ↗	(On Diff #165646)	Extra instructions here.
5170 ↗	(On Diff #165646)	Extra instructions here.
5319 ↗	(On Diff #165646)	Extra instructions here.
5468 ↗	(On Diff #165646)	Extra instructions here.
6995 ↗	(On Diff #165646)	Extra instructions here.
7116 ↗	(On Diff #165646)	Extra instruction here.
7235 ↗	(On Diff #165646)	Extra instruction here.
7356 ↗	(On Diff #165646)	Extra instructions here.
7505 ↗	(On Diff #165646)	Extra Instructions here.
7655 ↗	(On Diff #165646)	Extra instructions here.

In D38128#1239480, @niravd wrote:

I've gone through and marked all the places.

Is this just about the extra 'movdqa' in vector-shift-ashr-128.ll, or are there other diffs to look at?

These are all due to pblendvb/blendvpd/blendvps being hardwired to use the xmm0 as the mask register (limit goes away for avx)

In D38128#1239512, @RKSimon wrote:

In D38128#1239480, @niravd wrote:

I've gone through and marked all the places.

Is this just about the extra 'movdqa' in vector-shift-ashr-128.ll, or are there other diffs to look at?

These are all due to pblendvb/blendvpd/blendvps being hardwired to use the xmm0 as the mask register (limit goes away for avx)

Yep - thanks for marking all of those. That corner case shouldn't hold up the general improvement, and the number of customers specifically targeting SSE4.1 should go to zero over time, so LGTM.

This revision is now accepted and ready to land.Sep 19 2018, 9:41 AM

Closed by commit rL342578: [X86] Handle COPYs of physregs better (regalloc hints) (authored by RKSimon). · Explain WhySep 19 2018, 12:01 PM

This revision was automatically updated to reflect the committed changes.