This is an archive of the discontinued LLVM Phabricator instance.

Handle COPYs of physregs better (regalloc hints)
ClosedPublic

Authored by RKSimon on Sep 21 2017, 4:29 AM.

Details

Summary

While enabling the mischeduler for SystemZ, it was discovered that fore some reason a test needed one extra seemingly needless COPY (test/CodeGen/SystemZ/call-03.ll). The handling for that is resulted in this patch, which improves the register coalescing by providing not just one copy hint, but a sorted list of copy hints. On SystemZ, this gives ~12500 less register moves on SPEC, as well as marginally less spilling.

Instead of improving just the SystemZ backend, the improvement has been implemented in common-code (calculateSpillWeightAndHint(). This gives a lot of test failures, but since this should be a general improvement I hope that the involved targets will help and review the test updates.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Not sure if it is clear that coalescing with %R8 is generally better than %R0.

What do you mean, it isn't clear? Is the performance problem not clear? Or do you mean you're not sure how to detect this situation when you're sorting the hints?

I guess given your initial wording ("not great"), I was not sure if this is general and serious enough so that we really want to add an additional heuristic like "COPY to compare" on top of the sorting by weight. I suppose then there should be a flag like HasCompareUser which is then the tie-breaker when the weight is the same, or?

I don't believe there's anything special about compare here, this applies the same to any other use. In general, if you have something like "COPY dest, src" followed by "use dest", it is nearly always better to replace the second instruction by "use src" if this is possible (i.e. if src is still live and unchanged at that point). This is simply because on most processor architectures "COPY dest, src / use src" can execute in parallel, while "COPY dest, src / use dest" must execute sequentially.

jonpa added a comment.Oct 24 2017, 3:23 AM

Not sure if it is clear that coalescing with %R8 is generally better than %R0.

What do you mean, it isn't clear? Is the performance problem not clear? Or do you mean you're not sure how to detect this situation when you're sorting the hints?

I guess given your initial wording ("not great"), I was not sure if this is general and serious enough so that we really want to add an additional heuristic like "COPY to compare" on top of the sorting by weight. I suppose then there should be a flag like HasCompareUser which is then the tie-breaker when the weight is the same, or?

I don't believe there's anything special about compare here, this applies the same to any other use. In general, if you have something like "COPY dest, src" followed by "use dest", it is nearly always better to replace the second instruction by "use src" if this is possible (i.e. if src is still live and unchanged at that point). This is simply because on most processor architectures "COPY dest, src / use src" can execute in parallel, while "COPY dest, src / use dest" must execute sequentially.

That makes sense.

Would any of these be an improvement worth trying, then?

(1) If the source phys-reg is contained in the regclass of an immediately following instruction using the vreg, then increase the priority of the hint of that phys-reg.
(2) A simpler alternative would be to simply prefer phys-reg sources more than phys-reg dest-regs (if weight is equal). That would catch all the cases of (1).

Would any of these be an improvement worth trying, then?

(1) If the source phys-reg is contained in the regclass of an immediately following instruction using the vreg, then increase the priority of the hint of that phys-reg.
(2) A simpler alternative would be to simply prefer phys-reg sources more than phys-reg dest-regs (if weight is equal). That would catch all the cases of (1).

I believe (2) makes sense ... certainly worth a try.

jonpa added a comment.Oct 24 2017, 6:44 AM

Would any of these be an improvement worth trying, then?

(1) If the source phys-reg is contained in the regclass of an immediately following instruction using the vreg, then increase the priority of the hint of that phys-reg.
(2) A simpler alternative would be to simply prefer phys-reg sources more than phys-reg dest-regs (if weight is equal). That would catch all the cases of (1).

I believe (2) makes sense ... certainly worth a try.

I see that on SPEC/z13, I now get a few more copys left. Compared to master:

lgr            :               341784               328956   -12828
lr             :                26208                25803     -405
...
Spill|Reload   :               165135               164761     -374

-> with prefer phys reg source COPYs:

lgr            :               341784               329060   -12724
lr             :                26208                25817     -391
...
Spill|Reload   :               165135               164934     -201

I also see looking at just the loops that while 120 innner loops have changed size, 103 of them has gotten
bigger, which seems to indicate that at least currently (Wei Mi has an upcoming regsplit patch that may
change this), this idea is not quite promising.

It did however handle the test case under discussion (ARM/swifterror.ll)

Reversing this heuristic so that phys-reg dest copys are prioritized:

lgr            :               341784               328907   -12877
lr             :                26208                25762     -446
...
Spill|Reload   :               165135               164807     -328

For the innermost loops, there are only 34 different (15 smaller).

I am not sure it is right to tamper with this in this way given the above...

Right. Always coalescing with the copy source of course extends the live range of the source, and thus increases overall register pressure. So it's not a good idea to do this unconditionally.

On the other hand, in the example that triggered this discussion, there was no register pressure problem and source and destination register just happened to be both still live at the point of use, anyway. In this specific case, it would be preferable to use the source instead of the dest register at the point of use. On the other hand, if keeping the source alive would cause extra spilling, we don't want that, but want to use the dest after all.

Not sure how exactly to handle this in LLVM. It is probably not simply a matter of preference in the register coalescer.

jonpa added a comment.Oct 26 2017, 9:09 AM

Eli, do you have any comment on this?

What specifically do you want me to comment on? I don't really understand the register allocator well enough to comment on your code changes.

jonpa added a comment.Oct 27 2017, 2:53 AM

What specifically do you want me to comment on? I don't really understand the register allocator well enough to comment on your code changes.

Given that this is something that Quentin has been meaning to fix before I did, and that this is *generally* good, the question is now if we can accept minor issues like this. On SystemZ, I think this is quite clear since we get 12500 less COPYs and also less spilling (on SPEC). Could you perhaps try it on your target and see if you likewise would agree this is good enough despite the swifterror test issue?

Also, it could be that other tests with a similar code actually now improve, since this was a random result. So just one regression alone should not have to stop the patch, unless it is a sign of a more general issue.

It's a minor issue; I'm okay with committing this as-is and trying to address the issue later.

jonpa updated this revision to Diff 120782.Oct 30 2017, 2:14 AM

It's a minor issue; I'm okay with committing this as-is and trying to address the issue later.

All right - I take this to mean that you approve of the ARM test changes, then. Thanks.

PING!

Patch updated with a few X86 tests updated. Again, someone from each target with the right experience should review the test changes I have made:

TODO:
CodeGen : AArch64 AMDGPU BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

DONE:
CodeGen : ARM
DebugInfo :

ping @tstellar: CodeGen/AMDGPU/multilevel-break.ll

TODO:
CodeGen : AArch64 AMDGPU BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore

AArch64/win64_vararg.ll is still ok.
arm64-aapcs.ll and func-argpassing.ll also seem fine.

swifterror.ll seems to get a very minor regression, not sure who to get an authoritative approval from though.

jonpa updated this revision to Diff 121718.Nov 6 2017, 4:41 AM

PING!

@Quentin, I implemented this in common code because you had already plans for this... It's been weeks now since I updated all the tests manually, which was not a small effort. Still, there is very little progress on the review of the tests :-/ Do you know who I should add as reviewers? (BTW patch slightly changed since you approved it - still ok, I hope?)

@Tom, I hope you can find the time to look into the AMDGPU tests as I thought you meant to do...

Review of test changes:

TODO:
CodeGen : AMDGPU BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

PARTIALLY DONE (still more tests in need of review):
CodeGen :

  • AArch64: win64_vararg.ll, arm64-aapcs.ll, func-argpassing.ll (Martin Storsjo) swifterror.ll small regression / needs approval (Martin Storsjo)

DONE:
CodeGen :

  • ARM (Eli Friedman)

Just looking at the number and placement of reg-reg moves for PowerPC seems fine. It doesn't seem like there are regressions but no huge reduction in the number of copies. However, since the diff is without context, it's hard to tell exactly what is going on in the test cases. I'll apply this patch to ToT and have a look at how things change before signing off on the PPC changes.

The AMDGPU tests look OK, but the CHECK lines for test/CodeGen/AMDGPU/sgpr-control-flow.ll should be updated like this:
https://reviews.llvm.org/P8044

jonpa updated this revision to Diff 121843.Nov 7 2017, 12:46 AM

@Nemanja:

Just looking at the number and placement of reg-reg moves for PowerPC seems fine. It doesn't seem like there are regressions but no huge reduction in the number of copies. However, since the diff is without context, it's hard to tell exactly what is going on in the test cases. I'll apply this patch to ToT and have a look at how things change before signing off on the PPC changes.

Sorry about forgetting the context - fixed.

@Tom:

The AMDGPU tests look OK, but the CHECK lines for test/CodeGen/AMDGPU/sgpr-control-flow.ll should be updated like this: https://reviews.llvm.org/P8044

Thanks! sgpr-control-flow.ll updated, but there is also still the multilevel-break.ll failure. This is the only test that is failing even with this patch, and the reason for this is that I was not sure how to update it. We discussed this on Oct 11 here on Phabricator. Could you please take a look?

Test updates in need of review:

TODO:
CodeGen : BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

PARTIALLY DONE (still more tests in need of review):
CodeGen :

  • AArch64 : win64_vararg.ll, arm64-aapcs.ll, func-argpassing.ll (Martin Storsjo) swifterror.ll: small regression, needs approval (Martin Storsjo)
  • AMDGPU: (Tom Stellard) All reviewed, but multilevel-break.ll needs updating (fails with patch)

DONE:
CodeGen :

  • ARM (Eli Friedman)

I think the changes to the multilevel-break.ll test are OK. The register usage is increased by 2, but this won't have any impact on performance for this test. Here is a patch to make this test pass with your changes: https://reviews.llvm.org/P8046

nemanjai accepted this revision.Nov 8 2017, 10:55 AM

The PowerPC CodeGen test changes are fine. I've also confirmed that this patch reduces the total number of register move instructions (which implement the copies). So that LGTM.

This revision is now accepted and ready to land.Nov 8 2017, 10:55 AM
jonpa updated this revision to Diff 122212.Nov 9 2017, 2:31 AM

@Tom:

I think the changes to the multilevel-break.ll test are OK. The register usage is increased by 2, but this won't have any impact on performance for this test. Here is a patch to make this test pass with your changes: https://reviews.llv\

m.org/P8046
Applied. Thanks!

@Nemanja:

The PowerPC CodeGen test changes are fine. I've also confirmed that this patch reduces the total number of register move instructions (which implement the copies). So that LGTM.

Thanks, nice to hear :-)

Regression? : test/DebugInfo/X86/live-debug-variables.ll

Test updates in need of review:

TODO:
CodeGen   : BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

PARTIALLY DONE :
 CodeGen :
  - AArch64: All done but swifterror.ll (small regression and needs approval according to Martin Storsjo)

DONE:
 CodeGen :
  - ARM (Eli Friedman)
  - AMDGPU (Tom Stellard)
  - PPC (nemanjai)
jonpa updated this revision to Diff 122823.Nov 14 2017, 4:25 AM

Ping!

Three backends now done (with confirmed reductions in number of COPYs after regalloc). ninja check passes with patch (all failing tests updated).

Please, could someone from each backend listed below take a look at the test changes (already updated).

TODO:
CodeGen   : BPF Hexagon Mips PowerPC SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

PARTIALLY DONE :
 CodeGen :
  - AArch64: All done but swifterror.ll (small regression and needs approval according to Martin Storsjo)

APPROVED:
 CodeGen :
  - ARM (Eli Friedman)
  - AMDGPU (Tom Stellard)
  - PPC (Nemanja Ivanovic)

(Regression? : test/DebugInfo/X86/live-debug-variables.ll)
jonpa added a comment.Nov 15 2017, 6:32 AM

@Quentin: I think the patch is starting to be somewhat convincing, as it is now also confirmed and approved on PowerPC, as well as has gotten approved test changes for AMDGPU and CodeGen/ARM. Given that it's now been two months of review, and that I updated all the tests already a month ago, I wonder what you would say about activating this for now under a target preference hook? I think that SystemZ, PowerPC and AMDGPU could benefit from this now, while other targets could do so after reviewing test changes. Would this work?

Status table corrected:

TODO:
CodeGen   : BPF Hexagon Mips SPARC Thumb Thumb2 X86 XCore
DebugInfo : COFF X86

PARTIALLY DONE :
 CodeGen :
  - AArch64: All done but swifterror.ll (small regression and needs approval according to Martin Storsjo)

APPROVED:
 CodeGen :
  - CodeGen/ARM (Eli Friedman)
  - AMDGPU (Tom Stellard)
  - PowerPC (Nemanja Ivanovic)

(Regression? : test/DebugInfo/X86/live-debug-variables.ll)
qcolombet accepted this revision.Nov 17 2017, 10:17 AM

Still LGTM :).
It makes sense to have it behind a hook for now, but could you follow-up with the targets owners so that it gets to be the default and we can get rid of the hook.
I believe this is general goodness and that we should use it for all targets going forward.

jonpa updated this revision to Diff 124559.Nov 28 2017, 6:15 AM

Patch updated to include a TargetRegisterInfo hook 'enableMultipleCopyHints()' that returns false per default, which gives NFC to trunk. OnlySystemZ returns true for now. It is up to all target maintainers to enable this and review test changes, which hopefully will be done
soon so that the hook can be removed again.

Some minor changes compared to last version in order to get the NFC to trunk when returning false in enableMultipleCopyHints():

  • reinstate the mri.isAllocatable() check for the hint, which was dropped for some reason (makes a small difference on the resulting weight in some rare cases).
  • If target has created a simple hint, it must be cleared before the best hint from CopyHints is inserted, so a new clearSimpleHint() method in MachineRegisterInfo was needed. Not sure that all those different methods for manipulating hints are really needed in the end once all targets have been converted. In particular, I am curious why AMDGPU sets those hints (of generic type) when they will be recomputed always here...
  • A new temporary variable CopyHint::HintOrder, in order to properly mimic the current behavior on trunk for NFC. This is because on trunk, the best hint is kept in the order of discovery, which is not what the CopyHints set was doing.

These hacks aren't very pretty, so PLEASE EVERYONE: HELP REMOVE THEM BY ENABLING THE MULTIPLE COPY HINTS!

@nemanjai: You approved this for Power PC, but one more test has changed now (licm-tocReg.ll), so please get the previous test changes from here on Phabricator (which apply fine) and review the new one, and then return true in enableMultipleCopyHints() once this is approved.

@tstellard, @efriedman: Thanks for helping reviewing the tests on your targets. I have not enabled those targets as I have not checked on those tests on latest trunk. I hope you can download those tests from here on Phabricator and make sure eveything's green as you enable this.

Please let me know if this is ok to commit.

jonpa requested review of this revision.Nov 28 2017, 6:16 AM
jonpa edited edge metadata.
uweigand accepted this revision.Dec 5 2017, 2:27 AM

The SystemZ changes LGTM. Given that Quentin approved the common part, this should be fine.

This revision is now accepted and ready to land.Dec 5 2017, 2:27 AM
jonpa added a comment.Dec 5 2017, 2:58 AM

Commited as r319754.

I will not close this review since the next step is to remove the temporary enableMultipleCopyHints() hook, after everyone is returning true.

Tom, Nemanja, Martin, Eli... I hope you will try this soon...

Targets that implement getRegAllocationHints() (ARM), might want to review that implementation in regards to the new common-code...

jonpa updated this revision to Diff 129076.Jan 9 2018, 6:56 AM

This has been committed, but is still only enabled for SystemZ.

I thought we might try to make progress on enabling this on all targets, so I decided to now do one target at a time.

First out is PowerPC:

CodeGen/PowerPC/licm-tocReg.ll: Updated. Seems to use one instruction (mr) less.
load-two-flts.ll: Updated: Smaller immediate offsets should be better, I hope.
ppc64-byval-align.ll: Updated: One move less, it seems.
select-i1-vs-i1.ll: Updated: a few functions a bit different. If-conversion?

Nemanja?

jonpa added a comment.Jan 18 2018, 9:03 AM

Ping!

Could anyone on PowerPC confirm these test changes and perhaps also that number of COPYs are decreasing overall, please.

jonpa requested review of this revision.Jan 18 2018, 9:03 AM

Ping!

Could anyone on PowerPC confirm these test changes and perhaps also that number of COPYs are decreasing overall, please.

Sorry about the delay. The test case changes look neutral or like improvements (slect-i1-vs-i1.ll looks like an improvement). I'm running it now to see the overall impact on the number of register copies. I'll report back soon.

nemanjai accepted this revision.Jan 29 2018, 10:36 AM

This also reduces the overall number of register moves slightly. LGTM

This revision is now accepted and ready to land.Jan 29 2018, 10:36 AM
jonpa requested review of this revision.Jan 31 2018, 1:32 AM

Thanks Nemanja, this is now enabled (r323858) also for PowerPC :-)

A 'ping' to the other targets that I updated tests for a while back ago - has anyone made any progress?

nemanjai resigned from this revision.Feb 2 2018, 6:05 AM

Thank you Jonas. As this no longer needs PPC input, I'm resigning from the review.

jonpa updated this revision to Diff 132814.Feb 5 2018, 5:42 AM

Let's try to enable this for AArch64, next.

I have reapplied the test changes from earlier. Martin, does this look OK?

Yes, this looks like mostly no-op reorderings. I don't remember exactly what part I thought was a potential performance regression in swifterror.ll before - it moves one "mov" instruction (in two places) to before a branch, so potentially executing one instruction more than before. I would say it's most probably ok (and the gains in some of the other register shuffling tests would make it sound like a net gain in any case).

I'm in no way authoritative for this though, so perhaps @aemerson or @qcolombet would like to give such a stamp of approval?

mstorsjo accepted this revision.Feb 9 2018, 12:44 AM

So with the absolutely minimal performance regression in swifterror.ll and lack of response otherwise, I think I'm ok with giving it the LGTM for aarch64.

This revision is now accepted and ready to land.Feb 9 2018, 12:44 AM
jonpa added a comment.Feb 9 2018, 1:29 AM

So with the absolutely minimal performance regression in swifterror.ll and lack of response otherwise, I think I'm ok with giving it the LGTM for aarch64.

Thank you, Martin. Committed as r324720.

mstorsjo resigned from this revision.Feb 9 2018, 1:38 AM
This revision now requires review to proceed.Feb 9 2018, 1:38 AM
aemerson accepted this revision.Feb 9 2018, 2:39 AM

Sorry I didn't see this, I need to fix my email filters.

This revision is now accepted and ready to land.Feb 9 2018, 2:39 AM
aemerson closed this revision.Feb 9 2018, 2:39 AM
jonpa updated this revision to Diff 133575.Feb 9 2018, 2:40 AM

Enabled for ARM. Mostly same test updates as previously (still passing), plus a few more.

Somebody take a look, please. Eli?

jonpa reopened this revision.Feb 9 2018, 2:40 AM
This revision is now accepted and ready to land.Feb 9 2018, 2:40 AM
jonpa requested review of this revision.Feb 9 2018, 2:41 AM
jonpa added a comment.Feb 15 2018, 2:42 AM

Ping!

ARM test updates are waiting for approval.

efriedma accepted this revision.Feb 15 2018, 6:05 PM

ARM test changes LGTM

This revision is now accepted and ready to land.Feb 15 2018, 6:05 PM
jonpa added a comment.Feb 16 2018, 1:57 AM

ARM test changes LGTM

Thanks, Eli.

r325327.

jonpa updated this revision to Diff 134581.Feb 16 2018, 2:31 AM

Enabled for AMDGPU, with just a few minor test updates.

@tstellar : This looks to be a subset of the test changes you previously approved, so I am hoping they still look ok?

jonpa requested review of this revision.Feb 16 2018, 2:32 AM

[AMDGPU] Just a few test updates in need of approval.

rampitec accepted this revision.Feb 16 2018, 8:13 AM

Changes look non-essential to me.

This revision is now accepted and ready to land.Feb 16 2018, 8:13 AM
jonpa added a comment.Feb 17 2018, 2:06 AM

Changes look non-essential to me.

Thank you, Stanislav. I committed this even though a last-minute change slipped in (diff from approved changes):

--- a/test/CodeGen/AMDGPU/ret.ll
+++ b/test/CodeGen/AMDGPU/ret.ll
@@ -178,8 +178,8 @@ bb:
 }

 ; GCN-LABEL: {{^}}sgpr:
-; GCN-DAG: s_add_i32 s0, s3, 2
-; GCN-DAG: s_mov_b32 s2, s3
+; GCN: s_mov_b32 s2, s3
+; GCN: s_add_i32 s0, s2, 2
 ; GCN-NOT: s_endpgm
 define amdgpu_vs { i32, i32, i32 } @sgpr([9 x <16 x i8>] addrspace(4)* byval %arg, i32 inreg %arg1, i32 inreg %arg2, float %arg3) #0 {
 bb:

This looks harmless to me - the add is using s2 instead of s3, directly after the move of s3 to s2, right? Please take a look.

Committed as r325425.

jonpa updated this revision to Diff 134792.Feb 17 2018, 3:05 AM
jonpa removed subscribers: nemanjai, efriedma.

[BPF] multiple copy hints enabled. Tests updated and passing, please review.

jonpa requested review of this revision.Feb 17 2018, 3:06 AM

[BPF] Please review updated tests.

Changes look non-essential to me.

Thank you, Stanislav. I committed this even though a last-minute change slipped in (diff from approved changes):

--- a/test/CodeGen/AMDGPU/ret.ll
+++ b/test/CodeGen/AMDGPU/ret.ll
@@ -178,8 +178,8 @@ bb:
 }

 ; GCN-LABEL: {{^}}sgpr:
-; GCN-DAG: s_add_i32 s0, s3, 2
-; GCN-DAG: s_mov_b32 s2, s3
+; GCN: s_mov_b32 s2, s3
+; GCN: s_add_i32 s0, s2, 2
 ; GCN-NOT: s_endpgm
 define amdgpu_vs { i32, i32, i32 } @sgpr([9 x <16 x i8>] addrspace(4)* byval %arg, i32 inreg %arg1, i32 inreg %arg2, float %arg3) #0 {
 bb:

This looks harmless to me - the add is using s2 instead of s3, directly after the move of s3 to s2, right? Please take a look.

Committed as r325425.

This change in mov/add does not affect anything, thanks.

jonpa added a comment.Feb 18 2018, 2:16 AM

(Yonghong Song)
I checked BFP, yes. The patch looks good to me and you can add my ACK and push in.

BFP enabled by r325457.

jonpa updated this revision to Diff 134827.Feb 18 2018, 2:35 AM
jonpa added a reviewer: kparzysz.

[Hexagon] Enabled with updated tests.

Please take a look at the test updates.

kparzysz accepted this revision.Feb 21 2018, 7:52 AM

The Hexagon changes look ok.

This revision is now accepted and ready to land.Feb 21 2018, 7:52 AM
jonpa added a comment.Feb 21 2018, 8:40 AM

The Hexagon changes look ok.

Thank you.
r325697

jonpa updated this revision to Diff 135373.Feb 22 2018, 1:25 AM
jonpa edited reviewers, added: sdardis, atanasyan; removed: kparzysz.

[Mips] Multiple regalloc hints enabled with updated tests.

This revision now requires review to proceed.Feb 22 2018, 1:25 AM
sdardis accepted this revision.Feb 22 2018, 4:57 AM

Hi Jonas, I've regenerated the tests in rL325770 and the changes seem ok. The main difference I'm seeing is some moves are being eliminated, but in a few cases in test/CodeGen/Mips/analyzebranch.ll the earlier move prevents a delay slot being filled but otherwise looks ok.

https://reviews.llvm.org/differential/diff/135383 is my local copy of this patch against the regenerated tests.

This revision is now accepted and ready to land.Feb 22 2018, 4:57 AM
jonpa updated this revision to Diff 135383.Feb 23 2018, 12:21 AM

sdardis patch applied.

Hi Jonas, I've regenerated the tests in rL325770 and the changes seem ok. The main difference I'm seeing is some moves are being eliminated, but in a few cases in test/CodeGen/Mips/analyzebranch.ll the earlier move prevents a delay slot being filled but otherwise looks ok.

https://reviews.llvm.org/differential/diff/135383 is my local copy of this patch against the regenerated tests.

Thank you, Simon. I applied your patch with regenerated tests and committed it as r325870.

jonpa updated this revision to Diff 135599.Feb 23 2018, 1:00 AM
jonpa edited reviewers, added: JDevlieghere, fedor.sergeev; removed: sdardis, atanasyan.

[SPARC] Patch enabled - please review the updated tests.

This revision now requires review to proceed.Feb 23 2018, 1:00 AM
jyknight accepted this revision.Feb 23 2018, 8:33 AM

Sparc changes look reasonable.

This revision is now accepted and ready to land.Feb 23 2018, 8:33 AM

Sparc changes look reasonable.

Thank you, James. r326028.

jonpa updated this revision to Diff 135790.Feb 24 2018, 3:57 AM
jonpa edited reviewers, added: robertlytton; removed: JDevlieghere, fedor.sergeev, jyknight.

[XCore] patch enabled - please review updated test and approve patch.

This revision now requires review to proceed.Feb 24 2018, 3:57 AM
robertlytton requested changes to this revision.Feb 25 2018, 9:51 AM

please see in-line comments

test/CodeGen/XCore/byVal.ll
45–46 ↗(On Diff #135790)

Hi jonpa,
The output in test/CodeGen/XCore/byVal.ll is incorrect.
viz the value in r1 is not copied into r0 before being overwritten by the value in sp[2] (r0's indirected value).
These two lines should be the other way around.
sorry
robert

This revision now requires changes to proceed.Feb 25 2018, 9:51 AM
jonpa updated this revision to Diff 135853.Feb 25 2018, 11:57 PM

Test updated again.

jonpa marked an inline comment as done.Feb 25 2018, 11:58 PM
jonpa added inline comments.
test/CodeGen/XCore/byVal.ll
45–46 ↗(On Diff #135790)

Ah, I think 'r1' slipped through in the place of 'r11'... Does it look ok now?

robertlytton accepted this revision.Feb 26 2018, 12:01 AM

LGTM
thank you
robert

This revision is now accepted and ready to land.Feb 26 2018, 12:01 AM
jonpa marked an inline comment as done.Feb 26 2018, 12:09 AM

LGTM
thank you
robert

Thank you, Robert. r326069

jonpa updated this revision to Diff 135887.Feb 26 2018, 5:19 AM
jonpa edited reviewers, added: craig.topper; removed: robertlytton.

X86 is now the final backend to enable :-)

This patch enables X86 with regenerated tests in all (155) tests that previously had the "NOTE: : Assertions have been autogenerated..." comment. I have manually updated the remaining (28) ones.

Please review these test changes.

After this, we can finally get rid of the enableMultipleCopyHints() hook.

This revision now requires review to proceed.Feb 26 2018, 5:19 AM

This looks like a nice improvement modulo a few issues:

  • In a number of places (notably register arguments to shift instructions) we generate movq instead of movl that's shorter and equivalent. I believe there's no real performance difference. Certainly we should for when compiling for size.
  • There's some unnecessary shuffling of register names in SSE4 tests.

Someone who's a more familiar with Debug should double check those tests though they seem fine to me.

test/CodeGen/X86/fast-isel-shift.ll
25

Why are we getting two kill comments about cx here?

test/CodeGen/X86/schedule-x86-64-shld.ll
12–1

This should be movl given optsize.

test/CodeGen/X86/sret-implicit.ll
0–1

This shouldn't be DAG matches anymore the 2nd line should always come first

0–1

Same as above

test/CodeGen/X86/vector-shift-ashr-128.ll
1

All of the SSE4 changes regarding shifts seem to generate unnecessary register shuffling.

test/CodeGen/X86/vectorcall.ll
1

This is probably fine, but confusing.

Can you decompose this up into the X86 and X64? In fact, this file should probably be autogenerated with utils/update_llc_test_checks.py

2

another case of movq vs movl

RKSimon commandeered this revision.Apr 9 2018, 1:19 AM
RKSimon edited reviewers, added: jonpa; removed: RKSimon.
jonpa added a comment.Jul 24 2018, 4:26 AM

ping! Please don't forget to do this for X86, we still want to get rid of that temporary hook... :-)

RKSimon updated this revision to Diff 157993.Jul 30 2018, 10:17 AM
RKSimon added reviewers: efriedma, spatel.
RKSimon set the repository for this revision to rL LLVM.

Updated all x86 tests to trunk latest

Has anyone looked at @niravd's comments?

Has anyone looked at @niravd's comments?

I haven't - just updated all the tests

RKSimon added inline comments.Jul 30 2018, 11:50 AM
test/CodeGen/X86/vector-shift-ashr-128.ll
1

If I had to guess - this is probably due to SSE41's PBLENDV instructions being hardwired to use xmm0

Is this ready or any blockers?

RKSimon updated this revision to Diff 165646.Sep 15 2018, 10:04 AM

rebased to trunk

Is this ready or any blockers?

Other than the poor codegen issues (but not regressions) commented on by @niravd I don't think there are any blockers.

OK to commit?

The codesize issues are minor and shouldn't hold this patch up. The only blocker I see is the unnecessary data shuffling for SSE41 codegen which someone else should decide on.

OK to commit?

The codesize issues are minor and shouldn't hold this patch up. The only blocker I see is the unnecessary data shuffling for SSE41 codegen which someone else should decide on.

Is this just about the extra 'movdqa' in vector-shift-ashr-128.ll, or are there other diffs to look at?

I've gone through and marked all the places.

Is this just about the extra 'movdqa' in vector-shift-ashr-128.ll, or are there other diffs to look at?

test/CodeGen/X86/vector-shift-lshr-128.ll
232

Extra instruction here

test/CodeGen/X86/vselect-minmax.ll
4539

Extra instruction here.

4660

Extra instruction here.

4781

Extra Instruction here.

4901

Extra instruction here.

5019–5020

Extra instructions here.

5170–5176

Extra instructions here.

5319–5320

Extra instructions here.

5468–5469

Extra instructions here.

6995

Extra instructions here.

7116–7117

Extra instruction here.

7235

Extra instruction here.

7356–7357

Extra instructions here.

7505–7506

Extra Instructions here.

7655–7656

Extra instructions here.

I've gone through and marked all the places.

Is this just about the extra 'movdqa' in vector-shift-ashr-128.ll, or are there other diffs to look at?

These are all due to pblendvb/blendvpd/blendvps being hardwired to use the xmm0 as the mask register (limit goes away for avx)

spatel accepted this revision.Sep 19 2018, 9:41 AM

I've gone through and marked all the places.

Is this just about the extra 'movdqa' in vector-shift-ashr-128.ll, or are there other diffs to look at?

These are all due to pblendvb/blendvpd/blendvps being hardwired to use the xmm0 as the mask register (limit goes away for avx)

Yep - thanks for marking all of those. That corner case shouldn't hold up the general improvement, and the number of customers specifically targeting SSE4.1 should go to zero over time, so LGTM.

This revision is now accepted and ready to land.Sep 19 2018, 9:41 AM
This revision was automatically updated to reflect the committed changes.