This is an archive of the discontinued LLVM Phabricator instance.

performance improvements for ThunderX2 T99
AbandonedPublic

Authored by steleman on Nov 17 2017, 7:14 AM.

Details

Summary

This changeset causes performance improvements for the Cavium ThunderX2T99 micro-arch. The changeset is specific to T99. It does not affect any other micro-arch.

Tested with SPECcpu2017 and libquantum.

As an example, for the performance gains on libquantum, please see here:

https://docs.google.com/spreadsheets/d/1Lo1o2E1NjrpkwS7DvYYWsiVvPdd93h7KBaqeptMrZPY/edit?usp=sharing

Diff Detail

Repository
rL LLVM

Event Timeline

steleman created this revision.Nov 17 2017, 7:14 AM
mgrang edited reviewers, added: t.p.northover; removed: llvm-commits.Nov 17 2017, 11:56 AM
mgrang added a subscriber: llvm-commits.
mgrang added a subscriber: mgrang.

Could you please add a few tests? Also please remember to add llvm-commits as subscriber to your patches.

Could you please add a few tests? Also please remember to add llvm-commits as subscriber to your patches.

I'm not sure how I can add tests for this change. There is, in fact, no functional change.

davide added a subscriber: davide.Nov 17 2017, 4:01 PM

I'm under the impression this is not NFC.
Presumably, the changes in the scheduler reflect in a different schedule being produced.
You can then check the IR (or MIR) with the new schedule.
If you feel motivated, you can check the test before (standalone), so that the commit will highlight the differences pre/post & will make the changeset more readable.
If no change in the schedule is produced, I'm confused.

steleman updated this revision to Diff 123874.Nov 21 2017, 7:13 PM

Added test case.

davide added inline comments.Nov 21 2017, 8:48 PM
test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll
6 ↗(On Diff #123874)

I don't understand why this change is related to loop-unrolling, can you please elaborate?

steleman added inline comments.Nov 21 2017, 9:02 PM
test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll
6 ↗(On Diff #123874)

I don't understand why this change is related to loop-unrolling, can you please elaborate?

I am not sure I understand your question.

Are you referring to the entire test case, or just to the two lines above your comment?

If you are asking about the entire test case, then the answer is: the change in the T99 scheduler is about loop unrolling, and only about loop unrolling. It's not about anything else.

davide added inline comments.Nov 21 2017, 9:13 PM
test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll
6 ↗(On Diff #123874)

Can you please check in the test standalone and update the change to show the difference?

steleman added inline comments.Nov 21 2017, 9:44 PM
test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll
6 ↗(On Diff #123874)

I can't check in because I don't have check-in privileges. :-)

But I uploaded the relevant files to my Google Drive:

https://drive.google.com/open?id=18GseEsH4XN6bjD3UAcSB06GkXhWzPSAf

(Please let me know if you have problems accessing that directory; I don't think you should; I set it to world-readable).

There are 3 files in there:

  1. t99-loop-unrolling-without-scheduler-patch.txt
    • That's the output from opt from LLVM Git 2017/11/16 without my scheduler patch.
  1. t99-loop-unrolling-with-scheduler-patch.txt
    • That's the output from opt from LVM Git 2017/11/16 with my scheduler patch.
  1. loop.c
    • That's a test/experiment that I wrote/used to test the changes in CodeGen triggered by the change in the scheduler. It's an interesting test.

If you have the time, and the inclination, you can use loop.c to track what happens to the loop at different optimization levels (-O1, -O2, -O3).

Also, yes, it wasn't entirely clear to me this was related to unrolling.
This is because the commit message is really vague and doesn't really explain the change.
I think you might consider rephrasing it to be more explicative, thanks!

This change makes sense to me (after the comments are addressed), but please wait for another opinion (@MatzeB / @qcolombet are probably good people to take another look/sign off)

test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll
2 ↗(On Diff #123874)

I don't think you need the debug output.

kristof.beyls added inline comments.
lib/Target/AArch64/AArch64ISelLowering.cpp
10981–10984

I get the impression that the enabling of AggressiveFMAFusion is orthogonal to the change in the pipeline model.
If so, I think it's better to commit these separately.
Also, I think this needs a regression test to show that this does what you want.

fhahn added a subscriber: fhahn.Nov 22 2017, 1:39 AM
fhahn added inline comments.Nov 22 2017, 8:57 AM
lib/Target/AArch64/AArch64SchedThunderX2T99.td
394–395

Unsupported = 0 should be the default, so you should be able to just drop this line, see https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Target/TargetSchedule.td#L257

Also if I understand correctly, this should have an impact on scheduling instructions using WriteAtomic, like CASB. So it should be possible to create a test making sure the scheduler uses that info as expected. Although I am not sure if it's worth it and how much work that would be :)

Finally, this seems unrelated to the LoopMicroOpBufferSize change and could be split out in a separate patch.

Also if I understand correctly, this should have an impact on scheduling instructions using WriteAtomic, like CASB.

It doesn't. Maybe it was supposed to, in theory, but in reality it makes no difference whatsoever.

The only noticeable difference is in instruction cost: when Unsupported == 1, the estimated instruction cost is higher than when Unsupported == 0. Which makes no difference whatsoever in practice, given that the real cost of LSE instructions is pretty high anyway.

We've been emitting LSE instructions with no problems for many months. Whether or not Unsupported was 0 or 1 made no difference whatsoever.

fhahn added a comment.Nov 27 2017, 6:33 AM

Also if I understand correctly, this should have an impact on scheduling instructions using WriteAtomic, like CASB.

It doesn't. Maybe it was supposed to, in theory, but in reality it makes no difference whatsoever.

The only noticeable difference is in instruction cost: when Unsupported == 1, the estimated instruction cost is higher than when Unsupported == 0. Which makes no difference whatsoever in practice, given that the real cost of LSE instructions is pretty high anyway.

We've been emitting LSE instructions with no problems for many months. Whether or not Unsupported was 0 or 1 made no difference whatsoever.

It is probably unlikely that atomic instructions are in hot loops of most benchmarks, so I am not surprised that you do not see any impact on runtimes (assuming that's what you meant with 'it makes no difference whatsoever). However there should be cases where it should make a difference when scheduling, but it probably requires some time to come up with a test case.

I think going forward, it would be best to split up this patch in 3 independent parts:

  1. enableAggressiveFMAFusion
  2. WriteAtomic change
  3. LoopMicroOpBufferSize

This would make it slightly easier to review/accept, keep track of comments independent things, and to revert in case something goes wrong.

MatzeB requested changes to this revision.Nov 27 2017, 12:49 PM
MatzeB added inline comments.
lib/Target/AArch64/AArch64ISelLowering.cpp
10981–10984

Please avoid getProcFamily() checks outside of AArch64SubtargetInfo. Use target features and transfer the logic into SubtargetInfo!

This revision now requires changes to proceed.Nov 27 2017, 12:49 PM
steleman added a comment.EditedNov 27 2017, 3:00 PM

Please avoid getProcFamily() checks outside of AArch64SubtargetInfo. Use target features and transfer the logic into SubtargetInfo!

Can you please explain the rationale behind this restriction?

It's certainly not supported by the class interface design, as the Subtarget pointer is accessible and Subtarget->getProcFamily() is accessible as well.

It also does not appear to be a real restriction at all, considering that Subtarget->getProcFamily() is already being used in AArch64ISelLowering.cpp.

What you are asking here is the implementation of a new target feature. This seems (a) unnecessary and (b) introduces unnecessary code complexity.

For starters, aggressive FMA is a compiler-dependent subjective decision. It's not a target-dependent objective attribute, such as LSE or Neon. What's aggressive FMA to LLVM is not aggressive at all to GCC for example.

The design implication is that, for every single existing public interface we are going to implement an additional, functionally duplicative, target feature. Which really isn't a target feature to begin with.

The implementation implication is that, a one-line code change becomes a 10-line code change that accomplishes the same exact thing as the one-line change.

Thank you in advance for your explanation.

Please avoid getProcFamily() checks outside of AArch64SubtargetInfo. Use target features and transfer the logic into SubtargetInfo!

Can you please explain the rationale behind this restriction?

It's not a strict rule but unless there is a good reason not to do it we should :)

It's certainly not supported by the class interface design, as the Subtarget pointer is accessible and Subtarget->getProcFamily() is accessible as well.

It also does not appear to be a real restriction at all, considering that Subtarget->getProcFamily() is already being used in AArch64ISelLowering.cpp.

What you are asking here is the implementation of a new target feature. This seems (a) unnecessary and (b) introduces unnecessary code complexity.

For starters, aggressive FMA is a compiler-dependent subjective decision. It's not a target-dependent objective attribute, such as LSE or Neon. What's aggressive FMA to LLVM is not aggressive at all to GCC for example.

The design implication is that, for every single existing public interface we are going to implement an additional, functionally duplicative, target feature. Which really isn't a target feature to being with.

The implementation implication is that, a one-line code change becomes a 10-line code change that accomplishes the same exact thing as the one-line change.

Thank you in advance for your explanation.

  • I actively worked on changing those CPU specific checks/tweaks across the target code into target features, I'd like for this to stay this way.
  • The name "target feature" is a certainly unfortunate as those aren't really features but rather tuning objectives. Nevertheless they are target specific so the target feature system works nice here.
  • Some reasons as to why making feature bits out of them is a good thing:
    • Things often start out as an inconspicuous if (CPU) doSomething(). But in time more CPUs are added resulting in longish if (CPU || CPUversion2 || CPUversion3 || otherVendorsCPU) doSomething() scattered accross the code.
    • We suddenly start mixing code that deals with something like "more aggressive FMA folding" with small lists of CPUs that should use this. This is annoying when adding new CPUs: It is hard to spot all places where checks could/should be added. It feels a lot cleaner to do this in AArch64.td as a list of feature/tuning bits.
    • As a bonus, target features can be enabled/disabled with -Xclang -target-feature=-xxx,+yyy which is sometimes nice for compiler developers to experiment with.

Also if I understand correctly, this should have an impact on scheduling instructions using WriteAtomic, like CASB.

It doesn't. Maybe it was supposed to, in theory, but in reality it makes no difference whatsoever.

The only noticeable difference is in instruction cost: when Unsupported == 1, the estimated instruction cost is higher than when Unsupported == 0. Which makes no difference whatsoever in practice, given that the real cost of LSE instructions is pretty high anyway.

We've been emitting LSE instructions with no problems for many months. Whether or not Unsupported was 0 or 1 made no difference whatsoever.

It is probably unlikely that atomic instructions are in hot loops of most benchmarks, so I am not surprised that you do not see any impact on runtimes (assuming that's what you meant with 'it makes no difference whatsoever). However there should be cases where it should make a difference when scheduling, but it probably requires some time to come up with a test case.

That's not what I meant. What I meant has nothing to do with hot loops in SPECcpu benchmarks.

You can convince yourself of what I am saying by compiling the following test program:

https://drive.google.com/file/d/1gZ_i738yJ1TlJcQQ_oo-vTghydPUU1ZR/view?usp=sharing

(lseadd.c)

You can change around the #define for ATOMIC_MEMORY_ORDERING in the test program.

clang { -O0|-O1|-O2|-O3 } -std=c99 -mcpu=thunderx2t99 -S lseadd.c -o lseadd-t99.S
clang { -O0|-O1|-O2|-O3 } -std=c99 -mcpu=cortex-a57 -S lseadd.c -o lseadd-a57.S

T99 supports LSE Atomics. Cortex-A57 does not. For Cortex-A57, LLVM will emit LL/SC for the atomics. For T99, LLVM will always emit LSE Atomics for this test program, regardless of whether Unsupported == 0 or Unsupported == 1 in the AArch64SchedThunderX2T99.td scheduler file.

When compiling for T99, no warning is being emitted about Unsupported == 1.

Also, when compiling LLVM, no warning is emitted by llvm-tblgen about Unsupported == 0 while having LSE Atomics available.

steleman abandoned this revision.Nov 30 2017, 5:23 PM

Superseded by:

D40694 - Remove Unsupported flag from T99 Scheduler
D40695 - Improve loop unrolling performance on T99
D40696 - Enable aggressive FMA on T99 and provide AArch64 option for other micro-arch's