This is an archive of the discontinued LLVM Phabricator instance.

Performance enhancements for Cavium ThunderX2 T99
ClosedPublic

Authored by joelkevinjones on Apr 6 2017, 8:25 PM.

Details

Summary

This patch enables significant performance enhancements to the
Cavium ThunderX2T99 LLVM backend, as observed by running SPEC2K6.

The patch is based on LLVM ToT (Git mirror) as of 20 June 2017.

Related Bugzilla bug: http://bugs.llvm.org/show_bug.cgi?id=32562
Patch author: steleman

Diff Detail

Repository
rL LLVM

Event Timeline

steleman created this revision.Apr 6 2017, 8:25 PM
rengolin edited edge metadata.
rengolin added a subscriber: kristof.beyls.

Wow, great additions, thanks!

I think the check could be a target feature, or it could be true to all AArch64 targets.

@kristof.beyls isn't FMA always faster on vanilla AArch64?

@t.p.northover same for Apple's hardware, I think, no?

cheers,
--renato

lib/Target/AArch64/AArch64ISelLowering.cpp
7624 ↗(On Diff #94477)

This smells like a target feature. :)

javed.absar added inline comments.
lib/Target/AArch64/AArch64SchedThunderX2T99.td
370 ↗(On Diff #94477)

nitpick. These could be compressed into fewer lines separated as (instregex a, b, ...). Also, doesn't "^B" cover "^BL" as well?

I will re-submit with changes shortly - I also have a few more additions to the *.td
file as well.

lib/Target/AArch64/AArch64ISelLowering.cpp
7624 ↗(On Diff #94477)

I can definitely re-write it as a TargetFeature.

lib/Target/AArch64/AArch64SchedThunderX2T99.td
370 ↗(On Diff #94477)

Yes I will compress them and re-submit.

kristof.beyls added inline comments.Apr 11 2017, 8:17 AM
lib/Target/AArch64/AArch64ISelLowering.cpp
7624 ↗(On Diff #94477)

my 2 cents, hoping someone with more experience in this area can help further:

IIUC, the effect of this change is that also for vector types, this function now returns true, when targeting ThunderX2T99 or ThunderXT88?

I can see that for different cores/micro-architectures a different response could be wanted here, also depending on the types involved.
Since the answer could be different dependent on the type, I'm guessing that this won't map nicely to a target feature.
Maybe this information needs to become a subtarget hook somehow?

Also, please upload the next iteration of this patch with lots of context (e.g. git diff -U9999) - that makes reviewing it a bit easier.

steleman added inline comments.Apr 11 2017, 9:11 AM
lib/Target/AArch64/AArch64ISelLowering.cpp
7624 ↗(On Diff #94477)

IIUC, the effect of this change is that also for vector types, this function now returns true, when targeting ThunderX2T99 or ThunderXT88?

Correct, for floating-point vector types.

Since the answer could be different dependent on the type, I'm guessing that this won't map nicely to a target feature.

Since the answer could be different dependent on the type, I'm guessing that this won't map nicely to a target feature.
Maybe this information needs to become a subtarget hook somehow?

I think it depends on

  • how many other micro-architectures (besides Cavium T88/T99) want/need this sub-feature
  • how many different types would need different behaviors

So, in my mind, this comes down to where exactly does this complexity reside: in this function, or in a separate hook someplace else.

I don't have a clear answer to this question because I don't know what other micro-architectures need or want with respect to FMA.

t.p.northover added inline comments.Apr 11 2017, 12:20 PM
lib/Target/AArch64/AArch64ISelLowering.cpp
7624 ↗(On Diff #94477)

IIUC, the effect of this change is that also for vector types, this function now returns true, when targeting ThunderX2T99 or ThunderXT88?

Doesn't getScalarType already cover that? I was trying to work out what this exactly did yesterday and thought it was excluding scalars, but I now see that's not true either.

I strongly suspect it's a NOP (except for f16 CodeGen).

steleman added inline comments.Apr 11 2017, 2:55 PM
lib/Target/AArch64/AArch64ISelLowering.cpp
7624 ↗(On Diff #94477)

I strongly suspect it's a NOP (except for f16 CodeGen).

I am not so sure it's a NOP considering that the expression

if (VT.isFloatingPoint() && VT.isVector())

will evaluate to true - for the case of a floating-point vector type.

What I am unclear about is: why do I need to bother knowing the scalar type when I already know that a floating-point vector type will always be faster using FMA than by using MUL + ADD -- at least for the T88 and T99 micro-architectures.

t.p.northover added inline comments.Apr 11 2017, 3:07 PM
lib/Target/AArch64/AArch64ISelLowering.cpp
7624 ↗(On Diff #94477)

if (VT.isFloatingPoint() && VT.isVector()) will evaluate to true - for the case of a floating-point vector type.

Yes, but we already get true. If the input type is v4f32 (say) then VT = VT.getScalarType() returns f32 and the existing switch returns true.

steleman updated this revision to Diff 96673.Apr 25 2017, 11:52 PM

Updates to the T99 Scheduler.

This patch is based on LLVM git mirror from 2017/04/25.

steleman updated this revision to Diff 97036.Apr 27 2017, 8:10 PM

Updated to the most recent LLVM git mirror from 2017/04/27.
Several regex and instruction improvements.

rengolin added inline comments.Apr 28 2017, 2:44 AM
lib/Target/AArch64/AArch64ISelLowering.cpp
7624 ↗(On Diff #94477)

Tim is right, this is a NOP. This function already returns true for both scalar and vector f32/f64.

steleman added inline comments.Apr 28 2017, 6:45 AM
lib/Target/AArch64/AArch64ISelLowering.cpp
7624 ↗(On Diff #94477)

Tim is right, this is a NOP. This function already returns true [ ... ]

I had removed this change from my latest update, so it's no longer part of this revision, but:

Can't this entire function simply be re-written as one line:

bool AArch64TargetLowering::isFMAFasterThanFMulAndFAdd(EVT VT) const {

return VT.isFloatingPoint();

}

without having to go through all that extra stuff this function currently does?

Right, I clicked the link and it took me to the old code. :)

I'm not sure how f16/f128 would fare there. Probably not well.

Right, I clicked the link and it took me to the old code. :)

I'm not sure how f16/f128 would fare there. Probably not well.

So, then, this would be a case of "some micro-architectures can do this for any floating-point type, some others need special treatment".

Right now, f16 is completely excluded.

So, then, this would be a case of "some micro-architectures can do this for any floating-point type, some others need special treatment".

That's why I suggested sub-arch specific behaviour, either via subtarget or TLI hooks. The latter is probably better, for the reasons Kristof mentions.

But that's for another patch.

I'm ok with the change, as long as @javed.absar is happy. :)

javed.absar accepted this revision.May 10 2017, 1:40 AM

LGTM. Thanks for this.

This revision is now accepted and ready to land.May 10 2017, 1:40 AM
joelkevinjones commandeered this revision.Jun 20 2017, 5:15 PM
joelkevinjones edited reviewers, added: steleman; removed: joelkevinjones.
joelkevinjones edited edge metadata.

There is an issue with madd not being generated when this revision is applied to ToT. I have a fix that I will upload. It appears I have to commandeer the revision to do this?

joelkevinjones edited the summary of this revision. (Show Details)

Removed matching on "^SUB" which was inadvertently matching "SUBREG_TO_REG", with resulting failure in test/Codegen/AArch64/machine-combine-madd.ll

joelkevinjones requested review of this revision.Jun 26 2017, 11:43 AM
joelkevinjones edited edge metadata.

Ping?

This revision was automatically updated to reflect the committed changes.