This is an archive of the discontinued LLVM Phabricator instance.

Add loop unrolling and peeling preferences for RISCV
ClosedPublic

Authored by mcberg2021 on Nov 12 2021, 12:46 PM.

Details

Summary

Both these preference helper functions have initial support with this change. The loop unrolling preferences are set with initial settings to control thresholds, size and attributes of loops to unroll with some tuning done. The peeling preferences may need some tuning as well as the initial support looks much like what other architectures utilize. An unrolling test is added for RISCV as well to track how preferences modify/control loop unrolling.

Diff Detail

Event Timeline

mcberg2021 created this revision.Nov 12 2021, 12:46 PM
mcberg2021 requested review of this revision.Nov 12 2021, 12:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 12 2021, 12:46 PM
craig.topper added inline comments.Nov 12 2021, 1:05 PM
llvm/test/CodeGen/RISCV/unroll.ll
9 ↗(On Diff #386925)

Drop the FunctionAttrs comment and the dso_local and local_unnamed_addr

169 ↗(On Diff #386925)

I doubt we need all these attributes and metadata

craig.topper added inline comments.Nov 12 2021, 1:06 PM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
204

The MVE in this comment appears to have been copied from ARM.

jrtc27 added a comment.EditedNov 12 2021, 1:06 PM

This test screams of "I took C and shoved it into a test case" rather than actually taking the time to distill it down to a minimal example of IR free from irrelevant clutter

llvm/test/CodeGen/RISCV/unroll.ll
2 ↗(On Diff #386925)

opt test in CodeGen?

jrtc27 added inline comments.Nov 12 2021, 1:08 PM
llvm/test/CodeGen/RISCV/unroll.ll
2 ↗(On Diff #386925)

(this belong in llvm/test/Transforms/LoopUnroll/RISCV)

Updated as per feedback.

mcberg2021 marked an inline comment as done.Nov 12 2021, 1:45 PM
mcberg2021 marked 4 inline comments as done.
mcberg2021 added inline comments.
llvm/test/CodeGen/RISCV/unroll.ll
2 ↗(On Diff #386925)

Test moved to LoopUnroll/RISCV

mcberg2021 marked an inline comment as done.

Missed an update...

jrtc27 added inline comments.Nov 13 2021, 12:18 PM
llvm/test/Transforms/LoopUnroll/RISCV/unroll.ll
3

Use -mtriple=riscv64 unless it's genuinely OS-dependent (which this is not)

5

I doubt you need all these pointer attributes

147

These comments do nothing

147

This would also be more natural as the final block in the function; presumably the current block schedule is an artefact of the order in which various optimisation happened on the original C and IR

164

Is the TBAA actually needed?

Updated with simplifications and formatting.

mcberg2021 marked 5 inline comments as done.Nov 14 2021, 12:19 PM

Marked tasks as done.

jrtc27 added inline comments.Nov 14 2021, 12:25 PM
llvm/test/Transforms/LoopUnroll/RISCV/unroll.ll
147

This comment is still there

164

!0 is only used self-referentially; only !1 is referenced from outside of the metadata itself. So I don't think this distinct does anything.

165

Is this needed (same for the !llvm.loop)? If yes, just inline it like !llvm.loop, if no delete them.

More cleanup

mcberg2021 marked 3 inline comments as done.Nov 14 2021, 3:44 PM
frasercrmck added inline comments.Nov 15 2021, 2:00 AM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
204

Is this truly checking "vectorized loops" or just loops containing vector instructions? We've already checked that the loop isn't vectorized according to metadata. What about code written using RVV intrinsics, or with OpenCL/SYCL/etc? We might want to unroll those loops, right?

217

Does the explicit size of 4 help much or should we just use SmallVector<const Value*>?

khchen added a subscriber: khchen.Nov 15 2021, 4:34 AM
mcberg2021 added inline comments.Nov 15 2021, 9:10 AM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
204

I think for now I am going to mark this with a TODO for more tuning, I updated the comment for vectorized instructions, it will be uploaded soon...

217

This setting mirrors SLP's generic vector operand setting, which we utilize, so it does seem appropriate.

craig.topper added inline comments.Nov 15 2021, 9:11 AM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
204

This part of the patch came from a change I made internally. I believe this entire loop was just naively copied from the ARM target.

217

This was also copied from ARM.

Updated comments as needed.

mcberg2021 marked 4 inline comments as done.Nov 15 2021, 9:20 AM
craig.topper added inline comments.Nov 15 2021, 9:21 AM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
217

This shouldn't be mirroring anything. The ARM code I copied from pre-dates SmallVector's ability to automatically pick a value for the second template parameter. I think it will automatically pick a value more than 4. So I think the second parameter can be removed.

Removed size constraint on initialization of Operands to be consumed in getUserCost.

mcberg2021 marked an inline comment as done.Nov 15 2021, 9:36 AM

Are there any further concerns? If not can we progress towards approval?

This revision is now accepted and ready to land.Nov 19 2021, 10:00 AM

Perhaps we should run this across a set of benchmarks we're interested in?

Perhaps we should run this across a set of benchmarks we're interested in?

We've been using this internally on our SiFive 7 series and U8. Should we check the CPU?

Perhaps we should run this across a set of benchmarks we're interested in?

We've been using this internally on our SiFive 7 series and U8. Should we check the CPU?

I know that @asb often runs benchmarks over prospective patches, so I thought he might have some thoughts about we generally know this sort of thing is ready to go.

Ping, @asb can you take a look at this?

asb added a comment.Dec 6 2021, 12:24 PM

Sorry, I missed the previous ping and was out last week:

  • @luismarques reports this is performance neutral for Embench and Coremark on Ibex.
  • Representative benchmarks for anything like this is clearly difficult. If anyone has run e.g. SPEC on real RISC-V hardware, that would be interesting. Cases where the change is roughly performance neutral but may waste I$ might not show up on simple benchmarks.
  • Just as another datapoint, I ran this against the GCC torture suite. One case that stuck out to me was pr85169.c. It seems pretty unlikely that huge number of unrolled stores of zero byte is profitable (though maybe my intuition is wrong!). Could you please take a quick look at this case to see if there is any obvious tuning that can be done for it?

Otherwise, this looks good to me, and I don't think pr85169.c needs to be a blocker.

Alex, we have been using these unrolling preferences in house since mid summer for RISC-V. I will have a look at the outlier case too.

I altered the unrolling preferences for the indicated case, we meet the unroll criteria anyways, as the loop is small at the time of evaluation. Also the abort was motioned to the exit block by the time of the evaluation and so is never encountered as a call in the loop. So I think we are ok for this case.

jrtc27 added a comment.Dec 8 2021, 7:39 AM

Please include why you're reverting a commit in the commit message, as per the developer policy https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy

  • @luismarques reports this is performance neutral for Embench and Coremark on Ibex.

I think that was with an old version of this patch. With the current patch (now reverted) the numbers are:

CoreMark O3: +11.4% perf, 2.88 times the size
CoreMark O2: +9.36% perf, 2.88 times the size
Embench O3: no perf change, 23.6% size increase
Embench O2: no perf change, 28.6% size increase

No changes for Os and Oz.
That's almost tripling the CoreMark size.

mcberg2021 added a comment.EditedDec 8 2021, 12:04 PM

The initial checkin was reverted because a lit cfg was missing to exclude non target transform tests of RISCV loop unrolling, should be fixed shortly.

The initial checkin was reverted because a lit cfg was missing to exclude non target transform tests of RISCV loop unrolling, should be fixed shortly.

Let's discuss this patch tomorrow in the RISC-V sync-up call. Please don't merge it yet, even if that test issue is fixed.

luismarques reopened this revision.Dec 8 2021, 12:15 PM
This revision is now accepted and ready to land.Dec 8 2021, 12:15 PM
  • @luismarques reports this is performance neutral for Embench and Coremark on Ibex.

I think that was with an old version of this patch. With the current patch (now reverted) the numbers are:

CoreMark O3: +11.4% perf, 2.88 times the size
CoreMark O2: +9.36% perf, 2.88 times the size
Embench O3: no perf change, 23.6% size increase
Embench O2: no perf change, 28.6% size increase

No changes for Os and Oz.
That's almost tripling the CoreMark size.

CoreMark is a pretty tiny benchmark that fits in the L1 I$ of many processors, so that's probably not hugely surprising, though I suspect you could get a similar performance increase with a much more targeted segment of unrollings...

asb added a comment.Dec 9 2021, 7:48 AM

@luismarques and I were chatting about this patch some more. A few thoughts I'm writing down so we don't lose them (we should discuss on the call today too).

The key question is whether this unrolling should be enabled for all RISC-V targets or not. Looking at other backends:

  • AArch64: more aggressive unrolling options only enabled for in-order models
  • ARM: Most unrolling options only enabled for M-class cores

Is it your view that this transformation is worthwhile on all common RISC-V microarchitectures?

eopXD added a subscriber: eopXD.Dec 10 2021, 8:39 AM
lbenes added a subscriber: lbenes.Dec 10 2021, 10:43 AM

A branch containing this patch was accidentally pushed to GitHub: https://github.com/llvm/llvm-project/tree/arcpatch-D113798

Can someone please remove it?

A branch containing this patch was accidentally pushed to GitHub: https://github.com/llvm/llvm-project/tree/arcpatch-D113798

Can someone please remove it?

Gone.

This comment was removed by mcberg2021.
mcberg2021 added a comment.EditedDec 15 2021, 2:12 PM

The uploaded excel spreadsheet shows the difference between the default unroll preferences and the version presented in this review. The data is collected for Spec2k6 INT base.

Updated as per request, SiFive SubTargets are now guarded and default preferences used otherwise.

craig.topper added inline comments.Dec 17 2021, 5:47 PM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
175

I think this should use getTuneCPU not getCPU

Updated to use getTuneCPU in place of getCPU

mcberg2021 marked an inline comment as done.Dec 17 2021, 6:17 PM

Looking at recent issues, the test/Bindings/Go failure is intermittent on premerge testing for most changes.

This revision was landed with ongoing or failed builds.Dec 18 2021, 12:55 PM
This revision was automatically updated to reflect the committed changes.
zixuan-wu added inline comments.
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
179

How about getting UseDefaultPreferences flag from subtarget? And initialize the value in subtarget.

zixuan-wu added inline comments.Dec 22 2021, 1:55 AM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
175

It's not a general principle to just enumerate tune cpus. Could we predict some feature or parameter from subtarget such as whether it's out-of-order. Or we need get UseDefaultPreferences value which has been initialized in subtarget directly.