Both these preference helper functions have initial support with this change. The loop unrolling preferences are set with initial settings to control thresholds, size and attributes of loops to unroll with some tuning done. The peeling preferences may need some tuning as well as the initial support looks much like what other architectures utilize. An unrolling test is added for RISCV as well to track how preferences modify/control loop unrolling.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
204 | The MVE in this comment appears to have been copied from ARM. |
This test screams of "I took C and shoved it into a test case" rather than actually taking the time to distill it down to a minimal example of IR free from irrelevant clutter
llvm/test/CodeGen/RISCV/unroll.ll | ||
---|---|---|
2 ↗ | (On Diff #386925) | opt test in CodeGen? |
llvm/test/CodeGen/RISCV/unroll.ll | ||
---|---|---|
2 ↗ | (On Diff #386925) | (this belong in llvm/test/Transforms/LoopUnroll/RISCV) |
llvm/test/CodeGen/RISCV/unroll.ll | ||
---|---|---|
2 ↗ | (On Diff #386925) | Test moved to LoopUnroll/RISCV |
llvm/test/Transforms/LoopUnroll/RISCV/unroll.ll | ||
---|---|---|
3 | Use -mtriple=riscv64 unless it's genuinely OS-dependent (which this is not) | |
5 | I doubt you need all these pointer attributes | |
147 | These comments do nothing | |
147 | This would also be more natural as the final block in the function; presumably the current block schedule is an artefact of the order in which various optimisation happened on the original C and IR | |
164 | Is the TBAA actually needed? |
llvm/test/Transforms/LoopUnroll/RISCV/unroll.ll | ||
---|---|---|
147 | This comment is still there | |
164 | !0 is only used self-referentially; only !1 is referenced from outside of the metadata itself. So I don't think this distinct does anything. | |
165 | Is this needed (same for the !llvm.loop)? If yes, just inline it like !llvm.loop, if no delete them. |
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
204 | Is this truly checking "vectorized loops" or just loops containing vector instructions? We've already checked that the loop isn't vectorized according to metadata. What about code written using RVV intrinsics, or with OpenCL/SYCL/etc? We might want to unroll those loops, right? | |
217 | Does the explicit size of 4 help much or should we just use SmallVector<const Value*>? |
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
204 | I think for now I am going to mark this with a TODO for more tuning, I updated the comment for vectorized instructions, it will be uploaded soon... | |
217 | This setting mirrors SLP's generic vector operand setting, which we utilize, so it does seem appropriate. |
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
217 | This shouldn't be mirroring anything. The ARM code I copied from pre-dates SmallVector's ability to automatically pick a value for the second template parameter. I think it will automatically pick a value more than 4. So I think the second parameter can be removed. |
We've been using this internally on our SiFive 7 series and U8. Should we check the CPU?
I know that @asb often runs benchmarks over prospective patches, so I thought he might have some thoughts about we generally know this sort of thing is ready to go.
Sorry, I missed the previous ping and was out last week:
- @luismarques reports this is performance neutral for Embench and Coremark on Ibex.
- Representative benchmarks for anything like this is clearly difficult. If anyone has run e.g. SPEC on real RISC-V hardware, that would be interesting. Cases where the change is roughly performance neutral but may waste I$ might not show up on simple benchmarks.
- Just as another datapoint, I ran this against the GCC torture suite. One case that stuck out to me was pr85169.c. It seems pretty unlikely that huge number of unrolled stores of zero byte is profitable (though maybe my intuition is wrong!). Could you please take a quick look at this case to see if there is any obvious tuning that can be done for it?
Otherwise, this looks good to me, and I don't think pr85169.c needs to be a blocker.
Alex, we have been using these unrolling preferences in house since mid summer for RISC-V. I will have a look at the outlier case too.
I altered the unrolling preferences for the indicated case, we meet the unroll criteria anyways, as the loop is small at the time of evaluation. Also the abort was motioned to the exit block by the time of the evaluation and so is never encountered as a call in the loop. So I think we are ok for this case.
Please include why you're reverting a commit in the commit message, as per the developer policy https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy
I think that was with an old version of this patch. With the current patch (now reverted) the numbers are:
CoreMark O3: +11.4% perf, 2.88 times the size
CoreMark O2: +9.36% perf, 2.88 times the size
Embench O3: no perf change, 23.6% size increase
Embench O2: no perf change, 28.6% size increase
No changes for Os and Oz.
That's almost tripling the CoreMark size.
The initial checkin was reverted because a lit cfg was missing to exclude non target transform tests of RISCV loop unrolling, should be fixed shortly.
Let's discuss this patch tomorrow in the RISC-V sync-up call. Please don't merge it yet, even if that test issue is fixed.
CoreMark is a pretty tiny benchmark that fits in the L1 I$ of many processors, so that's probably not hugely surprising, though I suspect you could get a similar performance increase with a much more targeted segment of unrollings...
@luismarques and I were chatting about this patch some more. A few thoughts I'm writing down so we don't lose them (we should discuss on the call today too).
The key question is whether this unrolling should be enabled for all RISC-V targets or not. Looking at other backends:
- AArch64: more aggressive unrolling options only enabled for in-order models
- ARM: Most unrolling options only enabled for M-class cores
Is it your view that this transformation is worthwhile on all common RISC-V microarchitectures?
A branch containing this patch was accidentally pushed to GitHub: https://github.com/llvm/llvm-project/tree/arcpatch-D113798
Can someone please remove it?
The uploaded excel spreadsheet shows the difference between the default unroll preferences and the version presented in this review. The data is collected for Spec2k6 INT base.
Updated as per request, SiFive SubTargets are now guarded and default preferences used otherwise.
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
175 | I think this should use getTuneCPU not getCPU |
Looking at recent issues, the test/Bindings/Go failure is intermittent on premerge testing for most changes.
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
179 | How about getting UseDefaultPreferences flag from subtarget? And initialize the value in subtarget. |
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
175 | It's not a general principle to just enumerate tune cpus. Could we predict some feature or parameter from subtarget such as whether it's out-of-order. Or we need get UseDefaultPreferences value which has been initialized in subtarget directly. |
I think this should use getTuneCPU not getCPU