Both these preference helper functions have initial support with this change. The loop unrolling preferences are set with initial settings to control thresholds, size and attributes of loops to unroll with some tuning done. The peeling preferences may need some tuning as well as the initial support looks much like what other architectures utilize. An unrolling test is added for RISCV as well to track how preferences modify/control loop unrolling.
Use -mtriple=riscv64 unless it's genuinely OS-dependent (which this is not)
I doubt you need all these pointer attributes
These comments do nothing
This would also be more natural as the final block in the function; presumably the current block schedule is an artefact of the order in which various optimisation happened on the original C and IR
Is the TBAA actually needed?
This comment is still there
!0 is only used self-referentially; only !1 is referenced from outside of the metadata itself. So I don't think this distinct does anything.
Is this needed (same for the !llvm.loop)? If yes, just inline it like !llvm.loop, if no delete them.
Is this truly checking "vectorized loops" or just loops containing vector instructions? We've already checked that the loop isn't vectorized according to metadata. What about code written using RVV intrinsics, or with OpenCL/SYCL/etc? We might want to unroll those loops, right?
Does the explicit size of 4 help much or should we just use SmallVector<const Value*>?
I think for now I am going to mark this with a TODO for more tuning, I updated the comment for vectorized instructions, it will be uploaded soon...
This setting mirrors SLP's generic vector operand setting, which we utilize, so it does seem appropriate.
This shouldn't be mirroring anything. The ARM code I copied from pre-dates SmallVector's ability to automatically pick a value for the second template parameter. I think it will automatically pick a value more than 4. So I think the second parameter can be removed.
Sorry, I missed the previous ping and was out last week:
- @luismarques reports this is performance neutral for Embench and Coremark on Ibex.
- Representative benchmarks for anything like this is clearly difficult. If anyone has run e.g. SPEC on real RISC-V hardware, that would be interesting. Cases where the change is roughly performance neutral but may waste I$ might not show up on simple benchmarks.
- Just as another datapoint, I ran this against the GCC torture suite. One case that stuck out to me was pr85169.c. It seems pretty unlikely that huge number of unrolled stores of zero byte is profitable (though maybe my intuition is wrong!). Could you please take a quick look at this case to see if there is any obvious tuning that can be done for it?
Otherwise, this looks good to me, and I don't think pr85169.c needs to be a blocker.
I altered the unrolling preferences for the indicated case, we meet the unroll criteria anyways, as the loop is small at the time of evaluation. Also the abort was motioned to the exit block by the time of the evaluation and so is never encountered as a call in the loop. So I think we are ok for this case.
I think that was with an old version of this patch. With the current patch (now reverted) the numbers are:
CoreMark O3: +11.4% perf, 2.88 times the size
CoreMark O2: +9.36% perf, 2.88 times the size
Embench O3: no perf change, 23.6% size increase
Embench O2: no perf change, 28.6% size increase
No changes for Os and Oz.
That's almost tripling the CoreMark size.
CoreMark is a pretty tiny benchmark that fits in the L1 I$ of many processors, so that's probably not hugely surprising, though I suspect you could get a similar performance increase with a much more targeted segment of unrollings...
@luismarques and I were chatting about this patch some more. A few thoughts I'm writing down so we don't lose them (we should discuss on the call today too).
The key question is whether this unrolling should be enabled for all RISC-V targets or not. Looking at other backends:
- AArch64: more aggressive unrolling options only enabled for in-order models
- ARM: Most unrolling options only enabled for M-class cores
Is it your view that this transformation is worthwhile on all common RISC-V microarchitectures?
It's not a general principle to just enumerate tune cpus. Could we predict some feature or parameter from subtarget such as whether it's out-of-order. Or we need get UseDefaultPreferences value which has been initialized in subtarget directly.