The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
llvm/test/Transforms/LoopVectorize/RISCV/interleaved-accesses-expensive.ll | ||
---|---|---|
4 | We don't need to check if the type is legal in getInterleavedMemoryOpCost, since the loop vectorizer already checks if the type needs to be split by calling TTI.getNumberOfParts |
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
347 | This doesn't look right. We need to account for the cost of the actual memory op, plus the interweave cost (if any). At a minimum, we need to have the full cost of the wide memory op as a baseline. I can't imagine hardware with an optimized segment-2 which beats the cost of a normal load/store op of the same width. The only question left is whether we need to explicitly model the shuffle cost. Depending on the hardware, we may or may not have an optimized segment load/store. I think it's probably safest to cost model this as if we're going to do a wide load followed by a shuffle. We can reduce that cost if we have a target which a) actually has faster segment-2, and b) cares about the cost difference. | |
llvm/test/Transforms/LoopVectorize/RISCV/interleaved-accesses.ll | ||
5 | Can you add a couple tests for SEW < 64 bits? Also, you should probably add an actual CostModel test rather than relying on indirectly testing this through the vectorizer. |
llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll | ||
---|---|---|
1 ↗ | (On Diff #503785) | Unfortunately there doesn't seem to be a way to print out the result of getInterleavedMemoryOpCost other than via -debug-only=loop-vectorize. i.e. It's not called in getInstructionCost |
I don't understand this comment, and the code change appears to be unrelated. My LGTM does not hold with the change until this is separated and/or explained.
llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll | ||
---|---|---|
11 ↗ | (On Diff #505650) | @reames Apologies for jumping the gun there, this is the line that fails without that change to getShuffleCost. I've split it out into D146176, but the gist is that this deinterleaved load ends up getting costed as an interleave because it has a mask of <0, 2> This test case covers it but I wasn't able to find a way to test it in the aforementioned patch due to how getInstructionCost doesn't cost shuffles that change length. (Happy to take a look into that though if needed) |
LGTM (again)
p.s. In this case, leaving the code in this patch would have been fine. It was the lack of explanation/example in your rebase which was problematic.
This doesn't look right. We need to account for the cost of the actual memory op, plus the interweave cost (if any).
At a minimum, we need to have the full cost of the wide memory op as a baseline. I can't imagine hardware with an optimized segment-2 which beats the cost of a normal load/store op of the same width.
The only question left is whether we need to explicitly model the shuffle cost. Depending on the hardware, we may or may not have an optimized segment load/store.
I think it's probably safest to cost model this as if we're going to do a wide load followed by a shuffle. We can reduce that cost if we have a target which a) actually has faster segment-2, and b) cares about the cost difference.