This patch also fixes up a number of cases in DAGCombine and
SelectionDAGBuilder where the size of a scalable vector is used in a
fixed-width context (thus triggering an assertion failure).
Details
Diff Detail
Event Timeline
For every place you're adding if (scalable) return false;, I'd like to see a comment explaining why we're bailing out.
llvm/include/llvm/CodeGen/TargetLowering.h | ||
---|---|---|
1255 ↗ | (On Diff #232884) | While you're here, indentation? |
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
6882 | Should we have a helper for this pattern? | |
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp | ||
1362 | This is sort of weird for a method named "SelectAddrModeFrameIndexSVE"; should it not just fail? | |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
9332 | Is this necessary? | |
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
2241 | This seems sort of confusing. "Scale" here is implicitly multiplied by vl, and there's isn't any way for the caller to tell except by checking the opcode. | |
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td | ||
1221 | IsLE? Are we supposed to do something different on big-endian targets? | |
1234 | Should we always use PTRUE_B, even for non-byte element sizes, to encourage CSE? Should we prefer to use ldr/str where legal, to take advantage of the larger immediate offset? |
- Added convenience function MemoryLocation::getSizeOrUnknown
- Removed isLE predicate from predicate store patterns.
- Added comments explaining why we bail out of a function when the type is a scalable vector.
- Addressed other suggestions to clean up code.
llvm/include/llvm/CodeGen/TargetLowering.h | ||
---|---|---|
1255 ↗ | (On Diff #232884) | Good spot! |
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
6882 | Yes, that would be useful. I've added MemoryLocation::getSizeOrUnknown(const TypeSize &) | |
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp | ||
1362 | Agreed, that should not have been there. Fixed. | |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
9332 | No, good catch! | |
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
2241 | I'm not sure if is an actual issue in practice though. Are you suggesting to make Scale a TypeSize instead of an unsigned? | |
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td | ||
1221 | No, that was a misunderstanding on my part. I've removed this now. | |
1234 |
Our experience is that vectorized loops have most predicates CSEd anyway. For a loop that operates on two lanes, often a predicate is already available and there is no need to introduce an extra ptrue_b. If a loop using floats is vectorized with VF=2, we don't want operations on <vscale x 2 x float> to use ptrue.b because that would enable operations on all (vscale x) 4 lanes, which may not be valid.
That would not be endian safe, hence the preference to use ST1 (note that the order is dictated by the AAPCS for when passing the vectors by reference). This case of saving/restoring to/from the stack like this is pretty rare. Normal spills and fills will indeed use the STR/LDR instructions. And normal load/store vector instructions that are not storing to a local will likely use other addressing modes like reg+reg. |
llvm/lib/Analysis/Loads.cpp | ||
---|---|---|
144 | "how many bytes are dereferenced". | |
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp | ||
1362 | I'm not sure how you're proving that "N" is a FrameIndexSDNode here? | |
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
2241 | Yes, that would force the callers to explicitly handle scalable types. It looks like some of them don't. | |
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td | ||
1234 | Okay, that makes sense. For the CSE thing, we could maybe add an optimization pass after isel if it's necessary. |
- Code in SelectAddrModeFrameIndexSVE now checks if index is a FrameIndexSDNode (rather than assume it is one).
- Fixed whitespace and updated comment.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp | ||
---|---|---|
2241 | Given that this is a change propagates through the rest of the code-base, I will do this in a separate patch. |
LGTM with one minor suggestion.
Sorry about the delay; I didn't understand the dependency relationship between this and D72758.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
15735 | This comment doesn't seem quite right. We could theoretically merge two stores if they're both scalable. For example, two <vscale x 8 x i8> stores can be merged to one <vscale x 16 x i8> store; we know <vscale x 16 x i8> is exactly twice as large as <vscale x 8 x i8>. You'd need extra logic for that, though, so I'm not suggesting changing the code. | |
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td | ||
1267 | nxv2i1 has the same memory layout as nxv16i1? I guess that makes sense given the available instructions. We might need to modify the datalayout to make that work properly; I think, without any explicit guidance from the layout string, it will assume a nxv2i1 load reads "vscale" bytes, not "vscale * 2" bytes. Not something to change in this patch, of course. |
No worries, thanks for reviewing!
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
15735 | You're right, I will change the comment! | |
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td | ||
1267 | DataLayout assumes that each boolean has a memory size of i8 as each predicate needs to be individually addressable, which leads to storesize(<vscale x 2 x i1>) == storesize(<vscale x 2 x i8>). |
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td | ||
---|---|---|
1267 | I should probably also point out that there is no other interface for users to read/write these predicates other than through svbool_t, which is an opaque type, so I don't think there is any need to expand the store of nxv16i1 to a store of nxv16i8. |
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td | ||
---|---|---|
1267 |
The whole area is still messy, unfortunately. Like I stated before, the "store size" for vectors assumes the bits are tightly packed. For non-scalable vectors, SelectionDAG legalization assumes the bits are tightly packed. (I think we fixed all the legalization routines to be consistent with this.) And for AVX-512, loads and stores of <16 x i1> etc. are lowered to bit-packed operations (kmovw). I just did some quick tests, though, and unfortunately, it looks like the alignment (and therefore the allocation size) is messed up. The alignment of vectors is currently based on the alignment of the element type, not the size of the vector, so it's much larger than the store size for <N x i1>. Unless the store size is exactly 64 or 128 bits wide, in which case the alignment is 64/128 bits respectively. Probably someone needs to spend more time in this area at some point. |
"how many bytes are dereferenced".