This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Add patterns for unpredicated load/store to frame-indices.
ClosedPublic

Authored by sdesmalen on Dec 9 2019, 10:04 AM.

Details

Summary

This patch also fixes up a number of cases in DAGCombine and
SelectionDAGBuilder where the size of a scalable vector is used in a
fixed-width context (thus triggering an assertion failure).

Diff Detail

Event Timeline

sdesmalen created this revision.Dec 9 2019, 10:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2019, 10:04 AM

For every place you're adding if (scalable) return false;, I'd like to see a comment explaining why we're bailing out.

llvm/include/llvm/CodeGen/TargetLowering.h
1255

While you're here, indentation?

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
6852

Should we have a helper for this pattern?

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1326

This is sort of weird for a method named "SelectAddrModeFrameIndexSVE"; should it not just fail?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9186

Is this necessary?

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2237

This seems sort of confusing. "Scale" here is implicitly multiplied by vl, and there's isn't any way for the caller to tell except by checking the opcode.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
1165

IsLE? Are we supposed to do something different on big-endian targets?

1178

Should we always use PTRUE_B, even for non-byte element sizes, to encourage CSE?

Should we prefer to use ldr/str where legal, to take advantage of the larger immediate offset?

sdesmalen updated this revision to Diff 233449.Dec 11 2019, 2:28 PM
sdesmalen marked 8 inline comments as done.
  • Added convenience function MemoryLocation::getSizeOrUnknown
  • Removed isLE predicate from predicate store patterns.
  • Added comments explaining why we bail out of a function when the type is a scalable vector.
  • Addressed other suggestions to clean up code.
sdesmalen added inline comments.Dec 11 2019, 2:29 PM
llvm/include/llvm/CodeGen/TargetLowering.h
1255

Good spot!

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
6852

Yes, that would be useful. I've added MemoryLocation::getSizeOrUnknown(const TypeSize &)

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1326

Agreed, that should not have been there. Fixed.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9186

No, good catch!

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2237

I'm not sure if is an actual issue in practice though. Are you suggesting to make Scale a TypeSize instead of an unsigned?

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
1165

No, that was a misunderstanding on my part. I've removed this now.

1178

Should we always use PTRUE_B, even for non-byte element sizes, to encourage CSE?

Our experience is that vectorized loops have most predicates CSEd anyway. For a loop that operates on two lanes, often a predicate is already available and there is no need to introduce an extra ptrue_b. If a loop using floats is vectorized with VF=2, we don't want operations on <vscale x 2 x float> to use ptrue.b because that would enable operations on all (vscale x) 4 lanes, which may not be valid.

Should we prefer to use ldr/str where legal, to take advantage of the larger immediate offset?

That would not be endian safe, hence the preference to use ST1 (note that the order is dictated by the AAPCS for when passing the vectors by reference). This case of saving/restoring to/from the stack like this is pretty rare. Normal spills and fills will indeed use the STR/LDR instructions. And normal load/store vector instructions that are not storing to a local will likely use other addressing modes like reg+reg.

efriedma added inline comments.Dec 11 2019, 5:42 PM
llvm/lib/Analysis/Loads.cpp
144

"how many bytes are dereferenced".

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1326

I'm not sure how you're proving that "N" is a FrameIndexSDNode here?

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2237

Yes, that would force the callers to explicitly handle scalable types. It looks like some of them don't.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
1178

Okay, that makes sense. For the CSE thing, we could maybe add an optimization pass after isel if it's necessary.

sdesmalen updated this revision to Diff 237670.Jan 13 2020, 7:14 AM
sdesmalen marked 11 inline comments as done.
  • Code in SelectAddrModeFrameIndexSVE now checks if index is a FrameIndexSDNode (rather than assume it is one).
  • Fixed whitespace and updated comment.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2237

Given that this is a change propagates through the rest of the code-base, I will do this in a separate patch.

sdesmalen updated this revision to Diff 237674.Jan 13 2020, 7:19 AM

added context to the patch

sdesmalen marked 2 inline comments as done.Jan 15 2020, 3:46 AM
sdesmalen added inline comments.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2237

I've implemented this change in D72758.

sdesmalen marked an inline comment as done.Jan 21 2020, 11:18 AM

Gentle ping (I think I've addressed all comments on the patch)

efriedma accepted this revision.Jan 21 2020, 1:16 PM

LGTM with one minor suggestion.

Sorry about the delay; I didn't understand the dependency relationship between this and D72758.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15695

This comment doesn't seem quite right. We could theoretically merge two stores if they're both scalable. For example, two <vscale x 8 x i8> stores can be merged to one <vscale x 16 x i8> store; we know <vscale x 16 x i8> is exactly twice as large as <vscale x 8 x i8>.

You'd need extra logic for that, though, so I'm not suggesting changing the code.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
1202

nxv2i1 has the same memory layout as nxv16i1? I guess that makes sense given the available instructions. We might need to modify the datalayout to make that work properly; I think, without any explicit guidance from the layout string, it will assume a nxv2i1 load reads "vscale" bytes, not "vscale * 2" bytes.

Not something to change in this patch, of course.

This revision is now accepted and ready to land.Jan 21 2020, 1:16 PM
sdesmalen marked 2 inline comments as done.Jan 22 2020, 4:29 AM

LGTM with one minor suggestion.

Sorry about the delay; I didn't understand the dependency relationship between this and D72758.

No worries, thanks for reviewing!

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15695

You're right, I will change the comment!

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
1202

DataLayout assumes that each boolean has a memory size of i8 as each predicate needs to be individually addressable, which leads to storesize(<vscale x 2 x i1>) == storesize(<vscale x 2 x i8>).
This also means that alloca's of predicates may allocate too much stack space depending on the number of elements. The generated code is correct because all the offsets scale accordingly; you can see this for example in spill_nxv16i1 and spill_nxv2i1. The former (nxv16i1) allocates the sizeof two nxv16i8 vectors and loads the second predicate from offset 8 [* sizeof(predicate)], where the latter allocates the sizeof one nxv16i8 vector, and loads the second predicate from offset 2 [* sizeof(predicate)]. (sizeof(predicate) = (vscale * 2 bytes))
This is different from spills introduced by e.g. the register allocator, where LLVM allocates space for the size of an (otherwise opaque) predicate register, set to 2 bytes.

sdesmalen marked an inline comment as done.Jan 22 2020, 5:47 AM
sdesmalen added inline comments.
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
1202

I should probably also point out that there is no other interface for users to read/write these predicates other than through svbool_t, which is an opaque type, so I don't think there is any need to expand the store of nxv16i1 to a store of nxv16i8.

This revision was automatically updated to reflect the committed changes.
efriedma added inline comments.Jan 22 2020, 1:51 PM
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
1202

DataLayout assumes that each boolean has a memory size of i8 as each predicate needs to be individually addressable

The whole area is still messy, unfortunately.

Like I stated before, the "store size" for vectors assumes the bits are tightly packed. For non-scalable vectors, SelectionDAG legalization assumes the bits are tightly packed. (I think we fixed all the legalization routines to be consistent with this.) And for AVX-512, loads and stores of <16 x i1> etc. are lowered to bit-packed operations (kmovw).

I just did some quick tests, though, and unfortunately, it looks like the alignment (and therefore the allocation size) is messed up. The alignment of vectors is currently based on the alignment of the element type, not the size of the vector, so it's much larger than the store size for <N x i1>. Unless the store size is exactly 64 or 128 bits wide, in which case the alignment is 64/128 bits respectively.

Probably someone needs to spend more time in this area at some point.