This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Expand unaligned fixed-length vector memory accesses
ClosedPublic

Authored by frasercrmck on May 14 2021, 6:27 AM.

Details

Summary

RVV vectors must be aligned to their element types, so anything less is
unaligned.

For regular loads and stores, our custom-lowering of fixed-length
vectors meant that we opted out of LegalizeDAG's built-in unaligned
expansion. This patch adds that logic in to our custom lower function.

For masked intrinsics, we declare that anything unaligned is not legal,
leaving the ScalarizeMaskedMemIntrin pass to do the expansion for us.

Note that neither of these methods can handle the expansion of
scalable-vector memory ops, so those cases are left alone by this patch.
Scalable loads and stores already go through expansion by default but
hit an assertion, and scalable masked intrinsics will silently generate
incorrect code. It may be prudent to return an error in both of these
cases.

Diff Detail

Event Timeline

frasercrmck created this revision.May 14 2021, 6:27 AM
frasercrmck requested review of this revision.May 14 2021, 6:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2021, 6:27 AM
frasercrmck added inline comments.May 14 2021, 6:29 AM
llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll
8–9

I just updated these tests naively, @craig.topper, but since I suspect the intention was only to test well-aligned values, perhaps we should add additional cases?

craig.topper added inline comments.May 14 2021, 10:05 AM
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
107

Does the project intend to eventually remove the implicit cast to unsigned from TypeSize? If so should this call getFixedSize() before doing the compare?

llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll
8–9

Yeah I think we should have additional tests. I'm not sure why I got the alignment right on some tests but not others.

  • rebase
  • avoid implicit TypeSize -> unsigned cast
  • add explicit costmodel tests for unaligned
frasercrmck marked 2 inline comments as done.May 17 2021, 7:36 AM
frasercrmck added inline comments.
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
107

Good spot! My understanding is that the implicit cast is not long for this world. I've added the call to getFixedSize().

llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll
8–9

I've added those now. I pre-committed the updated alignment for these existing cases.

This revision is now accepted and ready to land.May 18 2021, 10:11 AM

Maybe I'm missing something, but <16 x i8> is always aligned, right? Can you convert a <4 x i32> load to a <16 x i8> load, and bitcast the result?

Maybe I'm missing something, but <16 x i8> is always aligned, right? Can you convert a <4 x i32> load to a <16 x i8> load, and bitcast the result?

I hadn't actually considered that. It should work in theory, but we do currently have a cap on the size of the legal vector types so this wouldn't currently work with something like <128 x i32>. We could maybe do it pre-legalization, but I don't know if it'd get legalized right back to the unaligned version. The good news it'd presumably work for scalable vectors too. Do you know of other targets doing this?

Getting masked intrinsics to work via bitcasts may take a bit of doing though. I suppose you'd need to shuffle the indices and add on extra byte offsets like <0,1,2,3,0,1,...>. I don't know if that's safe to do since you may risk overflow.

@craig.topper can you see anything I've missed here?

craig.topper added a comment.EditedMay 19 2021, 10:52 AM

Maybe I'm missing something, but <16 x i8> is always aligned, right? Can you convert a <4 x i32> load to a <16 x i8> load, and bitcast the result?

I hadn't actually considered that. It should work in theory, but we do currently have a cap on the size of the legal vector types so this wouldn't currently work with something like <128 x i32>. We could maybe do it pre-legalization, but I don't know if it'd get legalized right back to the unaligned version. The good news it'd presumably work for scalable vectors too. Do you know of other targets doing this?

Can we add the missing MVT types or cap the vXi32/i64 vectors we support to the same total with as the longest vXi8 type?

Getting masked intrinsics to work via bitcasts may take a bit of doing though. I suppose you'd need to shuffle the indices and add on extra byte offsets like <0,1,2,3,0,1,...>. I don't know if that's safe to do since you may risk overflow.

You would also need to duplicate bits in the mask which isn't straight forward.

@craig.topper can you see anything I've missed here?

The regular load/store case is the most interesting because InstCombine merges bitcasts into loads/stores if I recall correctly. So it might create a bad case without realizing it.

Do you know of other targets doing this?

We end up doing this sort of thing on ARM in some cases. For example https://reviews.llvm.org/D100527 , https://reviews.llvm.org/D70176 .

I hadn't actually considered that. It should work in theory, but we do currently have a cap on the size of the legal vector types so this wouldn't currently work with something like <128 x i32>. We could maybe do it pre-legalization, but I don't know if it'd get legalized right back to the unaligned version. The good news it'd presumably work for scalable vectors too. Do you know of other targets doing this?

Can we add the missing MVT types or cap the vXi32/i64 vectors we support to the same total with as the longest vXi8 type?

I have a task to look into this which I may bring forward. I think this is what we want to see, I just hope we don't have too many legalization issues at the type "limits" like that shuffle I initially tried to fix by legalizing any_extend_vector_inreg.

Getting masked intrinsics to work via bitcasts may take a bit of doing though. I suppose you'd need to shuffle the indices and add on extra byte offsets like <0,1,2,3,0,1,...>. I don't know if that's safe to do since you may risk overflow.

You would also need to duplicate bits in the mask which isn't straight forward.

Indeed.

@craig.topper can you see anything I've missed here?

The regular load/store case is the most interesting because InstCombine merges bitcasts into loads/stores if I recall correctly. So it might create a bad case without realizing it.

Yeah. Though the test case that uncovered this bug for us is a masked scatter, but that was the vectorizer's choice which is slightly different.

We end up doing this sort of thing on ARM in some cases. For example https://reviews.llvm.org/D100527 , https://reviews.llvm.org/D70176 .

Thanks for the pointers! Pun -- sadly -- intended.

I think if it's okay by you, @craig.topper, I'll merge this as-is (since we need the bug fix) and start working on the improvement patch. It seems like we may need to get some other changes in before the bitcasting will work across the board.

  • rebase for test changes
Matt added a subscriber: Matt.May 27 2021, 9:46 AM

Ping @craig.topper, just making sure you're okay with this patch as-is before diving into the next stage.

I'm going to start looking into a total-size cap but I think we'll need extra MVTs: with the max supported MVT size of v128i8 we'd be reducing our legal i64 vectors from v256i64 to v32i64 which feels like a sharp drop.

Perhaps this is something to discuss more broadly with the RISC-V group. What's a sensible max fixed-length size for RVV? We could go on adding new MVTs for a long time until everything gets fed up with us hogging the enum values. 512b vectors? 1Kb? 4Kb? Though I've just checked and an unspecified out-of-tree backend was able to add v128i32 up to v2048i32 so maybe there's more wiggle room than I thought.

Ping @craig.topper, just making sure you're okay with this patch as-is before diving into the next stage.

I'm going to start looking into a total-size cap but I think we'll need extra MVTs: with the max supported MVT size of v128i8 we'd be reducing our legal i64 vectors from v256i64 to v32i64 which feels like a sharp drop.

Perhaps this is something to discuss more broadly with the RISC-V group. What's a sensible max fixed-length size for RVV? We could go on adding new MVTs for a long time until everything gets fed up with us hogging the enum values. 512b vectors? 1Kb? 4Kb? Though I've just checked and an unspecified out-of-tree backend was able to add v128i32 up to v2048i32 so maybe there's more wiggle room than I thought.

I'm ok merging this as-is.

  • rebase
  • update test checks