This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Use scalar stores for splats of zero to memory up to XLen
ClosedPublic

Authored by reames on May 16 2023, 1:39 PM.

Details

Summary

The direct motivation here is to undo an unprofitable vectorization performed by SLP, but the transform seems generally useful as well. If we are storing a zero to memory, we can use a single scalar store (from X0) for all power of two sizes up to XLen.

Note: We can extend this transform in a bunch of ways. I'm deliberately starting narrow to focus on the two questions highlighted just above.

Diff Detail

Event Timeline

reames created this revision.May 16 2023, 1:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2023, 1:39 PM
reames requested review of this revision.May 16 2023, 1:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2023, 1:39 PM
craig.topper added inline comments.May 16 2023, 5:55 PM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12176

Did you intend to leave FLAGIT in here?

reames added inline comments.May 16 2023, 6:11 PM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12176

Nope. :) That's my usual easy grep flag for local work, will remove.

reames updated this revision to Diff 522869.May 16 2023, 6:16 PM
reames updated this revision to Diff 522873.May 16 2023, 6:28 PM
reames retitled this revision from [RISCV] Use scalar stores for splats of zero to memory upto 2 x XLen to [RISCV] Use scalar stores for splats of zero to memory up to XLen.
reames edited the summary of this revision. (Show Details)

Noticed while looking at something else that DAG combine store merging will produce the <2 x i64> vector stores this is splitting. The particular test case came from memset lowering. Oddly, this seeming conflict appears to work out, but it feels like we should really fix the store merging preference before handling the two stores part of this.

(following is removed from review description as it no longer applies, saved only for my later reference)

One concern here is that splitting the store into two instructions could introduce a store-to-load forwarding stall. I think this is worth doing, but what do others think? Should we restrict this to a maximum of XLEN sized operations which can be done in a single store?

This transform is potentially increasing the number of memory operations (from 1 to 2). The same restriction mentioned above would resolve this as well.

craig.topper accepted this revision.May 16 2023, 8:09 PM

This seems reasonable to me.

This revision is now accepted and ready to land.May 16 2023, 8:09 PM
luke added inline comments.May 17 2023, 2:24 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12189

Do we want to check if the memory access is fast? Currently we always report unaligned scalar accesses as slow, but that sounds like it's at odds with this combine

bool RISCVTargetLowering::allowsMisalignedMemoryAccesses(
    EVT VT, unsigned AddrSpace, Align Alignment, MachineMemOperand::Flags Flags,
    unsigned *Fast) const {
  if (!VT.isVector()) {
    if (Fast)
      *Fast = 0;
    return Subtarget.enableUnalignedScalarMem();
  }
asb added inline comments.May 17 2023, 4:14 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12189

Luke and I chatted about this earlier today, and the specific case would be where a vector access that is aligned according to the natural alignment of its elements is converted to a wider scalar access that is misaligned. Though now reminding myself of the how +unaligned-scalar-mem is used (https://reviews.llvm.org/D126085), this is probably not a concern, as it's only enabled if unaligned scalar mem is performant enough to be worth using.

reames added inline comments.May 17 2023, 7:26 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12189

Honestly, this is whole area is a bit of a mess. What exactly the fast flag means (it's a unsigned so it has many possible states) is very poorly defined. It's also extremely target specific, but interacts with generic codegen, so the best kind of fun.

At the moment, we're interpreting it as basically a hint that misaligned accesses should be aggressively formed - as opposed to simply used when natural during lowering.

In this case, I think moving forward without Fast is the right practical answer, but this is a bit of a code smell.

Thanks for pointing this out, I'd not considered it. I'm going to be looking at a couple other aspects of mem-op lowering, and I'll see if I can clean this up.