This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Convert vector store to scalar store if only one element updated
AbandonedPublic

Authored by qiucf on Dec 23 2019, 12:23 AM.

Details

Reviewers
spatel
nemanjai
andrewrk
fhahn
efriedma
bogner
lebedev.ri
jdoerfert
Group Reviewers
Restricted Project
Summary

This is a simplified version of https://reviews.llvm.org/D70223. Since we can at least be confident that single element store won't be worse than vector store, it's clearer to put them into InstCombine. And there's already some logic about transforming a shufflevector affecting one element into insertelement, we can take care of the only case.

Diff Detail

Event Timeline

qiucf created this revision.Dec 23 2019, 12:23 AM
RKSimon added subscribers: jdoerfert, RKSimon.

@spatel @jdoerfert Should/could we use attributor to keep track of the 'storeable' memory range?

llvm/test/Transforms/InstCombine/single-element-store.ll
1 ↗(On Diff #235102)

Regenerate with update_test_checks.py ?

qiucf updated this revision to Diff 235104.Dec 23 2019, 12:52 AM

Update test using auto-genearate tool.

qiucf marked an inline comment as done.Dec 23 2019, 12:52 AM
lebedev.ri added inline comments.Dec 23 2019, 12:55 AM
llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1200

I'd strongly suggest to simply create a new store.

llvm/test/Transforms/InstCombine/single-element-store.ll
8 ↗(On Diff #235104)

The alignment doesn't seem correct to me?

spatel requested changes to this revision.Dec 23 2019, 5:47 AM

InstCombine has some minimal load/store transforms, so this might be ok to add here, but InstCombine is probably not a pass that should try to do anything more complicated with memory ops.

This patch has the same requirements that were requested here:
https://reviews.llvm.org/D70223#1755719
...but it does not implement those.

In other words, the following tests will be miscompiled (and so they should be added to trunk independently before this patch tries to go any further):

define void @insert_store_addr(<16 x i8>* %p, <16 x i8>* %q, i8 %s) {
  %ld = load <16 x i8>, <16 x i8>* %p
  %ins = insertelement <16 x i8> %ld, i8 %s, i32 3
  store <16 x i8> %ins, <16 x i8>* %q ; store to different address
  ret void
}

define void @insert_store_mem_mod(<16 x i8>* %p, <16 x i8>* %q, i8 %s) {
  %ld = load <16 x i8>, <16 x i8>* %p
  store <16 x i8> zeroinitializer, <16 x i8>* %q ; do pointers alias?
  %ins = insertelement <16 x i8> %ld, i8 %s, i32 3
  store <16 x i8> %ins, <16 x i8>* %p
  ret void
}
This revision now requires changes to proceed.Dec 23 2019, 5:47 AM
qiucf updated this revision to Diff 235318.Dec 25 2019, 11:18 PM

Addressed comments:

  • Add check for address of load and store.
  • Add check for any memory write instructions between load and store.
  • Add more test cases for cases above. (Thanks to spatel)
qiucf marked an inline comment as done.Dec 25 2019, 11:18 PM
lkail added a subscriber: lkail.Dec 26 2019, 12:01 AM
lkail added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1196

Not an expert of opt, I just wonder is it more proper to check if Load->getPointerOperand() MustAlias SI.getPointerOperand(), since there might be bitcasts of these pointers?

@spatel @jdoerfert Should/could we use attributor to keep track of the 'storeable' memory range?

For simple examples like this we already do. Run the Attributor on them and you get dereferenceable annotations for the argument pointers.
Nevertheless, we can lose information. My suggestions to keep it is to emit something like this:

`call void @llvm.assume(i1 true) ["derefereceable"(%ptr, sizeof(access-we-just-removed))]`

I do have a RFC on the mailing list on this topic but no conclusion yet. If we decide to go forward with it we start emitting them and I
we write a combiner pass as well as integration into things like the Attributor to use the information.

Long story short, I think doing this is fine, once we have better assume capabilities we should emit one to keep the information about dereferenceability.

llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1196

I'd just call stripPointerCastSameRepresentation() on both pointers. That should give you all cases you care about (wrt. casts). Though, the match above might take care of it actually.

1216

The GEP should be inbounds and the store should keep the nontemporal flag if present on the original.

qiucf updated this revision to Diff 235556.Dec 29 2019, 11:24 PM
  • Create inbounds store
  • Strip casted pointer before comparison
  • Make sure load and store belong to the same BB
  • Keep nontemporal metadata of store
jdoerfert resigned from this revision.Dec 29 2019, 11:32 PM

I am generally fine with this but others are reviewing this change already.

qiucf marked 3 inline comments as done.Jan 4 2020, 8:23 AM

Ping..

spatel added a comment.Jan 6 2020, 9:02 AM

I think this is safe now (although limited to a single basic block), but I don't have enough experience with memop transforms to confidently approve. Please get a 2nd look from @lebedev.ri @jdoerfert @efriedma or someone else.

I'm also not sure if there's a better way to preserve metadata. For example, we have "copyMetadataForLoad()" - is there something like that that we can use here?

llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1193

Using 'auto *' is recommended with dyn_cast.

1213–1217
llvm/test/Transforms/InstCombine/load-insert-store.ll
7

Need to regenerate the CHECK lines? All tests should have 'inbounds' on the GEP, right?

70

Please add a test comment to explain the 2 transforms. Something like:

; p and q may alias, so it is not safe to remove the first load.
; r is known to not alias the other pointers, so it is safe to remove the second load.

I think you need to check the elements are byte-sized?

I'm a little concerned we're doing this too early, and it will end up blocking optimizations like GVN (since it can't reason about the partially overlapping store). Not sure how likely that is to trigger in practice, though.

llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1204

AliasAnalysis has a dedicated method getModRefInfo() to compute the exact property you want without caring about the specific kind of instruction.

qiucf updated this revision to Diff 236816.Jan 8 2020, 6:59 AM
qiucf marked 5 inline comments as done.

Address some comments.

  • Change some use of auto.
  • Update test cases with comments.
  • Use getModRefInfo.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

qiucf marked an inline comment as done.Jan 8 2020, 7:56 AM

The merge check bot should have some problems in resolving parent-child revision with some already committed. Currently, they have no problem applying into master and tests are passed.

llvm/test/Transforms/InstCombine/load-insert-store.ll
7

Sure. In some cases inbounds might be simplified, now we have them :-)

lebedev.ri requested changes to this revision.Jan 8 2020, 8:07 AM
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1189

Match m_Load() specifically?

1204
  1. Is it bitfield? Should we be checking if (AA->getModRefInfo(&*BBI, MemoryLocation::get(&SI) & ModRefInfo::Mod) ?
  2. This does correctly work for calls right, not just instructions?
This revision now requires changes to proceed.Jan 8 2020, 8:07 AM
qiucf updated this revision to Diff 237891.Jan 14 2020, 2:18 AM
qiucf marked 2 inline comments as done.
  • Use dedicated method to check ModRefInfo.
  • Add tests about calls.
llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1189

I think m_Load() is used for matching a load to some specific address. m_Load(Src) set Src as address of the load. But here we need the load instr to do more checks. Is that necessary?

Unit tests: pass. 61802 tests passed, 0 failed and 781 were skipped.

clang-tidy: unknown.

clang-format: pass.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

lebedev.ri added inline comments.Jan 14 2020, 2:39 AM
llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1201–1203

Hm, how do we know both the load and the store are in the same basic block?

1204

This looks better, but i don't have prior expirience with ModRefInfo,
so this still looks suspicious to me - is this conservatively correct?
isModSet() looks for MustMod, which seems stronger than Mod?

lebedev.ri requested changes to this revision.Jan 14 2020, 8:40 AM

(marking as reviewed)

llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1201–1203

(I would expect that there should either already be a function to do this check,
or this should be refactored into a helper function from the getgo.)

This revision now requires changes to proceed.Jan 14 2020, 8:40 AM
qiucf updated this revision to Diff 238436.Jan 16 2020, 2:40 AM
qiucf marked 4 inline comments as done.
  • Wrap memory modified check into a single method (which can help other methods)
llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1201–1203

We checked parent of Load and SI above, if they aren't the same, exit.

1204

hmm.. As the doc/comment says, I think this bit indicates that the mem location may/must be modified, and there's another bit about it's a 'must' or a 'may'. So the three bit should each mean May Mod Ref.

So Mod means it may write and mustAlias not found, while MustMod means it may write and mustAlias found. Here we only need to care if it may write or not (the bit), regardless of alias.

Maybe I'm not correct. I found a related commit, rG50db8a, for a look :)

Unit tests: pass. 61912 tests passed, 0 failed and 783 were skipped.

clang-tidy: unknown.

clang-format: pass.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

lebedev.ri resigned from this revision.Jan 25 2020, 3:19 AM
lebedev.ri requested changes to this revision.Jun 8 2020, 9:55 AM

Hm, i don't fully recall the story here, but @qiucf, are you still interested in this?
I think this looks about right now, with some nits.

@spatel, @jdoerfert: any other thoughts?

llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1200

We indeed can't easily (if at all?) handle non-constant case.
But there are no tests that we don't transform when the idx is a variable.

Likewise i suspect constantexprs will give is trouble here, so i strongly recommend using m_ConstantInt().

1203

I think we are better-off early-returning instead.

1218

Needs explicit Align(1) now.
Though i think at least for constant offsets we could deduce the alignment,
something close to:

commonAlignment(
    SI.getAlign(),
    Idx->getUniqueInteger().getLimitedValue() *
        (NewElement->getType()->getScalarSizeInBits() / 8))

But that can be done in a followup patch,

llvm/test/Transforms/InstCombine/load-insert-store.ll
2

I wasn't sure whether or not this is endian-sensitive, alive2 says it's not, but maybe we should be proactive and do

; RUN: opt < %s -instcombine -S -data-layout=e | FileCheck %s --check-prefix=ALL --check-prefixes=CHECK,LE
; RUN: opt < %s -instcombine -S -data-layout=E | FileCheck %s --check-prefix=ALL --check-prefixes=CHECK,BE
4

Add same test but with <8 x i16> (so it's not byte-sized)?

This revision now requires changes to proceed.Jun 8 2020, 9:55 AM
lebedev.ri added inline comments.Jun 8 2020, 9:57 AM
llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1184–1185

Do we care whether these are single-use?

One thing I've realized reading this thread again is that it's not only the compiler that can get confused by a wrong-width store; the CPU itself can also run into issues with store->load forwarding. See recent discussion http://lists.llvm.org/pipermail/llvm-dev/2020-May/141837.html . So whether this transform is worthwhile might depend on the context.

One thing I've realized reading this thread again is that it's not only the compiler that can get confused by a wrong-width store; the CPU itself can also run into issues with store->load forwarding. See recent discussion http://lists.llvm.org/pipermail/llvm-dev/2020-May/141837.html . So whether this transform is worthwhile might depend on the context.

Right. And if we know problematic context/conditions, we can always perform reverse fold later on, in backend. Correct?

One thing I've realized reading this thread again is that it's not only the compiler that can get confused by a wrong-width store; the CPU itself can also run into issues with store->load forwarding. See recent discussion http://lists.llvm.org/pipermail/llvm-dev/2020-May/141837.html . So whether this transform is worthwhile might depend on the context.

Right. And if we know problematic context/conditions, we can always perform reverse fold later on, in backend. Correct?

The C memory model doesn't allow reversing the fold in a lot of cases without encoding more information in the IR. Also, I'm not sure what the heuristic looks like, so I'm not sure how hard it is to compute in the backend.

I feel there are opportunities here as our vector handling needs to improve a lot.
Though, without more data to decide (1) where to put this (in the pipeline) and (2) when to enable this (platform, context,...), it is hard to say for sure.
Generally, I would like to have this in-tree (potentially under a flag).

I think the code looks good, though we can add FIXMEs for missing parts, e.g, better alignment, the load version of this, ...

spatel added a comment.Jun 9 2020, 6:13 AM

We have -vector-combine now, and it only runs late in the pipeline. If we put this transform in there, that should bypass concerns about interfering with GVN. That pass also has access to the cost model, so we could try to limit the transform if there are known cases where it would not be profitable (ie, no need to try to implement a codegen reversal).

qiucf marked an inline comment as done.Jun 9 2020, 9:18 AM

Thanks for comments!

We have -vector-combine now, and it only runs late in the pipeline. If we put this transform in there, that should bypass concerns about interfering with GVN. That pass also has access to the cost model, so we could try to limit the transform if there are known cases where it would not be profitable (ie, no need to try to implement a codegen reversal).

I'll try moving this into vector combine. Maybe we can do/know more by help from TTI.

llvm/test/Transforms/InstCombine/load-insert-store.ll
2

We have few docs explicitly talked about this, and I remember endianness doesn't matter. But adding BE/LE check is fine. Thanks.

qiucf abandoned this revision.May 8 2021, 7:33 PM

D98240 landed.