This is an archive of the discontinued LLVM Phabricator instance.

Teach instcombine to canonicalize "element extraction" from a load of an integer and "element insertion" into a store of an integer into actual element extraction, element insertion, and vector loads and stores.
ClosedPublic

Authored by chandlerc on Dec 5 2014, 3:56 AM.

Details

Summary

Previously various parts of LLVM (including instcombine itself) would
introduce integer loads and stores into the code as a way of opaquely
loading and storing "bits". In some cases (such as a memcpy of
std::complex<float> object) we will eventually end up using those bits
in non-integer types. In order for SROA to effectively promote the
allocas involved, it splits these "store a bag of bits" integer loads
and stores up into the constituent parts. However, for non-alloca loads
and tsores which remain, it uses integer math to recombine the values
into a large integer to load or store.

All of this would be "fine", except that it forces LLVM to go through
integer math to combine and split up values. While this makes perfect
sense for integers (and in fact is critical for bitfields to end up
lowering efficiently) it is *terrible* for non-integer types, especially
floating point types. We have a much more canonical way of representing
the act of concatenating the bits of two SSA values in LLVM: a vector
and insertelement. This patch teaching InstCombine to use this
representation.

With this patch applied, LLVM will no longer introduce integer math into
the critical path of every loop over std::complex<float> operations such
as those that make up the hot path of ... oh, most HPC code, Eigen, and
any other heavy linear algebra library.

For the record, I looked *extensively* at fixing this in other parts of
the compiler, but it just doesn't work:

  • We really do want to canonicalize memcpy and other bit-motion to integer loads and stores. SSA values are tremendously more powerful than "copy" intrinsics. Not doing this regresses massive amounts of LLVM's scalar optimizer.
  • We really do need to split up integer loads and stores of this form in SROA or every memcpy of a trivially copyable struct will prevent SSA formation of the members of that struct. It essentially turns off SROA.
  • The closest alternative is to actually split the loads and stores when partitioning with SROA, but this has all of the downsides historically discussed of splitting up loads and stores -- the wide-store information is fundamentally lost. We would also see performance regressions for bitfield-heavy code and other places where the integers aren't really intended to be split.
  • We *can* effectively fix this in instcombine, so it isn't that hard of a choice to make IMO.

Diff Detail

Repository
rL LLVM

Event Timeline

chandlerc updated this revision to Diff 16983.Dec 5 2014, 3:56 AM
chandlerc retitled this revision from to Teach instcombine to canonicalize "element extraction" from a load of an integer and "element insertion" into a store of an integer into actual element extraction, element insertion, and vector loads and stores..
chandlerc updated this object.
chandlerc edited the test plan for this revision. (Show Details)
chandlerc added reviewers: hfinkel, majnemer.
chandlerc added a subscriber: Unknown Object (MLST).
chandlerc updated this revision to Diff 16984.Dec 5 2014, 5:50 AM

Fix a somewhat obvious goof with endianness and a bunch of formatting badness
that drifted into the patch.

Update the test case to be much more precise about what it checks and to
enforce that we get endianness correct now.

hfinkel accepted this revision.Dec 8 2014, 12:25 AM
hfinkel edited edge metadata.

LGTM.

This will obviously lead to the formation of more vector types, many/most of which will be scalarized. We'll need to be careful re: code quality here, as the default expansion of extract_vector_elt, for example, stores the vector to the stack and loads one element (and will do this separately for each extraction). Nevertheless, this does seem to make sense as a canonical form.

It seems like the two primary types affected by this change will be floating-point types and pointers. You have tests with fp types, but none with pointers. Can you please add a test case with pointers? [The underlying logic is obviously the same, but we should have regression tests covering this basic facet of our canonical form]

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
390 ↗(On Diff #16984)

You have a more-explanatory comment for the store version, perhaps you could copy that here, "All of the elements extracted need to be the same type...."

395 ↗(On Diff #16984)

So you will catch pointer types here and form vectors of pointers; this is likely worth a comment somewhere.

This revision is now accepted and ready to land.Dec 8 2014, 12:25 AM

Thanks for the review! I've cleaned up the test a touch, added the pointer vector test, fixed the things you commented on.

Planning to commit. While I agree there may need to be more work if we start getting vectors of pointers and stuff, what I've tested already lowers better than the old code with the inputs we have today (floating point, and the existing backends).

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
390 ↗(On Diff #16984)

Done.

395 ↗(On Diff #16984)

Sure.

chandlerc closed this revision.Dec 9 2014, 12:56 AM
chandlerc updated this revision to Diff 17073.

Closed by commit rL223764 (authored by @chandlerc).