This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Unify vectorization of load/stores of aggregate arguments and return values.
ClosedPublic

Authored by tra on Feb 15 2017, 3:12 PM.

Details

Summary

Original code only used vector loads/stores for explicit vector arguments.
It could also do more loads/stores than necessary (e.g v5f32 would
touch 8 f32 values). Aggregate types were loaded one element at a time,
even the vectors contained within.

This change attempts to generalize (and simplify) parameter space
loads/stores so that vector loads/stores can be used more broadly.
Functionality of the patch has been verified by compiling thrust
test suite and manually checking the differences between PTX
generated by llvm with and without the patch.

General algorithm:

  • ComputePTXValueVTs() flattens input/output argument into a flat list of scalars to load/store and returns their types and offsets.
  • VectorizePTXValueVTs() uses that data to create vectorization plan which returns an array of flags marking boundaries of vectorized load/stores. Scalars are represented as 1-element vectors.
  • Code that generates loads/stores implements a simple state machine that constructs a vector according to the plan.
NOTE: the tests for this patch have found a problem with the way we load/store unaligned fields. This will be addressed separately.

Diff Detail

Repository
rL LLVM

Event Timeline

tra created this revision.Feb 15 2017, 3:12 PM
jlebar added inline comments.Feb 16 2017, 5:30 PM
lib/Target/NVPTX/NVPTXISelLowering.cpp
190 ↗(On Diff #88607)

All right, now that I understand this, it's doing something deceptively simple. Perhaps:

Check whether we can merge loads/stores of some of the pieces of a flattened function parameter or return value into a single vector load/store.

The flattened parameter is represented as a list of EVTs and offsets, and the whole structure is aligned to ParamAlignment. This function determines whether we can load/store pieces of the parameter starting at index Idx using a single vectorized op of size AccessSize. If so, it returns the number of param pieces covered by the vector op. Otherwise, it returns 1.

Perhaps "CanVectorizeAt" is not a great name. I think we want "Param" in the name. Maybe CanMergeParamLoadStoresStartingAt? Ideally we'd also pick a name that doesn't sound like this returns a boolean, but I don't have a suggestion.

208 ↗(On Diff #88607)

Actually I'm not sure why we bail here if the sizes are equal. Don't we want to vectorize {i32, i32}, even if AccessSize is 4 bytes?

217 ↗(On Diff #88607)

Can we move this right below the NumElts computation? Long head-scratcher trying to figure out how we knew that the division above is exact. :)

220 ↗(On Diff #88607)

Nit, "2- and 4-element vector ops"

224 ↗(On Diff #88607)

Maybe it would make sense to write this as

for (unsigned j = Idx + 1; j < Idx + NumElts; ++j)  { ... }
237 ↗(On Diff #88607)

Please add a brief comment. I'm also not sure about the name, for the same reasons as CanVectorizeAt -- it's very generic. When we change the enumeration name, we should probably also change PTX_LDST to something else.

248 ↗(On Diff #88607)

Perhaps

Computes whether and how we can vectorize the loads/stores of a flattened function parameter or return value.

The flattened parameter is represented as the list of ValueVTs and Offsets, and is aligned to ParamAlignment bytes. We return a vector of the same size as ValueVTs indicating how each piece should be loaded/stored (i.e. as a scalar, or as part of a vector load/store).

252 ↗(On Diff #88607)

Can we return the SmallVector instead of using an outparam? We'll get RVO, so it's the same to the compiler.

1417 ↗(On Diff #88607)

Does this comment still hold?

1456 ↗(On Diff #88607)

Is this the right name? We're dealing with function params, not a return value, I think?

1461 ↗(On Diff #88607)

StoreOperands or StNodeOperands? "Ops" is a bad abbreviation for "operands". :) And it looks like this is specifically for stores?

1461 ↗(On Diff #88607)

Can we assert somewhere that this vector is empty when we're done with the loop?

1463 ↗(On Diff #88607)

"New store."?

1480 ↗(On Diff #88607)

s/load-stores/stores/?

1507 ↗(On Diff #88607)

s#load/store#store#?

2400 ↗(On Diff #88607)

This looks oddly familiar from the parameter handling above. We really can't factor it out without making this into more of a hairball? If not, same comments apply down here.

tra updated this revision to Diff 88917.Feb 17 2017, 10:38 AM
tra marked 14 inline comments as done.

Addressed Justin's comments.

lib/Target/NVPTX/NVPTXISelLowering.cpp
208 ↗(On Diff #88607)

If EltSize == AccessSize, then we can only do 1-element load/store and there's nothing to vectorize.
{i32,i32} needs AccessSize of at least 8 so we can fold it into ld/st.v2 .

2400 ↗(On Diff #88607)

Alas, I've found no good way to generalize this further. We have 4 cases that we need to vectorize:

  • Load of function args
  • Storing return value.
  • Storing function args before a call
  • Loading call result.

Loads are currently rather different, so there's not much to factor out between them.
Stores are indeed similar, but create different nodes that take different set of operands. Combining them would probably make things more confusing that it already is.

jlebar accepted this revision.Feb 18 2017, 11:29 AM
jlebar added inline comments.
lib/Target/NVPTX/NVPTXISelLowering.cpp
190 ↗(On Diff #88917)

Nit, please add blank line to separate paragraphs.

253 ↗(On Diff #88917)

Newline to separate para

This revision is now accepted and ready to land.Feb 18 2017, 11:29 AM
tra updated this revision to Diff 89265.Feb 21 2017, 12:58 PM

Fixed last two nits.

This revision was automatically updated to reflect the committed changes.