This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Unify vectorization of load/stores of aggregate arguments and return values.
ClosedPublic

Authored by tra on Feb 15 2017, 3:12 PM.

Details

Summary

Original code only used vector loads/stores for explicit vector arguments.
It could also do more loads/stores than necessary (e.g v5f32 would
touch 8 f32 values). Aggregate types were loaded one element at a time,
even the vectors contained within.

This change attempts to generalize (and simplify) parameter space
loads/stores so that vector loads/stores can be used more broadly.
Functionality of the patch has been verified by compiling thrust
test suite and manually checking the differences between PTX
generated by llvm with and without the patch.

General algorithm:

  • ComputePTXValueVTs() flattens input/output argument into a flat list of scalars to load/store and returns their types and offsets.
  • VectorizePTXValueVTs() uses that data to create vectorization plan which returns an array of flags marking boundaries of vectorized load/stores. Scalars are represented as 1-element vectors.
  • Code that generates loads/stores implements a simple state machine that constructs a vector according to the plan.
NOTE: the tests for this patch have found a problem with the way we load/store unaligned fields. This will be addressed separately.

Event Timeline

tra created this revision.Feb 15 2017, 3:12 PM
jlebar added inline comments.Feb 16 2017, 5:30 PM
lib/Target/NVPTX/NVPTXISelLowering.cpp
190

All right, now that I understand this, it's doing something deceptively simple. Perhaps:

Check whether we can merge loads/stores of some of the pieces of a flattened function parameter or return value into a single vector load/store.

The flattened parameter is represented as a list of EVTs and offsets, and the whole structure is aligned to ParamAlignment. This function determines whether we can load/store pieces of the parameter starting at index Idx using a single vectorized op of size AccessSize. If so, it returns the number of param pieces covered by the vector op. Otherwise, it returns 1.

Perhaps "CanVectorizeAt" is not a great name. I think we want "Param" in the name. Maybe CanMergeParamLoadStoresStartingAt? Ideally we'd also pick a name that doesn't sound like this returns a boolean, but I don't have a suggestion.

208

Actually I'm not sure why we bail here if the sizes are equal. Don't we want to vectorize {i32, i32}, even if AccessSize is 4 bytes?

217

Can we move this right below the NumElts computation? Long head-scratcher trying to figure out how we knew that the division above is exact. :)

220

Nit, "2- and 4-element vector ops"

224

Maybe it would make sense to write this as

for (unsigned j = Idx + 1; j < Idx + NumElts; ++j)  { ... }
237

Please add a brief comment. I'm also not sure about the name, for the same reasons as CanVectorizeAt -- it's very generic. When we change the enumeration name, we should probably also change PTX_LDST to something else.

248

Perhaps

Computes whether and how we can vectorize the loads/stores of a flattened function parameter or return value.

The flattened parameter is represented as the list of ValueVTs and Offsets, and is aligned to ParamAlignment bytes. We return a vector of the same size as ValueVTs indicating how each piece should be loaded/stored (i.e. as a scalar, or as part of a vector load/store).

252

Can we return the SmallVector instead of using an outparam? We'll get RVO, so it's the same to the compiler.

1417

Does this comment still hold?

1456

Is this the right name? We're dealing with function params, not a return value, I think?

1461

StoreOperands or StNodeOperands? "Ops" is a bad abbreviation for "operands". :) And it looks like this is specifically for stores?

1461

Can we assert somewhere that this vector is empty when we're done with the loop?

1463

"New store."?

1480

s/load-stores/stores/?

1507

s#load/store#store#?

2400

This looks oddly familiar from the parameter handling above. We really can't factor it out without making this into more of a hairball? If not, same comments apply down here.

tra updated this revision to Diff 88917.Feb 17 2017, 10:38 AM
tra marked 14 inline comments as done.

Addressed Justin's comments.

lib/Target/NVPTX/NVPTXISelLowering.cpp
208

If EltSize == AccessSize, then we can only do 1-element load/store and there's nothing to vectorize.
{i32,i32} needs AccessSize of at least 8 so we can fold it into ld/st.v2 .

2400

Alas, I've found no good way to generalize this further. We have 4 cases that we need to vectorize:

  • Load of function args
  • Storing return value.
  • Storing function args before a call
  • Loading call result.

Loads are currently rather different, so there's not much to factor out between them.
Stores are indeed similar, but create different nodes that take different set of operands. Combining them would probably make things more confusing that it already is.

jlebar accepted this revision.Feb 18 2017, 11:29 AM
jlebar added inline comments.
lib/Target/NVPTX/NVPTXISelLowering.cpp
190

Nit, please add blank line to separate paragraphs.

253

Newline to separate para

This revision is now accepted and ready to land.Feb 18 2017, 11:29 AM
tra updated this revision to Diff 89265.Feb 21 2017, 12:58 PM

Fixed last two nits.

This revision was automatically updated to reflect the committed changes.