Page MenuHomePhabricator

[AMDGPU] Produce flat|global_dwordx3 instructions
Needs ReviewPublic

Authored by rampitec on Jul 14 2017, 1:07 PM.

Details

Reviewers
vpykhtin
alex-t
Summary

The patch allows to produce dwordx3 loads out of v3i32/v3f32 loads.
There is still a future work to allow vectorizer to create that
vec3 loads.

Diff Detail

Event Timeline

rampitec created this revision.Jul 14 2017, 1:07 PM

This is the wrong way to handle this. I did most of the work to avoid having to select the machine nodes so early a long time ago. I have the patches to add v3* to MVT. Short of that a new LOAD_V3 node would be better than going direct to the instruction here

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2554

There's a global offset subtarget feature. I also have the patch to start selecting global, but I haven't committed it yet

This is the wrong way to handle this. I did most of the work to avoid having to select the machine nodes so early a long time ago. I have the patches to add v3* to MVT. Short of that a new LOAD_V3 node would be better than going direct to the instruction here

V3 is an alien to LLVM, so I had to do it this way. It works.
What about patches for MVT to support V3? Are they anywhere ready to be submitted?

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2554

This is not about offsets, this is about support of global_load instructions, which is started from GFX9. Then flat_load's also have offsets there and that is checked above as Subtarget->hasFlatInstOffsets(). A global offset is actually a feature to support offsetting workitem ids, so unrelated.

vpykhtin edited edge metadata.Jul 21 2017, 12:49 PM

The implementation of this approach looks good to me. The only question is which way to go to implement v3 vector.

The implementation of this approach looks good to me. The only question is which way to go to implement v3 vector.

Probably making v3 generally legal and simple type is a right thing to do. This will solve not only problem with loads, but compute on the resulting vector as well. Currently such compute is done on a 4 component vector created by the legalization with the promote of v3 to v4.

It however seems to be long way because a lot of places just designed to work with a power of 2 vectors, halfing and doubling them. I would do this style vec3 load in the short term and target legal v3 in a long term.

I could probably get the v3 patch in. IIRC I had all tests passing with a hack to keep the legalization unchanged and then got stuck fixing all cases with proper legalization

I could probably get the v3 patch in. IIRC I had all tests passing with a hack to keep the legalization unchanged and then got stuck fixing all cases with proper legalization

If legal v3 is around that is certainly preferable.

I could probably get the v3 patch in. IIRC I had all tests passing with a hack to keep the legalization unchanged and then got stuck fixing all cases with proper legalization

If legal v3 is around that is certainly preferable.

https://github.com/arsenm/llvm/tree/legal-vector3-v2

The first 2 commits here seem to work without test failures (but a few cost model regressions). They succeed in adding the basic types, the few after that need some more work

I could probably get the v3 patch in. IIRC I had all tests passing with a hack to keep the legalization unchanged and then got stuck fixing all cases with proper legalization

If legal v3 is around that is certainly preferable.

https://github.com/arsenm/llvm/tree/legal-vector3-v2

The first 2 commits here seem to work without test failures (but a few cost model regressions). They succeed in adding the basic types, the few after that need some more work

I wander what happens to passes which like to split a vector by half? I also can see that vectorizer (load/store and SLP) for instance is written in a way that does not support non power of 2 vectors. I hope some passes will just silently bail instead of silently fail at least when v3* will be reported as legal.

Anyway, if you are planning to fix these patches and merge I will hold current review. It also does not solve v3 operations problem other than load and potentially store, because v3 will be promoted on any arithmetic, so it is way not perfect.