Page MenuHomePhabricator

[LegalizeVectorTypes] Allow single loads and stores for more short vectors
ClosedPublic

Authored by tauril on Jan 2 2019, 8:17 AM.

Details

Summary

When lowering a load or store for TypeWidenVector, the type legalizer
would use a single load or store if the associated integer type was legal
or promoted. E.g. it loads a v4i8 as an i32 if i32 is legal/promotable.
(See https://reviews.llvm.org/rL236528 for reference.)

This patch applies that behaviour to vector types. If the vector type
is TypePromoteInteger, the element type is going to be
TypePromoteInteger as well, which will lead to have a single promoting
load rather than N individual promoting loads. For instance, if we have
a v3i1, we would now have a load of v4i1 instead of 3 loads of i1.

I don't have any knowledge with AMDGPU and it seems that this commit
introduces some less performant code with the R600 architecture. I would
appreciate if someone knowledgeable with that architecture could enlighten
me on how bad the changes are.

Diff Detail

Repository
rL LLVM

Event Timeline

tauril created this revision.Jan 2 2019, 8:17 AM
tauril edited the summary of this revision. (Show Details)Jan 3 2019, 3:28 AM
jvesely added a subscriber: arsenm.Jan 8 2019, 9:12 AM

Hi,

sorry for the delay. afaik r600 does not do any special handling wrt to coalescing loads. There is a general load/store vectorizer by @arsenm, so it looks like this patch is interfering with it, but I'd expect the same to happen for GCN as well.
I'm OK with these pessimizations, R600 loads/stores have bigger problems.

jan

test/CodeGen/AMDGPU/load-constant-i16.ll
197 ↗(On Diff #179849)

is there any reason to remove these lines from the test?

test/CodeGen/AMDGPU/load-global-i16.ll
201 ↗(On Diff #179849)

have the extra moves been eliminated by this patch? if not, is there another reason to remove these lines?

222 ↗(On Diff #179849)

same here

tauril marked 3 inline comments as done.Jan 9 2019, 2:20 AM
tauril added inline comments.
test/CodeGen/AMDGPU/load-constant-i16.ll
197 ↗(On Diff #179849)

I removed these lines from the test because the generated code no longer has them in it.

test/CodeGen/AMDGPU/load-global-i16.ll
201 ↗(On Diff #179849)

Exactly, the moves have been eliminated by the patch!

222 ↗(On Diff #179849)

Same thing here, the patch removed these lines from the generated code.

jvesely added inline comments.Jan 12 2019, 3:12 PM
test/CodeGen/AMDGPU/load-constant-i16.ll
197 ↗(On Diff #179849)

I see. If the values are directly loaded to registers used in STORE_RAW, it'd be nice to preserve the tracking instead of using generic {{T[0-9]\.[XYZW]}}
i.e:
VTX_READ_16 [[ST_LO]].X, ... 0, #1
VTX_READ_16 [[ST_LO]].Y, ... 2, #1
VTX_READ_16 [[ST_HI]].X, ... 4, #1

tauril updated this revision to Diff 181556.Jan 14 2019, 8:23 AM

I modified some AMDGPU tests to track more registers where possible as @jvesely suggested, and I added some missing new relevant generated instructions (BFE_INT).

I modified some AMDGPU tests to track more registers where possible as @jvesely suggested, and I added some missing new relevant generated instructions (BFE_INT).

thanks. the r600 test changes LGTM.
I can't comment on the x86 ones.

I modified some AMDGPU tests to track more registers where possible as @jvesely suggested, and I added some missing new relevant generated instructions (BFE_INT).

I can't comment on the x86 ones.

test/CodeGen/X86/load-local-i1.ll should be precommitted i guess, so the change is actually visible.

tauril updated this revision to Diff 181601.Jan 14 2019, 10:57 AM

I renamed the new test from test/CodeGen/X86/load-local-i1.ll to test/CodeGen/X86/load-local-v3i1.ll
because I made a mistake the first time.
I also precommitted that same new test so that we can see the difference that this patch introduces with it,
as suggested by @lebedev.ri.
The relevant information to notice is the removal of the negb instruction followed by the filling of the
arguments of the masked_load_v3 function with the same value (as it is explained in the beginning of
the test).

Because this updated patch now considers that the new test already existed, I don't know how this will be
handled if this patch gets accepted, when trying to commit it.

craig.topper added inline comments.Jan 23 2019, 10:29 AM
test/CodeGen/X86/load-local-v3i1.ll
28 ↗(On Diff #181601)

Can you add 'nounwind' to this to drop the .cfi lines

tauril updated this revision to Diff 183269.Jan 24 2019, 2:08 AM

I added the nounwind attribute to the new test function's definition as @craig.topper suggested.

tauril marked an inline comment as done.Jan 24 2019, 2:09 AM

X86 changes look good to me

Ping @bogner.

I also want to point out I don't have commit access if the patch were to be accepted.

tauril marked 4 inline comments as done.Feb 1 2019, 6:43 AM
arsenm added a comment.Feb 1 2019, 6:53 AM

The R600 code is worse, but it seems to just be for the 3x vector cases. I don't expect much out of r600, particularly for edge cases like this so I wouldn't worry about it too much

If no one else has any remark regarding this patch, could someone with the rights to approve the patch? @bogner

bogner accepted this revision.Mar 25 2019, 4:47 PM

Sorry for the delay, I somehow completely missed this. The change looks good - I'll commit it for you today or tomorrow.

This revision is now accepted and ready to land.Mar 25 2019, 4:47 PM
bogner requested changes to this revision.Mar 26 2019, 1:18 PM

I hit a couple of minor issues trying to apply the patch:

  1. We hit an assertion in test/CodeGen/AMDGPU/kernel-args.ll. Can you take a look?
  2. I don't have test/CodeGen/X86/load-local-v3i1.ll in my tree, so that part doesn't apply. Is it reasonable to drop this part of the change?
  3. A couple of the X86 tests have changed slightly - these are easy enough to upgrade with update_llc_test_checks though.
This revision now requires changes to proceed.Mar 26 2019, 1:18 PM
tauril updated this revision to Diff 192460.Wed, Mar 27, 10:02 AM

Hello @bogner, thanks for accepting this patch!

  1. We hit an assertion in test/CodeGen/AMDGPU/kernel-args.ll. Can you take a look?

The problem came from the R600ISelLowering where apparently the global addresspace is not supported for vector truncating stores. (@jvesely would it be possible that you have a look on the changes I made in R600ISelLowering.cpp to avoid the assert?). The failing test (v5i16_arg) was introduced in https://reviews.llvm.org/D58928 by @tpr.

  1. I don't have test/CodeGen/X86/load-local-v3i1.ll in my tree, so that part doesn't apply. Is it reasonable to drop this part of the change?

About that, it was requested by @lebedev.ri to have the file pre-committed to see how my modifications changed it. I've now removed the pre-committed file so that it is created by this patch.

  1. A couple of the X86 tests have changed slightly - these are easy enough to upgrade with update_llc_test_checks though.

I have updated the tests accordingly!

I also ran ninja check-llvm on a build with asserts activated to ensure everything passed.

Herald added a project: Restricted Project. · View Herald TranscriptWed, Mar 27, 10:02 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
bogner accepted this revision.Wed, Mar 27, 1:34 PM
This revision is now accepted and ready to land.Wed, Mar 27, 1:34 PM
This revision was automatically updated to reflect the committed changes.