This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Fix DPP combiner
ClosedPublic

Authored by vpykhtin on Dec 7 2018, 9:22 AM.

Details

Summary

Fix issues discussed after https://reviews.llvm.org/D55314

  1. dpp move with uses and old reg initializer should be in the same BB.
  1. bound_ctrl:0 is only considered when bank_mask and row_mask are fully enabled (0xF). Othervise the old register value is checked for identity.
  1. Added add, subrev, and, or instructions to the old folding function.
  1. Kill flag is cleared for the src0 (DPP register) as it may be copied into more than one user.

Tests are updated.

The pass is still disabled by default, I would be appreciate if someone could test it on RADV (being turned on).

Diff Detail

Repository
rL LLVM

Event Timeline

vpykhtin created this revision.Dec 7 2018, 9:22 AM
vpykhtin updated this revision to Diff 177866.Dec 12 2018, 9:27 AM

updated diff, added tests.

  1. dpp mov with "src reg" == "old reg" can also be combined.
  2. Added check EXEC mask ins't modified between dpp mov and consumer instructions
  3. Enhanced debug output
cwabbott added inline comments.Dec 12 2018, 10:24 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
262

I think you have AND and OR mixed up... the identity for AND should be -1, and OR should be 0, right?

266

If the old operand is the identity in the original mov, then in the transformed instruction the old operand should be the un-swizzled source, not the identity again. I think this is the reason the add tests to still fail on radv, since it generates something like:

v_mov_b32_e32 v5, 0 
s_nop 1
v_add_u32_dpp v5, vcc, v4, v4  row_shr:4 row_mask:0xf bank_mask:0xe

from

%88 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %87, i32 276, i32 15, i32 14, i1 false) #2
  %89 = add i32 %87, %88

when what we really want is

v_add_u32_dpp v4, vcc, v4, v4  row_shr:4 row_mask:0xf bank_mask:0xe

Connor, indeed, my bad, I'll try to fix this in a couple of days.

Thank you,
Valery

vpykhtin updated this revision to Diff 178878.Dec 19 2018, 6:29 AM

Fixed issue with identity values and other cases, f32/f16 identity values to be added later.

fma/mac instructions is disabled for now.

Test is fully reworked, added comments.

Connor, can you please take a look at the test in this update (first half) - I described different cases to combine there and possibly run the patch on radv test (the pass has to be switched on manually).

The pass is still disabled by default.

If there is no strong objections I would like to submit my latest patch here since the pass is disabled and the patch looks better anyway. Otherwise I'll return from NY holidays only on Jan 11.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 9 2019, 5:49 AM
Closed by commit rL350721: [AMDGPU] Fix DPP combiner (authored by vpykhtin). · Explain Why
This revision was automatically updated to reflect the committed changes.

Hi Connor,

I submitted my latest patch, the pass is still disabled by defaul. Can you please run the RADV tests now (with the pass manually enabled)?

vpykhtin updated this revision to Diff 180842.EditedJan 9 2019, 7:30 AM

The patch was reverted as not reviewed. Uploaded rebased diff.

vpykhtin reopened this revision.Jan 9 2019, 7:36 AM

reopening revision.

cwabbott added a comment.EditedJan 10 2019, 5:34 AM

Sorry, I just got back from break this week. I've run CTS with the pass enabled, and it now passes, although it seems most of the patterns we use don't get folded. Firstly AND, XOR, unsigned max, and unsigned min are most troubling, since the code that gets generated looks like it should be optimized:

	v_mov_b32_e32 v8, -1                                          ; 7E1002C1
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_and_b32_e32 v2, v2, v8                                      ; 26041102

as well as

	v_mov_b32_e32 v8, 0                                                         ; 7E100280
	s_nop 1                                                                     ; BF800001
	v_mov_b32_dpp v8, v2  row_shr:8 row_mask:0xf bank_mask:0xc                  ; 7E1002FA FC011802
	v_max_u32_e32 v2, v2, v8                                                    ; 1E041102

and

	v_mov_b32_e32 v8, -1                                          ; 7E1002C1
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_u32_e32 v2, v2, v8                                      ; 1C041102

and finally

	v_mov_b32_e32 v8, 0                                                         ; 7E100280
	s_nop 1                                                                     ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf               ; 7E1002FA AF014202
	v_xor_b32_e32 v2, v2, v8                                                    ; 2A041102

Anyways, the other cases look like maybe some other clever optimization for the immediate is hindering this one, for example this with signed minimum:

	v_bfrev_b32_e32 v8, -2                                        ; 7E1058C2
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_i32_e32 v2, v2, v8                                      ; 18041102

Maybe this pass needs to be moved earlier in the pipeline?

I'll take a more detailed look soon. But I'd also like to say that I'm a little worried that there are currently no integration tests that use DPP besides the Vulkan CTS, so there is currently no way for anyone working on this from the ROCm/compute side to test this pass properly, which means that none of you can work on this with any confidence. I think this exchange has clearly shown that unit tests aren't enough, since they only show that the code does what you think it does, not whether what you think it does is correct :) Maybe we should fix the atomic optimizer first so that it does something similar to what radv and AMDVLK do, to make it easier to test? Or maybe you just need to go write some tests.

Thank you, Connor.

I'll go through the cases you described, there're may be issues with instruction commutation (dpp src reg should be src0) or other problems (I already saw e64 instructions that cannot be converted to e32). I would be appreciate if you can attach .ll file for your cases.

Speaking about testing, the test I added isn't just a unit test but it contains situations (each commented) the pass is supposed to combine. I think we should review the test and agree if it does the right thing first.

@cwabbott I planned to do a followup once this DPP change had landed to add the missing dpp/codegen patterns to the atomic optimizer - so watch this space!

I do agree though that we should probably add the DPP patterns that we want to be 100% certain get combined into a test case, I'll work with @vpykhtin to add this.

Anyways, the other cases look like maybe some other clever optimization for the immediate is hindering this one, for example this with signed minimum:

	v_bfrev_b32_e32 v8, -2                                        ; 7E1058C2
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_i32_e32 v2, v2, v8                                      ; 18041102

Maybe this pass needs to be moved earlier in the pipeline?

I'm not sure I can insert the pass that high, I'll think of how it can be skipped.

Anyways, the other cases look like maybe some other clever optimization for the immediate is hindering this one, for example this with signed minimum:

	v_bfrev_b32_e32 v8, -2                                        ; 7E1058C2
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_i32_e32 v2, v2, v8                                      ; 18041102

Maybe this pass needs to be moved earlier in the pipeline?

I'm not sure I can insert the pass that high, I'll think of how it can be skipped.

This kind of immediate init optimization is performed by SIShrinkInstructions pass and it runs after the DPP combiner pass, so the issue is different here, would be nice to have .ll file.

I figured it would be a little easier if I looked at these cases by myself. It turns out there are more problems with isIdentityValue, including some correctness issues. After fixing these, everything works correctly now.

lib/Target/AMDGPU/GCNDPPCombine.cpp
265

You have min and max mixed up... the identity for min is the maximum possible value, not the minimum. The same goes for signed and unsigned min/max. Also, you can add V_XOR with an identity value of 0 here.

271–273

It turns out that LLVM sign-extends the immediate to 64 bits sometime during instruction selection, which means this won't work for any MIR generated by LLVM IR. We should be ignoring the high 32 bits when doing all these comparisons, since they're irrelevant.

vpykhtin updated this revision to Diff 181239.Jan 11 2019, 3:58 AM

Thanks! I wonder how easy is to get confused there. I updated diff with the latest found problems fixed.

vpykhtin marked 2 inline comments as done.Jan 11 2019, 4:00 AM

I think we reached the state this can be submitted (and probably enabled with subsequent patch). This would allow any of us make other fixes if required.

cwabbott added inline comments.Jan 14 2019, 10:27 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
369

I don't think this is correct. A lane can be set to 0 for the shift DPP controls (e.g. DPP_ROW_SL1) or if EXEC is zero for the lane to be read, in which case the user will do something non-trivial (e.g. OR will pass through the other operand) which we can't emulate in general.

385

Indentation is off here, seems like your editor is using hard tabs

391

Also here

412

This seems incorrect to me. In the lanes where the source is invalid or the row/bank mask is 0, the DPP move will act as a no-op, here so if that lane is then added, or'd, etc. with something else, we can't emulate that with a single instruction.

Hi Valery, I really like the way the different cases are listed in the explanatory comment at the top of the file, and I believe those cases are correct. Would it be possible to restructure the code in a way that follows those cases? I think that would make it much easier to follow.

That is, in combineDPPMov, can you restructure the first do/while construct so that it mirrors the cases of the top of file comment and defines new variables CombinedOld and CombinedBoundCtrl? Come to think of it, it may then be wise to move these top of file comments into the function to increase the chances that they will be kept up-to-date in the future.

lib/Target/AMDGPU/GCNDPPCombine.cpp
355

Having both OldOpndVGPR and OldOpndValue seems redundant. You could pass OldOpndValue as a MachineOperand to createDPPInst and have it assert on !CombinedOld || CombinedOld->Reg(). That should help cut down the code size here somewhat and make the code easier to follow.

424

These should be SmallVectors.

473

Can we simplify the code here by just commuting the OrigMI unconditionally?

That is, first check whether &Use != src0 && &Use == src1; in that case, try commuting the instruction. If that fails, break (but I don't think there'd be a need to rollback the commutation). If it succeeds, continue down the same path that is used for the &Use == src0 case. You end up with only a single call to createDPPInst and less code duplication in general.

vpykhtin marked 4 inline comments as done.Jan 15 2019, 4:04 AM
vpykhtin added inline comments.
lib/Target/AMDGPU/GCNDPPCombine.cpp
369

Could you please you show an example of how it wouldn't work? Note that EXEC mask remains the same (by the check above) and no combining is performed when the result of DPP mov is used in any other way than consumed by DPP-capable VALU instruction. I think the result of DPP mov should be the same as the DPP src operand used in the dpp-capable VALU instruction.

385

Yea, thanks. I reinstalled my editor recently and forgot to turn off tabs. It should be ok with the latest diff.

412

mov_dpp v1, v1, ... (v1 of other lane is stored in v1 of this lane)
add_u32 v0, v1, ...

This is the case when DPP src register is stored in the VGPR with the same name of the issuing lane. This way v1 would contain the same value after unsuccessfull DPP mov (no-op) and therefore can be used in the combined VALU op.

vpykhtin marked 4 inline comments as done.Jan 15 2019, 5:47 AM

Hi Nikolai,

thank you for the review, I'll think of how to restructure the code to match top comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
355

There was a reason I did it initially though after a time I need to recall it. If I remember correctly the value is tracked through the instructions like mov/copy and subreg manupulation and having only value insn't enough to obtain the reg to store in the DPP instruction.

424

Ok.

473

This is a good idea: I don't like leaving commuted instruction on rollback at some intuitive level but really don't have arguments against it at the moment :-)

cwabbott added inline comments.Jan 15 2019, 7:44 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
369

Yes, you're right. I was misrembering what bound_ctrl:0 does. Sorry!

412

I still don't see how this can work. For something like:

mov_dpp v1, v1, ...
add_u32 v0, v1, v2

lanes where the shared data is invalid based on the DPP ctrl or EXEC will return v1 (same lane) + v2, whereas this will transform it to something like

add_u32_dpp v0, v1, v2, ...

which will give you v0 (undef). What's an example of a transform you're trying to accomplish?

vpykhtin marked an inline comment as done.Jan 15 2019, 8:00 AM
vpykhtin added inline comments.
lib/Target/AMDGPU/GCNDPPCombine.cpp
412

Sorry, this should look like:

mov v1, X
mov v2, Y
mov_dpp v1, v1, some_DPP_ctrl
add_u32 v0, v1, v2

transformed to

mov v1, X
mov v2, Y
add_u32_dpp v0, v1, v2, some_DPP_ctrl

v1 should contain X on invalid DPP access or X from other lane on valid.

vpykhtin marked an inline comment as done.Jan 15 2019, 8:35 AM
vpykhtin added inline comments.
lib/Target/AMDGPU/GCNDPPCombine.cpp
412

I forgto to mention that some_DPP_ctrl should have all masks fully enabled, otherwise add wouldn't write it's result and v0 in undef indeed.

cwabbott added inline comments.Jan 16 2019, 1:13 AM
lib/Target/AMDGPU/GCNDPPCombine.cpp
412

This still seems wrong. For the first sequence you gave, the add_u32 will return X (same lane) + Y on invalid DPP access while add_u32_dpp will return undef (assuming v0 isn't initialized, since you set old to undef in this patch).

vpykhtin marked an inline comment as done.Jan 16 2019, 6:09 AM
vpykhtin added inline comments.
lib/Target/AMDGPU/GCNDPPCombine.cpp
412

It looks like the documentation is a bit unclear here. I thought when bound ctrl is off it means hardware just uses the value of issuing lane instead of DPP read but if it disables writing the result then you're right - v0 would be undef. I'll find out and fix this.

vpykhtin marked an inline comment as done.Jan 18 2019, 5:02 AM
vpykhtin added inline comments.
lib/Target/AMDGPU/GCNDPPCombine.cpp
473

Actually a pass returns boolean meaning it changes something in the IR. What should it return when it does rollback? It may create new commuted instructions which are need to be mapped in SlotIndexes for example. I don't think its a good idea to return true only when a rollback took place.

arsenm added inline comments.Jan 18 2019, 12:08 PM
lib/Target/AMDGPU/GCNDPPCombine.cpp
473

Can rollback be avoided?

I think the usual purpose of knowing something changed is to know if iterators are still valid, which probably isn't the case if anything was modified at any point

arsenm added inline comments.Jan 18 2019, 12:38 PM
lib/Target/AMDGPU/GCNDPPCombine.cpp
473

I suppose if it's just commuting, it's ok to report no change

vpykhtin marked an inline comment as done.Jan 21 2019, 6:34 AM
vpykhtin added inline comments.
lib/Target/AMDGPU/GCNDPPCombine.cpp
473

The commutation is the whole problem. I cannot know whether the instuction can be commuted beforehand. Also an instruction copy should be created for the commutation - this means that previous analisys should be invalidated and the IR change should be repored.

vpykhtin marked an inline comment as done.Jan 21 2019, 6:42 AM
vpykhtin added inline comments.
lib/Target/AMDGPU/GCNDPPCombine.cpp
473

I think there is no way to avoid new instructions on commutation because of reversed instructions

vpykhtin updated this revision to Diff 183315.Jan 24 2019, 7:52 AM

Fixed issue with old = dpp src register when bound ctrl is off.

Slightly refactored and simplified. Description is corrected, though it's not one-to-one maps on the code as there're other things to do (as reusing IMPLICIT_DEF instructions)

Variable names changed to be more consistent with the description.

I decided to left current commutation as is because it doesn't come for free: it may create new instructions (at least reversed instructions) and for that reason previous analisys (like SlotIndexes) should be invalidated.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2019, 8:40 AM
This revision is now accepted and ready to land.Feb 7 2019, 12:43 AM
vpykhtin updated this revision to Diff 185732.Feb 7 2019, 3:42 AM

rebased diff.

Thanks Nikolai!

Connor can you please try this patch and probably accept it?

cwabbott accepted this revision.Feb 8 2019, 2:54 AM

I'm not going to read everything in detail, but the combining rules look correct to me and everything passes with this pass enabled. Feel free to re-enable it.

This revision was automatically updated to reflect the committed changes.

Thank you Connor! I really appreciate your effort on this DPP work.

I'll enable the pass with the subsequent submit.

test/CodeGen/AMDGPU/dpp_combine.mir