This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Improved wide multiplies
ClosedPublic

Authored by OutOfCache on Dec 16 2022, 5:00 AM.

Details

Summary

These checks show optimized instructions if an operand is known to be
(partially) zero.

Change-Id: Ie2f6d0d3ee9d5b279d1f4c1dd0787492e39cc77a

Diff Detail

Event Timeline

OutOfCache created this revision.Dec 16 2022, 5:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 16 2022, 5:00 AM
OutOfCache requested review of this revision.Dec 16 2022, 5:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 16 2022, 5:00 AM

Added missing commits

OutOfCache retitled this revision from [AMDGPU] Improved wide multiplies tests to [AMDGPU] Improved wide multiplies.Dec 16 2022, 5:14 AM
arsenm added inline comments.Dec 16 2022, 5:27 AM
llvm/include/llvm/CodeGen/GlobalISel/LegalizerHelper.h
85–87

Can you pre-commit the change to add GISelKnown bits to the legalizer

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
109–115

Don't need both constructors, just the one with default argument

Fixing indentation

arsenm added inline comments.Dec 16 2022, 5:32 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
2253–2255

Missing mir test updates? I thought we had mir coverage for mul legalize already

Adding missing commits.

foad added a comment.Dec 16 2022, 5:34 AM

I guess we could get the same improvement by running the binop_right_to_zero combine post-legalization?

tsymalla added inline comments.Dec 16 2022, 6:29 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2932

Can inline this into the loop

3058

Intendation issue?

3184

Naming is a bit unfortunate, as the function returning the pointer to the KBAnalysis is named the same as the function getting the known bits.

Somethign went wrong during applying the patch. Could you please have a look?
https://buildkite.com/llvm-project/diff-checks/builds/142783#01851b23-3226-4e98-bac5-219865bb07b5

OutOfCache marked an inline comment as done and an inline comment as not done.

Adressing the redundant constructor and minor format issues

OutOfCache updated this revision to Diff 483596.EditedDec 16 2022, 10:48 AM

Adding the new test file

tsymalla added inline comments.Dec 18 2022, 5:19 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3184

Can be merged to AtLeastOneSrcIsZero (just like with AtLeastOneArgIsZero)

3184

Intendation (maybe run clang-format)?

llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
6

Can you give some more insightful names to the tests (appending _xy vs. naming the test @v_mul_i64_zext_no_args or something similar)?

7

Can you pre-commit the tests so it is easier to identify the changes in the assembly?

146

Maybe add some test to show whats happening when both high and low bits are being masked (0x0000FFFF000...)

235

Should this tests and the ones following also be prefixed with v_?

270

Typo: differnt

272

Maybe add some 32-bit tests to show that your changes are being applied correctly.

It seems like this is breaking the LegalizerHelper.h / the LegalizerHelperTest:
/var/lib/buildkite-agent/builds/llvm-project/llvm/include/llvm/CodeGen/GlobalISel/LegalizerHelper.h:47:7: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 4 were provided
/var/lib/buildkite-agent/builds/llvm-project/llvm/unittests/CodeGen/GlobalISel/LegalizerHelperTest.cpp:641:19: error: no matching constructor for initialization of 'llvm::LegalizerHelper'

LegalizerHelper Helper(*MF, Info, Observer, B);

Improved tests with better naming.
Better variable naming in LegalizerInfo.
Added constructor again because that was causing issues.

This comment was removed by OutOfCache.
  • [AMDGPU] Improved naming of mul-known-bits tests
  • [AMDGPU] Better variable naming in LegalizerInfo

Improved tests with better naming.
Better variable naming in LegalizerInfo.
Added constructor again because that was causing issues.

Fixing constructor issues by moving the default parameter to the header.

The default parameter for the new LegalizerHelper constructor has to be set
in the header file and not the definition in the .cpp file.

Fixing constructor issues by moving the default parameter to the header.

The default parameter for the new LegalizerHelper constructor has to be set
in the header file and not the definition in the .cpp file.

  • [GISel] Adding KnownBitsAnalysis to Legalizer
  • [AMDGPU] Wide multiplies with Known Bits Analysis
  • [AMDGPU] Improved wide multiplies tests

Fixed clang-format issue.

  • [AMDGPU] Improved wide multiplies tests

Fixed clang-format.

  • [AMDGPU] Wide multiplies with Known Bits Analysis
  • [AMDGPU] Improved wide multiplies tests

Fixed clang-format for good.

A few nits / questions, otherwise LGTM.

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3002

It looks like this is doing the same thing like in LOC 3040. Does it make sense to cache the results (is there anything computed when invoking isZero())? You could also cache the pairs that are known to be zero, but this is just optional. I'd just think about re-implementing the same thing a few lines later.

3185

I'd still prefer to have the getter method Helper::getKnownBits() differently named. The one thing is returning a pointer to the known bits analysis, the other one is returning the known bits for a register.

llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
396

Can you please pre-commit these tests?

foad added a comment.Jan 9 2023, 5:35 AM

Can you rebase now that D140907 has landed please?

Thanks, this already looks good to me. I do have a small number of comments inline still.

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2901–2907

Did you run clang-format on this? Usually it chooses a different layout.

2933–2936

Please remove the assumption that Src0 and Src1 are equally long. I'd suggest using two for-each loops.

3002–3004

nit: I don't think having the separate variable is worth it anymore, just put the condition directly into the if-statement (same below)

3185

I think it's fine. The class is called GISelKnownBits, not GISelKnownBitsAnalysis or similar.

3188–3192

Does this actually happen? (Is there a test case that changes when you remove this?) I would have thought that such a G_MUL would have been eliminated by an earlier combine.

OutOfCache marked 9 inline comments as done.
  • [GISel] Adding KnownBitsAnalysis to Legalizer
  • [AMDGPU] Wide multiplies with Known Bits Analysis
  • [AMDGPU] Improved wide multiplies tests
  • [GISel/AMDGPU] caching results of isZero()
  • [AMDGPU] Inlining conditions in buildMultiply
  • [AMDGPU] Removing redundant zero-check for mults
  • Rebasing after D140907 landed
  • Addressing comments
OutOfCache marked 5 inline comments as done.Jan 10 2023, 4:32 AM
OutOfCache added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2901–2907

I did. It even caused failing pre-merge checks if I remember correctly. I ran clang-format again and this is still the result.

3002

isZero() is calculating the number of zeroes and checking if the sum is the same as the total number of bits. Because of that, I decided to cache the results of isZero() in the vectors.

3007

This check is required, when the accumulator is a zero register.

!LocalAccum[0] only checks for the existence of a Register. It is still true, if the Register is known to be all zeroes.
This particular case occurs when the lower bytes of an operand are masked.
In that case, the check in line 3048 will fail and no G_MAD will be created. LocalAccum[0] will still be set to the result of the Unmerge of the Tmp register in line 3060. Tmp is set to a zero register in line 3041, so it is all zeroes at this point.

By stepping through the debugger, I confirmed that in that case the first condition, !LocalAccum[0] will be false, but the second condition will be correctly evaluated to true and therefore skip the addition to 0.

3184

I can see the issue. I am not sure what to rename it to, though. getKnownBitsAnalysis? Not quite correct since the type is GISelKnownBits, but perhaps less confusing?

3185

I kept it as is because of the type name and because it has the same name in the CombinerHelper

tsymalla added inline comments.Jan 10 2023, 4:32 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2936

Can use ranged for-loops

3047

Can one of these accesses be out-of-bounds now that you removed the assumption that both Src0 and Src1 are of equal length?

tsymalla added inline comments.Jan 10 2023, 4:38 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2901–2907

You can try clang-format -style=file:path and see if this changes something. I think you can also manually invoke git-clang-format to see if anything would have been changed by clang-format.

OutOfCache marked an inline comment as done.
  • [AMDGPU] Removing redundant zero-check for mults
OutOfCache marked an inline comment as done.Jan 11 2023, 8:38 AM
OutOfCache added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2901–2907

I ran git clang-format but it didn't change this part. clang-format file -i changed the file in a lot of places, but not here either. Maybe someone could double-check on their machine?

3047

I double-checked and the way the code works currently makes sure that both Src0 and Src1 are the same length. In line 3179, both arrays are created in the same for-loop. I don't know if there is a case where they will have different sizes. If the operands are not of the same size, an error is thrown and it does not compile. So in theory yes, but in practice no? Is this a case that should be considered or would that be too much?

tsymalla added inline comments.Jan 11 2023, 4:31 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2901–2907

I just re-ran git-clang-format HEAD~1 and it did not change anything for me. The Buildbot does not seem to spill out an error due to clang-format, so I guess it's alright.

3047

@nhaehnle If both arrays are initialized in one loop, shouldn't we just move back to initialize Src0KnownZeros and Src1KnownZeros in one loop, too?

arsenm added inline comments.Jan 11 2023, 4:38 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3007

If you're just looking for zero, just looking for the constant zero is cheaper than going through getKnownBits

OutOfCache added inline comments.Jan 16 2023, 7:17 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3007

Sounds like a good idea, but how do I do that?

tsymalla added inline comments.Jan 17 2023, 3:00 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3007

I guess he meant checking the operands for being zero explicitly. I think using getKnownBits is fine.

  • [AMDGPU] reverting separation of for-loops
arsenm added inline comments.Jan 20 2023, 7:54 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3007

Check if it's G_CONSTANT i32 0. There are a few too many ways to check for it (I'd suggest MIPatternMatch's m_ZeroInt)

Can you re-apply clang-format on AMDGPULegalizerInfo.cpp please?

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2933

I think you wanted to iterate with a classical for loop and access Src0 and Src via the subscript operator?

3007

Is that check correct? Can you remove the comment please?

tsymalla added inline comments.Jan 20 2023, 10:31 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3047

Can you add some GISel tests showing the behavior by transforming MIR to MIR? You had an initial test case, you could re-use that. Maybe there are already MIR test cases you could extend. This should also show that Src0 and Src1 are handled correctly.

  • [AMDGPU] reverting separation of for-loops
  • [AMDGPU] reverting separation of for-loops
OutOfCache marked 2 inline comments as done.Jan 23 2023, 7:09 AM
OutOfCache added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3007

I tried mi_match(LocalAccum[0], MRI, m_ZeroInt()), but for some reason it always returned false.

I also tried replacing the SrcXKnownZeros.push_back(KB.getKnownBits(SrcX[i]).isZero()) with Src0KnownZeros.push_back(mi_match(SrcX[i], MRI, m_ZeroInt()) and similarly, it returned false when the first one returned true.

This also caused the @v_mul_i64_masked_src0_lo and @v_mul_i64_masked_src1_lo tests to fail and produce multiplications with 0.

OutOfCache added inline comments.Jan 23 2023, 7:15 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3007

This is the Code before the Legalizer:

bb.1.entry:
  liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
  %2:_(s32) = COPY $vgpr0
  %3:_(s32) = COPY $vgpr1
  %0:_(s64) = G_MERGE_VALUES %2:_(s32), %3:_(s32)
  %4:_(s32) = COPY $vgpr2
  %5:_(s32) = COPY $vgpr3
  %1:_(s64) = G_MERGE_VALUES %4:_(s32), %5:_(s32)
  %6:_(s64) = G_CONSTANT i64 -4294967296
  %7:_(s64) = G_AND %1:_, %6:_
  %8:_(s64) = G_MUL %0:_, %7:_
  %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %8:_(s64)
  $vgpr0 = COPY %9:_(s32)
  $vgpr1 = COPY %10:_(s32)
  SI_RETURN implicit $vgpr0, implicit $vgpr1

The only G_CONSTANTs are the mask for the G_AND and a 64-bit 0 for the G_MAD addition

tsymalla added inline comments.Jan 25 2023, 12:21 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3007

Your MIR should look something like that at the time of the LocalAccum[0] check, so you need to query appropriately.

/Users/seuchomat/Documents/Projekte/C++/llvm-project/build/bin/llc -global-isel -march=amdgcn -mcpu=gfx1010 /Users/seuchomat/Documents/Projekte/C++/llvm-project/llvm/test/CodeGen/AMDGPU/GlobalISel/mad.mir -run-pass=legalizer
  %18:_(s32) = G_MUL %11:_, %14:_
# Machine code for function test_mad: IsSSA, NoPHIs

bb.0.entry:
  %0:_(s32) = COPY $vgpr0
  %1:_(s32) = COPY $vgpr1
  %2:_(s64) = G_MERGE_VALUES %0:_(s32), %1:_(s32)
  %3:_(s32) = COPY $vgpr2
  %4:_(s32) = COPY $vgpr3
  %5:_(s64) = G_MERGE_VALUES %3:_(s32), %4:_(s32)
  %6:_(s64) = G_CONSTANT i64 -4294967296
  %7:_(s64) = G_AND %5:_, %6:_
  %11:_(s32), %13:_(s32) = G_UNMERGE_VALUES %2:_(s64)
  %12:_(s32), %14:_(s32) = G_UNMERGE_VALUES %7:_(s64)
  %15:_(s64) = G_CONSTANT i64 0
  %16:_(s32), %17:_(s32) = G_UNMERGE_VALUES %15:_(s64)
  %18:_(s32) = G_MUL %11:_, %14:_
  %8:_(s64) = G_MUL %2:_, %7:_
  %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %8:_(s64)
  $vgpr0 = COPY %9:_(s32)
  $vgpr1 = COPY %10:_(s32)
  SI_RETURN implicit $vgpr0, implicit $vgpr1

# End machine code for function test_mad.

So, I think, by applying your pattern matching to the register itself, it tries to find the last instruction that uses the constant 0 which is the G_UNMERGE_VALUES itself in llvm::getConstantVRegValWithLookThrough.

OutOfCache added inline comments.Feb 14 2023, 7:34 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3007

I tried following the suggestion of using mi_match. Unfortunately, it did not work. There is no i32 G_CONSTANT that can be matched.

The LocalAccum has the result of the G_MAD, which is not using a G_CONSTANT, even if the result is 0.

%16:_(s64), %17:_(s1) = G_AMDGPU_MAD_U64_U32 %2:_(s32), %12:_, %15:_
%18:_(s32), %19:_(s32) = G_UNMERGE_VALUES %16:_(s64)

Essentially we are looking at %18 and %19. If any of these are all zeroes, which we only know using the Known Bits Analysis (to my knowledge), we can save the following G_ADDs here:

%20:_(s32) = G_MUL %2:_, %14:_
%21:_(s32) = G_ADD %19:_, %20:_
%22:_(s32) = G_MUL %3:_, %12:_
%23:_(s32) = G_ADD %21:_, %22:_

The Multiplication Arguments are within VGPRs, for which the mi_match does not work.

Thank you for the suggestion, though, @arsenm! It would have been more efficient if I managed to make it work. Plus, I learned a lot along the way.

In case I should try something else, feel free to let me know.

I apologize for the delay regarding my answer, I took time off for my exams.

Can you please add some MIR test cases in addition to your LLVM IR tests? You already put some example code in the discussion, but I believe a test that shows the transformation on a GMIR level could be helpful.

Other than that, it looks good to me.

Added MIR Tests for gfx10 and gfx11.

  • [GISel] Adding KnownBitsAnalysis to Legalizer
  • [AMDGPU] Wide multiplies with Known Bits Analysis
  • [AMDGPU] Improved wide multiplies tests
  • [GISel/AMDGPU] caching results of isZero()
  • [AMDGPU] Inlining conditions in buildMultiply
  • [AMDGPU] Removing redundant zero-check for mults
  • [AMDGPU] reverting separation of for-loops
  • [AMDGPU] Update Tests for multiplies using KBA
tsymalla added inline comments.Feb 21 2023, 1:35 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.mir
10

Can you please pre-commit the tests based on the new test names so they don't show up in the diff?

Btw, it seems like gfx10 and gfx11 always get the same changes in the GMIR test. In that case, I'd rather remove the gfx11 tests and just execute one test. This keeps the GMIR test less cluttered.

Remove gfx11 from MIR tests

  • [GISel] Adding KnownBitsAnalysis to Legalizer
  • [AMDGPU] Wide multiplies with Known Bits Analysis
  • [AMDGPU] Improved wide multiplies tests
  • [GISel/AMDGPU] caching results of isZero()
  • [AMDGPU] Inlining conditions in buildMultiply
  • [AMDGPU] Removing redundant zero-check for mults
  • [AMDGPU] reverting separation of for-loops
  • [AMDGPU] Update Tests for multiplies using KBA
OutOfCache marked 2 inline comments as done.Feb 21 2023, 8:15 AM
This revision was not accepted when it landed; it landed in state Needs Review.Feb 22 2023, 7:40 AM
This revision was automatically updated to reflect the committed changes.