This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][DAG] Only apply trunc/shift combine to 16 bit types
AbandonedPublic

Authored by Pierre-vh on Oct 13 2022, 4:58 AM.

Download Raw Diff

Details

Reviewers

Summary

Before, we checked <32 - probably assuming anything below 32 would be 16 bits.
However, odd integer types like i26 exist and are legal. Don't combine in those cases.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Pierre-vh created this revision.Oct 13 2022, 4:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 13 2022, 4:58 AM

Herald added subscribers: kosarev, foad, kerbowa and 7 others. · View Herald Transcript

Pierre-vh requested review of this revision.Oct 13 2022, 4:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 13 2022, 4:58 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B191945: Diff 467446.Oct 13 2022, 5:35 AM

What are you trying to solve here? alignbit isn't preferable to 32-bit shifts, and the shift doesn't seem to be wrong

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3240	Legal 16-bit isn't the point here, it's to avoid the 64-bit shift. Even if we didn't have 16-bit types we would want the combine.
llvm/test/CodeGen/AMDGPU/partial-shift-shrink.ll
157	The alignbit and shift are equally fast, and the shift is easier to understand

This is trying to solve a miscompilation, but it might be the wrong fix. I know for sure this combine is responsible for a miscompilation in a sample, as disabling it/removing it/applying this fix resolves the issue.

This is the DAG being wrongly combined:

      t26: i64 = srl t23, Constant:i32<26>
      t28: i64 = mul nuw nsw t22, Constant:i64<12345678>
    t29: i64 = add nuw nsw t26, t28
  t32: i64 = srl t29, Constant:i32<25>
t33: i26 = truncate t32

Combining: t33: i26 = truncate t32
Creating new node: t82: i32 = truncate t29
Creating new node: t83: i32 = srl t82, Constant:i32<25>
Creating new node: t84: i26 = truncate t83
 ... into: t84: i26 = truncate t83

srl is a shift-right, so it doesn't seem right to me that we truncate the source to 32 bits as it changes the output.

Maybe the error is in the VT of the first trunc? Shouldn't we have at least 2*K for shift-right?
Maybe this is a better fix? This also works to fix the miscompilation I'm seeing.

EVT MidScalarTy = MVT::i32;

// For right shifts, ensure the VT of the shift source is wide
// enough that we don't lose bits in the result.
if(Src.getOpcode() == ISD::SRL || Src.getOpcode() == ISD::SRA) {

  // Don't risk losing info if we don't know the shift amount.
  if(!Known.isConstant())
    return SDValue();

  const uint64_t ScalarWidth = Known.getConstant().getZExtValue() * 2;
  if(ScalarWidth >= 64)
    return SDValue();

  MidScalarTy = EVT::getIntegerVT(*DAG.getContext(), ScalarWidth);
}

EVT MidVT = VT.isVector() ?
  EVT::getVectorVT(*DAG.getContext(), MidScalarTy,
                   VT.getVectorNumElements()) : MidScalarTy;

The checks are that the shift doesn't cross the 32-bit boundary. Here is the broken case: https://alive2.llvm.org/ce/z/P6iBze

The shift amount check seems to be wrong. I think the correct condition is ShiftAmt <= (32 - VT.getScalarSizeInBits()) https://alive2.llvm.org/ce/z/uYZ9tq

I'm not sure alive2 still has the abstract condition checks anymore like the old version

The shift amount check seems to be wrong. I think the correct condition is ShiftAmt <= (32 - VT.getScalarSizeInBits()) https://alive2.llvm.org/ce/z/uYZ9tq

For right shifts. For left shifts it's size.

GlobalISel also seems to have half ported this combine. For some reason it's only handling shl

Pierre-vh abandoned this revision.Oct 17 2022, 12:54 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUISelLowering.cpp

2 lines

test/

CodeGen/

AMDGPU/

partial-shift-shrink.ll

23 lines

Diff 467446

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 3,231 Lines • ▼ Show 20 Lines	if (auto K = isConstOrConstSplat(Src.getOperand(1))) {
}		}
}		}
}		}

// Partially shrink 64-bit shifts to 32-bit if reduced to 16-bit.		// Partially shrink 64-bit shifts to 32-bit if reduced to 16-bit.
//		//
// i16 (trunc (srl i64:x, K)), K <= 16 ->		// i16 (trunc (srl i64:x, K)), K <= 16 ->
// i16 (trunc (srl (i32 (trunc x), K)))		// i16 (trunc (srl (i32 (trunc x), K)))
if (VT.getScalarSizeInBits() < 32) {		if (VT.getScalarSizeInBits() == 16) {
		arsenmUnsubmitted Not Done Reply Inline Actions Legal 16-bit isn't the point here, it's to avoid the 64-bit shift. Even if we didn't have 16-bit types we would want the combine. arsenm: Legal 16-bit isn't the point here, it's to avoid the 64-bit shift. Even if we didn't have 16…
EVT SrcVT = Src.getValueType();		EVT SrcVT = Src.getValueType();
if (SrcVT.getScalarSizeInBits() > 32 &&		if (SrcVT.getScalarSizeInBits() > 32 &&
(Src.getOpcode() == ISD::SRL \|\|		(Src.getOpcode() == ISD::SRL \|\|
Src.getOpcode() == ISD::SRA \|\|		Src.getOpcode() == ISD::SRA \|\|
Src.getOpcode() == ISD::SHL)) {		Src.getOpcode() == ISD::SHL)) {
SDValue Amt = Src.getOperand(1);		SDValue Amt = Src.getOperand(1);
KnownBits Known = DAG.computeKnownBits(Amt);		KnownBits Known = DAG.computeKnownBits(Amt);
unsigned Size = VT.getScalarSizeInBits();		unsigned Size = VT.getScalarSizeInBits();
▲ Show 20 Lines • Show All 1,588 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/partial-shift-shrink.ll

	Show First 20 Lines • Show All 143 Lines • ▼ Show 20 Lines
	; GCN-NEXT: v_and_b32_e32 v2, 31, v2			; GCN-NEXT: v_and_b32_e32 v2, 31, v2
	; GCN-NEXT: v_lshrrev_b64 v[0:1], v2, v[0:1]			; GCN-NEXT: v_lshrrev_b64 v[0:1], v2, v[0:1]
	; GCN-NEXT: s_setpc_b64 s[30:31]			; GCN-NEXT: s_setpc_b64 s[30:31]
	%amt.masked = and i64 %amt, 31			%amt.masked = and i64 %amt, 31
	%shift = lshr i64 %x, %amt.masked			%shift = lshr i64 %x, %amt.masked
	%trunc = trunc i64 %shift to i16			%trunc = trunc i64 %shift to i16
	ret i16 %trunc			ret i16 %trunc
	}			}

				; Checks that we don't blindly apply the combine on anything <32.
				; It's completely possible to trunc to weird integer types like i26
				; as an intermediate step of a bigger computation.
				;
				; Thus, we should have an alignbit here and not a lshrrev
				arsenmUnsubmitted Not Done Reply Inline Actions The alignbit and shift are equally fast, and the shift is easier to understand arsenm: The alignbit and shift are equally fast, and the shift is easier to understand
				define i32 @trunc_srl_i64_25_to_i26(i64 %x) {
				; GCN-LABEL: trunc_srl_i64_25_to_i26:
				; GCN: ; %bb.0:
				; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GCN-NEXT: v_and_b32_e32 v0, 0xa000000, v0
				; GCN-NEXT: v_alignbit_b32 v0, 0, v0, 25
				; GCN-NEXT: v_add_u32_e32 v0, 55, v0
				; GCN-NEXT: s_setpc_b64 s[30:31]
				%value.knownbits2 = and i64 %x, 167772160 ; 0xA000000
				%shift = lshr i64 %value.knownbits2, 25
				%trunc = trunc i64 %shift to i26
				%add = add i26 %trunc, 55
				%ext = zext i26 %add to i32
				ret i32 %ext
				}