This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
1/3
DAGCombiner.cpp
-
test/CodeGen/
-
CodeGen/
-
ARM/
-
shift-combine.ll
-
X86/
-
shift-folding.ll

Differential D117406

[DAGCombiner] Adjust some checks in DAGCombiner::reduceLoadWidth
ClosedPublic

Authored by bjope on Jan 15 2022, 2:20 PM.

Download Raw Diff

Details

Reviewers

spatel
nemanjai
samparker

Commits

rG46cacdbb21c2: [DAGCombiner] Adjust some checks in DAGCombiner::reduceLoadWidth

Summary

In code review for D117104 two slightly weird checks were found
in DAGCombiner::reduceLoadWidth. They were typically checking
if BitsA was a mulitple of BitsB by looking at (BitsA & (BitsB - 1)),
but such a comparison actually only make sense if BitsB is a power
of two.

The checks were related to the code that attempted to shrink a load
based on the fact that the loaded value would be right shifted.

Afaict the legality of the value types is checked later (typically in
isLegalNarrowLdSt), so the existing checks were both overly
conservative as well as being wrong whenever ExtVTBits wasn't a
power of two. The latter was a situation triggered by a number of
lit tests so we could not just assert on ExtVTBIts being a power of
two).

When attempting to simply remove the checks I found some problems,
that seems to have been guarded by the checks (maybe just out of
luck). A typical example would be a pattern like this:

t1 = load i96* ptr
t2 = srl t1, 64
t3 = truncate t2 to i64

When DAGCombine is visiting the truncate reduceLoadWidth is called
attempting to narrow the load to 64 bits (ExtVT := MVT::i64). Then
the SRL is detected and we set ShAmt to 64.

In the past we've bailed out due to i96 not being a multiple of 64.
If we simply remove that check then we would end up replacing the
load with a new load that would read 64 bits but with a base pointer
adjusted by 64 bits. So we would read 32 bits the wasn't accessed by
the original load.
This patch will instead utilize the fact that the logical left shift
can be folded away by using a zextload. Thus, the pattern above will
now be combined into

t3 = load i32* ptr+offset, zext to i64

Another case is shown in the new X86/combine-srl-load.ll test case:

t1 = load i32* ptr
t2 = srl i32 t1, 8
t3 = truncate t2 to i16

In the past we bailed out due to the shift count (8) not being a
multiple of 16. Now the narrowing kicks in and we get

t3 = load i16* ptr+offset

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bjope created this revision.Jan 15 2022, 2:20 PM

Herald added subscribers: ecnelises, steven.zhang, hiraditya. · View Herald TranscriptJan 15 2022, 2:20 PM

bjope requested review of this revision.Jan 15 2022, 2:20 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 15 2022, 2:20 PM

bjope added a parent revision: D117104: [DAGCombine] Refactor DAGCombiner::ReduceLoadWidth. NFCI.Jan 15 2022, 2:20 PM

Harbormaster completed remote builds in B143619: Diff 400319.Jan 15 2022, 2:21 PM

bjope mentioned this in D117104: [DAGCombine] Refactor DAGCombiner::ReduceLoadWidth. NFCI.Jan 15 2022, 2:21 PM

bjope edited the summary of this revision. (Show Details)

bjope added a child revision: D116930: [DAGCombine] Fold SRA of a load into a narrower sign-extending load.Jan 15 2022, 2:28 PM

spatel mentioned this in D117508: [SDAG] add demanded bits transform for bswap.Jan 17 2022, 12:01 PM

spatel added inline comments.Jan 17 2022, 12:05 PM

llvm/test/CodeGen/PowerPC/pr39478.ll
6 ↗	(On Diff #400319)	This is a generic missed fold. I added more tests and a proposed fix: D117508

spatel mentioned this in rGba6485e25fc5: [SDAG] add demanded bits transform for bswap.Jan 17 2022, 3:34 PM

Rebased (now including the BSWAP fold from D117508 to avoid regression in a PPC test case).

Also added a new X86 test as a regression test for new fold when the shift
count isn't a multiple of the narrowed load width.

Herald added a subscriber: pengfei. · View Herald TranscriptJan 18 2022, 2:03 AM

bjope edited the summary of this revision. (Show Details)Jan 18 2022, 2:06 AM

bjope added inline comments.Jan 18 2022, 2:10 AM

llvm/test/CodeGen/X86/combine-srl-load.ll
7 ↗	(On Diff #400773)	I've assumed that DAGCombiner::isLegalNarrowLdSt would have bailed out if this would result in a unaligned memory access for the target (it does use TLI.allowsMemoryAccess). Had perhaps been nice with a test case for some target where this would matter. Any suggestions?

bjope added inline comments.Jan 18 2022, 2:27 AM

llvm/test/CodeGen/X86/combine-srl-load.ll
7 ↗	(On Diff #400773)	Well, I've verified it by using riscv64 as a target. Not sure if I need to add a test case for that.

Harbormaster completed remote builds in B143967: Diff 400773.Jan 18 2022, 2:49 AM

spatel added inline comments.Jan 18 2022, 9:45 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12227	happend -> happen
12233–12234	Why set these variables if we are exiting the function? Can we specify in the comment which function does the narrowing of the load for this pattern?
llvm/test/CodeGen/X86/combine-srl-load.ll
4 ↗	(On Diff #400773)	It would better to pre-commit this test with the baseline CHECKs. I don't think there's a specific test file for x86 used for verifying this type of fold, but "shift-folding.ll" has a test that looks similar, so you could add it to that file.

bjope added inline comments.Jan 18 2022, 10:32 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12233–12234	Oh, that return statement looks like a mistake. I started out by simply returning here, but then I realized that it should be OK to use ZEXTLOAD here, as long as that assert never triggers about replacing sextload by zextload. The idea was to remove the return statement again, not including it in the patch. The example with `(i64 (truncate (i96 (srl (load x), 64))))` would however be optimzed also without this special case. But then it would start off by first rewriting srl+load into a larger zextload, which later is combined with the truncate to form a smaller zextload. By removing the return here we would trigger the full fold already based on the truncate.

bjope mentioned this in D117588: Pre-commit test case for trunc+lshr+load folds.Jan 18 2022, 10:49 AM

Rebased given parent patch with test cases for pre-commit (D117588).

Harbormaster completed remote builds in B144065: Diff 400915.Jan 18 2022, 10:54 AM

LGTM

This revision is now accepted and ready to land.Jan 18 2022, 12:06 PM

This revision was landed with ongoing or failed builds.Jan 24 2022, 3:24 AM

Closed by commit rG46cacdbb21c2: [DAGCombiner] Adjust some checks in DAGCombiner::reduceLoadWidth (authored by bjope). · Explain Why

This revision was automatically updated to reflect the committed changes.

bjope mentioned this in rG12a499eb00e3: Pre-commit test case for trunc+lshr+load folds.

bjope added a commit: rG46cacdbb21c2: [DAGCombiner] Adjust some checks in DAGCombiner::reduceLoadWidth.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

25 lines

test/

CodeGen/

ARM/

shift-combine.ll

20 lines

X86/

shift-folding.ll

4 lines

Diff 402449

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 12,208 Lines • ▼ Show 20 Lines	if (Opc == ISD::SRL \|\| N0.getOpcode() == ISD::SRL) {
auto *SRL1C = dyn_cast<ConstantSDNode>(SRL.getOperand(1));		auto *SRL1C = dyn_cast<ConstantSDNode>(SRL.getOperand(1));
if (!SRL1C \|\| !LN)		if (!SRL1C \|\| !LN)
return SDValue();		return SDValue();

// If the shift amount is larger than the input type then we're not		// If the shift amount is larger than the input type then we're not
// accessing any of the loaded bytes. If the load was a zextload/extload		// accessing any of the loaded bytes. If the load was a zextload/extload
// then the result of the shift+trunc is zero/undef (handled elsewhere).		// then the result of the shift+trunc is zero/undef (handled elsewhere).
ShAmt = SRL1C->getZExtValue();		ShAmt = SRL1C->getZExtValue();
if (ShAmt >= LN->getMemoryVT().getSizeInBits())		uint64_t MemoryWidth = LN->getMemoryVT().getSizeInBits();
		if (ShAmt >= MemoryWidth)
return SDValue();		return SDValue();

// Because a SRL must be assumed to need to zero-extend the high bits		// Because a SRL must be assumed to need to zero-extend the high bits
// (as opposed to anyext the high bits), we can't combine the zextload		// (as opposed to anyext the high bits), we can't combine the zextload
// lowering of SRL and an sextload.		// lowering of SRL and an sextload.
if (LN->getExtensionType() == ISD::SEXTLOAD)		if (LN->getExtensionType() == ISD::SEXTLOAD)
return SDValue();		return SDValue();

unsigned ExtVTBits = ExtVT.getScalarSizeInBits();		// Avoid reading outside the memory accessed by the original load (could
		spatelUnsubmitted Not Done Reply Inline Actions happend -> happen spatel: happend -> happen
// Is the shift amount a multiple of size of ExtVT?		// happened if we only adjust the load base pointer by ShAmt). Instead we
if ((ShAmt & (ExtVTBits - 1)) != 0)		// try to narrow the load even further. The typical scenario here is:
return SDValue();		// (i64 (truncate (i96 (srl (load x), 64)))) ->
// Is the load width a multiple of size of ExtVT?		// (i64 (truncate (i96 (zextload (load i32 + offset) from i32))))
if ((SRL.getScalarValueSizeInBits() & (ExtVTBits - 1)) != 0)		if (ExtVT.getScalarSizeInBits() > MemoryWidth - ShAmt) {
		// Don't replace sextload by zextload.
		if (ExtType == ISD::SEXTLOAD)
		spatelUnsubmitted Not Done Reply Inline Actions Why set these variables if we are exiting the function? Can we specify in the comment which function does the narrowing of the load for this pattern? spatel: Why set these variables if we are exiting the function? Can we specify in the comment which…
		bjopeAuthorUnsubmitted Done Reply Inline Actions Oh, that return statement looks like a mistake. I started out by simply returning here, but then I realized that it should be OK to use ZEXTLOAD here, as long as that assert never triggers about replacing sextload by zextload. The idea was to remove the return statement again, not including it in the patch. The example with `(i64 (truncate (i96 (srl (load x), 64))))` would however be optimzed also without this special case. But then it would start off by first rewriting srl+load into a larger zextload, which later is combined with the truncate to form a smaller zextload. By removing the return here we would trigger the full fold already based on the truncate. bjope: Oh, that return statement looks like a mistake. I started out by simply returning here, but…
return SDValue();		return SDValue();
		// Narrow the load.
		ExtType = ISD::ZEXTLOAD;
		ExtVT = EVT::getIntegerVT(*DAG.getContext(), MemoryWidth - ShAmt);
		}

// If the SRL is only used by a masking AND, we may be able to adjust		// If the SRL is only used by a masking AND, we may be able to adjust
// the ExtVT to make the AND redundant.		// the ExtVT to make the AND redundant.
SDNode Mask = (SRL->use_begin());		SDNode Mask = (SRL->use_begin());
if (SRL.hasOneUse() && Mask->getOpcode() == ISD::AND &&		if (SRL.hasOneUse() && Mask->getOpcode() == ISD::AND &&
isa<ConstantSDNode>(Mask->getOperand(1))) {		isa<ConstantSDNode>(Mask->getOperand(1))) {
const APInt& ShiftMask = Mask->getConstantOperandAPInt(1);		const APInt& ShiftMask = Mask->getConstantOperandAPInt(1);
if (ShiftMask.isMask()) {		if (ShiftMask.isMask()) {
EVT MaskedVT = EVT::getIntegerVT(*DAG.getContext(),		EVT MaskedVT = EVT::getIntegerVT(*DAG.getContext(),
ShiftMask.countTrailingOnes());		ShiftMask.countTrailingOnes());
// If the mask is smaller, recompute the type.		// If the mask is smaller, recompute the type.
if ((ExtVTBits > MaskedVT.getScalarSizeInBits()) &&		if ((ExtVT.getScalarSizeInBits() > MaskedVT.getScalarSizeInBits()) &&
TLI.isLoadExtLegal(ExtType, SRL.getValueType(), MaskedVT))		TLI.isLoadExtLegal(ExtType, SRL.getValueType(), MaskedVT))
ExtVT = MaskedVT;		ExtVT = MaskedVT;
}		}
}		}

N0 = SRL.getOperand(0);		N0 = SRL.getOperand(0);
}		}

▲ Show 20 Lines • Show All 11,830 Lines • Show Last 20 Lines

llvm/test/CodeGen/ARM/shift-combine.ll

Show First 20 Lines • Show All 296 Lines • ▼ Show 20 Lines
define arm_aapcscc i32 @test_lshr_load64_4_unaligned(i64* %a) {		define arm_aapcscc i32 @test_lshr_load64_4_unaligned(i64* %a) {
; CHECK-ARM-LABEL: test_lshr_load64_4_unaligned:		; CHECK-ARM-LABEL: test_lshr_load64_4_unaligned:
; CHECK-ARM: @ %bb.0: @ %entry		; CHECK-ARM: @ %bb.0: @ %entry
; CHECK-ARM-NEXT: ldr r0, [r0, #2]		; CHECK-ARM-NEXT: ldr r0, [r0, #2]
; CHECK-ARM-NEXT: bx lr		; CHECK-ARM-NEXT: bx lr
;		;
; CHECK-BE-LABEL: test_lshr_load64_4_unaligned:		; CHECK-BE-LABEL: test_lshr_load64_4_unaligned:
; CHECK-BE: @ %bb.0: @ %entry		; CHECK-BE: @ %bb.0: @ %entry
; CHECK-BE-NEXT: ldr r1, [r0]		; CHECK-BE-NEXT: ldr r0, [r0, #2]
; CHECK-BE-NEXT: ldrh r0, [r0, #4]
; CHECK-BE-NEXT: orr r0, r0, r1, lsl #16
; CHECK-BE-NEXT: bx lr		; CHECK-BE-NEXT: bx lr
;		;
; CHECK-THUMB-LABEL: test_lshr_load64_4_unaligned:		; CHECK-THUMB-LABEL: test_lshr_load64_4_unaligned:
; CHECK-THUMB: @ %bb.0: @ %entry		; CHECK-THUMB: @ %bb.0: @ %entry
; CHECK-THUMB-NEXT: ldr.w r0, [r0, #2]		; CHECK-THUMB-NEXT: ldr.w r0, [r0, #2]
; CHECK-THUMB-NEXT: bx lr		; CHECK-THUMB-NEXT: bx lr
;		;
; CHECK-ALIGN-LABEL: test_lshr_load64_4_unaligned:		; CHECK-ALIGN-LABEL: test_lshr_load64_4_unaligned:
Show All 20 Lines
define arm_aapcscc i32 @test_lshr_load64_1_lsb(i64* %a) {		define arm_aapcscc i32 @test_lshr_load64_1_lsb(i64* %a) {
; CHECK-ARM-LABEL: test_lshr_load64_1_lsb:		; CHECK-ARM-LABEL: test_lshr_load64_1_lsb:
; CHECK-ARM: @ %bb.0: @ %entry		; CHECK-ARM: @ %bb.0: @ %entry
; CHECK-ARM-NEXT: ldr r0, [r0, #3]		; CHECK-ARM-NEXT: ldr r0, [r0, #3]
; CHECK-ARM-NEXT: bx lr		; CHECK-ARM-NEXT: bx lr
;		;
; CHECK-BE-LABEL: test_lshr_load64_1_lsb:		; CHECK-BE-LABEL: test_lshr_load64_1_lsb:
; CHECK-BE: @ %bb.0: @ %entry		; CHECK-BE: @ %bb.0: @ %entry
; CHECK-BE-NEXT: ldr r1, [r0]		; CHECK-BE-NEXT: ldr r0, [r0, #1]
; CHECK-BE-NEXT: ldrb r0, [r0, #4]
; CHECK-BE-NEXT: orr r0, r0, r1, lsl #8
; CHECK-BE-NEXT: bx lr		; CHECK-BE-NEXT: bx lr
;		;
; CHECK-THUMB-LABEL: test_lshr_load64_1_lsb:		; CHECK-THUMB-LABEL: test_lshr_load64_1_lsb:
; CHECK-THUMB: @ %bb.0: @ %entry		; CHECK-THUMB: @ %bb.0: @ %entry
; CHECK-THUMB-NEXT: ldr.w r0, [r0, #3]		; CHECK-THUMB-NEXT: ldr.w r0, [r0, #3]
; CHECK-THUMB-NEXT: bx lr		; CHECK-THUMB-NEXT: bx lr
;		;
; CHECK-ALIGN-LABEL: test_lshr_load64_1_lsb:		; CHECK-ALIGN-LABEL: test_lshr_load64_1_lsb:
▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	entry:
%1 = lshr i64 %0, 48		%1 = lshr i64 %0, 48
%conv = trunc i64 %1 to i32		%conv = trunc i64 %1 to i32
ret i32 %conv		ret i32 %conv
}		}

define arm_aapcscc i32 @test_lshr_load4_fail(i64* %a) {		define arm_aapcscc i32 @test_lshr_load4_fail(i64* %a) {
; CHECK-ARM-LABEL: test_lshr_load4_fail:		; CHECK-ARM-LABEL: test_lshr_load4_fail:
; CHECK-ARM: @ %bb.0: @ %entry		; CHECK-ARM: @ %bb.0: @ %entry
; CHECK-ARM-NEXT: ldrd r0, r1, [r0]		; CHECK-ARM-NEXT: ldr r0, [r0, #1]
; CHECK-ARM-NEXT: lsr r0, r0, #8
; CHECK-ARM-NEXT: orr r0, r0, r1, lsl #24
; CHECK-ARM-NEXT: bx lr		; CHECK-ARM-NEXT: bx lr
;		;
; CHECK-BE-LABEL: test_lshr_load4_fail:		; CHECK-BE-LABEL: test_lshr_load4_fail:
; CHECK-BE: @ %bb.0: @ %entry		; CHECK-BE: @ %bb.0: @ %entry
; CHECK-BE-NEXT: ldrd r0, r1, [r0]		; CHECK-BE-NEXT: ldr r0, [r0, #3]
; CHECK-BE-NEXT: lsr r1, r1, #8
; CHECK-BE-NEXT: orr r0, r1, r0, lsl #24
; CHECK-BE-NEXT: bx lr		; CHECK-BE-NEXT: bx lr
;		;
; CHECK-THUMB-LABEL: test_lshr_load4_fail:		; CHECK-THUMB-LABEL: test_lshr_load4_fail:
; CHECK-THUMB: @ %bb.0: @ %entry		; CHECK-THUMB: @ %bb.0: @ %entry
; CHECK-THUMB-NEXT: ldrd r0, r1, [r0]		; CHECK-THUMB-NEXT: ldr.w r0, [r0, #1]
; CHECK-THUMB-NEXT: lsrs r0, r0, #8
; CHECK-THUMB-NEXT: orr.w r0, r0, r1, lsl #24
; CHECK-THUMB-NEXT: bx lr		; CHECK-THUMB-NEXT: bx lr
;		;
; CHECK-ALIGN-LABEL: test_lshr_load4_fail:		; CHECK-ALIGN-LABEL: test_lshr_load4_fail:
; CHECK-ALIGN: @ %bb.0: @ %entry		; CHECK-ALIGN: @ %bb.0: @ %entry
; CHECK-ALIGN-NEXT: ldrd r0, r1, [r0]		; CHECK-ALIGN-NEXT: ldrd r0, r1, [r0]
; CHECK-ALIGN-NEXT: lsrs r0, r0, #8		; CHECK-ALIGN-NEXT: lsrs r0, r0, #8
; CHECK-ALIGN-NEXT: orr.w r0, r0, r1, lsl #24		; CHECK-ALIGN-NEXT: orr.w r0, r0, r1, lsl #24
; CHECK-ALIGN-NEXT: bx lr		; CHECK-ALIGN-NEXT: bx lr
▲ Show 20 Lines • Show All 437 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/shift-folding.ll

Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retl
ret i32 %xor		ret i32 %xor
}		}

; Should be possible to adjust the pointer and narrow the load to 16 bits.		; Should be possible to adjust the pointer and narrow the load to 16 bits.
define i16 @srl_load_narrowing1(i32* %arg) {		define i16 @srl_load_narrowing1(i32* %arg) {
; CHECK-LABEL: srl_load_narrowing1:		; CHECK-LABEL: srl_load_narrowing1:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax		; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
; CHECK-NEXT: movl (%eax), %eax		; CHECK-NEXT: movzwl 1(%eax), %eax
; CHECK-NEXT: shrl $8, %eax
; CHECK-NEXT: # kill: def $ax killed $ax killed $eax
; CHECK-NEXT: retl		; CHECK-NEXT: retl
%tmp1 = load i32, i32* %arg, align 1		%tmp1 = load i32, i32* %arg, align 1
%tmp2 = lshr i32 %tmp1, 8		%tmp2 = lshr i32 %tmp1, 8
%tmp3 = trunc i32 %tmp2 to i16		%tmp3 = trunc i32 %tmp2 to i16
ret i16 %tmp3		ret i16 %tmp3
}		}

define i16 @srl_load_narrowing2(i32* %arg) {		define i16 @srl_load_narrowing2(i32* %arg) {
Show All 11 Lines