This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
2
InstCombineSimplifyDemanded.cpp
-
test/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
-
trunc-demand.ll

Differential D110170

[InstCombine] fold cast of right-shift if high bits are not demanded
ClosedPublic

Authored by spatel on Sep 21 2021, 6:52 AM.

Download Raw Diff

Details

Reviewers

lebedev.ri
nikic
mnadeem

Commits

rGf32c0fe8e505: [InstCombine] fold cast of right-shift if high bits are not demanded (3rd try)
rGbb9333c3504a: [InstCombine] fold cast of right-shift if high bits are not demanded (2nd try)
rG2f6b07316f56: [InstCombine] fold cast of right-shift if high bits are not demanded

Summary

(masked) trunc (lshr X, ShiftC) --> (masked) lshr (trunc X), C

Narrowing the shift should be better for analysis and can lead to follow-on transforms as shown.

Attempt at the general proof in Alive2:
https://alive2.llvm.org/ce/z/tRnnSF

Here are a couple of the specific tests:
https://alive2.llvm.org/ce/z/bCnTp-

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Sep 21 2021, 6:52 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptSep 21 2021, 6:53 AM

spatel requested review of this revision.Sep 21 2021, 6:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 21 2021, 6:53 AM

Should this be demandedbits-driven?

Harbormaster completed remote builds in B124897: Diff 373900.Sep 21 2021, 7:05 AM

In D110170#3012379, @lebedev.ri wrote:

Should this be demandedbits-driven?

Yes - I thought about that while drafting this, but I didn't follow through. I'll conjure some new tests and try that.

spatel mentioned this in rG08ef71ca92d9: [InstCombine] move/add tests for trunc-of-lshr; NFC.Sep 21 2021, 9:11 AM

Patch updated:
Generalized to be a demanded-bits fold. I added tests ending in 'or' rather than just 'and'. ( https://alive2.llvm.org/ce/z/TfaHnb )
I think we already fold other cases (for example, ending with a left-shift).

spatel retitled this revision from [InstCombine] fold cast between shift and mask to [InstCombine] fold cast of right-shift if high bits are not demanded.Sep 21 2021, 10:01 AM

spatel edited the summary of this revision. (Show Details)

lebedev.ri added inline comments.Sep 21 2021, 10:17 AM

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
397	I think this is only part of the check. https://alive2.llvm.org/ce/z/59fb7i => https://alive2.llvm.org/ce/z/s_G7u7

Looks fine to me.

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
397	No, ignore me.

This revision is now accepted and ready to land.Sep 21 2021, 10:31 AM

Harbormaster completed remote builds in B124946: Diff 373972.Sep 21 2021, 10:32 AM

This revision was landed with ongoing or failed builds.Sep 21 2021, 1:10 PM

Closed by commit rG2f6b07316f56: [InstCombine] fold cast of right-shift if high bits are not demanded (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rG2f6b07316f56: [InstCombine] fold cast of right-shift if high bits are not demanded.

This is causing timeouts on a number of multistage builds. I have seen it at least on PPC, SystemZ and AArch64:
PPC: https://lab.llvm.org/buildbot/#/builders/121/builds/11680
PPC: https://lab.llvm.org/buildbot/#/builders/36/builds/12596
SystemZ: https://lab.llvm.org/buildbot/#/builders/8/builds/1821
AArch64: https://lab.llvm.org/buildbot/#/builders/179/builds/1073

The timeout happens when the stage 1 compiler is building lib/Transforms/Coroutines/CoroFrame.cpp
I'll see if I can track down what is causing the infinite loop/recursion.

I also run into infinite loops caused by this commit. It's reproducible with https://martin.st/temp/qsettings-preproc.cpp with clang -target x86_64-w64-mingw32 -c qsettings-preproc.cpp -O3 -std=c++17. But that's not minimzed/reduced at all (and that source file takes a pretty significant amount of time to compile even to begin with).

uabelho added a subscriber: uabelho.Sep 22 2021, 3:36 AM

spatel added a reverting change: rGc6013f71a455: Revert "[InstCombine] fold cast of right-shift if high bits are not demanded".Sep 22 2021, 4:45 AM

In D110170#3014407, @mstorsjo wrote:

I also run into infinite loops caused by this commit. It's reproducible with https://martin.st/temp/qsettings-preproc.cpp with clang -target x86_64-w64-mingw32 -c qsettings-preproc.cpp -O3 -std=c++17. But that's not minimzed/reduced at all (and that source file takes a pretty significant amount of time to compile even to begin with).

In D110170#3014232, @nemanjai wrote:

This is causing timeouts on a number of multistage builds. I have seen it at least on PPC, SystemZ and AArch64:
PPC: https://lab.llvm.org/buildbot/#/builders/121/builds/11680
PPC: https://lab.llvm.org/buildbot/#/builders/36/builds/12596
SystemZ: https://lab.llvm.org/buildbot/#/builders/8/builds/1821
AArch64: https://lab.llvm.org/buildbot/#/builders/179/builds/1073

The timeout happens when the stage 1 compiler is building lib/Transforms/Coroutines/CoroFrame.cpp
I'll see if I can track down what is causing the infinite loop/recursion.

Thanks for letting me know. I reverted the patch (and the follow-up that fixed a clang test failure at 52832cd917af0), so I'm not holding up those bots while we investigate. I'm trying to reduce Martin's file now, but it is taking a while! If anyone finds a smaller example, please do post it here.

In D110170#3014861, @spatel wrote:

In D110170#3014407, @mstorsjo wrote:

I also run into infinite loops caused by this commit. It's reproducible with https://martin.st/temp/qsettings-preproc.cpp with clang -target x86_64-w64-mingw32 -c qsettings-preproc.cpp -O3 -std=c++17. But that's not minimzed/reduced at all (and that source file takes a pretty significant amount of time to compile even to begin with).

In D110170#3014232, @nemanjai wrote:

This is causing timeouts on a number of multistage builds. I have seen it at least on PPC, SystemZ and AArch64:
PPC: https://lab.llvm.org/buildbot/#/builders/121/builds/11680
PPC: https://lab.llvm.org/buildbot/#/builders/36/builds/12596
SystemZ: https://lab.llvm.org/buildbot/#/builders/8/builds/1821
AArch64: https://lab.llvm.org/buildbot/#/builders/179/builds/1073

The timeout happens when the stage 1 compiler is building lib/Transforms/Coroutines/CoroFrame.cpp
I'll see if I can track down what is causing the infinite loop/recursion.

Thanks for letting me know. I reverted the patch (and the follow-up that fixed a clang test failure at 52832cd917af0), so I'm not holding up those bots while we investigate. I'm trying to reduce Martin's file now, but it is taking a while! If anyone finds a smaller example, please do post it here.

I am running a reducer on the CoroFrame.cpp one and it is pretty small so far and still reducing. I'll provide it once it is done in case it turns out to be useful.

Reduced test case at https://pastebin.com/NVYbsRfD

Run opt -O3 -mtriple=powerpc64le-- -disable-output file.ll and it doesn't terminate.

In D110170#3015960, @nemanjai wrote:

Reduced test case at https://pastebin.com/NVYbsRfD

Run opt -O3 -mtriple=powerpc64le-- -disable-output file.ll and it doesn't terminate.

Thanks! It should be easy to find the opposing transform now. I got that test down to an infinite loop with -instcombine and:

declare void @use(i64)

define i64 @t0(i64 %x) {
  %a = ashr i64 %x, 3
  call void @use(i64 %a)
  %tr = trunc i64 %a to i32
  %sh = lshr i32 %tr, 6
  %z = zext i32 %sh to i64
  ret i64 %z
}

spatel mentioned this in rG1cd6b44f267b: [InstCombine] add one-use check to shift-shift transform.Sep 22 2021, 1:32 PM

In D110170#3016261, @spatel wrote:

In D110170#3015960, @nemanjai wrote:

Reduced test case at https://pastebin.com/NVYbsRfD

Run opt -O3 -mtriple=powerpc64le-- -disable-output file.ll and it doesn't terminate.

Thanks! It should be easy to find the opposing transform now. I got that test down to an infinite loop with -instcombine and:

That was easy (I hope)...
We had a transform that was creating extra instructions without checking uses correctly, so we'd end up ping-pong'ing between that and this one. I don't see infinite looping on either of the failure examples posted here after:
1cd6b44f267b

spatel mentioned this in rGc75c5c5f8f37: [CodeGen] update test file to not run the entire LLVM optimizer; NFC.Sep 23 2021, 6:25 AM

spatel added a commit: rGbb9333c3504a: [InstCombine] fold cast of right-shift if high bits are not demanded (2nd try).Sep 23 2021, 6:41 AM

Just a heads-up, I'm seeing timeouts again now with "2nd try" commited.

I'll see if I can pull out a reproducer working on main too.

In D110170#3019717, @uabelho wrote:

Just a heads-up, I'm seeing timeouts again now with "2nd try" commited.

I'll see if I can pull out a reproducer working on main too.

Ok:

opt -o /dev/null -passes='instcombine' hang.ll

with hang.ll being

target datalayout = "n32"

define i32 @f_t15_t01_t09(i40 %x) {
entry:
  store i40 %x, i40* undef, align 1
  %0 = load i40, i40* undef, align 1
  %1 = add i40 %0, 2147483647
  %2 = select i1 undef, i40 %1, i40 %0
  %downscale = ashr i40 %2, 31
  %resize = trunc i40 %downscale to i16
  %resize1 = sext i16 %resize to i32
  %upscale = shl i32 %resize1, 31
  ret i32 %upscale
}

In D110170#3019762, @uabelho wrote:
In D110170#3019717, @uabelho wrote:

Just a heads-up, I'm seeing timeouts again now with "2nd try" commited.

I'll see if I can pull out a reproducer working on main too.

Ok:
opt -o /dev/null -passes='instcombine' hang.ll
with hang.ll being
target datalayout = "n32"

define i32 @f_t15_t01_t09(i40 %x) {
entry:
  store i40 %x, i40* undef, align 1
  %0 = load i40, i40* undef, align 1
  %1 = add i40 %0, 2147483647
  %2 = select i1 undef, i40 %1, i40 %0
  %downscale = ashr i40 %2, 31
  %resize = trunc i40 %downscale to i16
  %resize1 = sext i16 %resize to i32
  %upscale = shl i32 %resize1, 31
  ret i32 %upscale
}

Thanks! I'll step into this in the debugger now.
Let me know if I should revert.

spatel added a reverting change: rG3c5500907b10: Revert "[InstCombine] fold cast of right-shift if high bits are not demanded….Sep 24 2021, 7:47 AM

The root bug is in that same block that caused the previous problem.
I think it's time to fix that for good by decomposing it into simpler folds that won't conflict with other transforms.
So I reverted this patch again: 3c5500907b10

spatel mentioned this in rGa47c8e40c734: [InstCombine] fold lshr(trunc(lshr X, C1)) C2.Sep 24 2021, 12:45 PM

spatel mentioned this in rG025a805d7ca2: [InstCombine] match variable names and code comments; NFC.Sep 27 2021, 7:58 AM

spatel mentioned this in rG21429cf43a41: [InstCombine] generalize fold for (trunc (X u>> C1)) u>> C.

spatel mentioned this in rG3fcb00df5dbf: [InstCombine] restrict shift-trunc-shift fold to opposite direction shifts.Sep 30 2021, 12:06 PM

spatel mentioned this in rG3fabd98e5b3e: [InstCombine] fold (trunc (X>>C1)) << C to shift+mask directly.Oct 1 2021, 11:22 AM

spatel mentioned this in rG88a9c1827e8d: [InstCombine] add test for shl + demanded bits; NFC.Oct 3 2021, 7:39 AM

spatel added a commit: rGf32c0fe8e505: [InstCombine] fold cast of right-shift if high bits are not demanded (3rd try).

Hi @spatel,

I noticed a regression in a downstream benchmark, that at least partly seem to be caused by it. Here is a reduced example: https://godbolt.org/z/M9MKjcYPG

From what I can see there is a quite early run of InstCombine in the O3 pipeline, which basically happens directly after GlobalOpt without any CSE in between. So in such an early run of InstCombine we do trigger transforms based on "one use", which wouldn't have happened if running CSE before InstCombine. I figure that might be a more general problem and not only specific to the rewrites introduced in this patch.

We'll analyse the regression a bit more (maybe there are other things that happens that contributes to the regression). But wanted to mention the above. And it makes me a bit curious if it is a general problem with that early instcombine run that "one use" checks might be fooled by not having done CSE after GlobalOpt.

In D110170#3041159, @bjope wrote:

I noticed a regression in a downstream benchmark, that at least partly seem to be caused by it. Here is a reduced example: https://godbolt.org/z/M9MKjcYPG

From what I can see there is a quite early run of InstCombine in the O3 pipeline, which basically happens directly after GlobalOpt without any CSE in between. So in such an early run of InstCombine we do trigger transforms based on "one use", which wouldn't have happened if running CSE before InstCombine. I figure that might be a more general problem and not only specific to the rewrites introduced in this patch.

We'll analyse the regression a bit more (maybe there are other things that happens that contributes to the regression). But wanted to mention the above. And it makes me a bit curious if it is a general problem with that early instcombine run that "one use" checks might be fooled by not having done CSE after GlobalOpt.

Thanks for posting the example. That does seem like a general problem, and it's worth experimenting with the pass manager to see if reordering the passes makes things better or worse.
I'm not sure if we have an IR pass that is responsible for seeing that we have redundant shift ops like in the example. Is that a possible trick for GVN?
Also, I tried running the example through codegen for x86 and AArch64, and they both manage to eliminate the redundant extra shift after legalization. Is it possible that your target is missing a semi-generic SDAG transform?

In D110170#3042818, @spatel wrote:

In D110170#3041159, @bjope wrote:

I noticed a regression in a downstream benchmark, that at least partly seem to be caused by it. Here is a reduced example: https://godbolt.org/z/M9MKjcYPG

From what I can see there is a quite early run of InstCombine in the O3 pipeline, which basically happens directly after GlobalOpt without any CSE in between. So in such an early run of InstCombine we do trigger transforms based on "one use", which wouldn't have happened if running CSE before InstCombine. I figure that might be a more general problem and not only specific to the rewrites introduced in this patch.

We'll analyse the regression a bit more (maybe there are other things that happens that contributes to the regression). But wanted to mention the above. And it makes me a bit curious if it is a general problem with that early instcombine run that "one use" checks might be fooled by not having done CSE after GlobalOpt.

Thanks for posting the example. That does seem like a general problem, and it's worth experimenting with the pass manager to see if reordering the passes makes things better or worse.
I'm not sure if we have an IR pass that is responsible for seeing that we have redundant shift ops like in the example. Is that a possible trick for GVN?
Also, I tried running the example through codegen for x86 and AArch64, and they both manage to eliminate the redundant extra shift after legalization. Is it possible that your target is missing a semi-generic SDAG transform?

The IR posted in godbolt was a bit reduced, and running the example through codegen gave the same result also for my target.
Although. the original IR looked a bit more like in this example https://godbolt.org/z/s8Krzrq36 , which show that the number of instructions in the loop increase from 16 to 19, for x86, when using opt from trunc instead of the 13.0.0 version. And afaict this patch is the main difference.

In D110170#3044059, @bjope wrote:

Although. the original IR looked a bit more like in this example https://godbolt.org/z/s8Krzrq36 , which show that the number of instructions in the loop increase from 16 to 19, for x86, when using opt from trunc instead of the 13.0.0 version. And afaict this patch is the main difference.

Ah - the larger example is very interesting if I'm seeing it correctly. We're xor'ing 4 bits from some value?

That could be made significantly shorter in IR or codegen:
https://alive2.llvm.org/ce/z/bWgS_h
https://godbolt.org/z/sP6n13xd3
(Note that the x86 codegen is likely better for a target without popcount! I'll file some bugs tomorrow.)

In D110170#3044059, @bjope wrote:

Although. the original IR looked a bit more like in this example https://godbolt.org/z/s8Krzrq36 , which show that the number of instructions in the loop increase from 16 to 19, for x86, when using opt from trunc instead of the 13.0.0 version. And afaict this patch is the main difference.

I agree with your analysis, and I don't mean to dismiss the regression, but we might be able to optimize the larger example even better (either because of or in spite of this patch!).

I filed these bugs:
https://llvm.org/PR52092
https://llvm.org/PR52093
https://llvm.org/PR52094

Revision Contents

Path

Size

llvm/

lib/

Transforms/

InstCombine/

InstCombineSimplifyDemanded.cpp

20 lines

test/

Transforms/

InstCombine/

trunc-demand.ll

56 lines

Diff 374018

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp

Show First 20 Lines • Show All 379 Lines • ▼ Show 20 Lines	case Instruction::Select: {
if (CanonicalizeSelectConstant(I, 1, DemandedMask) \|\|		if (CanonicalizeSelectConstant(I, 1, DemandedMask) \|\|
CanonicalizeSelectConstant(I, 2, DemandedMask))		CanonicalizeSelectConstant(I, 2, DemandedMask))
return I;		return I;

// Only known if known in both the LHS and RHS.		// Only known if known in both the LHS and RHS.
Known = KnownBits::commonBits(LHSKnown, RHSKnown);		Known = KnownBits::commonBits(LHSKnown, RHSKnown);
break;		break;
}		}
case Instruction::ZExt:
case Instruction::Trunc: {		case Instruction::Trunc: {
		// If we do not demand the high bits of a right-shifted and truncated value,
		// then we may be able to truncate it before the shift.
		Value *X;
		const APInt *C;
		if (match(I->getOperand(0), m_OneUse(m_LShr(m_Value(X), m_APInt(C))))) {
		// The shift amount must be valid (not poison) in the narrow type, and
		// it must not be greater than the high bits demanded of the result.
		if (C->ult(I->getType()->getScalarSizeInBits()) &&
		C->ule(DemandedMask.countLeadingZeros())) {
		lebedev.riUnsubmitted Not Done Reply Inline Actions I think this is only part of the check. https://alive2.llvm.org/ce/z/59fb7i => https://alive2.llvm.org/ce/z/s_G7u7 lebedev.ri: I think this is only part of the check. https://alive2.llvm.org/ce/z/59fb7i => https://alive2.
		lebedev.riUnsubmitted Not Done Reply Inline Actions No, ignore me. lebedev.ri: No, ignore me.
		// trunc (lshr X, C) --> lshr (trunc X), C
		IRBuilderBase::InsertPointGuard Guard(Builder);
		Builder.SetInsertPoint(I);
		Value *Trunc = Builder.CreateTrunc(X, I->getType());
		return Builder.CreateLShr(Trunc, C->getZExtValue());
		}
		}
		}
		LLVM_FALLTHROUGH;
		case Instruction::ZExt: {
unsigned SrcBitWidth = I->getOperand(0)->getType()->getScalarSizeInBits();		unsigned SrcBitWidth = I->getOperand(0)->getType()->getScalarSizeInBits();

APInt InputDemandedMask = DemandedMask.zextOrTrunc(SrcBitWidth);		APInt InputDemandedMask = DemandedMask.zextOrTrunc(SrcBitWidth);
KnownBits InputKnown(SrcBitWidth);		KnownBits InputKnown(SrcBitWidth);
if (SimplifyDemandedBits(I, 0, InputDemandedMask, InputKnown, Depth + 1))		if (SimplifyDemandedBits(I, 0, InputDemandedMask, InputKnown, Depth + 1))
return I;		return I;
assert(InputKnown.getBitWidth() == SrcBitWidth && "Src width changed?");		assert(InputKnown.getBitWidth() == SrcBitWidth && "Src width changed?");
Known = InputKnown.zextOrTrunc(BitWidth);		Known = InputKnown.zextOrTrunc(BitWidth);
▲ Show 20 Lines • Show All 1,201 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/trunc-demand.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -instcombine -S \| FileCheck %s			; RUN: opt < %s -instcombine -S \| FileCheck %s

	declare void @use6(i6)			declare void @use6(i6)
	declare void @use8(i8)			declare void @use8(i8)

	define i6 @trunc_lshr(i8 %x) {			define i6 @trunc_lshr(i8 %x) {
	; CHECK-LABEL: @trunc_lshr(			; CHECK-LABEL: @trunc_lshr(
	; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 2			; CHECK-NEXT: [[TMP1:%.]] = trunc i8 [[X:%.]] to i6
	; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6			; CHECK-NEXT: [[TMP2:%.*]] = lshr i6 [[TMP1]], 2
	; CHECK-NEXT: [[R:%.*]] = and i6 [[T]], 14			; CHECK-NEXT: [[R:%.*]] = and i6 [[TMP2]], 14
	; CHECK-NEXT: ret i6 [[R]]			; CHECK-NEXT: ret i6 [[R]]
	;			;
	%s = lshr i8 %x, 2			%s = lshr i8 %x, 2
	%t = trunc i8 %s to i6			%t = trunc i8 %s to i6
	%r = and i6 %t, 14			%r = and i6 %t, 14
	ret i6 %r			ret i6 %r
	}			}

				; The 'and' is eliminated.

	define i6 @trunc_lshr_exact_mask(i8 %x) {			define i6 @trunc_lshr_exact_mask(i8 %x) {
	; CHECK-LABEL: @trunc_lshr_exact_mask(			; CHECK-LABEL: @trunc_lshr_exact_mask(
	; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 2			; CHECK-NEXT: [[TMP1:%.]] = trunc i8 [[X:%.]] to i6
	; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6			; CHECK-NEXT: [[TMP2:%.*]] = lshr i6 [[TMP1]], 2
	; CHECK-NEXT: [[R:%.*]] = and i6 [[T]], 15			; CHECK-NEXT: ret i6 [[TMP2]]
	; CHECK-NEXT: ret i6 [[R]]
	;			;
	%s = lshr i8 %x, 2			%s = lshr i8 %x, 2
	%t = trunc i8 %s to i6			%t = trunc i8 %s to i6
	%r = and i6 %t, 15			%r = and i6 %t, 15
	ret i6 %r			ret i6 %r
	}			}

				; negative test - a high bit of x is in the result

	define i6 @trunc_lshr_big_mask(i8 %x) {			define i6 @trunc_lshr_big_mask(i8 %x) {
	; CHECK-LABEL: @trunc_lshr_big_mask(			; CHECK-LABEL: @trunc_lshr_big_mask(
	; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 2			; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 2
	; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6			; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6
	; CHECK-NEXT: [[R:%.*]] = and i6 [[T]], 31			; CHECK-NEXT: [[R:%.*]] = and i6 [[T]], 31
	; CHECK-NEXT: ret i6 [[R]]			; CHECK-NEXT: ret i6 [[R]]
	;			;
	%s = lshr i8 %x, 2			%s = lshr i8 %x, 2
	%t = trunc i8 %s to i6			%t = trunc i8 %s to i6
	%r = and i6 %t, 31			%r = and i6 %t, 31
	ret i6 %r			ret i6 %r
	}			}

				; negative test - too many uses

	define i6 @trunc_lshr_use1(i8 %x) {			define i6 @trunc_lshr_use1(i8 %x) {
	; CHECK-LABEL: @trunc_lshr_use1(			; CHECK-LABEL: @trunc_lshr_use1(
	; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 2			; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 2
	; CHECK-NEXT: call void @use8(i8 [[S]])			; CHECK-NEXT: call void @use8(i8 [[S]])
	; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6			; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6
	; CHECK-NEXT: [[R:%.*]] = and i6 [[T]], 15			; CHECK-NEXT: [[R:%.*]] = and i6 [[T]], 15
	; CHECK-NEXT: ret i6 [[R]]			; CHECK-NEXT: ret i6 [[R]]
	;			;
	%s = lshr i8 %x, 2			%s = lshr i8 %x, 2
	call void @use8(i8 %s)			call void @use8(i8 %s)
	%t = trunc i8 %s to i6			%t = trunc i8 %s to i6
	%r = and i6 %t, 15			%r = and i6 %t, 15
	ret i6 %r			ret i6 %r
	}			}

				; negative test - too many uses

	define i6 @trunc_lshr_use2(i8 %x) {			define i6 @trunc_lshr_use2(i8 %x) {
	; CHECK-LABEL: @trunc_lshr_use2(			; CHECK-LABEL: @trunc_lshr_use2(
	; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 2			; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 2
	; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6			; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6
	; CHECK-NEXT: call void @use6(i6 [[T]])			; CHECK-NEXT: call void @use6(i6 [[T]])
	; CHECK-NEXT: [[R:%.*]] = and i6 [[T]], 15			; CHECK-NEXT: [[R:%.*]] = and i6 [[T]], 15
	; CHECK-NEXT: ret i6 [[R]]			; CHECK-NEXT: ret i6 [[R]]
	;			;
	%s = lshr i8 %x, 2			%s = lshr i8 %x, 2
	%t = trunc i8 %s to i6			%t = trunc i8 %s to i6
	call void @use6(i6 %t)			call void @use6(i6 %t)
	%r = and i6 %t, 15			%r = and i6 %t, 15
	ret i6 %r			ret i6 %r
	}			}

				; Splat vectors are ok.

	define <2 x i7> @trunc_lshr_vec_splat(<2 x i16> %x) {			define <2 x i7> @trunc_lshr_vec_splat(<2 x i16> %x) {
	; CHECK-LABEL: @trunc_lshr_vec_splat(			; CHECK-LABEL: @trunc_lshr_vec_splat(
	; CHECK-NEXT: [[S:%.]] = lshr <2 x i16> [[X:%.]], <i16 5, i16 5>			; CHECK-NEXT: [[TMP1:%.]] = trunc <2 x i16> [[X:%.]] to <2 x i7>
	; CHECK-NEXT: [[T:%.*]] = trunc <2 x i16> [[S]] to <2 x i7>			; CHECK-NEXT: [[TMP2:%.*]] = lshr <2 x i7> [[TMP1]], <i7 5, i7 5>
	; CHECK-NEXT: [[R:%.*]] = and <2 x i7> [[T]], <i7 1, i7 1>			; CHECK-NEXT: [[R:%.*]] = and <2 x i7> [[TMP2]], <i7 1, i7 1>
	; CHECK-NEXT: ret <2 x i7> [[R]]			; CHECK-NEXT: ret <2 x i7> [[R]]
	;			;
	%s = lshr <2 x i16> %x, <i16 5, i16 5>			%s = lshr <2 x i16> %x, <i16 5, i16 5>
	%t = trunc <2 x i16> %s to <2 x i7>			%t = trunc <2 x i16> %s to <2 x i7>
	%r = and <2 x i7> %t, <i7 1, i7 1>			%r = and <2 x i7> %t, <i7 1, i7 1>
	ret <2 x i7> %r			ret <2 x i7> %r
	}			}

				; The 'and' is eliminated.

	define <2 x i7> @trunc_lshr_vec_splat_exact_mask(<2 x i16> %x) {			define <2 x i7> @trunc_lshr_vec_splat_exact_mask(<2 x i16> %x) {
	; CHECK-LABEL: @trunc_lshr_vec_splat_exact_mask(			; CHECK-LABEL: @trunc_lshr_vec_splat_exact_mask(
	; CHECK-NEXT: [[S:%.]] = lshr <2 x i16> [[X:%.]], <i16 6, i16 6>			; CHECK-NEXT: [[TMP1:%.]] = trunc <2 x i16> [[X:%.]] to <2 x i7>
	; CHECK-NEXT: [[T:%.*]] = trunc <2 x i16> [[S]] to <2 x i7>			; CHECK-NEXT: [[TMP2:%.*]] = lshr <2 x i7> [[TMP1]], <i7 6, i7 6>
	; CHECK-NEXT: [[R:%.*]] = and <2 x i7> [[T]], <i7 1, i7 1>			; CHECK-NEXT: ret <2 x i7> [[TMP2]]
	; CHECK-NEXT: ret <2 x i7> [[R]]
	;			;
	%s = lshr <2 x i16> %x, <i16 6, i16 6>			%s = lshr <2 x i16> %x, <i16 6, i16 6>
	%t = trunc <2 x i16> %s to <2 x i7>			%t = trunc <2 x i16> %s to <2 x i7>
	%r = and <2 x i7> %t, <i7 1, i7 1>			%r = and <2 x i7> %t, <i7 1, i7 1>
	ret <2 x i7> %r			ret <2 x i7> %r
	}			}

				; negative test - the shift is too big for the narrow type

	define <2 x i7> @trunc_lshr_big_shift(<2 x i16> %x) {			define <2 x i7> @trunc_lshr_big_shift(<2 x i16> %x) {
	; CHECK-LABEL: @trunc_lshr_big_shift(			; CHECK-LABEL: @trunc_lshr_big_shift(
	; CHECK-NEXT: [[S:%.]] = lshr <2 x i16> [[X:%.]], <i16 7, i16 7>			; CHECK-NEXT: [[S:%.]] = lshr <2 x i16> [[X:%.]], <i16 7, i16 7>
	; CHECK-NEXT: [[T:%.*]] = trunc <2 x i16> [[S]] to <2 x i7>			; CHECK-NEXT: [[T:%.*]] = trunc <2 x i16> [[S]] to <2 x i7>
	; CHECK-NEXT: [[R:%.*]] = and <2 x i7> [[T]], <i7 1, i7 1>			; CHECK-NEXT: [[R:%.*]] = and <2 x i7> [[T]], <i7 1, i7 1>
	; CHECK-NEXT: ret <2 x i7> [[R]]			; CHECK-NEXT: ret <2 x i7> [[R]]
	;			;
	%s = lshr <2 x i16> %x, <i16 7, i16 7>			%s = lshr <2 x i16> %x, <i16 7, i16 7>
	%t = trunc <2 x i16> %s to <2 x i7>			%t = trunc <2 x i16> %s to <2 x i7>
	%r = and <2 x i7> %t, <i7 1, i7 1>			%r = and <2 x i7> %t, <i7 1, i7 1>
	ret <2 x i7> %r			ret <2 x i7> %r
	}			}

				; High bits could also be set rather than cleared.

	define i6 @or_trunc_lshr(i8 %x) {			define i6 @or_trunc_lshr(i8 %x) {
	; CHECK-LABEL: @or_trunc_lshr(			; CHECK-LABEL: @or_trunc_lshr(
	; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 1			; CHECK-NEXT: [[TMP1:%.]] = trunc i8 [[X:%.]] to i6
	; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6			; CHECK-NEXT: [[TMP2:%.*]] = lshr i6 [[TMP1]], 1
	; CHECK-NEXT: [[R:%.*]] = or i6 [[T]], -32			; CHECK-NEXT: [[R:%.*]] = or i6 [[TMP2]], -32
	; CHECK-NEXT: ret i6 [[R]]			; CHECK-NEXT: ret i6 [[R]]
	;			;
	%s = lshr i8 %x, 1			%s = lshr i8 %x, 1
	%t = trunc i8 %s to i6			%t = trunc i8 %s to i6
	%r = or i6 %t, 32 ; 0b100000			%r = or i6 %t, 32 ; 0b100000
	ret i6 %r			ret i6 %r
	}			}

	define i6 @or_trunc_lshr_more(i8 %x) {			define i6 @or_trunc_lshr_more(i8 %x) {
	; CHECK-LABEL: @or_trunc_lshr_more(			; CHECK-LABEL: @or_trunc_lshr_more(
	; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 4			; CHECK-NEXT: [[TMP1:%.]] = trunc i8 [[X:%.]] to i6
	; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6			; CHECK-NEXT: [[TMP2:%.*]] = lshr i6 [[TMP1]], 4
	; CHECK-NEXT: [[R:%.*]] = or i6 [[T]], -4			; CHECK-NEXT: [[R:%.*]] = or i6 [[TMP2]], -4
	; CHECK-NEXT: ret i6 [[R]]			; CHECK-NEXT: ret i6 [[R]]
	;			;
	%s = lshr i8 %x, 4			%s = lshr i8 %x, 4
	%t = trunc i8 %s to i6			%t = trunc i8 %s to i6
	%r = or i6 %t, 60 ; 0b111100			%r = or i6 %t, 60 ; 0b111100
	ret i6 %r			ret i6 %r
	}			}

				; negative test - need all high bits to be undemanded

	define i6 @or_trunc_lshr_small_mask(i8 %x) {			define i6 @or_trunc_lshr_small_mask(i8 %x) {
	; CHECK-LABEL: @or_trunc_lshr_small_mask(			; CHECK-LABEL: @or_trunc_lshr_small_mask(
	; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 4			; CHECK-NEXT: [[S:%.]] = lshr i8 [[X:%.]], 4
	; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6			; CHECK-NEXT: [[T:%.*]] = trunc i8 [[S]] to i6
	; CHECK-NEXT: [[R:%.*]] = or i6 [[T]], -8			; CHECK-NEXT: [[R:%.*]] = or i6 [[T]], -8
	; CHECK-NEXT: ret i6 [[R]]			; CHECK-NEXT: ret i6 [[R]]
	;			;
	%s = lshr i8 %x, 4			%s = lshr i8 %x, 4
	%t = trunc i8 %s to i6			%t = trunc i8 %s to i6
	%r = or i6 %t, 56 ; 0b111000			%r = or i6 %t, 56 ; 0b111000
	ret i6 %r			ret i6 %r
	}			}