This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/CodeGen/
-
CodeGen/
-
OptimizePHIs.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
madd.ll
-
opt_phis2.mir
-
sad.ll

Differential D54839

[CodeGen] Enhance machine PHIs optimization
ClosedPublic

Authored by anton-afanasyev on Nov 22 2018, 2:13 PM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
craig.topper
dtemirbulatov
MatzeB
aemerson
qcolombet
atrick

Commits

rG8c8724dd0d12: [CodeGen] Enhance machine PHIs optimization
rL349271: [CodeGen] Enhance machine PHIs optimization

Summary

Make machine PHIs optimization to work for single value register taken from
several different copies. This is the first step to fix PR38917. This change
allows to get rid of redundant PHIs (see opt_phis2.mir test) to make
the subsequent optimizations (like CSE) possible and simpler.

For instance, before this patch the code like this:

%b = COPY %z
...
%a = PHI %bb1, %a; %bb2, %b

could be optimized to:

%a = %b

but the code like this:

%c = COPY %z
...
%b = COPY %z
...
%a = PHI %bb1, %a; %bb2, %b; %bb3, %c

would remain unchanged.
With this patch the latter case will be optimized:

%a = %z```.

Diff Detail

Repository: rL LLVM

Event Timeline

anton-afanasyev created this revision.Nov 22 2018, 2:13 PM

Herald added a subscriber: llvm-commits. · View Herald TranscriptNov 22 2018, 2:13 PM

dmgreen added a subscriber: dmgreen.Nov 25 2018, 1:56 PM

anton-afanasyev edited the summary of this revision. (Show Details)Nov 28 2018, 8:13 AM

anton-afanasyev edited the summary of this revision. (Show Details)

I've updated summary to be more clear about the change. How can I else get the review of this differential?

This looks ok, but I know very little about this pass - @MatzeB @craig.topper any comments?

(Adding more potential reviewers for MI changes)

I'm not too familiar with the history of OptimizePHIs but some things feel odd to me here:

The comments in the pass make it seem like it is designed to catch dataflow around loops, yet the motivation for your changes appears to be simpler phis without loops involved.
I keep wondering why we bother with COPYs here at all. If the COPYs are really trivial COPYs between the same register classes then we really should just remove them when we are still in MachineSSA. Do you know how they are created? Maybe you can simply stop their creation instead?
Skipping COPYs like this is also a bit inconsequential and will miss several cases: You could have more than 1 COPY in a row or switch between register classes. The PeepholeOptimizer pass already handles these more complex cases, so maybe we rather should look for running OptimizePHIs after the Peephole Optimizer so those COPYs are optimized? (Not sure though if there are other negative effects of running OptimizePHIs later or PeepholeOpt earlier).

With all that said, I don't expect the patch to break anything, so no principal objections to committing it...

lib/CodeGen/OptimizePHIs.cpp
123–125 ↗	(On Diff #175072)	I don't undestand why COPYs would be more likely in the predecessor blocks of the PHIs (this is before PHI lowering after all). And it also doesn't make any sense to me to have different behavior depending on where the COPY is, can't we just always look through the COPY and change `SrcReg`?
140 ↗	(On Diff #175072)	This is a nice change!
test/CodeGen/X86/opt_phis2.mir
15–35 ↗	(On Diff #175072)	Things should also work without this block, as you already define all the register classes on the MIR operands themselfes.
36–39 ↗	(On Diff #175072)	I suspect you can simply drop this block as well for this test

In D54839#1321539, @MatzeB wrote:

I'm not too familiar with the history of OptimizePHIs but some things feel odd to me here:

The comments in the pass make it seem like it is designed to catch dataflow around loops, yet the motivation for your changes appears to be simpler phis without loops involved.

This pass is more about PHIs cycles rather than about dataflow loops (though they are related). I've just expanded this pass to work with more complex PHIs configuration. I wanted firstly to rename function IsSingleValuePHICycle() to something like IsSingleValuePHICycleOrChain() but then realized that before it was not only about cycle as well (PHIs configuration could contain several cycles). Actual name of function (before or after my change) should be IsSingleValuePHIGraph() or IsSingleValuePHIConfiguration(). The common sense remained the same -- we have several (possibly single) PHIs connected by cycle/cycles/chain with only single external register. You are right in the sense that with patch this pass can eliminate single PHI node without cycle to other PHI nodes, but one can still regard it as "PHI loop" to itself through COPYied register. And you are right that it was my motivation initially (for further MachineCSE/GVN simplification), but this case was expanded to more common fortunately. Also it triggered several enhancements in regression tests having phi cycles.
What I really forgot is to add some comments to this function description to make it more clear. Will it be enough?

I keep wondering why we bother with COPYs here at all. If the COPYs are really trivial COPYs between the same register classes then we really should just remove them when we are still in MachineSSA. Do you know how they are created? Maybe you can simply stop their creation instead?

Skipping COPYs like this is also a bit inconsequential and will miss several cases: You could have more than 1 COPY in a row or switch between register classes. The PeepholeOptimizer pass already handles these more complex cases, so maybe we rather should look for running OptimizePHIs after the Peephole Optimizer so those COPYs are optimized? (Not sure though if there are other negative effects of running OptimizePHIs later or PeepholeOpt earlier).

These COPYs are created while MIR instruction selection, for instance, llvm instr %2 = bitcast <4 x i64> %1 to <8 x i32> translates to %2 = COPY %1. Not sure this COPYs could be eliminated during X86 DAG->DAG itself.
I'm not familiar with PeepholeOptimizer, are you sure it is aimed to handle such COPYs? I've tried llc -run-pass=peephole-opt opt_phis2.mir, haven't seen any COPYs propagation on my test sample opt_phis2.mir. Nevertheless, eliminating COPYs is not enough to optimize phis at the samples like this one. Without patch this pass even cannot optimize %a = PHI %b, %bb1, %b, %bb2 to %a = COPY %b. Actually COPYs are auxilarly thing in this pass. I've explored MachineCSE pass before this patch coding -- it uses auxiliary copy propagation as well to help main optimization (see PerformTrivialCopyPropagation()). Actually OptimizePHI pass itself (without my patch) already uses copy propagation -- but only to get next possible PHI node, not to get possible same register. One can do copy propagation (in MachineSSA) in any preceding pass (which one?), or just add several lines in passes where it is needed (MachineCSE, OptimizePHIs, ...). After breaking SSA form such redundant COPIes will be eliminated anyway.
Yes, you're right, this pass checks only one COPY in a row but that is true before my patch as well, I've just added the case when several COPYs leads to the same register. Though I believe more than one COPY in a row is rare case.

With all that said, I don't expect the patch to break anything, so no principal objections to committing it...

Thank you for deep review! I'm to remove redundant if-condition of COPY locations which you have noticed.

anton-afanasyev marked 7 inline comments as done.Dec 9 2018, 11:38 PM

anton-afanasyev added inline comments.

lib/CodeGen/OptimizePHIs.cpp
123–125 ↗	(On Diff #175072)	Yes, you are right, this check is redundant! I'm to remove it.
test/CodeGen/X86/opt_phis2.mir
15–35 ↗	(On Diff #175072)	Yes, thank you!
36–39 ↗	(On Diff #175072)	Ok.

Update according to @MatzeB remarks.

Hi @MatzeB, as for special pass for COPYs propagation, I haven't found natural place for it after MIR ISel and before OptimizationPHIs among existing passes:

X86 DAG->DAG Instruction Selection
MachineDominator Tree Construction
Local Dynamic TLS Access Clean-up
X86 PIC Global Base Reg Initialization
Expand ISel Pseudo-instructions
X86 Domain Reassignment Pass
Early Tail Duplication
Optimize machine instruction PHIs

I think such pass could be useful right before OptimizePHIs, possibly making special passes in OptimizePHIs and MachineCSE needless.
But it is not related straightforwardly to this patch since COPYs check has been already in place before patch.

Hi Matthias, is it ok for LGTM now?

LGTM

This revision is now accepted and ready to land.Dec 14 2018, 3:32 PM

LGTM cheers

Closed by commit rL349271: [CodeGen] Enhance machine PHIs optimization (authored by dinar). · Explain WhyDec 15 2018, 6:40 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

OptimizePHIs.cpp

17 lines

test/

CodeGen/

X86/

madd.ll

6 lines

opt_phis2.mir

72 lines

sad.ll

4 lines

Diff 178363

llvm/trunk/lib/CodeGen/OptimizePHIs.cpp

Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	bool OptimizePHIs::runOnMachineFunction(MachineFunction &Fn) {
bool Changed = false;		bool Changed = false;
for (MachineFunction::iterator I = Fn.begin(), E = Fn.end(); I != E; ++I)		for (MachineFunction::iterator I = Fn.begin(), E = Fn.end(); I != E; ++I)
Changed \|= OptimizeBB(*I);		Changed \|= OptimizeBB(*I);

return Changed;		return Changed;
}		}

/// IsSingleValuePHICycle - Check if MI is a PHI where all the source operands		/// IsSingleValuePHICycle - Check if MI is a PHI where all the source operands
/// are copies of SingleValReg, possibly via copies through other PHIs. If		/// are copies of SingleValReg, possibly via copies through other PHIs. If
/// SingleValReg is zero on entry, it is set to the register with the single		/// SingleValReg is zero on entry, it is set to the register with the single
/// non-copy value. PHIsInCycle is a set used to keep track of the PHIs that		/// non-copy value. PHIsInCycle is a set used to keep track of the PHIs that
/// have been scanned.		/// have been scanned. PHIs may be grouped by cycle, several cycles or chains.
bool OptimizePHIs::IsSingleValuePHICycle(MachineInstr *MI,		bool OptimizePHIs::IsSingleValuePHICycle(MachineInstr *MI,
unsigned &SingleValReg,		unsigned &SingleValReg,
InstrSet &PHIsInCycle) {		InstrSet &PHIsInCycle) {
assert(MI->isPHI() && "IsSingleValuePHICycle expects a PHI instruction");		assert(MI->isPHI() && "IsSingleValuePHICycle expects a PHI instruction");
unsigned DstReg = MI->getOperand(0).getReg();		unsigned DstReg = MI->getOperand(0).getReg();

// See if we already saw this register.		// See if we already saw this register.
if (!PHIsInCycle.insert(MI).second)		if (!PHIsInCycle.insert(MI).second)
Show All 9 Lines	for (unsigned i = 1; i != MI->getNumOperands(); i += 2) {
if (SrcReg == DstReg)		if (SrcReg == DstReg)
continue;		continue;
MachineInstr *SrcMI = MRI->getVRegDef(SrcReg);		MachineInstr *SrcMI = MRI->getVRegDef(SrcReg);

// Skip over register-to-register moves.		// Skip over register-to-register moves.
if (SrcMI && SrcMI->isCopy() &&		if (SrcMI && SrcMI->isCopy() &&
!SrcMI->getOperand(0).getSubReg() &&		!SrcMI->getOperand(0).getSubReg() &&
!SrcMI->getOperand(1).getSubReg() &&		!SrcMI->getOperand(1).getSubReg() &&
TargetRegisterInfo::isVirtualRegister(SrcMI->getOperand(1).getReg()))		TargetRegisterInfo::isVirtualRegister(SrcMI->getOperand(1).getReg())) {
SrcMI = MRI->getVRegDef(SrcMI->getOperand(1).getReg());		SrcReg = SrcMI->getOperand(1).getReg();
		SrcMI = MRI->getVRegDef(SrcReg);
		}
if (!SrcMI)		if (!SrcMI)
return false;		return false;

if (SrcMI->isPHI()) {		if (SrcMI->isPHI()) {
if (!IsSingleValuePHICycle(SrcMI, SingleValReg, PHIsInCycle))		if (!IsSingleValuePHICycle(SrcMI, SingleValReg, PHIsInCycle))
return false;		return false;
} else {		} else {
// Fail if there is more than one non-phi/non-move register.		// Fail if there is more than one non-phi/non-move register.
if (SingleValReg != 0)		if (SingleValReg != 0 && SingleValReg != SrcReg)
return false;		return false;
SingleValReg = SrcReg;		SingleValReg = SrcReg;
}		}
}		}
return true;		return true;
}		}

/// IsDeadPHICycle - Check if the register defined by a PHI is only used by		/// IsDeadPHICycle - Check if the register defined by a PHI is only used by
Show All 34 Lines	for (MachineBasicBlock::iterator
unsigned SingleValReg = 0;		unsigned SingleValReg = 0;
InstrSet PHIsInCycle;		InstrSet PHIsInCycle;
if (IsSingleValuePHICycle(MI, SingleValReg, PHIsInCycle) &&		if (IsSingleValuePHICycle(MI, SingleValReg, PHIsInCycle) &&
SingleValReg != 0) {		SingleValReg != 0) {
unsigned OldReg = MI->getOperand(0).getReg();		unsigned OldReg = MI->getOperand(0).getReg();
if (!MRI->constrainRegClass(SingleValReg, MRI->getRegClass(OldReg)))		if (!MRI->constrainRegClass(SingleValReg, MRI->getRegClass(OldReg)))
continue;		continue;

		// for the case SingleValReg taken from copy instr
		MRI->clearKillFlags(SingleValReg);

MRI->replaceRegWith(OldReg, SingleValReg);		MRI->replaceRegWith(OldReg, SingleValReg);
MI->eraseFromParent();		MI->eraseFromParent();
++NumPHICycles;		++NumPHICycles;
Changed = true;		Changed = true;
continue;		continue;
}		}

// Check for dead PHI cycles.		// Check for dead PHI cycles.
Show All 15 Lines

llvm/trunk/test/CodeGen/X86/madd.ll

	Show First 20 Lines • Show All 421 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: addq $16, %rcx			; AVX1-NEXT: addq $16, %rcx
	; AVX1-NEXT: cmpq %rcx, %rax			; AVX1-NEXT: cmpq %rcx, %rax
	; AVX1-NEXT: jne .LBB3_1			; AVX1-NEXT: jne .LBB3_1
	; AVX1-NEXT: # %bb.2: # %middle.block			; AVX1-NEXT: # %bb.2: # %middle.block
	; AVX1-NEXT: vpaddd %xmm0, %xmm2, %xmm3			; AVX1-NEXT: vpaddd %xmm0, %xmm2, %xmm3
	; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm4			; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm4
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm5			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm5
	; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm2			; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm2
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm6			; AVX1-NEXT: vpaddd %xmm5, %xmm2, %xmm2
	; AVX1-NEXT: vpaddd %xmm6, %xmm2, %xmm2
	; AVX1-NEXT: vpaddd %xmm2, %xmm5, %xmm2			; AVX1-NEXT: vpaddd %xmm2, %xmm5, %xmm2
	; AVX1-NEXT: vpaddd %xmm2, %xmm4, %xmm2			; AVX1-NEXT: vpaddd %xmm2, %xmm4, %xmm2
	; AVX1-NEXT: vpaddd %xmm3, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm3, %xmm0, %xmm0
	; AVX1-NEXT: vpaddd %xmm2, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm2, %xmm0, %xmm0
	; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0			; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0
	; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
	▲ Show 20 Lines • Show All 591 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: addq $32, %rcx			; AVX1-NEXT: addq $32, %rcx
	; AVX1-NEXT: cmpq %rcx, %rax			; AVX1-NEXT: cmpq %rcx, %rax
	; AVX1-NEXT: jne .LBB7_1			; AVX1-NEXT: jne .LBB7_1
	; AVX1-NEXT: # %bb.2: # %middle.block			; AVX1-NEXT: # %bb.2: # %middle.block
	; AVX1-NEXT: vpaddd %xmm0, %xmm2, %xmm3			; AVX1-NEXT: vpaddd %xmm0, %xmm2, %xmm3
	; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm4			; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm4
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm5			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm5
	; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm2			; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm2
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm6			; AVX1-NEXT: vpaddd %xmm5, %xmm2, %xmm2
	; AVX1-NEXT: vpaddd %xmm6, %xmm2, %xmm2
	; AVX1-NEXT: vpaddd %xmm2, %xmm5, %xmm2			; AVX1-NEXT: vpaddd %xmm2, %xmm5, %xmm2
	; AVX1-NEXT: vpaddd %xmm2, %xmm4, %xmm2			; AVX1-NEXT: vpaddd %xmm2, %xmm4, %xmm2
	; AVX1-NEXT: vpaddd %xmm3, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm3, %xmm0, %xmm0
	; AVX1-NEXT: vpaddd %xmm2, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm2, %xmm0, %xmm0
	; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0			; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0
	; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
	▲ Show 20 Lines • Show All 1,687 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/opt_phis2.mir

				# RUN: llc -run-pass opt-phis -march=x86-64 -o - %s \| FileCheck %s
				# All PHIs should be removed since they can be securely replaced
				# by %8 register.
				# CHECK-NOT: PHI
				--- \|
				define void @test() {
				ret void
				}
				...
				---
				name: test
				alignment: 4
				tracksRegLiveness: true
				jumpTable:
				kind: block-address
				entries:
				- id: 0
				blocks: [ '%bb.3', '%bb.2', '%bb.1', '%bb.4' ]
				body: \|
				bb.0:
				liveins: $edi, $ymm0, $rsi

				%9:gr64 = COPY $rsi
				%8:vr256 = COPY $ymm0
				%7:gr32 = COPY $edi
				%11:gr32 = SAR32ri %7, 31, implicit-def dead $eflags
				%12:gr32 = SHR32ri %11, 30, implicit-def dead $eflags
				%13:gr32 = ADD32rr %7, killed %12, implicit-def dead $eflags
				%14:gr32 = AND32ri8 %13, -4, implicit-def dead $eflags
				%15:gr32 = SUB32rr %7, %14, implicit-def dead $eflags
				%10:gr64_nosp = SUBREG_TO_REG 0, %15, %subreg.sub_32bit
				%16:gr32 = SUB32ri8 %15, 3, implicit-def $eflags
				JA_1 %bb.8, implicit $eflags

				bb.9:
				JMP64m $noreg, 8, %10, %jump-table.0, $noreg :: (load 8 from jump-table)

				bb.1:
				%0:vr256 = COPY %8
				JMP_1 %bb.5

				bb.2:
				%1:vr256 = COPY %8
				JMP_1 %bb.6

				bb.3:
				%2:vr256 = COPY %8
				JMP_1 %bb.7

				bb.4:
				%3:vr256 = COPY %8
				%17:vr128 = VEXTRACTF128rr %8, 1
				VPEXTRDmr %9, 1, $noreg, 12, $noreg, killed %17, 2

				bb.5:
				%4:vr256 = PHI %0, %bb.1, %3, %bb.4
				%18:vr128 = VEXTRACTF128rr %4, 1
				VPEXTRDmr %9, 1, $noreg, 8, $noreg, killed %18, 1

				bb.6:
				%5:vr256 = PHI %1, %bb.2, %4, %bb.5
				%19:vr128 = VEXTRACTF128rr %5, 1
				VMOVPDI2DImr %9, 1, $noreg, 4, $noreg, killed %19

				bb.7:
				%6:vr256 = PHI %2, %bb.3, %5, %bb.6
				%20:vr128 = COPY %6.sub_xmm
				VPEXTRDmr %9, 1, $noreg, 0, $noreg, killed %20, 3

				bb.8:
				RET 0
				...

llvm/trunk/test/CodeGen/X86/sad.ll

	Show First 20 Lines • Show All 301 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vpaddd %xmm1, %xmm3, %xmm1			; AVX1-NEXT: vpaddd %xmm1, %xmm3, %xmm1
	; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm1, %ymm1			; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm1, %ymm1
	; AVX1-NEXT: addq $4, %rax			; AVX1-NEXT: addq $4, %rax
	; AVX1-NEXT: jne .LBB1_1			; AVX1-NEXT: jne .LBB1_1
	; AVX1-NEXT: # %bb.2: # %middle.block			; AVX1-NEXT: # %bb.2: # %middle.block
	; AVX1-NEXT: vpaddd %xmm0, %xmm0, %xmm2			; AVX1-NEXT: vpaddd %xmm0, %xmm0, %xmm2
	; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm3			; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm3
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm4			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm4
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm5			; AVX1-NEXT: vpaddd %xmm4, %xmm4, %xmm5
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm6
	; AVX1-NEXT: vpaddd %xmm6, %xmm5, %xmm5
	; AVX1-NEXT: vpaddd %xmm5, %xmm4, %xmm4			; AVX1-NEXT: vpaddd %xmm5, %xmm4, %xmm4
	; AVX1-NEXT: vpaddd %xmm4, %xmm3, %xmm3			; AVX1-NEXT: vpaddd %xmm4, %xmm3, %xmm3
	; AVX1-NEXT: vpaddd %xmm2, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm2, %xmm0, %xmm0
	; AVX1-NEXT: vpaddd %xmm3, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm3, %xmm0, %xmm0
	; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0			; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0
	; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
	▲ Show 20 Lines • Show All 1,287 Lines • Show Last 20 Lines