This is an archive of the discontinued LLVM Phabricator instance.

Relax the clearance calculating for breaking partial register dependency.
ClosedPublic

Authored by danielcdh on Jun 21 2016, 10:04 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
myatsina
wmi
davidxl
joerg
mkuper
stoklund
zansari
DavidKreitzer

Commits

rG8cd84aaa6ff0: Relax the clearance calculating for breaking partial register dependency.
rL274068: Relax the clearance calculating for breaking partial register dependency.

Summary

LLVM assumes that large clearance will hide the partial register spill penalty. But in our experiment, 16 clearance is too small. As the inserted XOR is normally fairly cheap, we should have a higher clearance threshold to aggressively insert XORs that is necessary to break partial register dependency.

Diff Detail

Event Timeline

danielcdh updated this revision to Diff 61397.Jun 21 2016, 10:04 AM

danielcdh retitled this revision from to Relax the clearance calculating for breaking partial register dependency..

danielcdh updated this object.

danielcdh added reviewers: mkuper, davidxl, wmi.

danielcdh added a subscriber: llvm-commits.

Can you also add the original author who set the current threshold for the review?

danielcdh added a reviewer: stoklund.Jun 21 2016, 10:30 AM

This mostly makes sense to me. 16 is as much of a magic number as 64, so if this is supported by performance numbers, it's fine.

The problem is that we're using the number of instructions between the write and the read as a proxy for the "the latency of the critical chain between the write and the read". And it's not a very good proxy, since it means how correct we are depends on how much ILP the loop (assuming this is a loop-carried dependency) has. So any number we have here will be just hand-waving.

Adding some Intel people in case they have more input.

mkuper added reviewers: DavidKreitzer, zansari.Jun 21 2016, 11:30 AM

I think this is a good change. The cost of inserting an xor when it is not needed is very small. The cost of failing to insert an xor when it IS needed can be huge.

But fixing the threshold isn't good enough for several reasons.
(1) It isn't always possible, because the register may hold a live value. Marina (added) is working on a patch that will fix this problem.
(2) If the instruction has a "real" XMM operand, it is better to choose that register for the undef operand rather than inserting an xor. (Doing so hides the false dependence behind a true dependence that is unavoidable.) This applies to instructions like vcvtsd2ss et al. Marina is planning to work on this also.

-Dave

lib/Target/X86/X86InstrInfo.cpp
5977	minor nit, instruction's --> instructions'

DavidKreitzer added a reviewer: myatsina.Jun 21 2016, 11:46 AM

silvas added reviewers: spatel, RKSimon.Jun 21 2016, 11:57 AM

Here is a testcase extracted from internal benchmark, which runs 2% faster on sandybridge if clearance threshold is changed from 16 to 64. For this extracted testcase itself, when threshold is set at 16, the cvt2si for "g" will not have xor inserted to break dependency. As a result, the inner loop consumes 13 cycles (comparing with 11 cycles if breaking the dependency for all "r", "g" and "b").

int datas[10000];
int datad[10000];
void foo(float r, float g, float b, int s, int d, int h, int w, int *datas, int *datad) attribute ((noinline));
void foo(float r, float g, float b, int s, int d, int h, int w, int *datas, int *datad) {

int i, j;
for (i = 0; i < h; i++) {
    int *lines = datas + i * s;
    int *lined = datad + i * d;
    for (j = 0; j < w; j++) {
        int word = *(lines + j);
        int val = (int)(r * ((word >> 8) & 0xff) +
                        g * ((word >> 16) & 0xff) +
                        b * ((word >> 24) & 0xff) + 0.5);
        *((char *)(lined) + j) = val;
    }
}

}
int main() {

for (int i = 0; i < 100000; i++) {
  foo(2.0, 3.0, 4.0, 100, 100, 100, 100, datas,  datad);
}
return 0;

}

I don't have public benchmark result for this change yet. For internal benchmarks, no noticeable code size change has been observed. It has 2% speedup on the benchmark that motivated this patch, and has no performance impact on any other internal benchmarks.

In D21560#463527, @DavidKreitzer wrote:

I think this is a good change. The cost of inserting an xor when it is not needed is very small. The cost of failing to insert an xor when it IS needed can be huge.

But fixing the threshold isn't good enough for several reasons.
(1) It isn't always possible, because the register may hold a live value. Marina (added) is working on a patch that will fix this problem.
(2) If the instruction has a "real" XMM operand, it is better to choose that register for the undef operand rather than inserting an xor. (Doing so hides the false dependence behind a true dependence that is unavoidable.) This applies to instructions like vcvtsd2ss et al. Marina is planning to work on this also.

-Dave

It's great that Marina is working on a better fix for this problem. When is the ETA for the fix? If it's coming pretty soon, we should abandon this change. Otherwise if the change does no harm, let's just update the magic number so that we can reclaim performance for the motivating benchmark. Thoughts?

Thanks,
Dehao

I wasn't suggesting that Marina's fixes should replace this one. Sorry for that confusion. My point was simply that this is just one piece of a more comprehensive solution to the false dependence problem.

I think we should go ahead with this change. But I agree with davidxl that you should get an 'ok' from the original author, if possible.

Thanks,
-Dave

Since even the code comments acknowledge that this is a repeated magic number, please give it a name. Even better would be to make it a cl::opt, so we can run experiments more easily.

FWIW, I don't see any measurable perf difference for the example test case (compiled with -O2) running on Haswell.

Is there already a regression test to show that we're not adding instructions if we're in MinSize mode?

For reference, some bug reports about generating xorps here:
https://llvm.org/bugs/show_bug.cgi?id=22024
https://llvm.org/bugs/show_bug.cgi?id=25277
https://llvm.org/bugs/show_bug.cgi?id=26491
https://llvm.org/bugs/show_bug.cgi?id=27573

Is there already a regression test to show that we're not adding instructions if we're in MinSize mode?

I believe we currently add the xors in MinSize mode, and, as I wrote on PR22024, I think it's justified.
I guess it's a question of how -Oz is defined - whether it's "make code smaller at all costs", or (as the manual currently says):

-Os: Like -O2 with extra optimizations to reduce code size.
-Oz: Like -Os (and thus -O2), but reduces code size further.

My gut feeling is that omitting the xor is not a good performance/size trade-off even for -Oz.

In D21560#464962, @mkuper wrote:

Is there already a regression test to show that we're not adding instructions if we're in MinSize mode?

I believe we currently add the xors in MinSize mode, and, as I wrote on PR22024, I think it's justified.

Yes, I pasted the links before I scrolled through the comments.
In that case, let me negate my question. :)

Is there already a regression test to show that we ARE adding instructions if we're in MinSize mode?

The performance of attached testcase running on haswell: (to reproduce, need to build with -O2 -fno-tree-vectorize)

$ time ./good.out
real 0m2.442s
user 0m2.437s
sys 0m0.001s

$ time ./bad.out
real 0m3.480s
user 0m3.475s
sys 0m0.002s

In D21560#465070, @danielcdh wrote:

The performance of attached testcase running on haswell: (to reproduce, need to build with -O2 -fno-tree-vectorize)

I missed -fno-tree-vectorize in my earlier experiment. With that setting, I am able to reproduce the Haswell win. This is at nominal 4GHz:

lessxor: user   0m2.977s
morexor: user	0m2.068s

I would expect all OoO SSE machines to have the same problem, and testing on AMD Jaguar generally shows that. But in this particular case, performance gets worse. This is at nominal 1.5GHz:

lessxor: user	0m11.916s
morexor: user	0m12.795s

I don't have an explanation for that loss yet. The loop is optimally aligned on a 64-byte boundary. The extra xorps adds 3 bytes causing the inner loop to grow from an even 80 bytes to 83 bytes. The extra bytes require an additional ifetch operation that is somehow slowing down the whole chain?

Note that the partial update problem is limited to SSE codegen. If we use -mavx, we generate the 'v' versions of the conversion instructions. Those do not have partial reg update problems, and so we don't need any (v)xorps instructions in the loop. Performance for AVX versions of the program are slightly better than the best SSE case for both CPUs.

Here's the asm code for the inner loop that I'm testing with:

LBB0_4:                                 ## %for.body6
                                        ##   Parent Loop BB0_2 Depth=1
                                        ## =>  This Inner Loop Header: Depth=2
  movl  (%rbx), %eax
  movzbl  %ah, %edi  # NOREX
  xorps %xmm4, %xmm4  <--- this gets generated with the current '16' clearance setting
  cvtsi2ssl %edi, %xmm4
  mulss %xmm0, %xmm4
  movl  %eax, %edi
  shrl  $16, %edi
  movzbl  %dil, %edi
  xorps %xmm5, %xmm5  <--- this is the added instruction generated by this patch
  cvtsi2ssl %edi, %xmm5
  mulss %xmm1, %xmm5
  addss %xmm4, %xmm5
  shrl  $24, %eax
  xorps %xmm4, %xmm4  <--- this gets generated with the current '16' clearance setting
  cvtsi2ssl %eax, %xmm4
  mulss %xmm2, %xmm4
  addss %xmm5, %xmm4
  cvtss2sd  %xmm4, %xmm4
  addsd %xmm3, %xmm4
  cvttsd2si %xmm4, %eax
  movb  %al, (%rsi)
  addq  $4, %rbx
  incq  %rsi
  decl  %r15d
  jne LBB0_4

Put the magic number in an option.

spatel added inline comments.Jun 26 2016, 4:59 PM

lib/Target/X86/X86InstrInfo.cpp
61–66	The option is used for partial reg clearance and undef reg clearance. It should have a more generic name and description, or we should have 2 independent variables. I'm fine with either way, but what the patch has in this version is misleading.
5975–5978	Don't hard-code the comment to any specific value since it may change above.

update comments and options

Thanks.

Earlier in the thread, Joerg requested more general perf data. Has that request been fulfilled or cancelled?

It would be interesting to know what the optimal setting is for at least one benchmark, so you can leave a comment by the cl::opt definition to justify the default setting.

Also, we don't have to answer the MinSize question that I raised directly in this patch, but my perf data shows that we really shouldn't be adding xors in that situation - we hope these are free, but they are not in all cases. IMO, our documentation for -Oz is wrong. There shouldn't be any room for interpretation about the definition of "minimum"; this is the flag used by devs (embedded, boot code, etc) who require the smallest possible code, and perf shouldn't matter.

I just tested the perf impact of this patch for SPECCPU INT2006 benchmarks (perf test were done on sandybridge):

400.perlbench 2.21%
401.bzip2 -0.46%
403.gcc 1.59%
429.mcf 1.79%
445.gobmk 1.52%
456.hmmer 1.63%
458.sjeng 5.62%
462.libquantum 1.54%
464.h264ref -1.28%
471.omnetpp 0.00%
473.astar 2.35%
483.xalancbmk -0.27%
overall 1.29%

About the -Oz issue, I agree that there should be no xor inserted at -Oz. But that's a different issue and should be fixed in a separate patch.

What platform, Ivybridge ?

David

In D21560#468552, @danielcdh wrote:

I just tested the perf impact of this patch for SPECCPU INT2006 benchmarks (perf test were done on sandybridge):

400.perlbench 2.21%
401.bzip2 -0.46%
403.gcc 1.59%
429.mcf 1.79%
445.gobmk 1.52%
456.hmmer 1.63%
458.sjeng 5.62%
462.libquantum 1.54%
464.h264ref -1.28%
471.omnetpp 0.00%
473.astar 2.35%
483.xalancbmk -0.27%
overall 1.29%

That is a surprisingly large gain. This raises the question: if 64 is good, is 128 even better?

About the -Oz issue, I agree that there should be no xor inserted at -Oz. But that's a different issue and should be fixed in a separate patch.

Agreed. Please let me know if you plan to follow-up on that part.

I have no further comments on this patch, so LGTM.

This revision is now accepted and ready to land.Jun 28 2016, 9:04 AM

danielcdh closed this revision.Jun 28 2016, 2:26 PM

mkuper mentioned this in D22466: Avoid false dependencies of undef machine operands.Jul 21 2016, 5:35 PM

Revision Contents

Path

Size

lib/

Target/

X86/

X86InstrInfo.cpp

10 lines

test/

CodeGen/

X86/

vec_int_to_fp.ll

11 lines

Diff 61397

lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
PrintFailedFusing("print-failed-fuse-candidates",		PrintFailedFusing("print-failed-fuse-candidates",
cl::desc("Print instructions that the allocator wants to"		cl::desc("Print instructions that the allocator wants to"
" fuse, but the X86 backend currently can't"),		" fuse, but the X86 backend currently can't"),
cl::Hidden);		cl::Hidden);
static cl::opt<bool>		static cl::opt<bool>
ReMatPICStubLoad("remat-pic-stub-load",		ReMatPICStubLoad("remat-pic-stub-load",
cl::desc("Re-materialize load from stub in PIC mode"),		cl::desc("Re-materialize load from stub in PIC mode"),
cl::init(false), cl::Hidden);		cl::init(false), cl::Hidden);

enum {		enum {
// Select which memory operand is being unfolded.		// Select which memory operand is being unfolded.
// (stored in bits 0 - 3)		// (stored in bits 0 - 3)
TB_INDEX_0 = 0,		TB_INDEX_0 = 0,
TB_INDEX_1 = 1,		TB_INDEX_1 = 1,
		spatelUnsubmitted Not Done Reply Inline Actions The option is used for partial reg clearance and undef reg clearance. It should have a more generic name and description, or we should have 2 independent variables. I'm fine with either way, but what the patch has in this version is misleading. spatel: The option is used for partial reg clearance and undef reg clearance. It should have a more…
TB_INDEX_2 = 2,		TB_INDEX_2 = 2,
TB_INDEX_3 = 3,		TB_INDEX_3 = 3,
TB_INDEX_4 = 4,		TB_INDEX_4 = 4,
TB_INDEX_MASK = 0xf,		TB_INDEX_MASK = 0xf,

// Do not insert the reverse map (MemOp -> RegOp) into the table.		// Do not insert the reverse map (MemOp -> RegOp) into the table.
// This may be needed because there is a many -> one mapping.		// This may be needed because there is a many -> one mapping.
TB_NO_REVERSE = 1 << 4,		TB_NO_REVERSE = 1 << 4,
▲ Show 20 Lines • Show All 5,892 Lines • ▼ Show 20 Lines	getPartialRegUpdateClearance(const MachineInstr *MI, unsigned OpNum,
if (TargetRegisterInfo::isVirtualRegister(Reg)) {		if (TargetRegisterInfo::isVirtualRegister(Reg)) {
if (MO.readsReg() \|\| MI->readsVirtualRegister(Reg))		if (MO.readsReg() \|\| MI->readsVirtualRegister(Reg))
return 0;		return 0;
} else {		} else {
if (MI->readsRegister(Reg, TRI))		if (MI->readsRegister(Reg, TRI))
return 0;		return 0;
}		}

// If any of the preceding 16 instructions are reading Reg, insert a		// If any of the preceding 64 instructions are reading Reg, insert a
// dependency breaking instruction. The magic number is based on a few		// dependency breaking instruction, which is inexpensive and is likely to
// Nehalem experiments.		// be hidden in other instruction's cycles.
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions minor nit, instruction's --> instructions' DavidKreitzer: minor nit, instruction's --> instructions'
return 16;		return 64;
		spatelUnsubmitted Not Done Reply Inline Actions Don't hard-code the comment to any specific value since it may change above. spatel: Don't hard-code the comment to any specific value since it may change above.
}		}

// Return true for any instruction the copies the high bits of the first source		// Return true for any instruction the copies the high bits of the first source
// operand into the unused high bits of the destination operand.		// operand into the unused high bits of the destination operand.
static bool hasUndefRegUpdate(unsigned Opcode) {		static bool hasUndefRegUpdate(unsigned Opcode) {
switch (Opcode) {		switch (Opcode) {
case X86::VCVTSI2SSrr:		case X86::VCVTSI2SSrr:
case X86::VCVTSI2SSrm:		case X86::VCVTSI2SSrm:
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	if (!hasUndefRegUpdate(MI->getOpcode()))
return 0;		return 0;

// Set the OpNum parameter to the first source operand.		// Set the OpNum parameter to the first source operand.
OpNum = 1;		OpNum = 1;

const MachineOperand &MO = MI->getOperand(OpNum);		const MachineOperand &MO = MI->getOperand(OpNum);
if (MO.isUndef() && TargetRegisterInfo::isPhysicalRegister(MO.getReg())) {		if (MO.isUndef() && TargetRegisterInfo::isPhysicalRegister(MO.getReg())) {
// Use the same magic number as getPartialRegUpdateClearance.		// Use the same magic number as getPartialRegUpdateClearance.
return 16;		return 64;
}		}
return 0;		return 0;
}		}

void X86InstrInfo::		void X86InstrInfo::
breakPartialRegDependency(MachineBasicBlock::iterator MI, unsigned OpNum,		breakPartialRegDependency(MachineBasicBlock::iterator MI, unsigned OpNum,
const TargetRegisterInfo *TRI) const {		const TargetRegisterInfo *TRI) const {
unsigned Reg = MI->getOperand(OpNum).getReg();		unsigned Reg = MI->getOperand(OpNum).getReg();
▲ Show 20 Lines • Show All 1,459 Lines • Show Last 20 Lines

test/CodeGen/X86/vec_int_to_fp.ll

	Show First 20 Lines • Show All 1,574 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX1-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX1-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm0			; AVX1-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm0
	; AVX1-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX1-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	; AVX1-NEXT: .LBB45_10:			; AVX1-NEXT: .LBB45_10:
	; AVX1-NEXT: shrq %rax			; AVX1-NEXT: shrq %rax
	; AVX1-NEXT: orq %rax, %rcx			; AVX1-NEXT: orq %rax, %rcx
				; AVX1-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX1-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0			; AVX1-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0
	; AVX1-NEXT: vaddss %xmm0, %xmm0, %xmm0			; AVX1-NEXT: vaddss %xmm0, %xmm0, %xmm0
	; AVX1-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX1-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: uitofp_4i64_to_4f32:			; AVX2-LABEL: uitofp_4i64_to_4f32:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX2-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX2-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm0			; AVX2-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm0
	; AVX2-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX2-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	; AVX2-NEXT: .LBB45_10:			; AVX2-NEXT: .LBB45_10:
	; AVX2-NEXT: shrq %rax			; AVX2-NEXT: shrq %rax
	; AVX2-NEXT: orq %rax, %rcx			; AVX2-NEXT: orq %rax, %rcx
				; AVX2-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX2-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0			; AVX2-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0
	; AVX2-NEXT: vaddss %xmm0, %xmm0, %xmm0			; AVX2-NEXT: vaddss %xmm0, %xmm0, %xmm0
	; AVX2-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX2-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%cvt = uitofp <4 x i64> %a to <4 x float>			%cvt = uitofp <4 x i64> %a to <4 x float>
	ret <4 x float> %cvt			ret <4 x float> %cvt
	}			}
	▲ Show 20 Lines • Show All 1,110 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX1-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX1-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm0			; AVX1-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm0
	; AVX1-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX1-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	; AVX1-NEXT: .LBB74_10:			; AVX1-NEXT: .LBB74_10:
	; AVX1-NEXT: shrq %rax			; AVX1-NEXT: shrq %rax
	; AVX1-NEXT: orq %rax, %rcx			; AVX1-NEXT: orq %rax, %rcx
				; AVX1-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX1-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0			; AVX1-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0
	; AVX1-NEXT: vaddss %xmm0, %xmm0, %xmm0			; AVX1-NEXT: vaddss %xmm0, %xmm0, %xmm0
	; AVX1-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX1-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: uitofp_load_4i64_to_4f32:			; AVX2-LABEL: uitofp_load_4i64_to_4f32:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: vxorps %xmm0, %xmm0, %xmm0			; AVX2-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX2-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm0			; AVX2-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm0
	; AVX2-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX2-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	; AVX2-NEXT: .LBB74_10:			; AVX2-NEXT: .LBB74_10:
	; AVX2-NEXT: shrq %rax			; AVX2-NEXT: shrq %rax
	; AVX2-NEXT: orq %rax, %rcx			; AVX2-NEXT: orq %rax, %rcx
				; AVX2-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX2-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0			; AVX2-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0
	; AVX2-NEXT: vaddss %xmm0, %xmm0, %xmm0			; AVX2-NEXT: vaddss %xmm0, %xmm0, %xmm0
	; AVX2-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]			; AVX2-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%ld = load <4 x i64>, <4 x i64> *%a			%ld = load <4 x i64>, <4 x i64> *%a
	%cvt = uitofp <4 x i64> %ld to <4 x float>			%cvt = uitofp <4 x i64> %ld to <4 x float>
	ret <4 x float> %cvt			ret <4 x float> %cvt
	▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines
	; SSE-NEXT: js .LBB78_10			; SSE-NEXT: js .LBB78_10
	; SSE-NEXT: # BB#11:			; SSE-NEXT: # BB#11:
	; SSE-NEXT: xorps %xmm5, %xmm5			; SSE-NEXT: xorps %xmm5, %xmm5
	; SSE-NEXT: cvtsi2ssq %rax, %xmm5			; SSE-NEXT: cvtsi2ssq %rax, %xmm5
	; SSE-NEXT: jmp .LBB78_12			; SSE-NEXT: jmp .LBB78_12
	; SSE-NEXT: .LBB78_10:			; SSE-NEXT: .LBB78_10:
	; SSE-NEXT: shrq %rax			; SSE-NEXT: shrq %rax
	; SSE-NEXT: orq %rax, %rcx			; SSE-NEXT: orq %rax, %rcx
				; SSE-NEXT: xorps %xmm5, %xmm5
	; SSE-NEXT: cvtsi2ssq %rcx, %xmm5			; SSE-NEXT: cvtsi2ssq %rcx, %xmm5
	; SSE-NEXT: addss %xmm5, %xmm5			; SSE-NEXT: addss %xmm5, %xmm5
	; SSE-NEXT: .LBB78_12:			; SSE-NEXT: .LBB78_12:
	; SSE-NEXT: movd %xmm3, %rax			; SSE-NEXT: movd %xmm3, %rax
	; SSE-NEXT: movl %eax, %ecx			; SSE-NEXT: movl %eax, %ecx
	; SSE-NEXT: andl $1, %ecx			; SSE-NEXT: andl $1, %ecx
	; SSE-NEXT: testq %rax, %rax			; SSE-NEXT: testq %rax, %rax
	; SSE-NEXT: js .LBB78_13			; SSE-NEXT: js .LBB78_13
	; SSE-NEXT: # BB#14:			; SSE-NEXT: # BB#14:
	; SSE-NEXT: cvtsi2ssq %rax, %xmm7			; SSE-NEXT: cvtsi2ssq %rax, %xmm7
	; SSE-NEXT: jmp .LBB78_15			; SSE-NEXT: jmp .LBB78_15
	; SSE-NEXT: .LBB78_13:			; SSE-NEXT: .LBB78_13:
	; SSE-NEXT: shrq %rax			; SSE-NEXT: shrq %rax
	; SSE-NEXT: orq %rax, %rcx			; SSE-NEXT: orq %rax, %rcx
	; SSE-NEXT: cvtsi2ssq %rcx, %xmm7			; SSE-NEXT: cvtsi2ssq %rcx, %xmm7
	; SSE-NEXT: addss %xmm7, %xmm7			; SSE-NEXT: addss %xmm7, %xmm7
	; SSE-NEXT: .LBB78_15:			; SSE-NEXT: .LBB78_15:
	; SSE-NEXT: movd %xmm2, %rax			; SSE-NEXT: movd %xmm2, %rax
	; SSE-NEXT: movl %eax, %ecx			; SSE-NEXT: movl %eax, %ecx
	; SSE-NEXT: andl $1, %ecx			; SSE-NEXT: andl $1, %ecx
	; SSE-NEXT: testq %rax, %rax			; SSE-NEXT: testq %rax, %rax
	; SSE-NEXT: js .LBB78_16			; SSE-NEXT: js .LBB78_16
	; SSE-NEXT: # BB#17:			; SSE-NEXT: # BB#17:
				; SSE-NEXT: xorps %xmm1, %xmm1
	; SSE-NEXT: cvtsi2ssq %rax, %xmm1			; SSE-NEXT: cvtsi2ssq %rax, %xmm1
	; SSE-NEXT: jmp .LBB78_18			; SSE-NEXT: jmp .LBB78_18
	; SSE-NEXT: .LBB78_16:			; SSE-NEXT: .LBB78_16:
	; SSE-NEXT: shrq %rax			; SSE-NEXT: shrq %rax
	; SSE-NEXT: orq %rax, %rcx			; SSE-NEXT: orq %rax, %rcx
				; SSE-NEXT: xorps %xmm1, %xmm1
	; SSE-NEXT: cvtsi2ssq %rcx, %xmm1			; SSE-NEXT: cvtsi2ssq %rcx, %xmm1
	; SSE-NEXT: addss %xmm1, %xmm1			; SSE-NEXT: addss %xmm1, %xmm1
	; SSE-NEXT: .LBB78_18:			; SSE-NEXT: .LBB78_18:
	; SSE-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]			; SSE-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
	; SSE-NEXT: unpcklps {{.*#+}} xmm5 = xmm5[0],xmm6[0],xmm5[1],xmm6[1]			; SSE-NEXT: unpcklps {{.*#+}} xmm5 = xmm5[0],xmm6[0],xmm5[1],xmm6[1]
	; SSE-NEXT: pshufd {{.*#+}} xmm3 = xmm3[2,3,0,1]			; SSE-NEXT: pshufd {{.*#+}} xmm3 = xmm3[2,3,0,1]
	; SSE-NEXT: movd %xmm3, %rax			; SSE-NEXT: movd %xmm3, %rax
	; SSE-NEXT: movl %eax, %ecx			; SSE-NEXT: movl %eax, %ecx
	▲ Show 20 Lines • Show All 128 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vinsertps {{.*#+}} xmm3 = xmm3[0],xmm5[0],xmm3[2,3]			; AVX1-NEXT: vinsertps {{.*#+}} xmm3 = xmm3[0],xmm5[0],xmm3[2,3]
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm4			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm4
	; AVX1-NEXT: vmovq %xmm4, %rax			; AVX1-NEXT: vmovq %xmm4, %rax
	; AVX1-NEXT: movl %eax, %ecx			; AVX1-NEXT: movl %eax, %ecx
	; AVX1-NEXT: andl $1, %ecx			; AVX1-NEXT: andl $1, %ecx
	; AVX1-NEXT: testq %rax, %rax			; AVX1-NEXT: testq %rax, %rax
	; AVX1-NEXT: js .LBB78_19			; AVX1-NEXT: js .LBB78_19
	; AVX1-NEXT: # BB#20:			; AVX1-NEXT: # BB#20:
				; AVX1-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX1-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm5			; AVX1-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm5
	; AVX1-NEXT: jmp .LBB78_21			; AVX1-NEXT: jmp .LBB78_21
	; AVX1-NEXT: .LBB78_19:			; AVX1-NEXT: .LBB78_19:
	; AVX1-NEXT: shrq %rax			; AVX1-NEXT: shrq %rax
	; AVX1-NEXT: orq %rax, %rcx			; AVX1-NEXT: orq %rax, %rcx
				; AVX1-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX1-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0			; AVX1-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0
	; AVX1-NEXT: vaddss %xmm0, %xmm0, %xmm5			; AVX1-NEXT: vaddss %xmm0, %xmm0, %xmm5
	; AVX1-NEXT: .LBB78_21:			; AVX1-NEXT: .LBB78_21:
	; AVX1-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm2[0]			; AVX1-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm2[0]
	; AVX1-NEXT: vinsertps {{.*#+}} xmm1 = xmm3[0,1],xmm5[0],xmm3[3]			; AVX1-NEXT: vinsertps {{.*#+}} xmm1 = xmm3[0,1],xmm5[0],xmm3[3]
	; AVX1-NEXT: vpextrq $1, %xmm4, %rax			; AVX1-NEXT: vpextrq $1, %xmm4, %rax
	; AVX1-NEXT: movl %eax, %ecx			; AVX1-NEXT: movl %eax, %ecx
	; AVX1-NEXT: andl $1, %ecx			; AVX1-NEXT: andl $1, %ecx
	▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: vinsertps {{.*#+}} xmm3 = xmm3[0],xmm5[0],xmm3[2,3]			; AVX2-NEXT: vinsertps {{.*#+}} xmm3 = xmm3[0],xmm5[0],xmm3[2,3]
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm4			; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm4
	; AVX2-NEXT: vmovq %xmm4, %rax			; AVX2-NEXT: vmovq %xmm4, %rax
	; AVX2-NEXT: movl %eax, %ecx			; AVX2-NEXT: movl %eax, %ecx
	; AVX2-NEXT: andl $1, %ecx			; AVX2-NEXT: andl $1, %ecx
	; AVX2-NEXT: testq %rax, %rax			; AVX2-NEXT: testq %rax, %rax
	; AVX2-NEXT: js .LBB78_19			; AVX2-NEXT: js .LBB78_19
	; AVX2-NEXT: # BB#20:			; AVX2-NEXT: # BB#20:
				; AVX2-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX2-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm5			; AVX2-NEXT: vcvtsi2ssq %rax, %xmm0, %xmm5
	; AVX2-NEXT: jmp .LBB78_21			; AVX2-NEXT: jmp .LBB78_21
	; AVX2-NEXT: .LBB78_19:			; AVX2-NEXT: .LBB78_19:
	; AVX2-NEXT: shrq %rax			; AVX2-NEXT: shrq %rax
	; AVX2-NEXT: orq %rax, %rcx			; AVX2-NEXT: orq %rax, %rcx
				; AVX2-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; AVX2-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0			; AVX2-NEXT: vcvtsi2ssq %rcx, %xmm0, %xmm0
	; AVX2-NEXT: vaddss %xmm0, %xmm0, %xmm5			; AVX2-NEXT: vaddss %xmm0, %xmm0, %xmm5
	; AVX2-NEXT: .LBB78_21:			; AVX2-NEXT: .LBB78_21:
	; AVX2-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm2[0]			; AVX2-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm2[0]
	; AVX2-NEXT: vinsertps {{.*#+}} xmm1 = xmm3[0,1],xmm5[0],xmm3[3]			; AVX2-NEXT: vinsertps {{.*#+}} xmm1 = xmm3[0,1],xmm5[0],xmm3[3]
	; AVX2-NEXT: vpextrq $1, %xmm4, %rax			; AVX2-NEXT: vpextrq $1, %xmm4, %rax
	; AVX2-NEXT: movl %eax, %ecx			; AVX2-NEXT: movl %eax, %ecx
	; AVX2-NEXT: andl $1, %ecx			; AVX2-NEXT: andl $1, %ecx
	▲ Show 20 Lines • Show All 183 Lines • Show Last 20 Lines