Download Raw Diff

Details

Reviewers

qcolombet
chandlerc
craig.topper

Commits

rGb811c1d6a57b: prevent folding a scalar FP load into a packed logical FP instruction (PR22371)
rL229531: prevent folding a scalar FP load into a packed logical FP instruction (PR22371)

Summary

I have low hopes that this patch will meet approval as-is, but I figure a patch proposal is the best way to push to the correct solution.

There are at least 2 separate issues in PR22371 ( http://llvm.org/bugs/show_bug.cgi?id=22371 ):

We're folding loads at unaligned addresses into SSE (non-VEX prefixed) instructions. That part of the bug should be resolved with r227983.
We're folding scalar FP load operands (32 or 64 bit) into packed logical FP instructions which are required to load 128-bits. That's what I'm trying to fix in this patch.

Here's the example codegen that I think we have to prevent:

LCPI0_0:
  .quad	4607182418800017408     ## double 1
...
  cmplesd	%xmm0, %xmm1
  andpd	LCPI0_0(%rip), %xmm1   <--- load 128-bits from a 64-bit location
  movapd	%xmm1, %xmm0
  retq

The edit of the multiclass defs is simple: change the memory operands in sse12_fp_packed_scalar_logical_alias from scalars to vectors. That's what the hardware packed logical FP instructions define: 128-bit memory operands. How we would ever actually match that pattern, I don't know. And so I have no positive (load folding) test cases in this patch.

There's also an existing bug in the AVX flavors of the defm's: they should be using load operands, not memop* operands. The only difference between those two is that memops have an extra alignment check to match the SSE spec. This part of the bug was extended by r228123 ( http://reviews.llvm.org/rL228123 ). Again, I'm not sure how to test that path. There weren't any tests added with r228123 either, so that makes me feel better.

The test cases that I'm proposing to add in 'logical-load-fold.ll' cover negative cases for the 4 paths (float/double * sse/avx) of the sse12_fp_packed_scalar_logical_alias multiclass. I haven't found any other way to generate a packed logical FP instruction with a memory operand.

I'm also proposing to XFAIL a test case that wants to fold a scalar load from the stack into a packed logical FP instruction. Although this may be possible, I don't see how we can do this without custom logic to know that it's safe to read the extra bytes from memory, so I extracted that one test into its own file and added comments to explain the situation.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 19508.Feb 6 2015, 1:39 PM

spatel retitled this revision from to prevent folding a scalar FP load into a packed logical FP instruction (PR22371).

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: chandlerc, qcolombet, craig.topper.

spatel added a subscriber: Unknown Object (MLST).

RKSimon added a subscriber: RKSimon.Feb 6 2015, 5:20 PM

hans added a subscriber: hans.Feb 9 2015, 8:15 PM

Hi Sanjay,

test/CodeGen/X86/logical-load-fold.ll
9 ↗	(On Diff #19508)	Do we really care about that? Unless we hit some read protection error, I do not see why reading undefined memory is a problem as long as we use only the part that is defined. It seems that is the case in those cases, isn’t it?

In D7474#121658, @qcolombet wrote:

Unless we hit some read protection error, I do not see why reading undefined memory is a problem
as long as we use only the part that is defined. It seems that is the case in those cases, isn’t it?

Hi Quentin,

I think there is a danger beyond just a read protection error. For example, what if there's a denorm FP value in that extra bit of memory that we were never supposed to load?

See PR20358 ( http://llvm.org/bugs/show_bug.cgi?id=20358 ) for a really bad example of what might happen in that case - 19x perf hit.

Also, I'm not a security expert, but I worry that reading any extra mem is some kind of security violation / potential for hackery?

Hi Sanjay,

For example, what if there's a denorm FP value in that extra bit of memory that we were never supposed to load?

That's a good point, we may even generate exceptions! E.g., with a packed division.

I'll have a closer look to the patch.

Thanks,
-Quentin

test/CodeGen/X86/stack-align-vector-load.ll
11 ↗	(On Diff #19508)	Following the logic that reading undef is evil, shouldn't we just make sure that indeed this test is failing to fold?

spatel added inline comments.Feb 11 2015, 6:56 AM

test/CodeGen/X86/stack-align-vector-load.ll
11 ↗	(On Diff #19508)	Yes, that would be the safer thing, and there's very little upside in folding an FP logic load, so this optimization would probably remain XFAIL forever. I'll update the patch with this change.

Updated patch:

Rather than XFAIL the existing test case, check to make sure we don't fold an oversized load.
r228671 (no test cases?) compounded the bug by adding the AVX FP scalar FP logical ops to the load folding tables; removed those and added the AVX vector FP logical ops.
There's still a bug in the SSE scalar FP logical ops; added a FIXME so we don't lose track of that.

Previous upload was missing the new test file that checks to make sure we're not folding loads in each of the modified tablegen patterns (AVX / SSE + float / double).

Hi Sanjay,

lib/Target/X86/X86InstrFragmentsSIMD.td
374 ↗	(On Diff #19777)	This is not valid, is it? When this matches we will read 128-bit from the memory, i.e., pass what we do for the load32. Aren’t we? Something correct would be load128 -> extract element. Though I do not think that happens a lot…

spatel added inline comments.Feb 12 2015, 3:20 PM

lib/Target/X86/X86InstrFragmentsSIMD.td
374 ↗	(On Diff #19777)	I don't understand; we only want to match a 32-bit load that has been extended to fit in the 128-bit register, right? Isn't that what loadf32 guarantees? Perhaps this should be a zero-extend rather than scalar_to_vector though?

qcolombet added inline comments.Feb 12 2015, 3:29 PM

lib/Target/X86/X86InstrFragmentsSIMD.td
374 ↗	(On Diff #19777)	Well I may certainly misread the uses of loadf32_128, but does not this is used to fold the load in the related operation, thus we read 128-bit in memory, don't we?

spatel added inline comments.Feb 13 2015, 8:34 AM

lib/Target/X86/X86InstrFragmentsSIMD.td
374 ↗	(On Diff #19777)	I think I understand now: this is only working because it's difficult to match the more complicated pattern. If we somehow managed to match that pattern, then we'd trigger the same bug all over again. So the fundamental problem is that we're creating instruction definitions that don't exist in x86 reality. I see 2 potential fixes: (1) Pass a null / void memory operand pattern to the existing multiclass: defm PS : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode, FR32, f32, f128mem, [how to specify that nothing works here?] , SSEPackedSingle, itins>, PS; (2) Write a new multiclass that just has a register-register variant; no reg-mem option is possible.

qcolombet added inline comments.Feb 13 2015, 10:20 AM

lib/Target/X86/X86InstrFragmentsSIMD.td
374 ↗	(On Diff #19777)	For #1, I'm not sure this is possible. For #2, that sounds like the right approach. Just make sure that the related opcode are not used anywhere. If they are double check that this is correct and if it is correct, make sure to have a definition for those, just omit the patterns ([]). That's a lot of "if" :). Thanks.

Hi Quentin,

I went ahead with your first suggestion of changing the PatFrags to match load128 -> extract in this version of the patch. This corrects the buggy codegen via tablegen pattern-matching, but there's still a potential problem coming from the peephole pass and the load folding tables. I'll try to fix that in a subsequent patch.

Hi Sanjay,

Could you add tests for the new patterns?
I.e., some vector extract feeding an and.

Thanks,
-Quentin

In D7474#124867, @qcolombet wrote:

Could you add tests for the new patterns?
I.e., some vector extract feeding an and.

Hi,

This goes back to my initial problem - I don't know how to generate the positive test cases:

We can't bitcast our way there via an integer logic op because those wouldn't be lowered to X86::F[AND,OR,XOR].
We can't coerce an fabs / fneg / fcopysign into this pattern because they load a scalar; fixing that would be my next patch on this path.

There's some hope / danger that we'll start generating more of these nodes for shuffles or if we fix this:
http://llvm.org/bugs/show_bug.cgi?id=22428

...but until then I don't know how to produce IR to match the pattern. Suggestions welcome. :)

...but until then I don't know how to produce IR to match the pattern. Suggestions welcome. :)

I was afraid of that and I do not have any suggestions either.

LGTM then.

This revision is now accepted and ready to land.Feb 17 2015, 11:04 AM

Closed by commit rL229531: prevent folding a scalar FP load into a packed logical FP instruction (PR22371) (authored by spatel). · Explain WhyFeb 17 2015, 12:10 PM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D11477: fix invalid load folding with SSE/AVX FP logical instructions (PR22371).Jul 23 2015, 3:11 PM

spatel mentioned this in rL243361: fix invalid load folding with SSE/AVX FP logical instructions (PR22371).Jul 27 2015, 5:49 PM

hans mentioned this in rL243435: Merging r243361:.Jul 28 2015, 9:20 AM

Diff 20102

llvm/trunk/lib/Target/X86/X86InstrFragmentsSIMD.td

	Show First 20 Lines • Show All 360 Lines • ▼ Show 20 Lines
	def loadv16i32 : PatFrag<(ops node:$ptr), (v16i32 (load node:$ptr))>;			def loadv16i32 : PatFrag<(ops node:$ptr), (v16i32 (load node:$ptr))>;
	def loadv8i64 : PatFrag<(ops node:$ptr), (v8i64 (load node:$ptr))>;			def loadv8i64 : PatFrag<(ops node:$ptr), (v8i64 (load node:$ptr))>;

	// 128-/256-/512-bit extload pattern fragments			// 128-/256-/512-bit extload pattern fragments
	def extloadv2f32 : PatFrag<(ops node:$ptr), (v2f64 (extloadvf32 node:$ptr))>;			def extloadv2f32 : PatFrag<(ops node:$ptr), (v2f64 (extloadvf32 node:$ptr))>;
	def extloadv4f32 : PatFrag<(ops node:$ptr), (v4f64 (extloadvf32 node:$ptr))>;			def extloadv4f32 : PatFrag<(ops node:$ptr), (v4f64 (extloadvf32 node:$ptr))>;
	def extloadv8f32 : PatFrag<(ops node:$ptr), (v8f64 (extloadvf32 node:$ptr))>;			def extloadv8f32 : PatFrag<(ops node:$ptr), (v8f64 (extloadvf32 node:$ptr))>;

				// These are needed to match a scalar load that is used in a vector-only
				// math instruction such as the FP logical ops: andps, andnps, orps, xorps.
				// The memory operand is required to be a 128-bit load, so it must be converted
				// from a vector to a scalar.
				def loadf32_128 : PatFrag<(ops node:$ptr),
				(f32 (vector_extract (loadv4f32 node:$ptr), (iPTR 0)))>;
				def loadf64_128 : PatFrag<(ops node:$ptr),
				(f64 (vector_extract (loadv2f64 node:$ptr), (iPTR 0)))>;

	// Like 'store', but always requires 128-bit vector alignment.			// Like 'store', but always requires 128-bit vector alignment.
	def alignedstore : PatFrag<(ops node:$val, node:$ptr),			def alignedstore : PatFrag<(ops node:$val, node:$ptr),
	(store node:$val, node:$ptr), [{			(store node:$val, node:$ptr), [{
	return cast<StoreSDNode>(N)->getAlignment() >= 16;			return cast<StoreSDNode>(N)->getAlignment() >= 16;
	}]>;			}]>;

	// Like 'store', but always requires 256-bit vector alignment.			// Like 'store', but always requires 256-bit vector alignment.
	def alignedstore256 : PatFrag<(ops node:$val, node:$ptr),			def alignedstore256 : PatFrag<(ops node:$val, node:$ptr),
	▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	def memopfsf64 : PatFrag<(ops node:$ptr), (f64 (memop node:$ptr))>;			def memopfsf64 : PatFrag<(ops node:$ptr), (f64 (memop node:$ptr))>;

	// 128-bit memop pattern fragments			// 128-bit memop pattern fragments
	// NOTE: all 128-bit integer vector loads are promoted to v2i64			// NOTE: all 128-bit integer vector loads are promoted to v2i64
	def memopv4f32 : PatFrag<(ops node:$ptr), (v4f32 (memop node:$ptr))>;			def memopv4f32 : PatFrag<(ops node:$ptr), (v4f32 (memop node:$ptr))>;
	def memopv2f64 : PatFrag<(ops node:$ptr), (v2f64 (memop node:$ptr))>;			def memopv2f64 : PatFrag<(ops node:$ptr), (v2f64 (memop node:$ptr))>;
	def memopv2i64 : PatFrag<(ops node:$ptr), (v2i64 (memop node:$ptr))>;			def memopv2i64 : PatFrag<(ops node:$ptr), (v2i64 (memop node:$ptr))>;

				// These are needed to match a scalar memop that is used in a vector-only
				// math instruction such as the FP logical ops: andps, andnps, orps, xorps.
				// The memory operand is required to be a 128-bit load, so it must be converted
				// from a vector to a scalar.
				def memopfsf32_128 : PatFrag<(ops node:$ptr),
				(f32 (vector_extract (memopv4f32 node:$ptr), (iPTR 0)))>;
				def memopfsf64_128 : PatFrag<(ops node:$ptr),
				(f64 (vector_extract (memopv2f64 node:$ptr), (iPTR 0)))>;


	// SSSE3 uses MMX registers for some instructions. They aren't aligned on a			// SSSE3 uses MMX registers for some instructions. They aren't aligned on a
	// 16-byte boundary.			// 16-byte boundary.
	// FIXME: 8 byte alignment for mmx reads is not required			// FIXME: 8 byte alignment for mmx reads is not required
	def memop64 : PatFrag<(ops node:$ptr), (load node:$ptr), [{			def memop64 : PatFrag<(ops node:$ptr), (load node:$ptr), [{
	return cast<LoadSDNode>(N)->getAlignment() >= 8;			return cast<LoadSDNode>(N)->getAlignment() >= 8;
	}]>;			}]>;

	def memopmmx : PatFrag<(ops node:$ptr), (x86mmx (memop64 node:$ptr))>;			def memopmmx : PatFrag<(ops node:$ptr), (x86mmx (memop64 node:$ptr))>;
	▲ Show 20 Lines • Show All 130 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 927 Lines • ▼ Show 20 Lines	static const X86OpTblEntry OpTbl2[] = {
{ X86::DIVPDrr, X86::DIVPDrm, TB_ALIGN_16 },		{ X86::DIVPDrr, X86::DIVPDrm, TB_ALIGN_16 },
{ X86::DIVPSrr, X86::DIVPSrm, TB_ALIGN_16 },		{ X86::DIVPSrr, X86::DIVPSrm, TB_ALIGN_16 },
{ X86::DIVSDrr, X86::DIVSDrm, 0 },		{ X86::DIVSDrr, X86::DIVSDrm, 0 },
{ X86::DIVSDrr_Int, X86::DIVSDrm_Int, 0 },		{ X86::DIVSDrr_Int, X86::DIVSDrm_Int, 0 },
{ X86::DIVSSrr, X86::DIVSSrm, 0 },		{ X86::DIVSSrr, X86::DIVSSrm, 0 },
{ X86::DIVSSrr_Int, X86::DIVSSrm_Int, 0 },		{ X86::DIVSSrr_Int, X86::DIVSSrm_Int, 0 },
{ X86::DPPDrri, X86::DPPDrmi, TB_ALIGN_16 },		{ X86::DPPDrri, X86::DPPDrmi, TB_ALIGN_16 },
{ X86::DPPSrri, X86::DPPSrmi, TB_ALIGN_16 },		{ X86::DPPSrri, X86::DPPSrmi, TB_ALIGN_16 },

		// FIXME: We should not be folding Fs* scalar loads into vector
		// instructions because the vector instructions require vector-sized
		// loads. Lowering should create vector-sized instructions (the Fv*
		// variants below) to allow load folding.
{ X86::FsANDNPDrr, X86::FsANDNPDrm, TB_ALIGN_16 },		{ X86::FsANDNPDrr, X86::FsANDNPDrm, TB_ALIGN_16 },
{ X86::FsANDNPSrr, X86::FsANDNPSrm, TB_ALIGN_16 },		{ X86::FsANDNPSrr, X86::FsANDNPSrm, TB_ALIGN_16 },
{ X86::FsANDPDrr, X86::FsANDPDrm, TB_ALIGN_16 },		{ X86::FsANDPDrr, X86::FsANDPDrm, TB_ALIGN_16 },
{ X86::FsANDPSrr, X86::FsANDPSrm, TB_ALIGN_16 },		{ X86::FsANDPSrr, X86::FsANDPSrm, TB_ALIGN_16 },
{ X86::FsORPDrr, X86::FsORPDrm, TB_ALIGN_16 },		{ X86::FsORPDrr, X86::FsORPDrm, TB_ALIGN_16 },
{ X86::FsORPSrr, X86::FsORPSrm, TB_ALIGN_16 },		{ X86::FsORPSrr, X86::FsORPSrm, TB_ALIGN_16 },
{ X86::FsXORPDrr, X86::FsXORPDrm, TB_ALIGN_16 },		{ X86::FsXORPDrr, X86::FsXORPDrm, TB_ALIGN_16 },
{ X86::FsXORPSrr, X86::FsXORPSrm, TB_ALIGN_16 },		{ X86::FsXORPSrr, X86::FsXORPSrm, TB_ALIGN_16 },

		{ X86::FvANDNPDrr, X86::FvANDNPDrm, TB_ALIGN_16 },
		{ X86::FvANDNPSrr, X86::FvANDNPSrm, TB_ALIGN_16 },
		{ X86::FvANDPDrr, X86::FvANDPDrm, TB_ALIGN_16 },
		{ X86::FvANDPSrr, X86::FvANDPSrm, TB_ALIGN_16 },
		{ X86::FvORPDrr, X86::FvORPDrm, TB_ALIGN_16 },
		{ X86::FvORPSrr, X86::FvORPSrm, TB_ALIGN_16 },
		{ X86::FvXORPDrr, X86::FvXORPDrm, TB_ALIGN_16 },
		{ X86::FvXORPSrr, X86::FvXORPSrm, TB_ALIGN_16 },
{ X86::HADDPDrr, X86::HADDPDrm, TB_ALIGN_16 },		{ X86::HADDPDrr, X86::HADDPDrm, TB_ALIGN_16 },
{ X86::HADDPSrr, X86::HADDPSrm, TB_ALIGN_16 },		{ X86::HADDPSrr, X86::HADDPSrm, TB_ALIGN_16 },
{ X86::HSUBPDrr, X86::HSUBPDrm, TB_ALIGN_16 },		{ X86::HSUBPDrr, X86::HSUBPDrm, TB_ALIGN_16 },
{ X86::HSUBPSrr, X86::HSUBPSrm, TB_ALIGN_16 },		{ X86::HSUBPSrr, X86::HSUBPSrm, TB_ALIGN_16 },
{ X86::IMUL16rr, X86::IMUL16rm, 0 },		{ X86::IMUL16rr, X86::IMUL16rm, 0 },
{ X86::IMUL32rr, X86::IMUL32rm, 0 },		{ X86::IMUL32rr, X86::IMUL32rm, 0 },
{ X86::IMUL64rr, X86::IMUL64rm, 0 },		{ X86::IMUL64rr, X86::IMUL64rm, 0 },
{ X86::Int_CMPSDrr, X86::Int_CMPSDrm, 0 },		{ X86::Int_CMPSDrr, X86::Int_CMPSDrm, 0 },
▲ Show 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	static const X86OpTblEntry OpTbl2[] = {
{ X86::VDIVPDrr, X86::VDIVPDrm, 0 },		{ X86::VDIVPDrr, X86::VDIVPDrm, 0 },
{ X86::VDIVPSrr, X86::VDIVPSrm, 0 },		{ X86::VDIVPSrr, X86::VDIVPSrm, 0 },
{ X86::VDIVSDrr, X86::VDIVSDrm, 0 },		{ X86::VDIVSDrr, X86::VDIVSDrm, 0 },
{ X86::VDIVSDrr_Int, X86::VDIVSDrm_Int, 0 },		{ X86::VDIVSDrr_Int, X86::VDIVSDrm_Int, 0 },
{ X86::VDIVSSrr, X86::VDIVSSrm, 0 },		{ X86::VDIVSSrr, X86::VDIVSSrm, 0 },
{ X86::VDIVSSrr_Int, X86::VDIVSSrm_Int, 0 },		{ X86::VDIVSSrr_Int, X86::VDIVSSrm_Int, 0 },
{ X86::VDPPDrri, X86::VDPPDrmi, 0 },		{ X86::VDPPDrri, X86::VDPPDrmi, 0 },
{ X86::VDPPSrri, X86::VDPPSrmi, 0 },		{ X86::VDPPSrri, X86::VDPPSrmi, 0 },
{ X86::VFsANDNPDrr, X86::VFsANDNPDrm, 0 },		// Do not fold VFs* loads because there are no scalar load variants for
{ X86::VFsANDNPSrr, X86::VFsANDNPSrm, 0 },		// these instructions. When folded, the load is required to be 128-bits, so
{ X86::VFsANDPDrr, X86::VFsANDPDrm, 0 },		// the load size would not match.
{ X86::VFsANDPSrr, X86::VFsANDPSrm, 0 },		{ X86::VFvANDNPDrr, X86::VFvANDNPDrm, 0 },
{ X86::VFsORPDrr, X86::VFsORPDrm, 0 },		{ X86::VFvANDNPSrr, X86::VFvANDNPSrm, 0 },
{ X86::VFsORPSrr, X86::VFsORPSrm, 0 },		{ X86::VFvANDPDrr, X86::VFvANDPDrm, 0 },
{ X86::VFsXORPDrr, X86::VFsXORPDrm, 0 },		{ X86::VFvANDPSrr, X86::VFvANDPSrm, 0 },
{ X86::VFsXORPSrr, X86::VFsXORPSrm, 0 },		{ X86::VFvORPDrr, X86::VFvORPDrm, 0 },
		{ X86::VFvORPSrr, X86::VFvORPSrm, 0 },
		{ X86::VFvXORPDrr, X86::VFvXORPDrm, 0 },
		{ X86::VFvXORPSrr, X86::VFvXORPSrm, 0 },
{ X86::VHADDPDrr, X86::VHADDPDrm, 0 },		{ X86::VHADDPDrr, X86::VHADDPDrm, 0 },
{ X86::VHADDPSrr, X86::VHADDPSrm, 0 },		{ X86::VHADDPSrr, X86::VHADDPSrm, 0 },
{ X86::VHSUBPDrr, X86::VHSUBPDrm, 0 },		{ X86::VHSUBPDrr, X86::VHSUBPDrm, 0 },
{ X86::VHSUBPSrr, X86::VHSUBPSrm, 0 },		{ X86::VHSUBPSrr, X86::VHSUBPSrm, 0 },
{ X86::Int_VCMPSDrr, X86::Int_VCMPSDrm, 0 },		{ X86::Int_VCMPSDrr, X86::Int_VCMPSDrm, 0 },
{ X86::Int_VCMPSSrr, X86::Int_VCMPSSrm, 0 },		{ X86::Int_VCMPSSrr, X86::Int_VCMPSSrm, 0 },
{ X86::VMAXPDrr, X86::VMAXPDrm, 0 },		{ X86::VMAXPDrr, X86::VMAXPDrm, 0 },
{ X86::VMAXPSrr, X86::VMAXPSrm, 0 },		{ X86::VMAXPSrr, X86::VMAXPSrm, 0 },
▲ Show 20 Lines • Show All 5,115 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 2,868 Lines • ▼ Show 20 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// SSE 1 & 2 - Logical Instructions			// SSE 1 & 2 - Logical Instructions
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	// Multiclass for scalars using the X86 logical operation aliases for FP.			// Multiclass for scalars using the X86 logical operation aliases for FP.
	multiclass sse12_fp_packed_scalar_logical_alias<			multiclass sse12_fp_packed_scalar_logical_alias<
	bits<8> opc, string OpcodeStr, SDNode OpNode, OpndItins itins> {			bits<8> opc, string OpcodeStr, SDNode OpNode, OpndItins itins> {
	defm V#NAME#PS : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode,			defm V#NAME#PS : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode,
	FR32, f32, f128mem, loadf32, SSEPackedSingle, itins, 0>,			FR32, f32, f128mem, loadf32_128, SSEPackedSingle, itins, 0>,
	PS, VEX_4V;			PS, VEX_4V;

	defm V#NAME#PD : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,			defm V#NAME#PD : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,
	FR64, f64, f128mem, loadf64, SSEPackedDouble, itins, 0>,			FR64, f64, f128mem, loadf64_128, SSEPackedDouble, itins, 0>,
	PD, VEX_4V;			PD, VEX_4V;

	let Constraints = "$src1 = $dst" in {			let Constraints = "$src1 = $dst" in {
	defm PS : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode, FR32,			defm PS : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode, FR32,
	f32, f128mem, memopfsf32, SSEPackedSingle, itins>,			f32, f128mem, memopfsf32_128, SSEPackedSingle, itins>, PS;
	PS;

	defm PD : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode, FR64,			defm PD : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode, FR64,
	f64, f128mem, memopfsf64, SSEPackedDouble, itins>,			f64, f128mem, memopfsf64_128, SSEPackedDouble, itins>, PD;
	PD;
	}			}
	}			}

	let isCodeGenOnly = 1 in {			let isCodeGenOnly = 1 in {
	defm FsAND : sse12_fp_packed_scalar_logical_alias<0x54, "and", X86fand,			defm FsAND : sse12_fp_packed_scalar_logical_alias<0x54, "and", X86fand,
	SSE_BIT_ITINS_P>;			SSE_BIT_ITINS_P>;
	defm FsOR : sse12_fp_packed_scalar_logical_alias<0x56, "or", X86for,			defm FsOR : sse12_fp_packed_scalar_logical_alias<0x56, "or", X86for,
	SSE_BIT_ITINS_P>;			SSE_BIT_ITINS_P>;
	▲ Show 20 Lines • Show All 6,043 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/logical-load-fold.ll

				; RUN: llc < %s -mcpu=x86-64 -mattr=sse2,sse-unaligned-mem \| FileCheck %s --check-prefix=SSE2
				; RUN: llc < %s -mcpu=x86-64 -mattr=avx \| FileCheck %s --check-prefix=AVX

				; Although we have the ability to fold an unaligned load with AVX
				; and under special conditions with some SSE implementations, we
				; can not fold the load under any circumstances in these test
				; cases because they are not 16-byte loads. The load must be
				; executed as a scalar ('movs*') with a zero extension to
				; 128-bits and then used in the packed logical ('andp*') op.
				; PR22371 - http://llvm.org/bugs/show_bug.cgi?id=22371

				define double @load_double_no_fold(double %x, double %y) {
				; SSE2-LABEL: load_double_no_fold:
				; SSE2: ## BB#0:
				; SSE2-NEXT: cmplesd %xmm0, %xmm1
				; SSE2-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero
				; SSE2-NEXT: andpd %xmm1, %xmm0
				; SSE2-NEXT: retq
				;
				; AVX-LABEL: load_double_no_fold:
				; AVX: ## BB#0:
				; AVX-NEXT: vcmplesd %xmm0, %xmm1, %xmm0
				; AVX-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero
				; AVX-NEXT: vandpd %xmm1, %xmm0, %xmm0
				; AVX-NEXT: retq

				%cmp = fcmp oge double %x, %y
				%zext = zext i1 %cmp to i32
				%conv = sitofp i32 %zext to double
				ret double %conv
				}

				define float @load_float_no_fold(float %x, float %y) {
				; SSE2-LABEL: load_float_no_fold:
				; SSE2: ## BB#0:
				; SSE2-NEXT: cmpless %xmm0, %xmm1
				; SSE2-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; SSE2-NEXT: andps %xmm1, %xmm0
				; SSE2-NEXT: retq
				;
				; AVX-LABEL: load_float_no_fold:
				; AVX: ## BB#0:
				; AVX-NEXT: vcmpless %xmm0, %xmm1, %xmm0
				; AVX-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; AVX-NEXT: vandps %xmm1, %xmm0, %xmm0
				; AVX-NEXT: retq

				%cmp = fcmp oge float %x, %y
				%zext = zext i1 %cmp to i32
				%conv = sitofp i32 %zext to float
				ret float %conv
				}

llvm/trunk/test/CodeGen/X86/stack-align.ll

	; RUN: llc < %s -relocation-model=static -mcpu=yonah \| FileCheck %s			; RUN: llc < %s -relocation-model=static -mcpu=yonah \| FileCheck %s

	; The double argument is at 4(esp) which is 16-byte aligned, allowing us to			; The double argument is at 4(esp) which is 16-byte aligned, but we
	; fold the load into the andpd.			; are required to read in extra bytes of memory in order to fold the
				; load. Bad Things may happen when reading/processing undefined bytes,
				; so don't fold the load.
				; PR22371 / http://reviews.llvm.org/D7474

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"			target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
	target triple = "i686-apple-darwin8"			target triple = "i686-apple-darwin8"
	@G = external global double			@G = external global double

	define void @test({ double, double }* byval %z, double* %P) nounwind {			define void @test({ double, double }* byval %z, double* %P) nounwind {
	entry:			entry:
	%tmp3 = load double* @G, align 16 ; <double> [#uses=1]			%tmp3 = load double* @G, align 16 ; <double> [#uses=1]
	%tmp4 = tail call double @fabs( double %tmp3 ) readnone ; <double> [#uses=1]			%tmp4 = tail call double @fabs( double %tmp3 ) readnone ; <double> [#uses=1]
	store volatile double %tmp4, double* %P			store volatile double %tmp4, double* %P
	%tmp = getelementptr { double, double }* %z, i32 0, i32 0 ; <double*> [#uses=1]			%tmp = getelementptr { double, double }* %z, i32 0, i32 0 ; <double*> [#uses=1]
	%tmp1 = load volatile double* %tmp, align 8 ; <double> [#uses=1]			%tmp1 = load volatile double* %tmp, align 8 ; <double> [#uses=1]
	%tmp2 = tail call double @fabs( double %tmp1 ) readnone ; <double> [#uses=1]			%tmp2 = tail call double @fabs( double %tmp1 ) readnone ; <double> [#uses=1]
	; CHECK: andpd{{.*}}4(%esp), %xmm
	%tmp6 = fadd double %tmp4, %tmp2 ; <double> [#uses=1]			%tmp6 = fadd double %tmp4, %tmp2 ; <double> [#uses=1]
	store volatile double %tmp6, double* %P, align 8			store volatile double %tmp6, double* %P, align 8
	ret void			ret void

				; CHECK-LABEL: test:
				; CHECK: movsd {{.}}G, %xmm{{.}}
				; CHECK: andpd %xmm{{.}}, %xmm{{.}}
				; CHECK: movsd 4(%esp), %xmm{{.*}}
				; CHECK: andpd %xmm{{.}}, %xmm{{.}}


	}			}

	define void @test2() alignstack(16) nounwind {			define void @test2() alignstack(16) nounwind {
	entry:			entry:
				; CHECK-LABEL: test2:
	; CHECK: andl{{.*}}$-16, %esp			; CHECK: andl{{.*}}$-16, %esp
	ret void			ret void
	}			}

	; Use a call to force a spill.			; Use a call to force a spill.
	define <2 x double> @test3(<2 x double> %x, <2 x double> %y) alignstack(32) nounwind {			define <2 x double> @test3(<2 x double> %x, <2 x double> %y) alignstack(32) nounwind {
	entry:			entry:
				; CHECK-LABEL: test3:
	; CHECK: andl{{.*}}$-32, %esp			; CHECK: andl{{.*}}$-32, %esp
	call void @test2()			call void @test2()
	%A = fmul <2 x double> %x, %y			%A = fmul <2 x double> %x, %y
	ret <2 x double> %A			ret <2 x double> %A
	}			}

	declare double @fabs(double)			declare double @fabs(double)

	; The pointer is already known aligned, so and x,-16 is eliminable.			; The pointer is already known aligned, so and x,-16 is eliminable.
	Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

prevent folding a scalar FP load into a packed logical FP instruction (PR22371)
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 20102

llvm/trunk/lib/Target/X86/X86InstrFragmentsSIMD.td

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

llvm/trunk/lib/Target/X86/X86InstrSSE.td

llvm/trunk/test/CodeGen/X86/logical-load-fold.ll

llvm/trunk/test/CodeGen/X86/stack-align.ll

This is an archive of the discontinued LLVM Phabricator instance.

prevent folding a scalar FP load into a packed logical FP instruction (PR22371)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 20102

llvm/trunk/lib/Target/X86/X86InstrFragmentsSIMD.td

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

llvm/trunk/lib/Target/X86/X86InstrSSE.td

llvm/trunk/test/CodeGen/X86/logical-load-fold.ll

llvm/trunk/test/CodeGen/X86/stack-align.ll

prevent folding a scalar FP load into a packed logical FP instruction (PR22371)
ClosedPublic