This is an archive of the discontinued LLVM Phabricator instance.

prevent folding a scalar FP load into a packed logical FP instruction (PR22371)
ClosedPublic

Authored by spatel on Feb 6 2015, 1:39 PM.

Details

Summary

I have low hopes that this patch will meet approval as-is, but I figure a patch proposal is the best way to push to the correct solution.

There are at least 2 separate issues in PR22371 ( http://llvm.org/bugs/show_bug.cgi?id=22371 ):

  1. We're folding loads at unaligned addresses into SSE (non-VEX prefixed) instructions. That part of the bug should be resolved with r227983.
  2. We're folding scalar FP load operands (32 or 64 bit) into packed logical FP instructions which are required to load 128-bits. That's what I'm trying to fix in this patch.

Here's the example codegen that I think we have to prevent:

LCPI0_0:
  .quad	4607182418800017408     ## double 1
...
  cmplesd	%xmm0, %xmm1
  andpd	LCPI0_0(%rip), %xmm1   <--- load 128-bits from a 64-bit location
  movapd	%xmm1, %xmm0
  retq

The edit of the multiclass defs is simple: change the memory operands in sse12_fp_packed_scalar_logical_alias from scalars to vectors. That's what the hardware packed logical FP instructions define: 128-bit memory operands. How we would ever actually match that pattern, I don't know. And so I have no positive (load folding) test cases in this patch.

There's also an existing bug in the AVX flavors of the defm's: they should be using load operands, not memop* operands. The only difference between those two is that memops have an extra alignment check to match the SSE spec. This part of the bug was extended by r228123 ( http://reviews.llvm.org/rL228123 ). Again, I'm not sure how to test that path. There weren't any tests added with r228123 either, so that makes me feel better.

The test cases that I'm proposing to add in 'logical-load-fold.ll' cover negative cases for the 4 paths (float/double * sse/avx) of the sse12_fp_packed_scalar_logical_alias multiclass. I haven't found any other way to generate a packed logical FP instruction with a memory operand.

I'm also proposing to XFAIL a test case that wants to fold a scalar load from the stack into a packed logical FP instruction. Although this may be possible, I don't see how we can do this without custom logic to know that it's safe to read the extra bytes from memory, so I extracted that one test into its own file and added comments to explain the situation.

Diff Detail

Event Timeline

spatel updated this revision to Diff 19508.Feb 6 2015, 1:39 PM
spatel retitled this revision from to prevent folding a scalar FP load into a packed logical FP instruction (PR22371).
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added a subscriber: Unknown Object (MLST).
RKSimon added a subscriber: RKSimon.Feb 6 2015, 5:20 PM
hans added a subscriber: hans.Feb 9 2015, 8:15 PM
qcolombet edited edge metadata.Feb 10 2015, 3:24 PM

Hi Sanjay,

test/CodeGen/X86/logical-load-fold.ll
9 ↗(On Diff #19508)

Do we really care about that?

Unless we hit some read protection error, I do not see why reading undefined memory is a problem as long as we use only the part that is defined.
It seems that is the case in those cases, isn’t it?

Unless we hit some read protection error, I do not see why reading undefined memory is a problem
as long as we use only the part that is defined. It seems that is the case in those cases, isn’t it?

Hi Quentin,

I think there is a danger beyond just a read protection error. For example, what if there's a denorm FP value in that extra bit of memory that we were never supposed to load?

See PR20358 ( http://llvm.org/bugs/show_bug.cgi?id=20358 ) for a really bad example of what might happen in that case - 19x perf hit.

Also, I'm not a security expert, but I worry that reading any extra mem is some kind of security violation / potential for hackery?

Hi Sanjay,

For example, what if there's a denorm FP value in that extra bit of memory that we were never supposed to load?

That's a good point, we may even generate exceptions! E.g., with a packed division.

I'll have a closer look to the patch.

Thanks,
-Quentin

test/CodeGen/X86/stack-align-vector-load.ll
11 ↗(On Diff #19508)

Following the logic that reading undef is evil, shouldn't we just make sure that indeed this test is failing to fold?

spatel added inline comments.Feb 11 2015, 6:56 AM
test/CodeGen/X86/stack-align-vector-load.ll
11 ↗(On Diff #19508)

Yes, that would be the safer thing, and there's very little upside in folding an FP logic load, so this optimization would probably remain XFAIL forever.

I'll update the patch with this change.

spatel updated this revision to Diff 19776.Feb 11 2015, 12:39 PM
spatel edited edge metadata.

Updated patch:

  1. Rather than XFAIL the existing test case, check to make sure we don't fold an oversized load.
  2. r228671 (no test cases?) compounded the bug by adding the AVX FP scalar FP logical ops to the load folding tables; removed those and added the AVX vector FP logical ops.
  3. There's still a bug in the SSE scalar FP logical ops; added a FIXME so we don't lose track of that.
spatel updated this revision to Diff 19777.Feb 11 2015, 12:44 PM

Previous upload was missing the new test file that checks to make sure we're not folding loads in each of the modified tablegen patterns (AVX / SSE + float / double).

Hi Sanjay,

lib/Target/X86/X86InstrFragmentsSIMD.td
374

This is not valid, is it?

When this matches we will read 128-bit from the memory, i.e., pass what we do for the load32. Aren’t we?

Something correct would be load128 -> extract element.
Though I do not think that happens a lot…

spatel added inline comments.Feb 12 2015, 3:20 PM
lib/Target/X86/X86InstrFragmentsSIMD.td
374

I don't understand; we only want to match a 32-bit load that has been extended to fit in the 128-bit register, right? Isn't that what loadf32 guarantees? Perhaps this should be a zero-extend rather than scalar_to_vector though?

qcolombet added inline comments.Feb 12 2015, 3:29 PM
lib/Target/X86/X86InstrFragmentsSIMD.td
374

Well I may certainly misread the uses of loadf32_128, but does not this is used to fold the load in the related operation, thus we read 128-bit in memory, don't we?

spatel added inline comments.Feb 13 2015, 8:34 AM
lib/Target/X86/X86InstrFragmentsSIMD.td
374

I think I understand now: this is only working because it's difficult to match the more complicated pattern. If we somehow managed to match that pattern, then we'd trigger the same bug all over again.

So the fundamental problem is that we're creating instruction definitions that don't exist in x86 reality. I see 2 potential fixes:

(1) Pass a null / void memory operand pattern to the existing multiclass:

defm PS : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode, FR32,
f32, f128mem, [how to specify that nothing works here?] , SSEPackedSingle, itins>, PS;

(2) Write a new multiclass that just has a register-register variant; no reg-mem option is possible.

qcolombet added inline comments.Feb 13 2015, 10:20 AM
lib/Target/X86/X86InstrFragmentsSIMD.td
374

For #1, I'm not sure this is possible.

For #2, that sounds like the right approach. Just make sure that the related opcode are not used anywhere. If they are double check that this is correct and if it is correct, make sure to have a definition for those, just omit the patterns ([]).

That's a lot of "if" :).

Thanks.

spatel updated this revision to Diff 20087.Feb 17 2015, 8:04 AM

Hi Quentin,

I went ahead with your first suggestion of changing the PatFrags to match load128 -> extract in this version of the patch. This corrects the buggy codegen via tablegen pattern-matching, but there's still a potential problem coming from the peephole pass and the load folding tables. I'll try to fix that in a subsequent patch.

Hi Sanjay,

Could you add tests for the new patterns?
I.e., some vector extract feeding an and.

Thanks,
-Quentin

Could you add tests for the new patterns?
I.e., some vector extract feeding an and.

Hi,

This goes back to my initial problem - I don't know how to generate the positive test cases:

  1. We can't bitcast our way there via an integer logic op because those wouldn't be lowered to X86::F[AND,OR,XOR].
  2. We can't coerce an fabs / fneg / fcopysign into this pattern because they load a scalar; fixing that would be my next patch on this path.

There's some hope / danger that we'll start generating more of these nodes for shuffles or if we fix this:
http://llvm.org/bugs/show_bug.cgi?id=22428

...but until then I don't know how to produce IR to match the pattern. Suggestions welcome. :)

qcolombet accepted this revision.Feb 17 2015, 11:04 AM
qcolombet edited edge metadata.

...but until then I don't know how to produce IR to match the pattern. Suggestions welcome. :)

I was afraid of that and I do not have any suggestions either.

LGTM then.

Q.

This revision is now accepted and ready to land.Feb 17 2015, 11:04 AM
This revision was automatically updated to reflect the committed changes.