I have low hopes that this patch will meet approval as-is, but I figure a patch proposal is the best way to push to the correct solution.
There are at least 2 separate issues in PR22371 ( http://llvm.org/bugs/show_bug.cgi?id=22371 ):
- We're folding loads at unaligned addresses into SSE (non-VEX prefixed) instructions. That part of the bug should be resolved with r227983.
- We're folding scalar FP load operands (32 or 64 bit) into packed logical FP instructions which are required to load 128-bits. That's what I'm trying to fix in this patch.
Here's the example codegen that I think we have to prevent:
LCPI0_0: .quad 4607182418800017408 ## double 1 ... cmplesd %xmm0, %xmm1 andpd LCPI0_0(%rip), %xmm1 <--- load 128-bits from a 64-bit location movapd %xmm1, %xmm0 retq
The edit of the multiclass defs is simple: change the memory operands in sse12_fp_packed_scalar_logical_alias from scalars to vectors. That's what the hardware packed logical FP instructions define: 128-bit memory operands. How we would ever actually match that pattern, I don't know. And so I have no positive (load folding) test cases in this patch.
There's also an existing bug in the AVX flavors of the defm's: they should be using load operands, not memop* operands. The only difference between those two is that memops have an extra alignment check to match the SSE spec. This part of the bug was extended by r228123 ( http://reviews.llvm.org/rL228123 ). Again, I'm not sure how to test that path. There weren't any tests added with r228123 either, so that makes me feel better.
The test cases that I'm proposing to add in 'logical-load-fold.ll' cover negative cases for the 4 paths (float/double * sse/avx) of the sse12_fp_packed_scalar_logical_alias multiclass. I haven't found any other way to generate a packed logical FP instruction with a memory operand.
I'm also proposing to XFAIL a test case that wants to fold a scalar load from the stack into a packed logical FP instruction. Although this may be possible, I don't see how we can do this without custom logic to know that it's safe to read the extra bytes from memory, so I extracted that one test into its own file and added comments to explain the situation.
This is not valid, is it?
When this matches we will read 128-bit from the memory, i.e., pass what we do for the load32. Aren’t we?
Something correct would be load128 -> extract element.
Though I do not think that happens a lot…