When inserting an element that's coming from a vector load or a broadcast
of a vector (or scalar) load, combine the load into the insertps
instruction.
Added test cases.
Paths
| Differential D3581
Added more insertps optimizations ClosedPublic Authored by filcab on Apr 30 2014, 8:25 PM.
Details Summary When inserting an element that's coming from a vector load or a broadcast Added test cases.
Diff Detail Event Timelinefilcab updated this object. Comment Actions Pinging the patch, but also added a CHECK-NOT to a test. Changed CHECK to CHECK-NOT instead of removing. Comment Actions Hi Filipe, I think there is a better way to fix this bug. For example, you can simply add the following ISel patterns to X86InstrSSE.td instead of adding new target combine rules in X86ISelLowering.cpp. let Predicates = [UseSSE41] in { def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1), (loadv4f32 addr:$src2), imm:$src3)), (INSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>; def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1), (X86PShufd (v4f32 (scalar_to_vector (loadf32 addr:$src2))), (i8 0)), imm:$src3)), (INSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>; } let Predicates = [UseAVX] in { def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1), (loadv4f32 addr:$src2), imm:$src3)), (VINSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>; def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1), (X86VBroadcast (loadf32 addr:$src2)), imm:$src3)), (VINSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>; } I am pretty sure that the four patterns above would cover all the interesting cases. Also, you should add tests to verify the AVX codegen as well. Currently your new tests only verifies that we do the correct thing with SSE4.1. However, your patch also affects AVX (in fact, you specifically added combine rules for the case where the inserted element comes from a vbroadcast). -Andrea Comment Actions Hi Andrea, Wouldn't the tablegen patterns be problematic when we use the load result That was the reasoning behind doing it with code and guarding them with I'll add the AVX tests today. Filipe Comment Actions
The ISel pattern matcher will only reduce a dag sequence if the chain is "profitable to fold". A chain is never profitable, for example, if it contains nodes with more than one use. In the case of your new test function 'insertps_from_vector_load', if I add another use for variable %1, then the following rule will no longer trigger. def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1), (loadv4f32 addr:$src2), imm:$src3)), (INSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>; That is because eventually, method 'X86DAGtoDAGISel::IsProfitableToFold' will return false if the 'loadv4f32' has more than one use (see X86ISelDAGToDAG.cpp - around line 303). Basically, the matcher will try to verify that all the intermediate nodes between the X86Insertps (excluded) and the loadv4f32 (included), only have a single use. Given how the ISel Matcher works, in case of a load result used several times, we will not generate additional loads.
That reasoning is correct. However, the pattern Matcher will do the same and will never try to fold the loadv4f32 if it has more than one use. (P.s.: you can see it by yourself if you some experiments adding more uses for the load instruction and then you debug ISel passing flag -debug to llc). I hope this helps. filcab edited edge metadata. Comment ActionsAdded the patterns that Andrea mentioned. We still need to have PerformSELECTCombine for the case when we do a The other cases are simplified by just dealing with them with patterns in X86InstrSSE.td Added test cases for SSE4.1, AVX, and AVX2. Comment Actions Hi Filipe, I tested yout patch and it works for me.
Comment Actions Thanks Andrea. Committed as r209156.
Revision Contents
Diff 9502 docs/CommandGuide/llvm-symbolizer.rst
include/llvm/CodeGen/MachineOperand.h
include/llvm/DebugInfo/DIContext.h
lib/DebugInfo/DWARFDebugInfoEntry.cpp
lib/Target/PowerPC/PPCAsmPrinter.cpp
lib/Target/X86/X86ISelLowering.h
lib/Target/X86/X86ISelLowering.cpp
lib/Target/X86/X86InstrSSE.td
lib/Transforms/InstCombine/InstructionCombining.cpp
test/Analysis/CostModel/X86/vselect-cost.ll
test/CodeGen/X86/avx-blend.ll
test/CodeGen/X86/avx2.ll
test/CodeGen/X86/blend-msb.ll
test/CodeGen/X86/fold-load-vec.ll
test/CodeGen/X86/sse41-blend.ll
test/CodeGen/X86/sse41.ll
test/DebugInfo/llvm-symbolizer.test
test/Transforms/InstCombine/gepphigep.ll
tools/llvm-symbolizer/LLVMSymbolize.h
tools/llvm-symbolizer/LLVMSymbolize.cpp
tools/llvm-symbolizer/llvm-symbolizer.cpp
|
It might be useful to have a comment here explaining why you need a shift.
When the source is a memory operand, the Count_S bits of the immediate operand are not used to select the floating point element from the source memory location.
That's why we have to extract the 'Count_S' bits from the immediate operand and use them as 'index' for a new load instruction.