This is an exception to the rule that we should prefer xmm ops to ymm ops. As shown in PR42305:
https://bugs.llvm.org/show_bug.cgi?id=42305
...the store folding opportunity with vextractf128 may result in better perf by reducing the instruction count.
Details
Diff Detail
Event Timeline
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
4819 | I wonder if it would be better to use use_begin()/use_end() for the loop. Then you can use getUse().getResNo() to check if its the chain or not. That would be more efficient than scanning the operands of the user. |
Patch updated:
Use iterator loop rather than range loop. That allows checking the result number more efficiently from the SDUse.
LGTM - cheers
llvm/test/CodeGen/X86/sandybridge-loads.ll | ||
---|---|---|
44 | Not (very) related - but do we need a variant of this with that has an unaligned load: %v0 = load <8 x float>, <8 x float>* %a, align 32 ; <--- aligned %v1 = load <8 x float>, <8 x float>* %b, align 16 ; <--- unaligned store <8 x float> %v0, <8 x float>* %b, align 32 ; <--- aligned store <8 x float> %v1, <8 x float>* %a, align 16 ; <--- unaligned ret void |
llvm/test/CodeGen/X86/sandybridge-loads.ll | ||
---|---|---|
44 | I think for the purpose of this patch, that's covered by the next test diff (in widen_load-3.ll). And for the purpose of SandyBridge-specific (actual CPU feature is -slow-unaligned-mem-32), the test just above here covers it. But I can add that suggestion for completeness. |
I wonder if it would be better to use use_begin()/use_end() for the loop. Then you can use getUse().getResNo() to check if its the chain or not. That would be more efficient than scanning the operands of the user.