In SelectionDAGBuilder always translate the fshl and fshr intrinsics to
FSHL and FSHR (or ROTL and ROTR) instead of lowering them to shifts and
ORs. Improve the legalization of FSHL and FSHR to avoid code quality
|70 ms||linux > Polly.Isl/Ast::alias_checks_with_empty_context.ll|
Script: -- : 'RUN: at line 1'; opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-import-jscop-dir=/mnt/disks/ssd0/agent/llvm-project/polly/test/Isl/Ast -polly-codegen-verify -polly-ast -analyze < /mnt/disks/ssd0/agent/llvm-project/polly/test/Isl/Ast/alias_checks_with_empty_context.ll | FileCheck /mnt/disks/ssd0/agent/llvm-project/polly/test/Isl/Ast/alias_checks_with_empty_context.ll
I think SelectionDAGBuilder expanding this is nonsense, so yes. The expansion is also duplicated in two places, which is also bad.
- How should I implement integer expansion for funnel shifts?
This is already done?
- Any advice on fixing any of the code quality regressions in the tests?
You mean generate AND explicitly instead of UREM? I don't think there's any need to do that. Something else in SelectionDAG cleans it up.
Translate fshl/fshr intrinsics to ROTL/ROTR up front whenever possible.
Implement integer promotion for ROTL/ROTR in addition to FSHL/FSHR.
This removes most of the major code quality regressions.
Some tests still crash because I haven't implemented integer expansion.
Short answer: no not yet.
I haven't thought of any clever way to implement ExpandIntegerResult for funnel shifts, so I just want to lower them and run ExpandIntegerResult on the lowered code instead. But I'm not sure how to implement that.
The reg+reg shifts consider enough bits of the register that contains the shift amount to shift by bitwidth * 2 - 1 bits. For example (i64:x << y) == 0 for 64 <= y < 128. The immediate forms do not have this property (i.e. there is no way to encode (i64:x << 64) since the immediate field is 6 bits.
But even for the reg+reg versions, I can't think of how we can utilize this.
That's perfect. It means that for e.q. the fshl_i32 test case, the original code is fine and we don't even need the iseleq to handle the shift-by-zero case:
; CHECK-NEXT: andi. 5, 5, 31 ; CHECK-NEXT: subfic 6, 5, 32 ; CHECK-NEXT: slw 5, 3, 5 ; CHECK-NEXT: srw 4, 4, 6 ; CHECK-NEXT: or 4, 5, 4 ; CHECK-NEXT: iseleq 3, 3, 4 # not needed!
Unfortunately I think we'd need a PowerPC-specific lowering to implement this. (It doesn't matter about the immediate forms because funnel shifts by immediate-0 can be folded away.)
Can anyone suggest what might have caused this regression? We get two presumably redundant movsd instructions:
psrlq $50, %xmm1 movsd %xmm1, %xmm1 # xmm1 = xmm1[0,1] psllq $14, %xmm0 movsd %xmm0, %xmm0 # xmm0 = xmm0[0,1] orpd %xmm1, %xmm0
I looked at the debug output for this case.
t4: v2i64 = BUILD_VECTOR Constant:i64<14>, Constant:i64<14>
t15: v4i32 = BUILD_VECTOR Constant:i32<14>, undef:i32, Constant:i32<14>, undef:i32 t13: v2i64 = bitcast t15
And that then leads to a big mess when legalizing the 'rotl' node.
We convert an x86 vector-shift-by-variable node into an x86-vector-shift-by-constant node very late which alters an x86 MOVSD node to have the same operands.
But we've already processed that node in the last combining stage, so it survives to isel. We could probably zap that in DAGToDAG pre-processing.
This seems like an unlikely pattern/problem, and there are no existing tests that show it, so I don't think we need to hold this patch up for this regression.
I'd recommend C->getAPIntValue().urem(BW) instead - the fuzzer will catch this at some point........
You might be able to use ISD::matchUnaryPredicate to do some of this - so you just need to provide a predicate lambda
TargetLowering::expandROT does this: "// If a rotate in the other direction is supported, use it." Are there tests showing that it's not working for some reason?
No. I think it would need a target-specific lowering to take advantage of the fact that shift by 32 is well defined on PowerPC. Can I leave this for a later patch, and/or could a PowerPC expert volunteer to help with it?
I can't think of a simple micro optimization that would help. For PowerPC it would be better to expand fshl(X, Y, Z) as X << (Z & 31) | Y >> (32 - (Z & 31)) (where the right shift supports shifting by 32) which would be better than either the "before" or the "after" code shown in this diff.