and (shl/srl/sra, x, c), mask --> shl (srl/sra, x, c1), c2
Details
Diff Detail
Event Timeline
llvm/test/CodeGen/AArch64/shiftregister-from-and.ll | ||
---|---|---|
114 | This instruction looks wrong, with the shift being more than 32 for w regs. |
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp | ||
---|---|---|
633 | nit: use uint64_t (the same type as the return value of getZExtValue() to avoid warnings. | |
648–672 | nit pick: To ensure that UBFM/SBFM is used as a right shift (essentially UBFX) here, add assert(BitWidth-1>=NewShiftC && "error message") | |
655–656 | Relatedly, with 1) N as and (srl x, c), mask and 2) LowZBits == 0, N should be selected as UBFX by isBitfieldExtractOp before tablegen pattern (where AArch64DAGToDAGISel::SelectShiftedRegister gets called) applies. Similarly, with and(shl x, c), mask and LowZBits == 0, N is a UBFIZ. So I think this shouldn't happen be seen in practice; but an early exit (not assertion) looks okay. | |
658–659 | For correctness, similar check is needed when LHSOpcode is ISD::SRL; meanwhile for srl, the condition BitWidth == LowZBits + MaskLen + ShiftAmtC should be sufficient For example, https://gcc.godbolt.org/z/YvGG3Pov3 shows an IR at trunk; and with the current patch the optimized code changes the result (not correct). lsr x8, x0, #3 add x0, x1, x8, lsl #1 ret | |
llvm/test/CodeGen/AArch64/shiftregister-from-and.ll | ||
137–139 | nit pick: i16 will be legalized to i32 [1], and the patch optimizes shiftedreg_from_and_negative_type (trunk https://gcc.godbolt.org/z/cnac5989e). This doesn't seem to be a negative test . Probably use a vector like https://gcc.godbolt.org/z/T4szTsEav [1] llc -O3 -mtriple=aarch64 -debug -print-after-all test.ll gives debug logs, and https://gist.github.com/minglotus/763173f099bec76c720ec9ce7c181471 shows the legalization step. |
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp | ||
---|---|---|
694–699 | Selecting and (shl/srl/sra, x, c), mask into shl (srl/sra, x, c1), c2 (and folding shl) is simplifying one instruction away but could change the pipeline (latency and throughput as well) of the instruciton. For example, arithmetic operations with shifted operand could use M pipeline [1] on neoverse n1, and M pipeline is used for all ALU operations with imm-shifted operand for cortex a57. I wonder if improvements are seen in some benchmarks from this patch? [1] for neoverse n1, M pipeline is used for {ADDS, SUBS} with "Arithmetic, LSR/ASR/ROR shift or LSL shift > 4". |
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp | ||
---|---|---|
694–699 | I agree that it may not get any timing improvement except lsl < 4. I also don't believe this patch can get improvement in any real benchmarks. Actually the current shifted register select function also does not consider for the thing you mentioned. If you really worry about that I can limit the lsl constant < 4.But personally I don't want to do that. |
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp | ||
---|---|---|
694–699 | And for high-end CPU, it's really hard to model by just execute ports. 2 instructions will use 2 Rob entry, 2 physical registers, more front end decode, more code size. At least we should not consider these detail things when these pattern with the same latency and only some targets may have different throughput. If we really want to consider these detail micro architecture trade off, the better way is always do here and maybe split it in machine combiner that we already have detail schmodel. |
Performance can depend on many factors, and often less instructions is a win in itself. Can you separate out and rebase on the tests to just show the differences?
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp | ||
---|---|---|
694–699 | Thanks for the discussion! It makes sense to me to consider rewriting ALU with shifted operand into two simpler instructions in machine-combiner pass, and that less instructions is a win in itself (regarding David's comment) To provide some context why the original ask, there are motivating test cases (micro benchmarks) that want something opposite (i.e., split ALU with shifted operand into two simpler instructions); a manual hack to use dummy inline assembly (asm volatile("" : "+r" (variable));) shows speedups for neoverse n1 (sole M pipeline, and could become bottleneck) -> and I was thinking about machine-combiner as one place (to implement split with some critical path analysis?) as well. |
I'm not too concerned about having fewer instructions and using M pipeline, since I also concur having fewer instructions makes general sense (as you and David pointed out). So this transformation looks good to me . Also note on some aarch64 processors (e.g., neoverse n1), logical instructions with imm-shifted operand is a net win compared with two instructions; and some aarch64 processors have more than one M pipeline (neoverse n2), so fewer chances of M pipeline being a bottleneck.
Tests on microbenchmarks (llvm-test-suite, etc) might give us more confidence (the numbers depend on the specific processor though), but I don't strongly prefer a performance test.
From what I can tell I think the code is OK. LGTM
This doesn't come up often, but I do see it triggering. I'm not sure what the motivating case is but it seemed OK for performance in the tests I ran.
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp | ||
---|---|---|
634–635 | If RHS isnt used anywhere else: | |
659 | Some more comments explaining these conditions would be good. |
nit: use uint64_t (the same type as the return value of getZExtValue() to avoid warnings.