This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Transform shift+and to shift+shift to select more shifted register
ClosedPublic

Authored by bcl5980 on Nov 29 2022, 2:30 AM.

Details

Summary

and (shl/srl/sra, x, c), mask --> shl (srl/sra, x, c1), c2

Diff Detail

Event Timeline

bcl5980 created this revision.Nov 29 2022, 2:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 29 2022, 2:30 AM
bcl5980 requested review of this revision.Nov 29 2022, 2:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 29 2022, 2:30 AM

I will check in the precommit tests if the reviewers look good for the tests.

dmgreen added inline comments.
llvm/test/CodeGen/AArch64/shiftregister-from-and.ll
115

This instruction looks wrong, with the shift being more than 32 for w regs.

mingmingl added inline comments.Nov 29 2022, 5:37 PM
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
632

nit: use uint64_t (the same type as the return value of getZExtValue() to avoid warnings.

647–671

nit pick:

To ensure that UBFM/SBFM is used as a right shift (essentially UBFX) here, add assert(BitWidth-1>=NewShiftC && "error message")

654–655

Relatedly, with 1) N as and (srl x, c), mask and 2) LowZBits == 0, N should be selected as UBFX by isBitfieldExtractOp before tablegen pattern (where AArch64DAGToDAGISel::SelectShiftedRegister gets called) applies. Similarly, with and(shl x, c), mask and LowZBits == 0, N is a UBFIZ.

So I think this shouldn't happen be seen in practice; but an early exit (not assertion) looks okay.

657–658

For correctness, similar check is needed when LHSOpcode is ISD::SRL; meanwhile for srl, the condition BitWidth == LowZBits + MaskLen + ShiftAmtC should be sufficient

For example, https://gcc.godbolt.org/z/YvGG3Pov3 shows an IR at trunk; and with the current patch the optimized code changes the result (not correct).

	lsr	x8, x0, #3
	add	x0, x1, x8, lsl #1
	ret
llvm/test/CodeGen/AArch64/shiftregister-from-and.ll
138–140

nit pick:

i16 will be legalized to i32 [1], and the patch optimizes shiftedreg_from_and_negative_type (trunk https://gcc.godbolt.org/z/cnac5989e). This doesn't seem to be a negative test . Probably use a vector like https://gcc.godbolt.org/z/T4szTsEav

[1] llc -O3 -mtriple=aarch64 -debug -print-after-all test.ll gives debug logs, and https://gist.github.com/minglotus/763173f099bec76c720ec9ce7c181471 shows the legalization step.

bcl5980 updated this revision to Diff 478788.Nov 29 2022, 7:44 PM

address comments.

bcl5980 updated this revision to Diff 478789.Nov 29 2022, 7:46 PM

test comments update.

bcl5980 marked 5 inline comments as done.Nov 29 2022, 7:49 PM
bcl5980 added inline comments.
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
657–658

BitWidth <= LowZBits + MaskLen + ShiftAmtC should be better.

llvm/test/CodeGen/AArch64/shiftregister-from-and.ll
115

fixed the code and rename the test to @shiftedreg_from_and_negative_andc5

bcl5980 updated this revision to Diff 478790.Nov 29 2022, 7:51 PM
bcl5980 marked 2 inline comments as done.

remove redundant condition

mingmingl added inline comments.Nov 30 2022, 10:15 AM
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
689–692

Selecting and (shl/srl/sra, x, c), mask into shl (srl/sra, x, c1), c2 (and folding shl) is simplifying one instruction away but could change the pipeline (latency and throughput as well) of the instruciton.

For example, arithmetic operations with shifted operand could use M pipeline [1] on neoverse n1, and M pipeline is used for all ALU operations with imm-shifted operand for cortex a57.

I wonder if improvements are seen in some benchmarks from this patch?

[1] for neoverse n1, M pipeline is used for {ADDS, SUBS} with "Arithmetic, LSR/ASR/ROR shift or LSL shift > 4".

bcl5980 added inline comments.Nov 30 2022, 1:19 PM
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
689–692

I agree that it may not get any timing improvement except lsl < 4. I also don't believe this patch can get improvement in any real benchmarks. Actually the current shifted register select function also does not consider for the thing you mentioned. If you really worry about that I can limit the lsl constant < 4.But personally I don't want to do that.

bcl5980 added inline comments.Nov 30 2022, 1:56 PM
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
689–692

And for high-end CPU, it's really hard to model by just execute ports. 2 instructions will use 2 Rob entry, 2 physical registers, more front end decode, more code size. At least we should not consider these detail things when these pattern with the same latency and only some targets may have different throughput. If we really want to consider these detail micro architecture trade off, the better way is always do here and maybe split it in machine combiner that we already have detail schmodel.

Performance can depend on many factors, and often less instructions is a win in itself. Can you separate out and rebase on the tests to just show the differences?

bcl5980 updated this revision to Diff 479503.Dec 1 2022, 7:01 PM

rebase code

mingmingl added inline comments.Dec 4 2022, 10:08 PM
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
689–692

Thanks for the discussion! It makes sense to me to consider rewriting ALU with shifted operand into two simpler instructions in machine-combiner pass, and that less instructions is a win in itself (regarding David's comment)

To provide some context why the original ask, there are motivating test cases (micro benchmarks) that want something opposite (i.e., split ALU with shifted operand into two simpler instructions); a manual hack to use dummy inline assembly (asm volatile("" : "+r" (variable));) shows speedups for neoverse n1 (sole M pipeline, and could become bottleneck) -> and I was thinking about machine-combiner as one place (to implement split with some critical path analysis?) as well.

So how about this patch?

So how about this patch?

I'm not too concerned about having fewer instructions and using M pipeline, since I also concur having fewer instructions makes general sense (as you and David pointed out). So this transformation looks good to me . Also note on some aarch64 processors (e.g., neoverse n1), logical instructions with imm-shifted operand is a net win compared with two instructions; and some aarch64 processors have more than one M pipeline (neoverse n2), so fewer chances of M pipeline being a bottleneck.

Tests on microbenchmarks (llvm-test-suite, etc) might give us more confidence (the numbers depend on the specific processor though), but I don't strongly prefer a performance test.

So how about this patch?

I'm not too concerned about having fewer instructions and using M pipeline, since I also concur having fewer instructions makes general sense (as you and David pointed out). So this transformation looks good to me . Also note on some aarch64 processors (e.g., neoverse n1), logical instructions with imm-shifted operand is a net win compared with two instructions; and some aarch64 processors have more than one M pipeline (neoverse n2), so fewer chances of M pipeline being a bottleneck.

Tests on microbenchmarks (llvm-test-suite, etc) might give us more confidence (the numbers depend on the specific processor though), but I don't strongly prefer a performance test.

I try llvm-test-suite but unfortunately it looks no test trigger the code.

dmgreen accepted this revision.Dec 6 2022, 1:42 AM

From what I can tell I think the code is OK. LGTM

This doesn't come up often, but I do see it triggering. I'm not sure what the motivating case is but it seemed OK for performance in the tests I ran.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
633–634

If RHS isnt used anywhere else:
ConstantSDNode *RHSC = dyn_cast<ConstantSDNode>(N.getOperand(1));

658

Some more comments explaining these conditions would be good.

This revision is now accepted and ready to land.Dec 6 2022, 1:42 AM
bcl5980 updated this revision to Diff 480485.Dec 6 2022, 7:29 AM
This revision was landed with ongoing or failed builds.Dec 6 2022, 7:30 AM
This revision was automatically updated to reflect the committed changes.