This is an archive of the discontinued LLVM Phabricator instance.

[CodeGenPrepare] Relax conditions for folding addressing mode into loads/stores
AbandonedPublic

Authored by chill on Feb 13 2023, 2:50 AM.

Details

Summary

The sinking of address computations to their users (loads/stores)
is often blocked by call instructions, which take the address as
a parameter - unless the call is "cold", it's considered a non-foldable use.

Considering the whole call sequence, including passing the arguments,
it is sometimes possible to materialize an address computation directly
into a hard register, in a sense "to fold the addressing mode into the call".

For example, on AArch64 the register-to-register copy
instruction ("C6.2.190 MOV (register)", which would likely by used to pass
a pre-computed address argument, is an alias to "C6.2.207 ORR (shifted register)"
and typically has the same latency and throughput as an "ADD" instruction.

This change tries to allow sinking of more addresses to load/store instructions
by preventing some call instructions from being blockers.

With this change CodeGenPrepare still does sinking only towards memory
loads/stores. It works in synergy with a MachineSink patch in
https://reviews.llvm.org/D145706, which does sinking towards calls.

This patch (together with the others up/down the stack) improves
SPECv6 500.perlbench_r by about 3.26% and the whole
of SPECv6 intrate by about 0.46% (geomean).

Diff Detail

Event Timeline

chill created this revision.Feb 13 2023, 2:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 13 2023, 2:50 AM
chill requested review of this revision.Feb 13 2023, 2:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 13 2023, 2:50 AM
chill edited the summary of this revision. (Show Details)Feb 13 2023, 7:14 AM
chill edited the summary of this revision. (Show Details)Feb 13 2023, 7:54 AM
chill edited the summary of this revision. (Show Details)Feb 13 2023, 8:38 AM
efriedma added a comment.EditedFeb 13 2023, 10:30 AM

If I'm understanding correctly, the point is that we don't want to block sinking if an address computation has multiple uses, where only some are foldable?

Expressing this in terms of "folding into a call" seems confusing and unnecessary; we should just change the logic to allow sinking if we can fold some, but not all, the uses. (Maybe including some sort analysis of whether sinking increases the number of times we perform the address computation at runtime.)

chill added a comment.Feb 14 2023, 4:16 AM

If I'm understanding correctly, the point is that we don't want to block sinking if an address computation has multiple uses, where only some are foldable?

Yes, for when we cannot say we aren't extending live ranges of the address "registers".

Expressing this in terms of "folding into a call" seems confusing and unnecessary;

To me it looks like a straightforward analogy.
For load/stores we have an addressing computation that's essentially free
when it's a part of the load/store instruction, as opposed to it being a separate instruction and using a simple indirect load/store.
For calls we have an addressing computation that's essentially free
when it's a part of the call sequence, as opposed to it being a separate instruction and using a simple register-to-register move.

we should just change the logic to allow sinking if we can fold some, but not all, the uses.
(Maybe including some sort analysis of whether sinking increases the number of times we perform the address computation at runtime.)

IMHO the main issue here is not extending live ranges too much. As for the runtime overhead, sinking a foldable address is assumed to
not increase the execution time, because of the checks in isLegalAddressingMode and (with this patch) in canFoldAddrModeIntoCall.

Perhaps, with PGO, we could use block frequencies in a way that a non-foldable, infrequently executed user does not prevent sinking, much like
the way we currently use Attribute::Cold. However, I don't think that's an alternative to this patch.

For calls we have an addressing computation that's essentially free when it's a part of the call sequence, as opposed to it being a separate instruction and using a simple register-to-register move.

There are a lot of cases where there isn't any extra "mov" that can hide the cost; the most common of those being the case where the value in question is spilled. But I guess I can see the analogy. I'd generally prefer to consider that sort of transform in terms of rematerialization, though. We can compute the cost of remat much more accurately in the register allocator.

IMHO the main issue here is not extending live ranges too much. As for the runtime overhead, sinking a foldable address is assumed to not increase the execution time, because of the checks in isLegalAddressingMode and (with this patch) in canFoldAddrModeIntoCall.

I was thinking more in terms of sinking to arbitrary uses, as opposed to only sinking some uses.

chill edited the summary of this revision. (Show Details)Mar 9 2023, 10:49 AM
chill retitled this revision from [CodeGenPrepare] Fold addressing mode into calls to [CodeGenPrepare] Relax conditions for folding addressing mode into loads/stores.Mar 10 2023, 8:25 AM

For constructs like the given testcase, you don't need to reason about whether it's profitable to fold into a call; cloning a GEP into both sides of an if-else doesn't actually increase execution time, so it's obviously profitable even if you can only fold it on one side of the if-else. Can you add some examples that actually require predicting whether the GEP is as cheap as a mov, and the argument is passed in a register?

llvm/include/llvm/CodeGen/TargetLowering.h
3170

In this case, the addressing mode doesn't actually represent any computation, so it isn't relevant for this transform; when do you expect it to become relevant?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
24358

Counting arguments like this won't do what you want if there are arguments which are passed in multiple registers.

chill added inline comments.May 10 2023, 7:45 AM
llvm/include/llvm/CodeGen/TargetLowering.h
3170

The logic of the AddressingModeMatcher is that it accumulates an addressing mode from partial "expressions". For example an addressing mode
which ends up as [reg + imm] or [reg + reg] would have involved checks for legality of just [reg] here:
https://github.com/llvm/llvm-project/blob/ddfb974d0fca62e3eaeb98b79b5e29738c9082d2/llvm/lib/CodeGen/CodeGenPrepare.cpp#L4926

chill added inline comments.Jun 13 2023, 9:42 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
24358

Indeed.

For now, I'm proposing a different approach, doing the transformation entirely in MachineSink.
The relevant review is https://reviews.llvm.org/D152828

If it doesn't fly, I'll come back to this review and think how to solve this above issue.

chill added a comment.Jun 13 2023, 9:48 AM

For constructs like the given testcase, you don't need to reason about whether it's profitable to fold into a call; cloning a GEP into both sides of an if-else doesn't actually increase execution time, so it's obviously profitable even if you can only fold it on one side of the if-else. Can you add some examples that actually require predicting whether the GEP is as cheap as a mov, and the argument is passed in a register?

These test cases are constructed just to have the addressing computation and its users in separate basic blocks.
In our motivating examples, we have the addressing computation as loop invariant and its users inside a loop body, so
just sinking copies might increase run time.

I have added such test cases in https://reviews.llvm.org/D152828 : sink-and-fold.ll, functions f4 and f5.

chill abandoned this revision.Oct 23 2023, 10:38 AM