This is an archive of the discontinued LLVM Phabricator instance.

[LICM] Promote conditional, loop-invariant memory accesses to scalars
Needs ReviewPublic

Authored by dmilosevic141 on Dec 7 2021, 5:36 AM.

Details

Summary

Promotion of conditional accesses breaks the safety property, which divides into two parts:

  • The memory may not be dereferencable on loop entry. In this case, we cannot hoist load instructions into the preheader basic block.
  • The memory model does not allow us to insert a store along any dynamic path which did not originally have one.

The LICM pass, as of D113289, hoists load instructions which are not guaranteed to execute into the preheader basic block, if the memory proves to be dereferencable on loop entry. Note that this change does not sink the store instructions which are not guaranteed to execute into the exit blocks.
Sinking the store instructions, which are not guaranteed to execute, directly breaks the second part of the safety property mentioned above. To be more precise, sinking conditional store instructions may introduce data races/traps. In order to make it thread-safe, we need to make sure the sunken store instructions are executed conditionally, depending on whether or not the value was written to.
Consider the following source code:

int u;

void f(int a[restrict], int n)
{
	for (int i = 0; i < n; ++i)
		if (a[i])
			++u;
}

This patch allows the LICM pass, in order for it to promote the conditional access of u (which would include hoisting the load instruction and sinking the store instruction), to insert a simple flag, which is initially (in the preheader) down. Whenever the control flow gets to the conditional store to u, the flag is raised. Finally, the sunken store instruction is executed conditionally (using the llvm.masked.store intrinsic) in the exit blocks, depending on whether or not the flag was raised.
For the source code given above, the following pseudo LLVM IR code (only the relevant parts are shown) corresponds to the actual LLVM IR code the LICM, now, generates:

...
entry:
  u.promoted = load u
  br for.cond
for.cond:
  u.flag = phi [ 0, entry ], [ u.flag.next, for.inc ] ; Note that the flag is down, if we get here through the preheader basic block.
  inc = phi [ u.promoted, entry ], [ inc.next, for.inc ]
  ...
for.body: 
...
if.then:
  inc.if.then = add inc, 1
  br for.inc
for.inc:
  u.flag.next = phi [ 1, if.then ], [ u.flag, for.body ] ; Note that the flag is raised, if we get here through the if.then basic block.
  inc.next = phi [ inc.if.then, if.then ], [ inc, for.body ]
  ...
for.cond.cleanup: ; This is the only exit block.
  u.flag.lcssa = phi [ u.flag, for.cond ] ; Get the flag value.
  inc.lcssa = phi [ inc, for.cond ]
  call void @llvm.masked.store(<1 x i32> [inc.lcssa], <1 x i32>* [u], i32 <alignment>, <1 x i1> [u.flag.lcssa])
  ret void
}
...

This patch addresses proposed potential optimizations from: Missed opportunities for register promotion.

Diff Detail

Event Timeline

dmilosevic141 created this revision.Dec 7 2021, 5:36 AM
dmilosevic141 requested review of this revision.Dec 7 2021, 5:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 7 2021, 5:36 AM
djtodoro added a subscriber: petarj.

Thanks for working on this! This looks like an addition to https://reviews.llvm.org/D113289.

Can you please simplify the summary (e.g. the IR code could be written in a pseudo IR) ?

llvm/lib/Transforms/Scalar/LICM.cpp
147

IIUC, this should be target-independent optimization? And, GCC generates such code for both x86_64 and aarch64, right?

The generated code will be bigger, so this will impact the code size, so I guess this needs motivation in terms of (SPEC) benchmarking.

1929

This is comment is redundant.

llvm/test/Transforms/LICM/conditional-access-promotion.ll
14–17

not needed

89–107

I guess we don't need these.

dmilosevic141 edited the summary of this revision. (Show Details)Dec 7 2021, 5:54 AM
dmilosevic141 edited the summary of this revision. (Show Details)

Thanks @djtodoro !
I've updated the following:

  • Simplified the summary example.
  • Removed the unnecessary comment from the LICM.cpp file, as well as unnecessary attributes and metadata from the conditional-access-promotion.ll file.
dmilosevic141 marked 3 inline comments as done.Dec 7 2021, 6:36 AM
dmilosevic141 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
147

IIUC, this should be target-independent optimization? And, GCC generates such code for both x86_64 and aarch64, right?

Correct, GCC (excluding the Ofast optimization level, which allows data races) generates such code for both x86_64 and AArch64, hence why this should be target-independent.

The generated code will be bigger, so this will impact the code size, so I guess this needs motivation in terms of (SPEC) benchmarking.

Sure, will work on that.

ntesic added a subscriber: ntesic.Dec 7 2021, 8:00 AM
reames added a comment.Jan 5 2022, 8:51 AM

conditional store promotion is one of those transforms which I've been thinking about for a long time, and have never quite felt safe implementing.

My high level concern here is profitability. It's unclear to me that inserting a flag iv, and the conditional store outside the loop is generally profitable. It's clearly profitable in many cases, but I'm a bit hesitant on whether it makes sense to do as a canonicalization. I'd love to see some data on the impact of this.

Structure wise, I strongly encourage you to use predicated stores not CFG manipulation. LICM does not generally modify the CFG; we can, but the analysis updates are complicated. (See the diamond hoisting code.) I generally think that a predicate store instruction is the "right" construct here as it leads to more obvious optimization outcomes.

However, our masked.load handling for scalar (e.g. single value vectors) leaves a bit to be desired. I'd started improving it a bit (with this goal in mind), but if you're serious about this, you'd need to spend some time working through related issues there first.

llvm/lib/Transforms/Scalar/LICM.cpp
1894

See macro comment.

2014

The abnormal exit property applies to classic store promotion too. This is either redundant, or you have some bug you should discuss and fix separately.

2065

Stray change?

2091

This was already checked above.

2093

This set does not make sense. A store only needs to be condition if we can't find any store to the address which doesn't dominate all exits.

You should only need to track some global state here, not anything store specific.

2219

Please use i1 here, not i8.

Matt added a subscriber: Matt.Jan 7 2022, 7:33 AM
dmilosevic141 marked 2 inline comments as done.

Thanks @reames! Sorry for the delayed response, I haven't been able to fully focus on this topic in the past couple of months.

My high level concern here is profitability. It's unclear to me that inserting a flag iv, and the conditional store outside the loop is generally profitable. It's clearly profitable in many cases, but I'm a bit hesitant on whether it makes sense to do as a canonicalization. I'd love to see some data on the impact of this.

Definetly, hopefully I'll be able to provide the data soon.

Structure wise, I strongly encourage you to use predicated stores not CFG manipulation. LICM does not generally modify the CFG; we can, but the analysis updates are complicated. (See the diamond hoisting code.) I generally think that a predicate store instruction is the "right" construct here as it leads to more obvious optimization outcomes.

I haven't come across the predicated store instructions yet. Based on the first look, they definetly fit the needs better than the CFG manipulation, thanks for the directions.
Here's my vision for using them, please let me know your thoughts:

  • Hoisting load instructions into the preheader would still stay the same (using regular load instructions).
  • An additional flag would still be needed. It'd be used (e.g. initialized in the preheader basic block, its value propagated through the basic blocks) the same way.
  • Sinking the conditional store instructions into the exit block(s) with CFG manipulation would be replaced with an appropriate predicated store instructions sunk into the exit block(s).

The summary example would then look like this:

...
entry:
  u.promoted = load u
  br for.cond
for.cond:
  u.flag = phi [ false, entry ], [ u.flag.next, for.inc ]
  inc = phi [ u.promoted, entry ], [ inc.next, for.inc ]
  ...
for.body: 
...
if.then:
  inc.if.then = add inc, 1
  br for.inc
for.inc:
  u.flag.next = phi [ true, if.then ], [ u.flag, for.body ]
  inc.next = phi [ inc.if.then, if.then ], [ inc, for.body ]
  ...
for.cond.cleanup: ; This is the only exit block.
  u.flag.lcssa = phi [ u.flag, for.cond ] ; Get the flag value.
  inc.lcssa = phi [ inc, for.cond ]
  call @llvm.masked.store(<1x<u_type>> inc.lcssa, <1x<u_type>> &u, i32 <u_alignment>, <1xi1> u.flag.lcssa)
  ret void
}
...

I'll also take some more time diving deeper into the predicated instructions.

However, our masked.load handling for scalar (e.g. single value vectors) leaves a bit to be desired. I'd started improving it a bit (with this goal in mind), but if you're serious about this, you'd need to spend some time working through related issues there first.

I'm guessing the same goes for the masked.store.* intrinsics, so

call @llvm.masked.store(<1x<u_type>> inc.lcssa, <1x<u_type>>* &u, i32 <u_alignment>, <1xi1> u.flag.lcssa)

should be swapped with something more appropriate for scalar values?
To summarize, I've rebased and fixed a few things @reames pointed out.

Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 12:56 AM
dmilosevic141 added inline comments.Mar 31 2022, 12:57 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2014

Thanks for pointing this out! I implemented the loopHasNoAbnormalExits function because I was unsure whether the LICM pass makes sure there are no abnormal loop exits before it does 'classic' (non-conditional) promotions. There was no difference before and after calling this function (i.e. no bugs which this call 'fixes'), except for the (now) unnecessary overhead.

2065

Just something that I needed for the set that I was using. Reverting this back.

2091

Indeed, thanks!

2093

This set was used to propagate the value of the flag through the basic blocks, via the SSAUpdater's interface.
For that purpose, we already had the LoopUses set, so this set was redundant. Thanks!

2219

Thanks!

reames added a comment.May 1 2022, 9:44 AM

@dmilosevic141, your last comment was spot on. I do worry I might have confused you on one point though. When I said "predicated store" I had meant the llvm.masked.store intrinsic. We have another family of vector predicated intrinsics, but they're a lot less mature at the moment. I'm not exactly sure what their status is.

dmilosevic141 edited the summary of this revision. (Show Details)
dmilosevic141 set the repository for this revision to rG LLVM Github Monorepo.

Thanks @reames!
I took some time to dive deeper into the predicated instructions. So far, two things stand out to me:

  • The ScalarizeMaskedMemIntrin pass, which is a standard part of the CodeGen pipeline. This pass translates the masked memory intrinsics into chains of basic blocks, by loading/storing elements one-by-one. For single-value vectors, handling looks pretty much identical to the initial version of the patch. Scalarization of the masked memory intrinsics is, however, target-dependent (e.g. for x86, the scalarization of calls with single-value vectors passed in as arguments does occur; for RISC-V, it does not).
  • The Vector Predication Roadmap (Vector Predication Roadmap).

Having said that, I'm a little bit stuck in regards to the next steps towards better handling of single-value vectors, as I'm not sure that the ScalarieMaskedMemIntrin pass is *the* place for that. The Vector Predication Roadmap is a little bit overwhelming - i.e. I'm not sure which part of it is related to the better handling of single-value vectors. If you (or anyone else) have something concrete to point me to, I'd be really thankful. :)
I've updated the patch to use the llvm.masked.store intrinsic, instead of the complicated CFG manipulation.