This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Convert phi types
ClosedPublic

Authored by dmgreen on Jun 15 2020, 1:18 AM.

Details

Summary

If a collection of interconnected phi nodes is only ever loaded, stored or bitcast then we can convert the whole set to the bitcast type, potentially helping to reduce the number of register moves needed as the phi's are passed across basic block boundaries. This has to be done in CodegenPrepare as it naturally straddles basic blocks. It just looks from phi nodes, looking at uses and operands for a collection of nodes that all together are bitcast between float and integer types. We record visited phi nodes to not have to process them more than once. The whole subgraph is then replaced with a new type. Loads and Stores are bitcast to the correct type, which should then be folded into the load/store, changing it's type.

This comes up in the biquad testcase due to the way MVE needs to keep values in integer registers. I have also seen it come up from aarch64 partner example code, where a complicated set of sroa/inlining produced integer phis, where float would have been a better choice. I can't provide an example unfortunately (and you might argue it should be fixed earlier in that case). It also comes up in an X86 atomic test case, which looks better and functionally OK to my untrained eyes. I can make this ARM or MVE only, if that would be better. I also added extract_element handling, which increased this from a 60% improvement to a 109% improvement.

Diff Detail

Event Timeline

dmgreen created this revision.Jun 15 2020, 1:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2020, 1:18 AM

The "used by stores" part is potentially a little problematic, until we settle https://bugs.llvm.org/show_bug.cgi?id=45152 .

The algorithm seems reasonable.

This needs to happen in CodeGenPrepare due to the interaction with other CGP optimizations?

The "used by stores" part is potentially a little problematic, until we settle https://bugs.llvm.org/show_bug.cgi?id=45152 .

Fascinating. Am I right in saying that this would not be a problem under ARM? And only actually be a problem for X86-32 in llvm? I has heavily tempted to put this behind a TLI hook before, and this might be an extra reason. Something like shouldConvertPhiType(Src, Dst).

The algorithm seems reasonable.

This needs to happen in CodeGenPrepare due to the interaction with other CGP optimizations?

Yeah. It's related to converting splats to integer. Ideally the whole thing would be done during global-isel style lowering. (I think, although don't know the details, that it might forget and re-learn type information, so I guess it might get something like this for free). For ISel it's done in CGP though.

Am I right in saying that this would not be a problem under ARM?

Sort of. There are two aspects here. One, what does the hardware do? Two, what do optimizations do? ARM obviously allows arbitrary values in floating-point registers, but this transform would be wrong if we started throwing away nan bits in float constants.

Currently, we essentially implement the model where we preserve NaN bits, except on 32-bit x86 where things sort of explode. So until we actually start making changes, this is unlikely to actually break anything on ARM. Maybe safer to guard this under a target hook so we don't make the situation on 32-bit x86 worse, though.

llvm/lib/CodeGen/CodeGenPrepare.cpp
5783

The list of Defs you're handling here is sort of a heuristic. The transform itself doesn't really care.

I guess the choice of LoadInst and ExtractElementInst here is a reasonable conservative guess. If we wanted to be more aggressive, we'd probably want to try to do some real cost estimate.

dmgreen updated this revision to Diff 272175.Jun 19 2020, 2:02 PM

Added a shouldConvertPhiType TLI hook, which default to true but on 32bit X86 is turned off.

craig.topper added inline comments.Jun 19 2020, 2:40 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
30752

32-bit with sse2 isn't different than 64-bit is it? I'm assuming the real difference is x87 vs sse for floating point? In which case even sse or 64-bit still use x87 for fp80.

efriedma added inline comments.Jun 19 2020, 4:00 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
30752

fp80 doesn't have the same issue; SNaN payloads are preserved by fp80 load and store operations.

On 32-bit with SSE2, we still generate x87 operations to handle return values; the calling convention requires it. Not sure what effect that has on this transform, specifically; maybe it's okay as long as we keep the values in SSE registers within the function.

craig.topper added inline comments.Jun 19 2020, 4:38 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
30752

Thanks for the clarification. Since this patch was about phis I didn't equate the mention of 32-bit mode with the calling convention specifically.

This revision is now accepted and ready to land.Jun 19 2020, 5:00 PM
This revision was automatically updated to reflect the committed changes.

Thanks. I tried to tests some X86 a little to see if seemed OK, and didn't find any issues. I've committed it with an option to more easily turn off if problems arise.

If we do run into problems, I can add more cost modelling or restrict it to less targets.

bjope added a subscriber: bjope.Jun 22 2020, 1:57 AM

Couldn't there be uses in dbg intrinsics as well? What happens to such uses?
Maybe a test case with explicit dbg.value intrinsics could be useful (or maybe it is possible to use an extra RUN-line with -debugify)?

Oh. Yes. I do always forget about dbg uses more than I should. I will put together a patch, thanks for the suggestion.

tpopp added a subscriber: tpopp.Jun 22 2020, 4:11 AM

Hi,

I reverted the follow up change that enabled this pass as it is causing 2x slowdowns on compilation of some large binaries at Google and assume that it will be affecting others also. It looked maybe related to ThinLTO, but I don't know if that's relevant or not.

Compile time? 2x? That sounds like a lot! It must be hitting some degenerate cases much more than I expected..

Do you have a reproducer? I don't think any of my testing has turned up anything that big. (Although I'm worried now it was lost in the noise of the measurements. 2x should be easy enough to see though). http://llvm-compile-time-tracker.com/ doesn't show anything obvious either. There's perhaps a little movement, but nothing like 2x. Everything is +-0.05% and seems to go largely away with the next build.

I will try to run some more benchmarks, but it would be good to see what is causing a 2x change, exactly.

tpopp added a comment.Jun 23 2020, 5:57 AM

I'm trying to get some help from more experienced coworkers in California side to see how I could help prove this. I also was off in the 2x slowdown. Compilation of one object file is taking slightly under 3x longer and then being stopped by a timeout in infrastructure that I haven't yet found a workaround for. I'm looking into some parts to see if maybe there's an infinite loop while waiting for insights from others.

Might be stuck in a loop? Thanks for the info, that might help. I had tried to guard against that, but there's a chance it could happen somewhere. Let me know if you can find out what kind of code causing it to go wrong, it would help get to the bottom of what is wrong.

tbosch added a subscriber: tbosch.Jun 23 2020, 5:25 PM

FYI, working on a reduction. Will have it hopefully tomorrow.

David, I sent you an email with a reproduction.

Thanks. Much appreciated. I'll take a look

nikic added a subscriber: nikic.Jun 26 2020, 1:41 PM

InstCombine has a transform that, as far as I can see, does essentially the same as this in https://github.com/llvm/llvm-project/blob/903cf140d0118cf0d3f0f6f8967c6a20d9c5be6b/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp#L2298. I'm wondering why it is necessary to duplicate the logic in CGP. Is this because other CGP transformations only enable the transform to apply?

The InstCombine transform also had a series of issues (multiple instances of infinite loops and unprofitable transforms), so D82676 looks familiar :) Comparing some of the logic, I'm also wondering whether this transform is legal for volatile stores (InstCombine guards by isSimple).

InstCombine has a transform that, as far as I can see, does essentially the same as this in https://github.com/llvm/llvm-project/blob/903cf140d0118cf0d3f0f6f8967c6a20d9c5be6b/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp#L2298. I'm wondering why it is necessary to duplicate the logic in CGP. Is this because other CGP transformations only enable the transform to apply?

I had not seen that. Yes we need it to run this after the other transforms in CGP. It is really working around ISel only operating on a single basic block at a time.

I remember trying the original reproducer with the entire opt pipeline after running CGP on it, but the phi had not been converted like it needed. It looks like that version will always start with bitcast(phi(...)), and will not fan out into Uses like this one does, or catch phi(bitcast(..)) on it's own. This version seems a bit more powerful than that one (except for constants) but needs to start off at the phi to store visited's, to make sure we don't keep re-looking at the same values.

They are quite similar, but I'm not sure right now how I would combine them into a single version. That version has to work inside InstCombine, with it's limitations, and this version needs to make sure it doesn't become O(n^2) and handles all the cases we need it to. Plus I guess the inst combine version might be incorrect for X86_32?

The InstCombine transform also had a series of issues (multiple instances of infinite loops and unprofitable transforms), so D82676 looks familiar :) Comparing some of the logic, I'm also wondering whether this transform is legal for volatile stores (InstCombine guards by isSimple).

Good point. That's something I can do something about.