This is an archive of the discontinued LLVM Phabricator instance.

Lower idempotent RMWs to fence+load
ClosedPublic

Authored by morisset on Sep 19 2014, 4:51 PM.

Details

Summary

I originally tried doing this specifically for X86 in the backend in D5091,
but it was rather brittle and generally running to late to be general.
Furthermore, other targets may want to implement similar optimizations.
So I reimplemented it at the IR-level, fitting it into AtomicExpandPass
as it interacts with that pass (which could not be cleanly done before
at the backend level).

This optimization relies on a new target hook, which is only used by X86
for now, as the correctness of the optimization on other targets remains
an open question. If it is found correct on other targets, it should be
trivial to enable for them.

Details of the optimization are discussed in D5091.

Diff Detail

Repository
rL LLVM

Event Timeline

morisset updated this revision to Diff 13899.Sep 19 2014, 4:51 PM
morisset retitled this revision from to Lower idempotent RMWs to fence+load.
morisset updated this object.
morisset edited the test plan for this revision. (Show Details)
morisset added a reviewer: jfb.
morisset added a subscriber: Unknown Object (MLST).
jfb added inline comments.Sep 22 2014, 2:11 PM
lib/CodeGen/AtomicExpandPass.cpp
96 ↗(On Diff #13899)

I find that this code isn't quite obvious: if the RMW is idempotent then you try to simplify it in a target-defined way, and if that fails you call expandAtomicRMW from simplifyIdempotentRMW. There should be only one place where you call expandAtomicRMW, and it should be here.

Maybe add a virtual shouldSimplifyIdempotentRMWInIR function, or change the control flow here to first try to simplify, if that fails check if you should expand, and if that also fails when keep going (I think this is better).

481 ↗(On Diff #13899)

You could handle min/max/umin/umax but those seem somewhat useless. I guess other optimizations won't do partial evaluation into an atomic instruction, so constant propagation will feed up to the value operand of the atomic but not further, so it may be possible that this optimization would fire?

490 ↗(On Diff #13899)

Merge the two above lines, LLVM usually follows that style.

492 ↗(On Diff #13899)

Could you instead add the LoadInst to the AtomicInsts that you're iterating through? Same as before, I don't really like the repeated logic since it makes the code harder to modify and follow.

lib/Target/X86/X86ISelLowering.cpp
17029 ↗(On Diff #13899)

s/harmfull/harmful/

17030 ↗(On Diff #13899)

Not all primitive types smaller than native width have an appropriate load. Will this still work in these cases (or is it impossible to get here with types that aren't 8/16/32/64)?

test/CodeGen/X86/atomic_idempotent.ll
6 ↗(On Diff #13899)

optimization

47 ↗(On Diff #13899)

Could you also try weird integer sizes?

I also think that testing more RMW operations at least for 32-bit would be good, especially and.

Thanks for the review. Answers inline, and patch next.

lib/CodeGen/AtomicExpandPass.cpp
96 ↗(On Diff #13899)

I've changed the control-flow here, it is simpler (no more redundancy, only one call to expandAtomicRMW), but the control-flow lost a bit of readability. Please tell-me if it looks ok to you.

481 ↗(On Diff #13899)

There are two different optimizations doable for min/max/umin/umax:

  • simplify them when the value operand is the constant INT_MIN, INT_MAX, ...
  • simplify them when the value operand is necessarily the same value as the value already in the memory cell (for some definition of value, already and memory cell...).

It is not exactly clear to me which you want me to do. The first one is rather trivial to add, I would just need to find a way of getting INT_MIN/INT_MAX for any type size (probably not very hard, but I would rather keep it in a separate patch).

The second way would probably have to be a different pass, dealing with nightmarish complexity in the form of aliasing issues and subtle memory model issues (just defining what "current value in a memory location" is basically impossible in C11), and generally I don't want to open that can of worms unless it can be shown as crucial in actual benchmarks (I lost months trying to prove the correctness of simple variants of that class of optimizations with nothing to show for it but a slew of counter-examples and traps, so I am quite cautious about anything that looks like it).

490 ↗(On Diff #13899)

Done.

492 ↗(On Diff #13899)

I tried doing that originally, but it is not obvious how to do it cleanly:

  • adding the LoadInst to AtomicInsts might invalidate the iterator and break everything
  • a goto from that case to the beginning of the function would do the trick, but .. "goto" is not really recommended in LLVM I think
  • adding an extra level of nesting with a while loop would also work, but probably not be extremely readable (and the nesting of the control-flow in runOnFunction is already uncomfortable), and feels a bit overkill.

So I took this approach with a bit of redundancy because the other solutions looked even worse. I agree that it is ugly, I would love suggestions on how to clean it up.

lib/Target/X86/X86ISelLowering.cpp
17029 ↗(On Diff #13899)

Fixed.

17030 ↗(On Diff #13899)

Luckily, LLVM only allows power-of-2 sizes (in bytes) for Atomic RMW operations. So we don't have to bother checking for some abominations like an i24 or i13.

test/CodeGen/X86/atomic_idempotent.ll
6 ↗(On Diff #13899)

Fixed.

47 ↗(On Diff #13899)

I've added tests for and/sub.
However atomic RMWs are not defined/accepted by LLVM for sizes other than power of 2 number of bytes (luckily for my sanity).

morisset updated this revision to Diff 14004.Sep 23 2014, 11:30 AM

Partial cleanup of the control-flow + extra test + fixed formatting/typos.

jfb added inline comments.Sep 23 2014, 1:53 PM
include/llvm/Target/TargetLowering.h
1011 ↗(On Diff #14004)

I still find the interaction between this function and shouldExpandAtomicLoadInIR very confusing: it's not clear from this documentation that the implementation of lowerIdempotentRMWIntoFencedLoad can return a simple load atomic that will then be lowered appropriately if shouldExpandAtomicLoadInIR is true.

Overall the code is more understandable, though. I think updating this documentation is good enough.

lib/CodeGen/AtomicExpandPass.cpp
488 ↗(On Diff #14004)

Leave a FIXME that documents the other optimizations that can be done here.

morisset updated this revision to Diff 14019.Sep 23 2014, 2:02 PM

Add requested comments

jfb accepted this revision.Sep 23 2014, 2:09 PM
jfb edited edge metadata.

I think this looks good, but I'd leave it open for a while to see if others have comments.

This revision is now accepted and ready to land.Sep 23 2014, 2:09 PM
morisset closed this revision.Sep 25 2014, 10:37 AM
morisset updated this revision to Diff 14080.

Closed by commit rL218455 (authored by @morisset).