This is an archive of the discontinued LLVM Phabricator instance.

[AggressiveInstcombine] Conditionally fold saturated fptosi to llvm.fptosi.sat
ClosedPublic

Authored by dmgreen on May 17 2022, 3:03 AM.

Details

Summary

This adds a fold for aggressive instcombine that converts smin(smax(fptosi(x))) into a llvm.fptosi.sat, providing that the saturation constants are correct and the cost of the llvm.fptosi.sat is lower.

Unfortunately, a llvm.fptosi.sat cannot always be converted back to a smin/smax/fptosi. The llvm.fptosi.sat intrinsic is more defined that the original, which produces poison if the original fptosi was out of range. The llvm.fptosi.sat will saturate any value, so needs to be expanded to a fptosi(fpmin(fpmax(x))), which can be worse for codegeneration depending on the target.

So this is an RFC change that is conditional on the backend reporting that the llvm.fptosi.sat is cheaper that the original smin+smax+fptost. This is a change to the way that AggressiveInstrcombine has worked in the past. Instead of just being a canonicalization pass, that canonicalization can be dependant on the target in certain specific cases. This concept can also be useful in other cases such as the table base cttz from D113291 and possibly funnel shifts? (although I know the details there less).

Diff Detail

Event Timeline

dmgreen created this revision.May 17 2022, 3:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 17 2022, 3:03 AM
dmgreen requested review of this revision.May 17 2022, 3:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 17 2022, 3:03 AM
dmgreen edited the summary of this revision. (Show Details)May 17 2022, 3:03 AM

Are you needing the fp sat intrinsic to be visible in IR (better vectorization?) - otherwise might this be better off in CGP?

Yeah - it needs to be earlier than that, before vectorization and preferably unrolling to get the costs correct.

OK - speaking to @spatel offline there was a concern that AIC was becoming a bit of a dumping ground for combines that didn't really fit anywhere else.

It might be that we need to consider a CostDrivenInstCombine pass or something instead? But then I'm not sure how much that will overlap with VectorCombine etc,

spatel added a subscriber: lattner.May 18 2022, 7:53 AM

OK - speaking to @spatel offline there was a concern that AIC was becoming a bit of a dumping ground for combines that didn't really fit anywhere else.

AggressiveInstCombine was intended to be the dumping ground for regular InstCombine. :)

There was concern that InstCombine was taking too long per invocation and was getting added all over the default optimization pipelines just to cleanup mess from other passes. So AIC was a place that would only run once or twice in the pipeline where we could offload peepholes to ease compile-time cost. But that hasn't happened - we don't seem to be as collectively concerned about the compiler getting slower as much as when AIC was added. And everyone defaults to patching InstCombine because that guarantees that their fold will run and mesh with more folds. AIC is still only run at -O3 AFAIK.

It might be that we need to consider a CostDrivenInstCombine pass or something instead? But then I'm not sure how much that will overlap with VectorCombine etc,

VectorCombine is definitely a more focused pass. This patch and D113291 do not have vector requirements...but yes, the code starts to look similar once we put TTI cost comparisons into the mix.

In D113291, Craig made the point that we can predicate InstCombine and other pass transforms based on type legality from the DataLayout, so why not on opcode/intrinsic legality too?

So I'm not opposed, but I also don't have a good sense about potential downside of allowing target-specific IR transforms earlier in the optimization pipeline. Clearly, there's demand to do this because we get this kind of request fairly regularly.

@efriedma @lattner - any thoughts?

tschuett added a subscriber: tschuett.EditedMay 18 2022, 8:00 AM

I remember that there was talk about a FloatCombine pass with costs.

I like the distinction between a container CostDrivenInstCombine and special purpose cost-driven passes: Vector, Float, ...

If you start a modern CostDrivenInstCombine, it would be great to integrate a modern statistic tool to track hit rates of combines.

There's a continual struggle between pushing towards canonical IR, vs, pushing towards things we think are cheap on some specific target. Normally, the way we've resolved that is by distinguishing "early" vs. "late" optimizations: we try to push towards a canonical form early on to make the code easier to analyze, then we start doing optimizations like vectorization etc. using target-specific heuristics. AggressiveInstCombine doesn't really have anything to do with "early" vs. "late"; if we want something that runs just before vectorization, we should probably add a dedicated pass.

If vectorization is involved, we have to worry about cost difference between vector and scalar versions. For example, for vectors, we might want to use floating-point min, but for scalars we prefer integer min. Not sure if this is actually true for any target, but worth considering. If we need to deal with situations like that, we can't cleanly query the cost model, so we should prefer some unified approach. For example, attach some range metadata to fptosi.sat, or add some sort of "combiner" to VPlan.

That said, what targets are we worried about here? I guess soft-float targets? For targets with native floats, it's hard for me to imagine "nnan fptosi.sat" is significantly more expensive than "fptosi+min+max". It looks like isel currently doesn't take advantage of nnan, but it probably should.

Thanks for the comments.

There's a continual struggle between pushing towards canonical IR, vs, pushing towards things we think are cheap on some specific target. Normally, the way we've resolved that is by distinguishing "early" vs. "late" optimizations: we try to push towards a canonical form early on to make the code easier to analyze, then we start doing optimizations like vectorization etc. using target-specific heuristics. AggressiveInstCombine doesn't really have anything to do with "early" vs. "late"; if we want something that runs just before vectorization, we should probably add a dedicated pass.

It's not really about vectorizations - although that is where the biggest gains will come from. It's really about getting all the other cost based decisions correct throughout the llvm pipeline. If we don't then there will always be performance left on the table.

The inliner runs early and has always considered costs. As @spatel / @craig.topper said above, most other passes are directed by the target through the datalayout, it is just a different form of cost modelling.

I had considered moving the code that calculates costs in this patch into a separate function in the TTI. So AggressiveInstCombine can just call shouldChangeToFPSat, and that can be overridden by the target. The actual details inside shouldChangeToFPSat needn't be from the cost model, but in the case of fptosi.sat it might make the most sense. It's not always obvious when they will be profitable, and are usually custom legalized.

If vectorization is involved, we have to worry about cost difference between vector and scalar versions. For example, for vectors, we might want to use floating-point min, but for scalars we prefer integer min. Not sure if this is actually true for any target, but worth considering. If we need to deal with situations like that, we can't cleanly query the cost model, so we should prefer some unified approach. For example, attach some range metadata to fptosi.sat, or add some sort of "combiner" to VPlan.

There would be other ways to tackle this exact issue, but it doesn't solve the problem in general. I think for operations where there is a vector form but not a scalar form then there should be combiners in VPlan. MULH for example has vector instructions but not a scalar form. There was an example in D88152 of how that could work from a long time ago, but VPlan is still missing a few pieces (and has changed a lot since then).

We would also need to get the cost of smin(smax(fptosi)) correct, but I don't believe we can cost model multiple-element nodes like that well in general. Single instructions are usually easy enough to cost model. Two instructions like zext(load) are possible but it starts to get ugly. Three or more just doesn't work.

And none of that gets D113291 unstuck. Which is more about doing the transform early enough to benefit from other optimizations in the pipeline.

That said, what targets are we worried about here? I guess soft-float targets? For targets with native floats, it's hard for me to imagine "nnan fptosi.sat" is significantly more expensive than "fptosi+min+max". It looks like isel currently doesn't take advantage of nnan, but it probably should.

Yeah - soft-float Thumb1 targets are where I see significant losses from doing this unconditionally. You can always have targets where the integer smin/smax are cheaper than the fp min/max.

So I'm not opposed, but I also don't have a good sense about potential downside of allowing target-specific IR transforms earlier in the optimization pipeline. Clearly, there's demand to do this because we get this kind of request fairly regularly.

Do we know what the downsides presented in the past were? I've heard about an increased need for testing on all targets before, but that has really always been true with datalayout controlled combines. I believe that was more about the core of instcombine though, where more fundamental canonicalizations are happening. This and D113291 I feel are more about higher level irreversible transforms.

So I'm not opposed, but I also don't have a good sense about potential downside of allowing target-specific IR transforms earlier in the optimization pipeline. Clearly, there's demand to do this because we get this kind of request fairly regularly.

Do we know what the downsides presented in the past were? I've heard about an increased need for testing on all targets before, but that has really always been true with datalayout controlled combines. I believe that was more about the core of instcombine though, where more fundamental canonicalizations are happening. This and D113291 I feel are more about higher level irreversible transforms.

The primary downside of target-specific transforms is that it goes against canonicalization: if a combine is expecting a specific form of IR, that form will only show up on specific targets. So we either miss some transforms on some targets, or we write code to match the same thing in each possible form.

It's always been a bit of a spectrum; transforms like inlining are fundamentally driven by heuristics, and those heuristics are going to lead to different IR on different targets. But we want to encourage using canonical forms, even when we sometimes end up transforming from A->B, then later end up transforming B->A in the target's code generator. This shapes the way we define IR to some extent; for example, llvm.cttz has an "is_zero_poison" flag so we can use the same intrinsic on all targets.

This isn't to say we can never make a target-specific decision early, but we should explore alternatives that allow making a target-specific decisions later, where possible.

The primary downside of target-specific transforms is that it goes against canonicalization: if a combine is expecting a specific form of IR, that form will only show up on specific targets. So we either miss some transforms on some targets, or we write code to match the same thing in each possible form.

It's always been a bit of a spectrum; transforms like inlining are fundamentally driven by heuristics, and those heuristics are going to lead to different IR on different targets. But we want to encourage using canonical forms, even when we sometimes end up transforming from A->B, then later end up transforming B->A in the target's code generator. This shapes the way we define IR to some extent; for example, llvm.cttz has an "is_zero_poison" flag so we can use the same intrinsic on all targets.

This isn't to say we can never make a target-specific decision early, but we should explore alternatives that allow making a target-specific decisions later, where possible.

OK, I would like to get this and D113291 unstuck if we can. There are other similar problems that would fall into the same camp. I agree that canonicalization can be very useful (and we should be careful not to break it needlessly), but that canonicalizations doesn't need to be identical for every target. It is always going to be detrimental if it is - if the costs being too incorrect to make correct decisions, or extra optimizations that could occur in the mid-end do not happen where they should. A single (semi-)target-independant canonicalization doesn't seem like a strong enough benefit to sacrifice optimization power or compile time.

For this patch - the transform needs to happen before loop unrolling to get the costs correct. And I feel is less about canonicalization as the transform is not reversible, neither form is canonical. There are other cost decisions like inlining but I believe they would be less likely to be super important. Which sounds like AggressiveInstCombine would be a good place for it, given its position in the pipeline. It is early enough to get the following costs correct, but doesn't muddy up instcombine with target queries.

In summary, the cost of a saturating fptosi vs. plain fptosi+min+max varies so much across targets that that we need two canonical forms: one for targets where saturating fptosi with the given types is cheap, and one for targets where it isn't. I guess that conclusion is fine, but we should state it in a comment in the code, so it's clear why we're cost-modeling this.

I'm not sure how D113291 is related to that conclusion, though. The cost of llvm.cttz should be the the same as, or cheaper than, a table lookup. Worst case, SelectionDAG emits a table lookup itself. (SelectionDAG currently doesn't have code to emit a table lookup, I think, but that's straightforward to change.)

dmgreen updated this revision to Diff 432854.May 30 2022, 12:41 AM

Add a comment to tryToFPToSat explaining the situation.

efriedma accepted this revision.Jun 7 2022, 3:26 PM

LGTM

This revision is now accepted and ready to land.Jun 7 2022, 3:26 PM
This revision was landed with ongoing or failed builds.Jun 10 2022, 1:36 AM
This revision was automatically updated to reflect the committed changes.
dnsampaio added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
405–407

Hi @dmgreen, long time.
I'm updating our downstream compiler from llvm 12 to llvm 15 and this getCastInstrCost call is doing a rather strange thing, by asking the cost of a sign extend from i32 to i8, which seems awkward.
Instead of using the fixed Instruction::SExt, shouldn't it be using CastInst::getCastOpcode(IntTy, true, SatTy, true) here?

Reproducer:

@a = global float 0.000000e+00
@b = global i32 0

define i32 @c() {
  %1 = load float, ptr @a
  %2 = fptosi float %1 to i32
  %3 = call i32 @llvm.smin.i32(i32 %2, i32 127)
  %4 = call i32 @llvm.smax.i32(i32 %3, i32 -128)
  store i32 %4, ptr @b
  ret i32 undef
}

declare i32 @llvm.smin.i32(i32, i32)

declare i32 @llvm.smax.i32(i32, i32)

Coming from a creduced code:

float a;
b;
c() {
  b = a;
  if (b > 127)
    b = 127;
  if (b < -128)
    b = -128;
}
dmgreen added inline comments.Jul 10 2023, 1:35 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
405–407

Hello. Hope you are well.

It sounds like the dst and src types are the wrong way around. I think that SatTy should always be smaller than IntTy. I can run a couple of quick tests and will put up a patch for it.