This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE2] Use smarter instruction patterns for lowering UMIN/UMAX with v8i16.
ClosedPublic

Authored by TomHender on Sep 7 2020, 7:24 AM.

Details

Summary

This is my first LLVM patch, so please tell me if there are any process issues.

The main observation for this patch is that we can lower UMIN/UMAX with v8i16 by using unsigned saturated subtractions in a clever way. Previously this operation was lowered by turning the signbit of both inputs and the output which turns the unsigned minimum/maximum into a signed one.

We could use this trick in reverse for lowering SMIN/SMAX with v16i8 instead. In terms of latency/throughput this is the needs one large move instruction. It's just that the sign bit turning has an increased chance of being optimized further. This is particularly apparent in the "reduce" test cases. However due to the slight regression in the single use case, this patch no longer proposes this.

Unfortunately this argument also applies in reverse to the new lowering of UMIN/UMAX with v8i16 which regresses the "horizontal-reduce-umax", "horizontal-reduce-umin", "vector-reduce-umin" and "vector-reduce-umax" test cases a bit with this patch. Maybe some extra casework would be possible to avoid this. However independent of that I believe that the benefits in the common case of just 1 to 3 chained min/max instructions outweight the downsides in that specific case.

Diff Detail

Event Timeline

TomHender created this revision.Sep 7 2020, 7:24 AM
TomHender requested review of this revision.Sep 7 2020, 7:24 AM
TomHender edited the summary of this revision. (Show Details)

@RKSimon Can we match the horizontal reductions early during DAG combine and insert the sign bit flip at the top and bottom of the tree and rewrite the intermediate min/maxes to the other type? That would hide them from the lowering code changes.

@RKSimon Can we match the horizontal reductions early during DAG combine and insert the sign bit flip at the top and bottom of the tree and rewrite the intermediate min/maxes to the other type? That would hide them from the lowering code changes.

I don't think we currently do anything x86-specific for pre-SSE41 min/max reductions, and even then we just do the PHMINPOSUW vXi8/vXi16 cases.

I published a followup patch D88026 that fixes the horizontal reduction regression in the way suggested by craig.topper.

RKSimon added inline comments.Sep 29 2020, 10:03 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
26945

(style) Remove braces

llvm/test/CodeGen/X86/vselect-minmax.ll
20

Regressions? Increased instruction count (the movdqa rr on the other side is effectively free) + constant pool load.

TomHender marked an inline comment as done.Sep 29 2020, 11:31 AM

Thank you for your comments.

I am not sure what to do about the single i8 min/max regression. It's also at least 4 bytes larger in code size. The benefit of the change is only that it can enable further simplifications later on (sometimes drastically as can be seen in the other tests). Considering that this lowering strategy was previously used for the i16 case that must have been the argument for it. The current patch strives to be just a compromise.

I think the proper solution would be to use some kind of sophisticated heuristic that looks at the surrounding code and chooses a different lowering depending on what is there. For example one could look if there are already other min, max or xor operations on the arguments and the result (while looking through shuffles). This would also solve the i16 case more throughly. However this sounds not so trivial to implement and it is not even clear to me if that is the right place in the code for such logic and if would be something worthwhile to pursue at all. What do you think?

As a stopgap - would it be possible to split off the v16i8 cases into a new patch? And update this one with just the v8i16 case.

Additionally, please can you update the UMIN/UMAX costs for v8i16 in X86TTIImpl::getTypeBasedIntrinsicInstrCost? You'll need to add entries to SSE2CostTbl as it uses the default expansion atm. This might affect transform tests as well as analysis/costmodel tests.

TomHender updated this revision to Diff 296048.Oct 4 2020, 7:24 AM
TomHender retitled this revision from [X86][SSE2] Use smarter instruction patterns for lowering UMIN/UMAX with v8i16 and SMIN/SMAX with v16i8. to [X86][SSE2] Use smarter instruction patterns for lowering UMIN/UMAX with v8i16..
TomHender edited the summary of this revision. (Show Details)
TomHender updated this revision to Diff 296049.Oct 4 2020, 7:29 AM

Please can you update the costmodel tests checks? "ninja check-llvm-analysis-costmodel" should show some failures now - you can regenerate with "llvm\utils\update_analyze_test_checks.py"

You might see diffs under "ninja check-llvm-transforms" as well

This patch is now just about v8i16 and I updated the array in "getTypeBasedIntrinsicInstrCost".

Regarding the remaining regression and the alternative lowering for v16i8, I think would like to try implementing the cost analysis based sign-switching for both v8i16 and v16i8. I have given that some thought and I think the actual logic is actually relatively straight forward. We just look at connected sequences of minimums and maximums (where we can even have simple shuffles, xors, adds, subtractions and possibly other operations in between) and then calculate both the cost of lowering the sequence naively (i.e. as of this patch) vs. the cost of lowering after turning the signedness of all minimums and maximums and flipping the sign on all input/ouput points. If sign switching is cheaper, we rewrite the tree by switching all signed and unsigned operations as well as adding a xor at each input/output point. Some common patterns I hope to optimize this way include the following:

  • All the horizontal reduction code should automatically fall under this.
  • Calculating the maximum or minimum of many values.
  • The clamp pattern: min(max(x, min_value), max_value), possibly with the minimum and maximum constant.
  • The median pattern: max(min(x, y), min(max(x, y), z))

The main question though why I am not already implementing a prototype is that I don't know where to put the code. I think neither the custom lowering function "X86TargetLowering::LowerOperation" nor "X86TargetLowering::PerformDAGCombine" are a good fit. The first one seems suboptimal because from my superficial understanding this method is not supposed to operate on the whole instruction graph but only lower one instruction at a time. The second one is annoying because as far as I understand LLVM's control flow, we will no longer have minimums or maximums explicitely because they have already been replaced by the earlier "LowerOperation". This would make matching the code patterns much more annoying.

Is there some target specific callback which runs before the lowering and is able/supposed to do tree rewriting like this?

TomHender marked an inline comment as done.Oct 4 2020, 8:46 AM
TomHender updated this revision to Diff 296148.Oct 5 2020, 4:10 AM

Unfortunately I couldn't get the "update_analyze_test_checks.py" script to work no matter what arguments I tried (which included --force-update, --opt-binary=T:\llvm-project\llvm\NATIVE\Debug\bin\opt.exe, different quotation mark placement and so on). Most of the time it did just nothing without any error or it would sometimes error with "ERROR: Unexpected opt name: opt.exe".

Anyway I updated the test code manually and all the cost model tests are passing now. In the transforms directory the tests "Inline/optimization-remarks-yaml.ll" and " SampleProfile/uncompressed-profile-symbol-list.ll" fail for me but I suspect that this is not because of my patch.

Anyway I updated the test code manually and all the cost model tests are passing now. In the transforms directory the tests "Inline/optimization-remarks-yaml.ll" and " SampleProfile/uncompressed-profile-symbol-list.ll" fail for me but I suspect that this is not because of my patch.

Do they fail the same way without your patch?

LGTM - @craig.topper any final comments or suggestions on how to proceed for v16i8?

Please can you create an initial patch for the v16i8 cases based on the code you pulled out of this patch? We can discuss how to proceed from that.

RKSimon accepted this revision.Oct 7 2020, 2:02 PM

LGTM for the v8i16 cases

This revision is now accepted and ready to land.Oct 7 2020, 2:02 PM

Can you commit the changes for me?

My GitHub account is ActuallyaDeviloper <ActuallyaDeviloper@users.noreply.github.com>.

yubing added a subscriber: yubing.Jan 12 2021, 6:24 PM