This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add new pass `X86FixupInstTuning` for fixing up machine-instruction selection.
ClosedPublic

Authored by goldstein.w.n on Feb 10 2023, 3:31 PM.

Details

Summary

There are a variety of cases where we want more control over the exact
instruction emitted. This commit creates a new pass to fixup
instructions after the DAG has been lowered. The pass is only meant to
replace instructions that are guranteed to be interchangable, not to
do analysis for special cases.

Handling these instruction changes in in X86ISelLowering of
X86ISelDAGToDAG isn't ideal, as its liable to either break existing
patterns that expected a certain instruction or generate infinite
loops.

Currently, only vpermilps -> vshufps/vshufd is implemented, but
more cases can be added.

Diff Detail

Event Timeline

goldstein.w.n created this revision.Feb 10 2023, 3:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 3:31 PM
goldstein.w.n requested review of this revision.Feb 10 2023, 3:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 3:31 PM

Please can you split off the patch adding the free domain switch tuning flags? Its been something we've been missing from combineX86ShuffleChain for some time and should be addressed first.

llvm/lib/Target/X86/X86.td
533 ↗(On Diff #496616)

Maybe rename to TuningNoDomainDelay/TuningNoDomainDelayShuffle - bypass could mean many things.....

1005 ↗(On Diff #496616)

Bonnell has no domain penalty either (and REALLY likes the shorted instruction encodings).

Split off domain flags

goldstein.w.n marked 2 inline comments as done.

Please can you split off the patch adding the free domain switch tuning flags? Its been something we've been missing from combineX86ShuffleChain for some time and should be addressed first.

Done, see D143859

I think this should be controlled by a tuning flag tbh, or even better we do this in a later pass and not in dag/isel

I think this should be controlled by a tuning flag tbh, or even better we do this in a later pass and not in dag/isel

Where would you suggest putting it?

Also what is the case for the seperate tuning flag? AFAICT its just better code size and equal or better perf so seems
like clear universal win.

We have a number of cases where a specific instruction is faster on one target than another, or there's no domain switch cost and we can use smaller variants, etc.

This comes to mind as well: https://github.com/llvm/llvm-project/issues/43458

We can use tuning flags, but for many cases its just confusing and matching what our scheduler models already tell us. Plus we end up with many permutations of DAG / isel that don't always work well together, or cause infinite loops etc.

My idea was for a small pass similar to FixupLEA/FixBWI that we can drive from a mixture of the subtarget tuning flags and the scheduler model to decide between various equivalent instruction options based on cost estimates. Shuffle ops give themselves to this the most, but theres probably others we might consider as well.

Replacing a single instruction for another is trivial - it might be feasible to replace a single instruction for multiple instructions as well (HADD/HSUB expansion, Funnel Shifts, etc.).

We have a number of cases where a specific instruction is faster on one target than another, or there's no domain switch cost and we can use smaller variants, etc.

This comes to mind as well: https://github.com/llvm/llvm-project/issues/43458

We can use tuning flags, but for many cases its just confusing and matching what our scheduler models already tell us. Plus we end up with many permutations of DAG / isel that don't always work well together, or cause infinite loops etc.

My idea was for a small pass similar to FixupLEA/FixBWI that we can drive from a mixture of the subtarget tuning flags and the scheduler model to decide between various equivalent instruction options based on cost estimates. Shuffle ops give themselves to this the most, but theres probably others we might consider as well.

Replacing a single instruction for another is trivial - it might be feasible to replace a single instruction for multiple instructions as well (HADD/HSUB expansion, Funnel Shifts, etc.).

I see, that makes sense. I'll take a stab at that.

Move logic to a new pass

goldstein.w.n retitled this revision from [X86] Try to use `{v}shufps` instead of `vpermilps` for common float shuffles. to [X86] Add new pass `X86FixupISel` for fixing up machine-instruction selection..Feb 15 2023, 11:56 PM
goldstein.w.n edited the summary of this revision. (Show Details)

We have a number of cases where a specific instruction is faster on one target than another, or there's no domain switch cost and we can use smaller variants, etc.

This comes to mind as well: https://github.com/llvm/llvm-project/issues/43458

We can use tuning flags, but for many cases its just confusing and matching what our scheduler models already tell us. Plus we end up with many permutations of DAG / isel that don't always work well together, or cause infinite loops etc.

My idea was for a small pass similar to FixupLEA/FixBWI that we can drive from a mixture of the subtarget tuning flags and the scheduler model to decide between various equivalent instruction options based on cost estimates. Shuffle ops give themselves to this the most, but theres probably others we might consider as well.

Replacing a single instruction for another is trivial - it might be feasible to replace a single instruction for multiple instructions as well (HADD/HSUB expansion, Funnel Shifts, etc.).

Done, only added vpermilps -> vshufps/vshufd at the moment, but other replacements should be doable.

pengfei added inline comments.Feb 16 2023, 12:17 AM
llvm/lib/Target/X86/X86FixupISel.cpp
27 ↗(On Diff #497899)

We often use all lower case latters.

69 ↗(On Diff #497899)

Lambda function name should starts with upper latter.

75 ↗(On Diff #497899)

ditto.

86 ↗(On Diff #497899)

Better to use processVPERMILPSri/processVPERMILPSmi

llvm/test/CodeGen/X86/opt-pipeline.ll
205

Is the pass too far away from ISel?

goldstein.w.n added inline comments.Feb 16 2023, 12:21 AM
llvm/test/CodeGen/X86/opt-pipeline.ll
205

I think we want it to run quite after all other instruction transformations.

The important ones are register allocation (so we convert vpermlipsri -> vshufpsrri at a spill) and domainfixup (so we have the correct instructions).

goldstein.w.n marked 4 inline comments as done.
goldstein.w.n edited the summary of this revision. (Show Details)

Fix style

craig.topper added inline comments.
llvm/lib/Target/X86/X86FixupISel.cpp
1 ↗(On Diff #497899)

Referring to ISel and running the pass so far from instruction selection is perhaps misleading.

pengfei added inline comments.Feb 16 2023, 2:46 AM
llvm/lib/Target/X86/X86FixupISel.cpp
1 ↗(On Diff #497899)

I have the same concern, if it is required for vpermli*, maybe just name it FixupVpermli.

RKSimon added inline comments.Feb 16 2023, 4:36 AM
llvm/lib/Target/X86/X86FixupISel.cpp
1 ↗(On Diff #497899)

The hope is to add other instruction combos in the future - given we're trying to fixup for a mixture of tuning/scheduler - maybe X86FixupTuning.cpp ?

12 ↗(On Diff #497908)

What do you mean by analysis? I'm hoping we can use scheduler models in the future to drive some transforms here.

68 ↗(On Diff #497908)

Comments explaining the exact purpose of these transforms would be useful.

69 ↗(On Diff #497908)

(style) use bool instead of auto

goldstein.w.n added inline comments.Feb 16 2023, 9:44 AM
llvm/lib/Target/X86/X86FixupISel.cpp
12 ↗(On Diff #497908)

I mean something like transforming vpermq ymm <-> vshufd ymm should not be put in this file, as its not always valid (requires analyzing the mask to see if no lane crosses and repeated / mask is only in pairs of 2). OTOH vpermilps <-> vshufd are always interchangable. I'll make the logic more clear.

RKSimon added inline comments.Feb 16 2023, 10:05 AM
llvm/lib/Target/X86/X86FixupISel.cpp
12 ↗(On Diff #497908)

Agreed, anything involving valuetracking etc. shouldn't be done this late.

However, I don't see any problem with replacing instructions based on immediates that are part of the instruction - we already do this for load folding, commutation, domain switching etc.

goldstein.w.n marked 2 inline comments as done.Feb 16 2023, 10:29 AM
goldstein.w.n added inline comments.
llvm/lib/Target/X86/X86FixupISel.cpp
1 ↗(On Diff #497899)

Renamed to X86FixupInstTuning.cpp (though FixupTuning was a bit generic about what exactly is being tuned).

12 ↗(On Diff #497908)

Agreed, anything involving valuetracking etc. shouldn't be done this late.

However, I don't see any problem with replacing instructions based on immediates that are part of the instruction - we already do this for load folding, commutation, domain switching etc.

Its not inherently wrong, but my feeling is that belongs in X86ISelLowering/X86ISelDAGToDAG/td. I think it would work for a few instructions, but it seems like the kind of thing that will explode the complexity of the file over time (even the shuffle logic with in X86ISelLowering is extremely complex/convoluted with the clean APIs the DAG has). I think that kind of complexity on a file operation on machine instr will be bug-prone. Generally, we *could* do it here, but I don't think we *should*.

Fix some nits, rename

goldstein.w.n retitled this revision from [X86] Add new pass `X86FixupISel` for fixing up machine-instruction selection. to [X86] Add new pass `X86FixupInstTuning` for fixing up machine-instruction selection..Feb 16 2023, 10:30 AM
Matt added a subscriber: Matt.Feb 20 2023, 1:18 PM

Support for tput/code size decisions.
handle unpck/movhlps

Probably better for this first patch to add the pass and the inital vpermilps -> vshufps/vshufd fold, the scheduler based unpckpd fold can be added in a followup

llvm/lib/Target/X86/X86FixupInstTuning.cpp
85

It'd be better to return Optional<double> instead of 0.0 for failed matches (and return nullopt here)

goldstein.w.n marked an inline comment as done.

Remove unpckpd -> shufps and the target sched info stuff

Probably better for this first patch to add the pass and the inital vpermilps -> vshufps/vshufd fold, the scheduler based unpckpd fold can be added in a followup

Done, version with sched info + unpck transform as at: D144570

llvm/lib/Target/X86/X86FixupInstTuning.cpp
85

Done in new version: D144570

RKSimon accepted this revision.Feb 23 2023, 6:08 PM

LGTM with a few minors - thank you for working on this!

llvm/lib/Target/X86/X86FixupInstTuning.cpp
20

VPERMILPS xmm?

95

Maybe add a TODO for always changing for Os/Oz builds?

101

Are you intending to support the predicated variants as well? Maybe add a TODO?

This revision is now accepted and ready to land.Feb 23 2023, 6:08 PM
goldstein.w.n marked 2 inline comments as done.Feb 23 2023, 8:10 PM
goldstein.w.n added inline comments.
llvm/lib/Target/X86/X86FixupInstTuning.cpp
101

What do you mean? For VPERMILPS all variants are supported I believe.

Add some todos

RKSimon added inline comments.Feb 24 2023, 3:10 AM
llvm/lib/Target/X86/X86FixupInstTuning.cpp
101

X86::VPERMILPSZrikz etc.

e-kud accepted this revision.Feb 24 2023, 7:16 AM

LGTM. I second that masked versions should be handled as well. I thought they have TP equal to perms but I've double checked and it seems that masked shuffles have TP=0.5 comparing to perms.

goldstein.w.n added inline comments.Feb 24 2023, 9:52 AM
llvm/lib/Target/X86/X86FixupInstTuning.cpp
101

Oh sure, Ill add a todo for now and make a new patch for predicate version later today.

Add TODO for masked predicates

LGTM. I second that masked versions should be handled as well. I thought they have TP equal to perms but I've double checked and it seems that masked shuffles have TP=0.5 comparing to perms.

@RKSimon added masked predicates at D144763

RKSimon requested changes to this revision.Feb 24 2023, 3:36 PM

someting I just noticed - please can you ensure you have AVX1 test coverage for the VPERMILPSmi/VPERMILPSYmi cases - we can't fold the VPERMILPSYmi case to the VPSHUDYmi which requires AVX2

llvm/lib/Target/X86/X86FixupInstTuning.cpp
116

VPSHUFDYmi? I think this needs a AVX2 check as well

118

VPSHUFDmi

This revision now requires changes to proceed.Feb 24 2023, 3:36 PM
goldstein.w.n added inline comments.Feb 24 2023, 3:41 PM
llvm/lib/Target/X86/X86FixupInstTuning.cpp
116

Good catch. Do you know the API for checking if an instruction is supported on the target? Or will it need to be manually done case by case?

someting I just noticed - please can you ensure you have AVX1 test coverage for the VPERMILPSmi/VPERMILPSYmi cases - we can't fold the VPERMILPSYmi case to the VPSHUDYmi which requires AVX2

I'll add tests similar to what we have for .unpckpd

Add missing AVX2 check and fix ymm/xmm subsitutions

goldstein.w.n marked 2 inline comments as done.Feb 24 2023, 11:39 PM
goldstein.w.n added inline comments.
llvm/lib/Target/X86/X86FixupInstTuning.cpp
116

VPSHUFDYmi? I think this needs a AVX2 check as well

Done, did just a manual check here, but if you know an API for querying if an opcode is supported that would probably be better going forward.
Added tests for the permilps transforms here: D144779 and the AVX2 requirement is tested.

I don't think there exists such an API. We don't record such information for each instruction.

I don't think there exists such an API. We don't record such information for each instruction.

By this point we've lost that info - its embedded in the isel tables but not much else - so you'll have to do it manually for the AVX1/AVX2 cases

RKSimon accepted this revision.Feb 26 2023, 3:55 AM

LGTM

This revision is now accepted and ready to land.Feb 26 2023, 3:55 AM

Rebase (after D144832, no dep on D143786)

This revision was landed with ongoing or failed builds.Feb 27 2023, 4:54 PM
This revision was automatically updated to reflect the committed changes.