This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
-
CMakeLists.txt
-
X86.h
9/12
X86FixupInstTuning.cpp
-
X86TargetMachine.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
2012-01-12-extract-sv.ll
-
SwizzleShuff.ll
-
any_extend_vector_inreg_of_broadcast.ll
-
any_extend_vector_inreg_of_broadcast_from_memory.ll
-
avx-intrinsics-fast-isel.ll
-
avx-intrinsics-x86-upgrade.ll
-
avx-splat.ll
-
avx-vbroadcast.ll
-
avx-vinsertf128.ll
-
avx-vperm2x128.ll
-
avx2-intrinsics-fast-isel.ll
-
avx512-cvt.ll
-
avx512-intrinsics-fast-isel.ll
-
avx512-intrinsics-upgrade.ll
-
avx512-shuffles/
-
in_lane_permute.ll
-
shuffle.ll
-
avx512-trunc.ll
-
avx512-vec-cmp.ll
-
avx512fp16-mov.ll
-
avx512fp16-mscatter.ll
-
avx512vl-intrinsics-upgrade.ll
-
bitcast-int-to-vector-bool-sext.ll
-
bitcast-int-to-vector-bool-zext.ll
-
bitcast-int-to-vector-bool.ll
-
buildvec-extract.ll
-
combine-and.ll
-
combine-concatvectors.ll
-
copy-low-subvec-elt-to-high-subvec-elt.ll
-
extract-concat.ll
-
extract-store.ll
-
fdiv-combine-vec.ll
-
fmaddsub-combine.ll
-
haddsub-2.ll
-
haddsub-4.ll
-
haddsub-undef.ll
-
haddsub.ll
-
horizontal-reduce-smax.ll
-
horizontal-reduce-smin.ll
-
horizontal-reduce-umax.ll
-
horizontal-reduce-umin.ll
-
horizontal-shuffle-2.ll
-
horizontal-shuffle-3.ll
-
horizontal-shuffle-4.ll
-
horizontal-sum.ll
-
i64-to-float.ll
-
insertelement-var-index.ll
-
known-bits-vector.ll
-
known-signbits-vector.ll
-
masked_store.ll
-
masked_store_trunc.ll
-
masked_store_trunc_ssat.ll
-
masked_store_trunc_usat.ll
-
matrix-multiply.ll
-
oddshuffles.ll
1/2
opt-pipeline.ll
-
packss.ll
-
palignr.ll
-
pr31956.ll
-
pr40730.ll
-
pr40811.ll
-
pr50609.ll
-
rotate_vec.ll
-
scalarize-fp.ll
-
shuffle-of-shift.ll
-
shuffle-of-splat-multiuses.ll
-
sse-fsignum.ll
-
sse-intrinsics-fast-isel.ll
-
sse2-intrinsics-fast-isel.ll
-
sse2-intrinsics-x86-upgrade.ll
-
sse2.ll
-
sse3-avx-addsub-2.ll
-
sse41.ll
-
swizzle-avx2.ll
-
tuning-shuffle-permilps-avx512.ll
-
tuning-shuffle-permilps.ll
-
vec-strict-fptoint-256.ll
-
vec-strict-fptoint-512.ll
-
vec-strict-inttofp-128.ll
-
vec-strict-inttofp-256.ll
-
vec-strict-inttofp-512.ll
-
vec_fp_to_int.ll
-
vec_int_to_fp.ll
-
vec_umulo.ll
-
vector-fshr-256.ll
-
vector-half-conversions.ll
-
vector-interleave.ll
-
vector-interleaved-load-i16-stride-5.ll
-
vector-interleaved-load-i16-stride-7.ll
-
vector-interleaved-load-i16-stride-8.ll
-
vector-interleaved-load-i32-stride-2.ll
-
vector-interleaved-load-i32-stride-3.ll
-
vector-interleaved-load-i32-stride-5.ll
-
vector-interleaved-load-i32-stride-6.ll
-
vector-interleaved-load-i32-stride-7.ll
-
vector-interleaved-load-i32-stride-8.ll
-
vector-interleaved-load-i64-stride-3.ll
-
vector-interleaved-store-i16-stride-7.ll
-
vector-interleaved-store-i16-stride-8.ll
-
vector-interleaved-store-i32-stride-2.ll
-
vector-interleaved-store-i32-stride-3.ll
-
vector-interleaved-store-i32-stride-4.ll
-
vector-interleaved-store-i32-stride-5.ll
-
vector-interleaved-store-i32-stride-6.ll
-
vector-interleaved-store-i32-stride-7.ll
-
vector-interleaved-store-i32-stride-8.ll
-
vector-interleaved-store-i64-stride-3.ll
-
vector-interleaved-store-i64-stride-5.ll
-
vector-interleaved-store-i64-stride-7.ll
-
vector-interleaved-store-i8-stride-6.ll
-
vector-interleaved-store-i8-stride-8.ll
-
vector-reduce-add-mask.ll
-
vector-reduce-and-cmp.ll
-
vector-reduce-and.ll
-
vector-reduce-fadd.ll
-
vector-reduce-fmax.ll
-
vector-reduce-fmin.ll
-
vector-reduce-fmul.ll
-
vector-reduce-or.ll
-
vector-reduce-smax.ll
-
vector-reduce-smin.ll
-
vector-reduce-umax.ll
-
vector-reduce-umin.ll
-
vector-reduce-xor.ll
-
vector-sext.ll
-
vector-shift-lshr-128.ll
-
vector-shift-lshr-256.ll
-
vector-shift-shl-256.ll
-
vector-shuffle-128-v2.ll
-
vector-shuffle-128-v4.ll
-
vector-shuffle-128-v8.ll
-
vector-shuffle-256-v16.ll
-
vector-shuffle-256-v32.ll
-
vector-shuffle-256-v4.ll
-
vector-shuffle-256-v8.ll
-
vector-shuffle-512-v16.ll
-
vector-shuffle-512-v8.ll
-
vector-shuffle-avx512.ll
-
vector-shuffle-combining-avx.ll
-
vector-shuffle-combining-avx2.ll
-
vector-shuffle-combining-avx512f.ll
-
vector-shuffle-combining-ssse3.ll
-
vector-shuffle-combining.ll
-
vector-shuffle-concatenation.ll
-
vector-trunc-ssat.ll
-
vector-trunc-usat.ll
-
vselect-avx.ll
-
x86-interleaved-access.ll
-
zero_extend_vector_inreg_of_broadcast.ll
-
zero_extend_vector_inreg_of_broadcast_from_memory.ll
-
utils/gn/secondary/llvm/lib/Target/X86/
-
gn/
-
secondary/
-
llvm/
-
lib/
-
Target/
-
X86/
-
BUILD.gn

Differential D143787

[X86] Add new pass `X86FixupInstTuning` for fixing up machine-instruction selection.
ClosedPublic

Authored by goldstein.w.n on Feb 10 2023, 3:31 PM.

Download Raw Diff

Details

Reviewers

pengfei
RKSimon
e-kud

Commits

rG69a322fed19b: Add new pass `X86FixupInstTuning` for fixing up machine-instruction selection.

Summary

There are a variety of cases where we want more control over the exact
instruction emitted. This commit creates a new pass to fixup
instructions after the DAG has been lowered. The pass is only meant to
replace instructions that are guranteed to be interchangable, not to
do analysis for special cases.

Handling these instruction changes in in X86ISelLowering of
X86ISelDAGToDAG isn't ideal, as its liable to either break existing
patterns that expected a certain instruction or generate infinite
loops.

Currently, only vpermilps -> vshufps/vshufd is implemented, but
more cases can be added.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

goldstein.w.n created this revision.Feb 10 2023, 3:31 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 3:31 PM

Herald added subscribers: pengfei, arphaman, hiraditya. · View Herald Transcript

goldstein.w.n requested review of this revision.Feb 10 2023, 3:31 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 3:31 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

goldstein.w.n added a parent revision: D143786: [X86] Add `TuningPreferShiftShuffle` for when Shifts are preferable to shuffles..Feb 10 2023, 3:33 PM

goldstein.w.n added reviewers: pengfei, RKSimon.

goldstein.w.n added a child revision: D143788: [X86] Add more tests for promoting `blendw` -> `blendd`; NFC.

Harbormaster completed remote builds in B213162: Diff 496616.Feb 10 2023, 4:34 PM

Please can you split off the patch adding the free domain switch tuning flags? Its been something we've been missing from combineX86ShuffleChain for some time and should be addressed first.

llvm/lib/Target/X86/X86.td
533 ↗	(On Diff #496616)	Maybe rename to TuningNoDomainDelay/TuningNoDomainDelayShuffle - bypass could mean many things.....
1005 ↗	(On Diff #496616)	Bonnell has no domain penalty either (and REALLY likes the shorted instruction encodings).

Split off domain flags

Harbormaster completed remote builds in B213307: Diff 496797.Feb 12 2023, 2:17 PM

In D143787#4120193, @RKSimon wrote:

Please can you split off the patch adding the free domain switch tuning flags? Its been something we've been missing from combineX86ShuffleChain for some time and should be addressed first.

Done, see D143859

goldstein.w.n removed a parent revision: D143786: [X86] Add `TuningPreferShiftShuffle` for when Shifts are preferable to shuffles..Feb 12 2023, 2:28 PM

I think this should be controlled by a tuning flag tbh, or even better we do this in a later pass and not in dag/isel

In D143787#4122814, @RKSimon wrote:

I think this should be controlled by a tuning flag tbh, or even better we do this in a later pass and not in dag/isel

Where would you suggest putting it?

Also what is the case for the seperate tuning flag? AFAICT its just better code size and equal or better perf so seems
like clear universal win.

pengfei added a reviewer: e-kud.Feb 14 2023, 6:28 PM

Rebase

Harbormaster completed remote builds in B213815: Diff 497560.Feb 15 2023, 12:29 AM

We have a number of cases where a specific instruction is faster on one target than another, or there's no domain switch cost and we can use smaller variants, etc.

This comes to mind as well: https://github.com/llvm/llvm-project/issues/43458

We can use tuning flags, but for many cases its just confusing and matching what our scheduler models already tell us. Plus we end up with many permutations of DAG / isel that don't always work well together, or cause infinite loops etc.

My idea was for a small pass similar to FixupLEA/FixBWI that we can drive from a mixture of the subtarget tuning flags and the scheduler model to decide between various equivalent instruction options based on cost estimates. Shuffle ops give themselves to this the most, but theres probably others we might consider as well.

Replacing a single instruction for another is trivial - it might be feasible to replace a single instruction for multiple instructions as well (HADD/HSUB expansion, Funnel Shifts, etc.).

In D143787#4128462, @RKSimon wrote:

We have a number of cases where a specific instruction is faster on one target than another, or there's no domain switch cost and we can use smaller variants, etc.

This comes to mind as well: https://github.com/llvm/llvm-project/issues/43458

We can use tuning flags, but for many cases its just confusing and matching what our scheduler models already tell us. Plus we end up with many permutations of DAG / isel that don't always work well together, or cause infinite loops etc.

My idea was for a small pass similar to FixupLEA/FixBWI that we can drive from a mixture of the subtarget tuning flags and the scheduler model to decide between various equivalent instruction options based on cost estimates. Shuffle ops give themselves to this the most, but theres probably others we might consider as well.

Replacing a single instruction for another is trivial - it might be feasible to replace a single instruction for multiple instructions as well (HADD/HSUB expansion, Funnel Shifts, etc.).

I see, that makes sense. I'll take a stab at that.

Move logic to a new pass

In D143787#4128462, @RKSimon wrote:

We have a number of cases where a specific instruction is faster on one target than another, or there's no domain switch cost and we can use smaller variants, etc.

This comes to mind as well: https://github.com/llvm/llvm-project/issues/43458

We can use tuning flags, but for many cases its just confusing and matching what our scheduler models already tell us. Plus we end up with many permutations of DAG / isel that don't always work well together, or cause infinite loops etc.

My idea was for a small pass similar to FixupLEA/FixBWI that we can drive from a mixture of the subtarget tuning flags and the scheduler model to decide between various equivalent instruction options based on cost estimates. Shuffle ops give themselves to this the most, but theres probably others we might consider as well.

Replacing a single instruction for another is trivial - it might be feasible to replace a single instruction for multiple instructions as well (HADD/HSUB expansion, Funnel Shifts, etc.).

Done, only added vpermilps -> vshufps/vshufd at the moment, but other replacements should be doable.

pengfei added inline comments.Feb 16 2023, 12:17 AM

llvm/lib/Target/X86/X86FixupISel.cpp
27 ↗	(On Diff #497899)	We often use all lower case latters.
69 ↗	(On Diff #497899)	Lambda function name should starts with upper latter.
75 ↗	(On Diff #497899)	ditto.
86 ↗	(On Diff #497899)	Better to use processVPERMILPSri/processVPERMILPSmi
llvm/test/CodeGen/X86/opt-pipeline.ll
205	Is the pass too far away from `ISel`?

goldstein.w.n added inline comments.Feb 16 2023, 12:21 AM

llvm/test/CodeGen/X86/opt-pipeline.ll
205	I think we want it to run quite after all other instruction transformations. The important ones are register allocation (so we convert `vpermlipsri` -> `vshufpsrri` at a spill) and domainfixup (so we have the correct instructions).

Fix style

Fixed the nits.

craig.topper added a subscriber: craig.topper.Feb 16 2023, 12:32 AM

craig.topper added inline comments.

llvm/lib/Target/X86/X86FixupISel.cpp
1 ↗	(On Diff #497899)	Referring to ISel and running the pass so far from instruction selection is perhaps misleading.

Harbormaster completed remote builds in B214076: Diff 497908.Feb 16 2023, 2:36 AM

pengfei added inline comments.Feb 16 2023, 2:46 AM

llvm/lib/Target/X86/X86FixupISel.cpp
1 ↗	(On Diff #497899)	I have the same concern, if it is required for `vpermli*`, maybe just name it `FixupVpermli`.

RKSimon added inline comments.Feb 16 2023, 4:36 AM

llvm/lib/Target/X86/X86FixupISel.cpp
1 ↗	(On Diff #497899)	The hope is to add other instruction combos in the future - given we're trying to fixup for a mixture of tuning/scheduler - maybe X86FixupTuning.cpp ?
12 ↗	(On Diff #497908)	What do you mean by analysis? I'm hoping we can use scheduler models in the future to drive some transforms here.
68 ↗	(On Diff #497908)	Comments explaining the exact purpose of these transforms would be useful.
69 ↗	(On Diff #497908)	(style) use bool instead of auto

goldstein.w.n added inline comments.Feb 16 2023, 9:44 AM

llvm/lib/Target/X86/X86FixupISel.cpp
12 ↗	(On Diff #497908)	I mean something like transforming `vpermq ymm` <-> `vshufd ymm` should not be put in this file, as its not always valid (requires analyzing the mask to see if no lane crosses and repeated / mask is only in pairs of 2). OTOH `vpermilps <-> vshufd` are always interchangable. I'll make the logic more clear.

RKSimon added inline comments.Feb 16 2023, 10:05 AM

llvm/lib/Target/X86/X86FixupISel.cpp
12 ↗	(On Diff #497908)	Agreed, anything involving valuetracking etc. shouldn't be done this late. However, I don't see any problem with replacing instructions based on immediates that are part of the instruction - we already do this for load folding, commutation, domain switching etc.

goldstein.w.n marked 2 inline comments as done.Feb 16 2023, 10:29 AM

goldstein.w.n added inline comments.

llvm/lib/Target/X86/X86FixupISel.cpp
1 ↗	(On Diff #497899)	Renamed to `X86FixupInstTuning.cpp` (though `FixupTuning` was a bit generic about what exactly is being tuned).
12 ↗	(On Diff #497908)	Agreed, anything involving valuetracking etc. shouldn't be done this late. However, I don't see any problem with replacing instructions based on immediates that are part of the instruction - we already do this for load folding, commutation, domain switching etc. Its not inherently wrong, but my feeling is that belongs in X86ISelLowering/X86ISelDAGToDAG/td. I think it would work for a few instructions, but it seems like the kind of thing that will explode the complexity of the file over time (even the shuffle logic with in X86ISelLowering is extremely complex/convoluted with the clean APIs the DAG has). I think that kind of complexity on a file operation on machine instr will be bug-prone. Generally, we could do it here, but I don't think we should.

Fix some nits, rename

goldstein.w.n retitled this revision from [X86] Add new pass `X86FixupISel` for fixing up machine-instruction selection. to [X86] Add new pass `X86FixupInstTuning` for fixing up machine-instruction selection..Feb 16 2023, 10:30 AM

Harbormaster completed remote builds in B214198: Diff 498072.Feb 16 2023, 12:24 PM

Matt added a subscriber: Matt.Feb 20 2023, 1:18 PM

Support for tput/code size decisions.
handle unpck/movhlps

goldstein.w.n added a parent revision: D144442: [X86] Add tests for replacing `{v}unpck{l|h}pd` -> `{v}shufps`; NFC.Feb 20 2023, 6:12 PM

goldstein.w.n removed a parent revision: D143859: [X86] Adding tuning flags for int <-> fp domain switching penalties; NFC.

Harbormaster completed remote builds in B214892: Diff 499000.Feb 20 2023, 6:12 PM

Rebase

Harbormaster completed remote builds in B214914: Diff 499025.Feb 20 2023, 10:29 PM

Rebase

Harbormaster completed remote builds in B215173: Diff 499384.Feb 22 2023, 12:26 AM

Probably better for this first patch to add the pass and the inital vpermilps -> vshufps/vshufd fold, the scheduler based unpckpd fold can be added in a followup

llvm/lib/Target/X86/X86FixupInstTuning.cpp
85	It'd be better to return Optional<double> instead of 0.0 for failed matches (and return nullopt here)

Remove unpckpd -> shufps and the target sched info stuff

In D143787#4144121, @RKSimon wrote:

Probably better for this first patch to add the pass and the inital vpermilps -> vshufps/vshufd fold, the scheduler based unpckpd fold can be added in a followup

Done, version with sched info + unpck transform as at: D144570

llvm/lib/Target/X86/X86FixupInstTuning.cpp
85	Done in new version: D144570

goldstein.w.n added a child revision: D144570: [X86] Add support for using Sched/Codesize information to `X86FixupInstTuning` Pass..Feb 22 2023, 9:15 AM

goldstein.w.n removed a child revision: D143788: [X86] Add more tests for promoting `blendw` -> `blendd`; NFC.

Harbormaster completed remote builds in B215286: Diff 499545.Feb 22 2023, 11:56 AM

Rebase

Harbormaster completed remote builds in B215403: Diff 499695.Feb 22 2023, 8:00 PM

LGTM with a few minors - thank you for working on this!

llvm/lib/Target/X86/X86FixupInstTuning.cpp
20	VPERMILPS xmm?
95	Maybe add a TODO for always changing for Os/Oz builds?
101	Are you intending to support the predicated variants as well? Maybe add a TODO?

This revision is now accepted and ready to land.Feb 23 2023, 6:08 PM

goldstein.w.n marked 2 inline comments as done.Feb 23 2023, 8:10 PM

goldstein.w.n added inline comments.

llvm/lib/Target/X86/X86FixupInstTuning.cpp
101	What do you mean? For `VPERMILPS` all variants are supported I believe.

Add some todos

Harbormaster completed remote builds in B215663: Diff 500062.Feb 23 2023, 10:16 PM

RKSimon added inline comments.Feb 24 2023, 3:10 AM

llvm/lib/Target/X86/X86FixupInstTuning.cpp
101	X86::VPERMILPSZrikz etc.

LGTM. I second that masked versions should be handled as well. I thought they have TP equal to perms but I've double checked and it seems that masked shuffles have TP=0.5 comparing to perms.

goldstein.w.n added inline comments.Feb 24 2023, 9:52 AM

llvm/lib/Target/X86/X86FixupInstTuning.cpp
101	Oh sure, Ill add a todo for now and make a new patch for predicate version later today.

Add TODO for masked predicates

In D143787#4150522, @e-kud wrote:

LGTM. I second that masked versions should be handled as well. I thought they have TP equal to perms but I've double checked and it seems that masked shuffles have TP=0.5 comparing to perms.

@RKSimon added masked predicates at D144763

someting I just noticed - please can you ensure you have AVX1 test coverage for the VPERMILPSmi/VPERMILPSYmi cases - we can't fold the VPERMILPSYmi case to the VPSHUDYmi which requires AVX2

llvm/lib/Target/X86/X86FixupInstTuning.cpp
116	VPSHUFDYmi? I think this needs a AVX2 check as well
118	VPSHUFDmi

This revision now requires changes to proceed.Feb 24 2023, 3:36 PM

goldstein.w.n added inline comments.Feb 24 2023, 3:41 PM

llvm/lib/Target/X86/X86FixupInstTuning.cpp
116	Good catch. Do you know the API for checking if an instruction is supported on the target? Or will it need to be manually done case by case?

In D143787#4151630, @RKSimon wrote:

someting I just noticed - please can you ensure you have AVX1 test coverage for the VPERMILPSmi/VPERMILPSYmi cases - we can't fold the VPERMILPSYmi case to the VPSHUDYmi which requires AVX2

I'll add tests similar to what we have for .unpckpd

Add missing AVX2 check and fix ymm/xmm subsitutions

Harbormaster completed remote builds in B215894: Diff 500375.Feb 24 2023, 11:31 PM

goldstein.w.n added a parent revision: D144779: [X86] Add tests for replacing `{v}permilps` -> `{v}shufps/{v}pshufd`; NFC.Feb 24 2023, 11:34 PM

goldstein.w.n removed a parent revision: D144442: [X86] Add tests for replacing `{v}unpck{l|h}pd` -> `{v}shufps`; NFC.Feb 24 2023, 11:37 PM

goldstein.w.n marked 2 inline comments as done.Feb 24 2023, 11:39 PM

goldstein.w.n added inline comments.

llvm/lib/Target/X86/X86FixupInstTuning.cpp
116	VPSHUFDYmi? I think this needs a AVX2 check as well Done, did just a manual check here, but if you know an API for querying if an opcode is supported that would probably be better going forward. Added tests for the `permilps` transforms here: D144779 and the AVX2 requirement is tested.

I don't think there exists such an API. We don't record such information for each instruction.

Rebase

Harbormaster completed remote builds in B215978: Diff 500476.Feb 25 2023, 6:53 PM

RKSimon mentioned this in D143788: [X86] Add more tests for promoting `blendw` -> `blendd`; NFC.Feb 26 2023, 2:46 AM

In D143787#4152348, @pengfei wrote:

I don't think there exists such an API. We don't record such information for each instruction.

By this point we've lost that info - its embedded in the isel tables but not much else - so you'll have to do it manually for the AVX1/AVX2 cases

LGTM

This revision is now accepted and ready to land.Feb 26 2023, 3:55 AM

goldstein.w.n mentioned this in D144442: [X86] Add tests for replacing `{v}unpck{l|h}pd` -> `{v}shufps`; NFC.Feb 26 2023, 10:47 AM

goldstein.w.n edited child revisions, added: D144442: [X86] Add tests for replacing `{v}unpck{l|h}pd` -> `{v}shufps`; NFC; removed: D144570: [X86] Add support for using Sched/Codesize information to `X86FixupInstTuning` Pass..Feb 26 2023, 10:52 AM

goldstein.w.n mentioned this in D144779: [X86] Add tests for replacing `{v}permilps` -> `{v}shufps/{v}pshufd`; NFC.Feb 26 2023, 12:06 PM

Rebase

Harbormaster completed remote builds in B216098: Diff 500614.Feb 26 2023, 1:21 PM

Rebase

Harbormaster completed remote builds in B216108: Diff 500626.Feb 26 2023, 3:41 PM

Rebase (after D144832, no dep on D143786)

Harbormaster completed remote builds in B216352: Diff 500962.Feb 27 2023, 4:20 PM

This revision was landed with ongoing or failed builds.Feb 27 2023, 4:54 PM

Closed by commit rG69a322fed19b: Add new pass `X86FixupInstTuning` for fixing up machine-instruction selection. (authored by goldstein.w.n). · Explain Why

This revision was automatically updated to reflect the committed changes.

goldstein.w.n added a commit: rG69a322fed19b: Add new pass `X86FixupInstTuning` for fixing up machine-instruction selection..

RKSimon mentioned this in D148999: [X86] X86FixupInstTunings - add VPERMILPDri -> VSHUFPDrri mapping.Apr 22 2023, 10:25 AM

RKSimon mentioned this in rGe9f9467da063: [X86] X86FixupInstTunings - add VPERMILPDri -> VSHUFPDrri mapping.Apr 23 2023, 3:49 AM

Large Diff

This large diff affects 154 files. Files without inline comments have been collapsed. Expand All Files

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

CMakeLists.txt

1 line

X86.h

5 lines

X86FixupInstTuning.cpp

146 lines

X86TargetMachine.cpp

1 line

test/

CodeGen/

X86/

2012-01-12-extract-sv.ll

2 lines

SwizzleShuff.ll

4 lines

any_extend_vector_inreg_of_broadcast.ll

4 lines

any_extend_vector_inreg_of_broadcast_from_memory.ll

4 lines

avx-intrinsics-fast-isel.ll

12 lines

avx-intrinsics-x86-upgrade.ll

8 lines

4 lines

6 lines

2 lines

2 lines

avx2-intrinsics-fast-isel.ll

2 lines

avx512-cvt.ll

4 lines

avx512-intrinsics-fast-isel.ll

4 lines

avx512-intrinsics-upgrade.ll

4 lines

avx512-shuffles/

8 lines

12 lines

4 lines

2 lines

6 lines

avx512fp16-mscatter.ll

4 lines

avx512vl-intrinsics-upgrade.ll

8 lines

bitcast-int-to-vector-bool-sext.ll

6 lines

bitcast-int-to-vector-bool-zext.ll

6 lines

bitcast-int-to-vector-bool.ll

2 lines

buildvec-extract.ll

6 lines

combine-and.ll

4 lines

combine-concatvectors.ll

2 lines

copy-low-subvec-elt-to-high-subvec-elt.ll

12 lines

4 lines

2 lines

8 lines

28 lines

4 lines

32 lines

26 lines

18 lines

horizontal-reduce-smax.ll

16 lines

horizontal-reduce-smin.ll

16 lines

horizontal-reduce-umax.ll

18 lines

horizontal-reduce-umin.ll

18 lines

horizontal-shuffle-2.ll

10 lines

horizontal-shuffle-3.ll

8 lines

horizontal-shuffle-4.ll

2 lines

horizontal-sum.ll

24 lines

i64-to-float.ll

4 lines

insertelement-var-index.ll

8 lines

known-bits-vector.ll

24 lines

known-signbits-vector.ll

16 lines

masked_store.ll

2 lines

masked_store_trunc.ll

2 lines

masked_store_trunc_ssat.ll

6 lines

masked_store_trunc_usat.ll

6 lines

168 lines

52 lines

1 line

4 lines

2 lines

2 lines

4 lines

4 lines

2 lines

2 lines

22 lines

8 lines

shuffle-of-splat-multiuses.ll

10 lines

sse-fsignum.ll

4 lines

sse-intrinsics-fast-isel.ll

24 lines

sse2-intrinsics-fast-isel.ll

4 lines

sse2-intrinsics-x86-upgrade.ll

4 lines

8 lines

22 lines

44 lines

4 lines

tuning-shuffle-permilps-avx512.ll

81 lines

tuning-shuffle-permilps.ll

55 lines

vec-strict-fptoint-256.ll

18 lines

vec-strict-fptoint-512.ll

12 lines

vec-strict-inttofp-128.ll

24 lines

vec-strict-inttofp-256.ll

16 lines

vec-strict-inttofp-512.ll

32 lines

32 lines

24 lines

4 lines

2 lines

vector-half-conversions.ll

36 lines

vector-interleave.ll

12 lines

vector-interleaved-load-i16-stride-5.ll

14 lines

vector-interleaved-load-i16-stride-7.ll

2 lines

vector-interleaved-load-i16-stride-8.ll

42 lines

vector-interleaved-load-i32-stride-2.ll

4 lines

vector-interleaved-load-i32-stride-3.ll

128 lines

vector-interleaved-load-i32-stride-5.ll

92 lines

vector-interleaved-load-i32-stride-6.ll

1036 lines

vector-interleaved-load-i32-stride-7.ll

478 lines

vector-interleaved-load-i32-stride-8.ll

344 lines

vector-interleaved-load-i64-stride-3.ll

62 lines

vector-interleaved-store-i16-stride-7.ll

14 lines

vector-interleaved-store-i16-stride-8.ll

8 lines

vector-interleaved-store-i32-stride-2.ll

4 lines

vector-interleaved-store-i32-stride-3.ll

140 lines

vector-interleaved-store-i32-stride-4.ll

140 lines

vector-interleaved-store-i32-stride-5.ll

878 lines

vector-interleaved-store-i32-stride-6.ll

334 lines

vector-interleaved-store-i32-stride-7.ll

798 lines

vector-interleaved-store-i32-stride-8.ll

440 lines

vector-interleaved-store-i64-stride-3.ll

62 lines

vector-interleaved-store-i64-stride-5.ll

22 lines

vector-interleaved-store-i64-stride-7.ll

76 lines

vector-interleaved-store-i8-stride-6.ll

4 lines

vector-interleaved-store-i8-stride-8.ll

120 lines

vector-reduce-add-mask.ll

2 lines

vector-reduce-and-cmp.ll

28 lines

vector-reduce-and.ll

28 lines

vector-reduce-fadd.ll

112 lines

vector-reduce-fmax.ll

36 lines

vector-reduce-fmin.ll

36 lines

vector-reduce-fmul.ll

84 lines

vector-reduce-or.ll

28 lines

vector-reduce-smax.ll

12 lines

vector-reduce-smin.ll

12 lines

vector-reduce-umax.ll

12 lines

vector-reduce-umin.ll

12 lines

vector-reduce-xor.ll

28 lines

vector-sext.ll

2 lines

vector-shift-lshr-128.ll

4 lines

vector-shift-lshr-256.ll

6 lines

vector-shift-shl-256.ll

6 lines

vector-shuffle-128-v2.ll

12 lines

vector-shuffle-128-v4.ll

72 lines

vector-shuffle-128-v8.ll

8 lines

vector-shuffle-256-v16.ll

50 lines

vector-shuffle-256-v32.ll

2 lines

vector-shuffle-256-v4.ll

28 lines

vector-shuffle-256-v8.ll

392 lines

vector-shuffle-512-v16.ll

10 lines

vector-shuffle-512-v8.ll

6 lines

vector-shuffle-avx512.ll

4 lines

vector-shuffle-combining-avx.ll

10 lines

vector-shuffle-combining-avx2.ll

2 lines

vector-shuffle-combining-avx512f.ll

2 lines

vector-shuffle-combining-ssse3.ll

4 lines

vector-shuffle-combining.ll

84 lines

vector-shuffle-concatenation.ll

16 lines

vector-trunc-ssat.ll

12 lines

vector-trunc-usat.ll

12 lines

vselect-avx.ll

8 lines

x86-interleaved-access.ll

16 lines

zero_extend_vector_inreg_of_broadcast.ll

8 lines

zero_extend_vector_inreg_of_broadcast_from_memory.ll

10 lines

utils/

gn/

secondary/

llvm/

lib/

Target/

X86/

BUILD.gn

1 line

Diff 500971

llvm/lib/Target/X86/CMakeLists.txt

Load File

llvm/lib/Target/X86/X86.h

Load File

llvm/lib/Target/X86/X86FixupInstTuning.cpp

This file was added.

				//===-- X86FixupInstTunings.cpp - replace instructions -----------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file does a tuning pass replacing slower machine instructions
				// with faster ones. We do this here, as opposed to during normal ISel, as
				// attempting to get the "right" instruction can break patterns. This pass
				// is not meant search for special cases where an instruction can be transformed
				// to another, it is only meant to do transformations where the old instruction
				// is always replacable with the new instructions. For example:
				//
				// `vpermq ymm` -> `vshufd ymm`
				// -- BAD, not always valid (lane cross/non-repeated mask)
				//
				// `vpermilps ymm` -> `vshufd ymm`
				// -- GOOD, always replaceable
				RKSimonUnsubmitted Done Reply Inline Actions VPERMILPS xmm? RKSimon: VPERMILPS xmm?
				//
				//===----------------------------------------------------------------------===//

				#include "X86.h"
				#include "X86InstrInfo.h"
				#include "X86Subtarget.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"

				using namespace llvm;

				#define DEBUG_TYPE "x86-fixup-inst-tuning"

				STATISTIC(NumInstChanges, "Number of instructions changes");

				namespace {
				class X86FixupInstTuningPass : public MachineFunctionPass {
				public:
				static char ID;

				X86FixupInstTuningPass() : MachineFunctionPass(ID) {}

				StringRef getPassName() const override { return "X86 Fixup Inst Tuning"; }

				bool runOnMachineFunction(MachineFunction &MF) override;
				bool processInstruction(MachineFunction &MF, MachineBasicBlock &MBB,
				MachineBasicBlock::iterator &I);

				// This pass runs after regalloc and doesn't support VReg operands.
				MachineFunctionProperties getRequiredProperties() const override {
				return MachineFunctionProperties().set(
				MachineFunctionProperties::Property::NoVRegs);
				}

				private:
				const X86InstrInfo *TII = nullptr;
				const X86Subtarget *ST = nullptr;
				};
				} // end anonymous namespace

				char X86FixupInstTuningPass::ID = 0;

				INITIALIZE_PASS(X86FixupInstTuningPass, DEBUG_TYPE, DEBUG_TYPE, false, false)

				FunctionPass *llvm::createX86FixupInstTuning() {
				return new X86FixupInstTuningPass();
				}

				bool X86FixupInstTuningPass::processInstruction(
				MachineFunction &MF, MachineBasicBlock &MBB,
				MachineBasicBlock::iterator &I) {
				MachineInstr &MI = *I;
				unsigned Opc = MI.getOpcode();
				unsigned NumOperands = MI.getDesc().getNumOperands();

				// `vpermilps r, i` -> `vshufps r, r, i`
				// `vshufps` is always as fast or faster than `vpermilps` and takes 1 less
				// byte of code size.
				auto ProcessVPERMILPSri = [&](unsigned NewOpc) -> bool {
				unsigned MaskImm = MI.getOperand(NumOperands - 1).getImm();
				MI.removeOperand(NumOperands - 1);
				MI.addOperand(MI.getOperand(1));
				MI.setDesc(TII->get(NewOpc));
				RKSimonUnsubmitted Done Reply Inline Actions It'd be better to return Optional<double> instead of 0.0 for failed matches (and return nullopt here) RKSimon: It'd be better to return Optional<double> instead of 0.0 for failed matches (and return nullopt…
				goldstein.w.nAuthorUnsubmitted Done Reply Inline Actions Done in new version: D144570 goldstein.w.n: Done in new version: D144570
				MI.addOperand(MachineOperand::CreateImm(MaskImm));
				return true;
				};

				// `vpermilps m, i` -> `vpshufd m, i` iff no domain delay penalty on shuffles.
				// `vpshufd` is always as fast or faster than `vpermilps` and takes 1 less
				// byte of code size.
				auto ProcessVPERMILPSmi = [&](unsigned NewOpc) -> bool {
				// TODO: Might be work adding bypass delay if -Os/-Oz is enabled as
				// `vpshufd` saves a byte of code size.
				RKSimonUnsubmitted Done Reply Inline Actions Maybe add a TODO for always changing for Os/Oz builds? RKSimon: Maybe add a TODO for always changing for Os/Oz builds?
				if (!ST->hasNoDomainDelayShuffle())
				return false;
				MI.setDesc(TII->get(NewOpc));
				return true;
				};

				RKSimonUnsubmitted Not Done Reply Inline Actions Are you intending to support the predicated variants as well? Maybe add a TODO? RKSimon: Are you intending to support the predicated variants as well? Maybe add a TODO?
				goldstein.w.nAuthorUnsubmitted Not Done Reply Inline Actions What do you mean? For `VPERMILPS` all variants are supported I believe. goldstein.w.n: What do you mean? For `VPERMILPS` all variants are supported I believe.
				RKSimonUnsubmitted Not Done Reply Inline Actions X86::VPERMILPSZrikz etc. RKSimon: X86::VPERMILPSZrikz etc.
				goldstein.w.nAuthorUnsubmitted Done Reply Inline Actions Oh sure, Ill add a todo for now and make a new patch for predicate version later today. goldstein.w.n: Oh sure, Ill add a todo for now and make a new patch for predicate version later today.
				// TODO: Add masked predicate execution variants.
				switch (Opc) {
				case X86::VPERMILPSri:
				return ProcessVPERMILPSri(X86::VSHUFPSrri);
				case X86::VPERMILPSYri:
				return ProcessVPERMILPSri(X86::VSHUFPSYrri);
				case X86::VPERMILPSZ128ri:
				return ProcessVPERMILPSri(X86::VSHUFPSZ128rri);
				case X86::VPERMILPSZ256ri:
				return ProcessVPERMILPSri(X86::VSHUFPSZ256rri);
				case X86::VPERMILPSZri:
				return ProcessVPERMILPSri(X86::VSHUFPSZrri);
				case X86::VPERMILPSmi:
				return ProcessVPERMILPSmi(X86::VPSHUFDmi);
				case X86::VPERMILPSYmi:
				RKSimonUnsubmitted Done Reply Inline Actions VPSHUFDYmi? I think this needs a AVX2 check as well RKSimon: VPSHUFDYmi? I think this needs a AVX2 check as well
				goldstein.w.nAuthorUnsubmitted Done Reply Inline Actions Good catch. Do you know the API for checking if an instruction is supported on the target? Or will it need to be manually done case by case? goldstein.w.n: Good catch. Do you know the API for checking if an instruction is supported on the target? Or…
				goldstein.w.nAuthorUnsubmitted Done Reply Inline Actions VPSHUFDYmi? I think this needs a AVX2 check as well Done, did just a manual check here, but if you know an API for querying if an opcode is supported that would probably be better going forward. Added tests for the `permilps` transforms here: D144779 and the AVX2 requirement is tested. goldstein.w.n: > VPSHUFDYmi? I think this needs a AVX2 check as well Done, did just a manual check here, but…
				// TODO: See if there is a more generic way we can test if the replacement
				// instruction is supported.
				RKSimonUnsubmitted Done Reply Inline Actions VPSHUFDmi RKSimon: VPSHUFDmi
				return ST->hasAVX2() ? ProcessVPERMILPSmi(X86::VPSHUFDYmi) : false;
				case X86::VPERMILPSZ128mi:
				return ProcessVPERMILPSmi(X86::VPSHUFDZ128mi);
				case X86::VPERMILPSZ256mi:
				return ProcessVPERMILPSmi(X86::VPSHUFDZ256mi);
				case X86::VPERMILPSZmi:
				return ProcessVPERMILPSmi(X86::VPSHUFDZmi);
				default:
				return false;
				}
				}

				bool X86FixupInstTuningPass::runOnMachineFunction(MachineFunction &MF) {
				LLVM_DEBUG(dbgs() << "Start X86FixupInstTuning\n";);
				bool Changed = false;
				ST = &MF.getSubtarget<X86Subtarget>();
				TII = ST->getInstrInfo();
				for (MachineBasicBlock &MBB : MF) {
				for (MachineBasicBlock::iterator I = MBB.begin(); I != MBB.end(); ++I) {
				if (processInstruction(MF, MBB, I)) {
				++NumInstChanges;
				Changed = true;
				}
				}
				}
				LLVM_DEBUG(dbgs() << "End X86FixupInstTuning\n";);
				return Changed;
				}

llvm/lib/Target/X86/X86TargetMachine.cpp

Load File

llvm/test/CodeGen/X86/2012-01-12-extract-sv.ll

Load File

llvm/test/CodeGen/X86/SwizzleShuff.ll

Load File

llvm/test/CodeGen/X86/any_extend_vector_inreg_of_broadcast.ll

Load File

llvm/test/CodeGen/X86/any_extend_vector_inreg_of_broadcast_from_memory.ll

Load File

llvm/test/CodeGen/X86/avx-intrinsics-fast-isel.ll

Load File

llvm/test/CodeGen/X86/avx-intrinsics-x86-upgrade.ll

Load File

llvm/test/CodeGen/X86/avx-splat.ll

Load File

llvm/test/CodeGen/X86/avx-vbroadcast.ll

Load File

llvm/test/CodeGen/X86/avx-vinsertf128.ll

Load File

llvm/test/CodeGen/X86/avx-vperm2x128.ll

Load File

llvm/test/CodeGen/X86/avx2-intrinsics-fast-isel.ll

Load File

llvm/test/CodeGen/X86/avx512-cvt.ll

Load File

llvm/test/CodeGen/X86/avx512-intrinsics-fast-isel.ll

Load File

llvm/test/CodeGen/X86/avx512-intrinsics-upgrade.ll

Load File

llvm/test/CodeGen/X86/avx512-shuffles/in_lane_permute.ll

Load File

llvm/test/CodeGen/X86/avx512-shuffles/shuffle.ll

Load File

llvm/test/CodeGen/X86/avx512-trunc.ll

Load File

llvm/test/CodeGen/X86/avx512-vec-cmp.ll

Load File

llvm/test/CodeGen/X86/avx512fp16-mov.ll

Load File

llvm/test/CodeGen/X86/avx512fp16-mscatter.ll

Load File

llvm/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll

Load File

llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-sext.ll

Load File

llvm/test/CodeGen/X86/bitcast-int-to-vector-bool-zext.ll

Load File

llvm/test/CodeGen/X86/bitcast-int-to-vector-bool.ll

Load File

llvm/test/CodeGen/X86/buildvec-extract.ll

Load File

llvm/test/CodeGen/X86/combine-and.ll

Load File

llvm/test/CodeGen/X86/combine-concatvectors.ll

Load File

llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll

Load File

llvm/test/CodeGen/X86/extract-concat.ll

Load File

llvm/test/CodeGen/X86/extract-store.ll

Load File

llvm/test/CodeGen/X86/fdiv-combine-vec.ll

Load File

llvm/test/CodeGen/X86/fmaddsub-combine.ll

Load File

llvm/test/CodeGen/X86/haddsub-2.ll

Load File

llvm/test/CodeGen/X86/haddsub-4.ll

Load File

llvm/test/CodeGen/X86/haddsub-undef.ll

Load File

llvm/test/CodeGen/X86/haddsub.ll

Load File

llvm/test/CodeGen/X86/horizontal-reduce-smax.ll

Load File

llvm/test/CodeGen/X86/horizontal-reduce-smin.ll

Load File

llvm/test/CodeGen/X86/horizontal-reduce-umax.ll

Load File

llvm/test/CodeGen/X86/horizontal-reduce-umin.ll

Load File

llvm/test/CodeGen/X86/horizontal-shuffle-2.ll

Load File

llvm/test/CodeGen/X86/horizontal-shuffle-3.ll

Load File

llvm/test/CodeGen/X86/horizontal-shuffle-4.ll

Load File

llvm/test/CodeGen/X86/horizontal-sum.ll

Load File

llvm/test/CodeGen/X86/i64-to-float.ll

Load File

llvm/test/CodeGen/X86/insertelement-var-index.ll

Load File

llvm/test/CodeGen/X86/known-bits-vector.ll

Load File

llvm/test/CodeGen/X86/known-signbits-vector.ll

Load File

llvm/test/CodeGen/X86/masked_store.ll

Load File

llvm/test/CodeGen/X86/masked_store_trunc.ll

Load File

llvm/test/CodeGen/X86/masked_store_trunc_ssat.ll

Load File

llvm/test/CodeGen/X86/masked_store_trunc_usat.ll

Load File

llvm/test/CodeGen/X86/matrix-multiply.ll

Load File

llvm/test/CodeGen/X86/oddshuffles.ll

Load File

llvm/test/CodeGen/X86/opt-pipeline.ll

	Show First 20 Lines • Show All 196 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: X86 vzeroupper inserter			; CHECK-NEXT: X86 vzeroupper inserter
	; CHECK-NEXT: MachineDominator Tree Construction			; CHECK-NEXT: MachineDominator Tree Construction
	; CHECK-NEXT: Machine Natural Loop Construction			; CHECK-NEXT: Machine Natural Loop Construction
	; CHECK-NEXT: Lazy Machine Block Frequency Analysis			; CHECK-NEXT: Lazy Machine Block Frequency Analysis
	; CHECK-NEXT: X86 Byte/Word Instruction Fixup			; CHECK-NEXT: X86 Byte/Word Instruction Fixup
	; CHECK-NEXT: Lazy Machine Block Frequency Analysis			; CHECK-NEXT: Lazy Machine Block Frequency Analysis
	; CHECK-NEXT: X86 Atom pad short functions			; CHECK-NEXT: X86 Atom pad short functions
	; CHECK-NEXT: X86 LEA Fixup			; CHECK-NEXT: X86 LEA Fixup
				; CHECK-NEXT: X86 Fixup Inst Tuning
				pengfeiUnsubmitted Not Done Reply Inline Actions Is the pass too far away from `ISel`? pengfei: Is the pass too far away from `ISel`?
				goldstein.w.nAuthorUnsubmitted Done Reply Inline Actions I think we want it to run quite after all other instruction transformations. The important ones are register allocation (so we convert `vpermlipsri` -> `vshufpsrri` at a spill) and domainfixup (so we have the correct instructions). goldstein.w.n: I think we want it to run quite after all other instruction transformations. The important…
	; CHECK-NEXT: Compressing EVEX instrs to VEX encoding when possible			; CHECK-NEXT: Compressing EVEX instrs to VEX encoding when possible
	; CHECK-NEXT: X86 Discriminate Memory Operands			; CHECK-NEXT: X86 Discriminate Memory Operands
	; CHECK-NEXT: X86 Insert Cache Prefetches			; CHECK-NEXT: X86 Insert Cache Prefetches
	; CHECK-NEXT: X86 insert wait instruction			; CHECK-NEXT: X86 insert wait instruction
	; CHECK-NEXT: Contiguously Lay Out Funclets			; CHECK-NEXT: Contiguously Lay Out Funclets
	; CHECK-NEXT: StackMap Liveness Analysis			; CHECK-NEXT: StackMap Liveness Analysis
	; CHECK-NEXT: Live DEBUG_VALUE analysis			; CHECK-NEXT: Live DEBUG_VALUE analysis
	; CHECK-NEXT: Machine Sanitizer Binary Metadata			; CHECK-NEXT: Machine Sanitizer Binary Metadata
	Show All 24 Lines