This is an archive of the discontinued LLVM Phabricator instance.

[X86] `lowerBuildVectorAsBroadcast()`: with AVX2, allow i64->XMM broadcasts from constant pool
AbandonedPublic

Authored by lebedev.ri on Apr 6 2022, 9:42 AM.

Details

Reviewers
RKSimon
Summary

While AVX2 can do that, it results in a number of load folding regressions,
and i'm a little lost as to what to do about them. Any hints?

Diff Detail

Unit TestsFailed

Event Timeline

lebedev.ri created this revision.Apr 6 2022, 9:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2022, 9:42 AM
Herald added a subscriber: pengfei. · View Herald Transcript
lebedev.ri requested review of this revision.Apr 6 2022, 9:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2022, 9:42 AM

please can generate the diff with context?

llvm/test/CodeGen/X86/combine-movmsk.ll
239

looks like you need to tweak the check prefixes

lebedev.ri updated this revision to Diff 420930.Apr 6 2022, 9:49 AM

Add more context.

Hm, so we already have cases where we fail to undo broadcast load into a folded load (https://godbolt.org/z/3jzEd91ca), so i'm still unsure if that is a blocker?

lebedev.ri updated this revision to Diff 420989.Apr 6 2022, 1:03 PM
lebedev.ri marked an inline comment as done.

Fix check prefixes in a (single) problematic test.

This is why I don't think we want to perform too much of this in the DAG - we quickly get to cases where the decision between broadcast vs vector load of constants can't be easily determined - value tracking, multiple uses, hoisting, lost folds, spilling etc. all get affected.

A while ago I was investigating the use of VPMOVSX/ZX to reduce the size of the constant pool, and hit many of the same problems. And constant rematerialization would be the same if we ever get to that point.

There's probably some minor further tweaks we can do (more hasOneUse checks?), but really we need to think about performing less in the DAG, and more in later passes.

This is why I don't think we want to perform too much of this in the DAG - we quickly get to cases where the decision between broadcast vs vector load of constants can't be easily determined - value tracking, multiple uses, hoisting, lost folds, spilling etc. all get affected.

A while ago I was investigating the use of VPMOVSX/ZX to reduce the size of the constant pool, and hit many of the same problems. And constant rematerialization would be the same if we ever get to that point.

There's probably some minor further tweaks we can do (more hasOneUse checks?), but really we need to think about performing less in the DAG, and more in later passes.

I see.
Please confirm my understanding, you are suggesting that we should generalize *SET0/*SETALLONES pseudo-instructions
into MATERIALIZE pseudo-instruction, with much the same handling of expanding it post-RA (expandPostRAPseudo())?

I see.
Please confirm my understanding, you are suggesting that we should generalize *SET0/*SETALLONES pseudo-instructions
into MATERIALIZE pseudo-instruction, with much the same handling of expanding it post-RA (expandPostRAPseudo())?

I'm not sure if we'd want to handle them as pseudos, or have a pass that converts vector constant pool loads into broadcasts/materialization etc. in general. Handling AVX512 broadcast folds just makes it more difficult.

All of this needs to be done in conjunction with the foldMemoryOperand stages, and I haven't investigated it much as to how to deal with it all.

lebedev.ri abandoned this revision.Apr 12 2022, 12:38 PM