While AVX2 can do that, it results in a number of load folding regressions,
and i'm a little lost as to what to do about them. Any hints?
Details
- Reviewers
- RKSimon 
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
| Time | Test | |
|---|---|---|
| 60,050 ms | x64 debian > MLIR.Examples/standalone::test.toy | 
Event Timeline
please can generate the diff with context?
| llvm/test/CodeGen/X86/combine-movmsk.ll | ||
|---|---|---|
| 239 | looks like you need to tweak the check prefixes | |
Hm, so we already have cases where we fail to undo broadcast load into a folded load (https://godbolt.org/z/3jzEd91ca), so i'm still unsure if that is a blocker?
This is why I don't think we want to perform too much of this in the DAG - we quickly get to cases where the decision between broadcast vs vector load of constants can't be easily determined - value tracking, multiple uses, hoisting, lost folds, spilling etc. all get affected.
A while ago I was investigating the use of VPMOVSX/ZX to reduce the size of the constant pool, and hit many of the same problems. And constant rematerialization would be the same if we ever get to that point.
There's probably some minor further tweaks we can do (more hasOneUse checks?), but really we need to think about performing less in the DAG, and more in later passes.
I see.
Please confirm my understanding, you are suggesting that we should generalize *SET0/*SETALLONES pseudo-instructions
into MATERIALIZE pseudo-instruction, with much the same handling of expanding it post-RA (expandPostRAPseudo())?
I'm not sure if we'd want to handle them as pseudos, or have a pass that converts vector constant pool loads into broadcasts/materialization etc. in general. Handling AVX512 broadcast folds just makes it more difficult.
All of this needs to be done in conjunction with the foldMemoryOperand stages, and I haven't investigated it much as to how to deal with it all.
looks like you need to tweak the check prefixes