If a loop has a constant trip count known to be a multiple of MaxVF (times user UF), LV infers that no tail will be generated for any chosen VF. This relies on the chosen VF's being powers of 2 bound by MaxVF, and assumes MaxVF is a power of 2.
Make sure the latter holds, in particular when MaxVF is set by a memory dependence distance which may not be a power of 2.
Details
- Reviewers
fhahn gilr - Commits
- rG840450549c91: [LV] Clamp MaxVF to power of 2.
Diff Detail
Event Timeline
LGTM, thanks!
llvm/test/Transforms/LoopVectorize/memdep-fold-tail.ll | ||
---|---|---|
108 | nit both loop and vectorize.enable metadata not needed? |
llvm/test/Transforms/LoopVectorize/memdep-fold-tail.ll | ||
---|---|---|
108 | they are needed in order to get the loop vectorized, eventually with VF=2 and a folded tail. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5056 | This has caused some problems for us (downstream) and I'm not really sure how to deal with it. Maybe this never happens for in-tree targets, but our target will return 160 for PowerOf2Floor. And I haven't really seen anything that says that TTI.getRegisterBitWidth(true) must return a power of 2 number. Even if we actually want to scalarize operations using i40, doing PowerOf2Floor makes WidestRegister=128. And then we won't get a power-of-2 result from this method, since 128/40 isn't a power of 2 (considering that SmallestType/WidestType may end up not being a power-of-2). And then we trigger the new assert in computeMaxVF. So in some sense this patch limits the types allowed in loops to be power-of-2-sized. And returning something that isn't a power-of-two from TTI.getRegisterBitWidth is now stupid. Maybe there are other ways to ensure MaxVF is a power-of-2, or were these implications for getSmallestAndWidestTypes and getRegisterBitWidth intentional? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5056 | Ah, thanks! I had not seen that one. |
This has caused some problems for us (downstream) and I'm not really sure how to deal with it.
Maybe this never happens for in-tree targets, but our target will return 160 for PowerOf2Floor. And I haven't really seen anything that says that TTI.getRegisterBitWidth(true) must return a power of 2 number.
Another, perhaps abnormal, thing is that our frontend support i40 types. So SmallestType/WidestType isn't guaranteed to be power-of-2 either.
Even if we actually want to scalarize operations using i40, doing PowerOf2Floor makes WidestRegister=128. And then we won't get a power-of-2 result from this method, since 128/40 isn't a power of 2 (considering that SmallestType/WidestType may end up not being a power-of-2). And then we trigger the new assert in computeMaxVF.
So in some sense this patch limits the types allowed in loops to be power-of-2-sized. And returning something that isn't a power-of-two from TTI.getRegisterBitWidth is now stupid.
Maybe there are other ways to ensure MaxVF is a power-of-2, or were these implications for getSmallestAndWidestTypes and getRegisterBitWidth intentional?