Recently, after D20474, we observed regressions in our internal testing because of changes to induction variable calculations. This patch implements a TargetTransform hook to query the backend for whether splatting a scalar is expensive and returns true for ARM. This enables the vectorization of integer induction phi nodes instead of using a scalar and splatting it for each loop iteration.
Diff Detail
Event Timeline
+michael, he is the author of the code.
Sam, could you add a testcase for the regression? I am curious why such regression is arm specific.
Adding @mssimpso, who improved the logic significantly after my original patch. :-)
And yes, as @wmi said, this needs a regression test. Also, could you explain why this is an ARM-specific issue? I want to make sure we're not talking about a case where we should never be using a scalar IV, regardless of platform.
Hi Michael,
The splatting of the indvar seems to be problematic for ARM because a VDUP instruction gets generated, which has a long latency,
Hi Everyone,
I'm planning to take a look at this case today. I'll update the thread in a few hours.
Matt.
Hi Sam,
I don't think TTI is the right way to fix this. The underlying issue in the test case is that we're vectorizing instructions marked uniform after vectorization. We should be scalarizing them instead.
This is what happens. We first collect the loop uniforms. The GEPs are marked uniform, as well as the arithmetic producing their indices. Because the IVs have no non-uniform users, they too are marked uniform, which all seems correct to me.
Then we vectorize the IVs. When vectorizing them, we know that we should only need scalar versions (Legal->isScalarAfterVectorization is true because the IVs are uniform). However, to keep WidenMap happy (see D23169) we actually vectorize them with the splat method as well. This is not ideal but also not the underlying issue. Because all users of the IVs should also be scalar, it's OK (but not good) that we vectorize them with the splat method since the vector versions would have no users and would be deleted later on in the clean-up phase.
Of course, the problem with this is that we actually still are vectorizing the uniform users of the IVs, creating uses of the vector IVs that we later can't remove. We shouldn't be doing this. I think the right fix is to have something like the following in each case in vectorizeBlockInLoop (except for PHIs and branches):
if (Legal->isScalarAfterVectorization(&I)) { scalarizeInstruction(&I); continue; }
If we've already decided that an instruciton should remain scalar, we shouldn't vectorize it. Does this make sense? We will probably want to wait to do this until after D23169 has been committed to avoid large increases in compile-time. I tested the above approach (without D23169) on your test case and the ugly vector IVs were removed. Would you mind doing the same to see if your performance is restored?
Matt.
Can we please also have this loop explained at a higher level? The IR in the testcase is not really digestible.
It's pretty odd that we don't already do this.
I assume the reason we don't already get a lot of garbage is because we have LICM before LV?
I can reproduce this with something like:
void foo(int *p, int *q, short k) { for (short i = 0; i < 256; ++i) p[i + k] = q[i + k - 1] * 5 + 1; }
If we only have, say, "i + k" as the IV user, instcombine can clean this up, but it seems like it can't handle more complex cases. Haven't dug into why.
Great! I hope you don't mind that a took a stab at implementing this approach over in D23889. We should now be able to scalarize instructions that LVL marks scalar after vectorization without increasing code size pre-instcombine. To do this we can check within the scalarization logic if an instruction is also uniform, and if so, only generate one scalar value per unroll iteration instead of VF values per iteration.