The patch targeted to recognize and replace with ctlz (make countable) loops like:
while(n) {
n >>= 1; i++; body();
}
use(i);
Should be replaced with:
ii = type_bitwidth(n) - ctlz(n);
for(j = 0; j < ii; j++) {
body();
}
use(ii + i)
Differential D32605
Recognize CTLZ builtin evstupac on Apr 27 2017, 11:22 AM. Authored by
Details The patch targeted to recognize and replace with ctlz (make countable) loops like: while(n) { n >>= 1; i++; body(); } Should be replaced with:
Diff Detail
Event TimelineComment Actions Regarding performance. Almost all benchmark are build same. Comment Actions Hi Evgeny, I think this is an interesting concept, but I share Joerg's concern. This has to, at least, calculate the cost of CTLZ, per architecture, and compare with the cost of the loop. For example, if the loop count is known and short, and the platform has no CTLZ, then it's quite possible that the transformation will render poorer code. It doesn't have to be a very accurate cost, but something needs to be done. You'd also need some more testing on non-x86 platforms, to make sure the decisions are good for all of them. cheers,
Comment Actions I can guard this. Or keep as is if we consider converting loop to countable is generally good. In this case builtin should be expanded at codegen in the most profitable way. Comment Actions Hi Renato, I assume if CPU have ctlz the transformation is always profitable. For x86 CPUs it is so. Thanks, Comment Actions We could add a TTI callback like we have for popcnt. You could also argue that this is the better canonical form because it can enable other analysis/optimizations (and we should fix the expansion if it is suboptimal). Comment Actions I also feel like "not a perfect expand" for particular CPU is the issue that should be resolved by CPU owner.
Guarding is easier. Let's do this only if someone complains about performance. Comment Actions I agree. If the CPU has it, it will be beneficial. If it doesn't, it is only a useful transformation if the intrinsic can be constant folded. Comment Actions @joerg, can you clarify what you agree with? It sounds to be that you're worried that the intrinsics will be worse in some cases? ARM(32/64) and x86_64 have CLZ, so it should always be beneficial, but I was just worries I'm missing something important. cheers, Comment Actions That's not true. ARMv4 for example has no clz, it's a V5T feature. That's my point: if the CPU has no direct lowering for the intrinsic, this transform is beneficial only if the resulting intrinsic can be constant folded. But I wonder if we don't catch those cases already with SCEV based optimisations. If a CPU has no direct lowering like on ARMv4, it will add a libcall and with a high chance of being more expensive than the optimisation. Comment Actions The target independent lowering code emits this for CTLZ when its not supported. I think the popcount expands to this http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel if its not supported. So there shouldn't be a libcall unless the target is changing the default behavior. // for now, we do this: // x = x | (x >> 1); // x = x | (x >> 2); // ... // x = x | (x >>16); // x = x | (x >>32); // for 64-bit input // return popcount(~x); // // Ref: "Hacker's Delight" by Henry Warren Comment Actions Why? // fold (srl (ctlz x), "5") -> x iff x has one bit set (the low bit). // select_cc seteq X, 0, sizeof(X), ctlz(X) -> ctlz(X) // select_cc seteq X, 0, sizeof(X), ctlz_zero_undef(X) -> ctlz(X) // select_cc seteq X, 0, sizeof(X), cttz(X) -> cttz(X) // select_cc seteq X, 0, sizeof(X), cttz_zero_undef(X) -> cttz(X) ..... Comment Actions Yeah, but even the generic expansion results in ~19 instructions on ARMv4. Compare that to one instruction in the loop and it can hardly be said to be a general win. Comment Actions There should be at least 3 instructions in the loop: add, shift and branch. For 32 bit instruction it will run 16 iterations average, so 16 * 3 > 19. Comment Actions I don't think this transform changes anything about the countability of the loop, SCEV should certainly be able to do
You are miscounting. The very example you originally gave trades a shift+count based loop into a clz + increment based loop. Naively speaking without subtleties of the architecture, The udivmodsi4.S implementation in compiler-rt is a good example -- if clz can be used, it provides a nice optimization. Otherwise the stupid linear checking performs better in the majority of cases. Please keep also in mind that we are talking about potentially executing fewer instructions vs blowing up .text here. Comment Actions It changes. SCEV is unable to get BECount for while (n>>=1) loop. Actually BECount should be CTLZ(n) - 1...
If we get rid of a loop I'm not miscounting.
If we apply the optimization only in case whole loop is converted to CTLZ this is ok? Comment Actions That is something SCEV should be able to discover on its own.
Replacing the full loop with the intrinsic is ok. The current default lowering is broken, but improving that is orthogonal. I.e. from a code size perspective, trading the loop for a libcall is still an improvement when using an optimized library version.
Hoisting the computation out of the loop without removing it should be guarded by the CPU support for CTLZ, correct. Comment Actions Could you briefly describe how? The only way to get BECount is to calculate CTLZ. What SCEV should discover?
Loop idiom recognition is not the best place to check if we'll be able to delete a loop or not. If we decided to insert TTI guard I'll do this in the beginning. Comment Actions I keep forgetting about ARMv4... :) My base point here is that we should avoid arguments like "the issue that should be resolved by CPU owner". For example, no one I know *benchmarks on ARMv4*. I know people that test stuff on it (Joerg, Saleem), but not benchmark. This would have passed unnoticed for how long? I think we need to take a more conservative view on optimisations, and get either an approval or "don't care" by CPU ownser, and *then* let them them work out the problems later, if they come. I particularly care about ARMv7+, and that's why I wanted to make sure Joerg was happy, because he cares about areas I usually don't. And I agree with Joerg, this will have a large impact on ARMv4, not just performance, but also code size. This feature *needs* a hook. cheers, Comment Actions Just as a side note: this is not only ARMv4. Older SPARC, M68K, SH, IA64, Alpha all seem to lack a FFS / CLZ instruction from cursory check.
Comment Actions Right, just to make sure it's doing what we expect it to, did you try to run this with targets "armv7a" and "armv4t"? They should have different results. Also, having a new test in the ARM side with two RUN lines, one for each, would make it much clearer the intentions. cheers, Comment Actions Add a test for armv4t and armv7a.
Comment Actions If I get yuo right, "ctlz_and_other" checks conversion to countable loop and that CPUs that not support ctlz will not insert it in this case. Comment Actions I mean: it doesn't translate the loop to use ctlz, but the comment suggests that it still transforms the loop into something better understood. But it doesn't test that part? Comment Actions Get it.
|
Just made a follow up commit r303212 adding "!IsCntPhiUsedOutsideLoop" check here.
Previously the code was potentially buggy because in case IsCntPhiUsedOutsideLoop we inserted CTLZ(X >> 1) and rely on check X != 0 (instead of (X >> 1) != 0).