This is an archive of the discontinued LLVM Phabricator instance.

[CodeMetrics] Don't let extends of i1 be free.
AbandonedPublic

Authored by jonpa on Oct 17 2018, 7:11 AM.

Download Raw Diff

Details

Reviewers

uweigand
eastig
chandlerc
eli.friedman
fhahn
craig.topper

Summary

I was experimenting with the LoopRotation option -rotation-max-header-size and found that if I increased this a few more loops were vectorized.

This is a threshold that compares against the computed size of the header which is done by calling getUserCost() (instead of getArithmeticInstrCost(), etc). This seems to be a poor-mans cost function, and I looked over the various costs that are actually used for SystemZ.

The only cost I came to question was that for the extension of an i1, which is per default free. I suspect this is wrong on SystemZ, as it is implemented with LOC for int, and branch sequence for floating point.

So far, I have seen some mixed results mostly in favor, but also a regression or two. I also experimented with the SimplifyCFG option -phi-node-folding-threshold=3 (instead of default 2), to handle some case that seemed better transformed by the CFG but were not with the added cost of the i1 extension. This in turn also led to mixed results.

I think this might be worth investigating further. In particular, I wonder if it may be worth fine-tuning exactly what the SimplifyCFG decides to speculate or not. Raising the limit to 3 seems beneficial in some cases at least.

Diff Detail

Event Timeline

jonpa created this revision.Oct 17 2018, 7:11 AM

It seems this is a deliberate choice made here:

// Result of a cmp instruction is often extended (to be used by other
// cmp instructions, logical or return instructions). These are usually
// nop on most sane targets.
if (isa<CmpInst>(CI->getOperand(0)))
  return TTI::TCC_Free;

which confuses me a bit. I guess it is true that if the result of the extension is then used in another compare, that can be optimized away. But if it cannot be optimized away, I'm not aware of any platform where this is free (e.g. on Intel there's the setcc type instructions).

Looking at the commit where (the predecessor of) this code was introduced:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20100111/094226.html
this also says:

Ext of i/fcmp results are mostly optimized away in codegen.

So the rationale is not that those extends are supposed to be cheap (and therefore it makes sense to make the cost different per platform). Rather the rationale is that those extends are typically *optimized away* later.

This leaves me wondering: is there something special about SystemZ where those extends are not usually optimized away, even though they are elsewhere? Or is it still true that they're typically optimized away just like elsewhere, only in this particular use case they aren't -- but then that would be a problem on other platforms, too?

How is this loop handled on Intel, for example?

In D53373#1268516, @uweigand wrote:
It seems this is a deliberate choice made here:
// Result of a cmp instruction is often extended (to be used by other
// cmp instructions, logical or return instructions). These are usually
// nop on most sane targets.
if (isa<CmpInst>(CI->getOperand(0)))
  return TTI::TCC_Free;
which confuses me a bit. I guess it is true that if the result of the extension is then used in another compare, that can be optimized away. But if it cannot be optimized away, I'm not aware of any platform where this is free (e.g. on Intel there's the setcc type instructions).

Looking at the commit where (the predecessor of) this code was introduced:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20100111/094226.html
this also says:

Ext of i/fcmp results are mostly optimized away in codegen.

So the rationale is not that those extends are supposed to be cheap (and therefore it makes sense to make the cost different per platform). Rather the rationale is that those extends are typically *optimized away* later.

This leaves me wondering: is there something special about SystemZ where those extends are not usually optimized away, even though they are elsewhere? Or is it still true that they're typically optimized away just like elsewhere, only in this particular use case they aren't -- but then that would be a problem on other platforms, too?

How is this loop handled on Intel, for example?

It seems that on x86, all these loops are implemented with testl + sets (llc -O3), which IIUC is basically a compare + setcc, which you mentioned.

If I first run opt -O3, the first two loops instead use icmp + select and the output on x86 is instead something like testq + cmovq (the conversions to fp still use testl + sets). So even if "optimized away" would mean "turned into select", that is still not free...

Still not sure then what this is about - will try to ask on the mailing list.

After discussions on llvm-dev, it has become most likely that this code is outdated and wrong. So I have changed the patch to simply remove it. No tests failed.

It would have been somewhat better to do

if (const CastInst *CI = dyn_cast<CastInst>(U)) {
     // Extension of a vector compare may be free.
     Type *Ty = CI->getType();
     if (Ty->isVectorTy() && isa<CmpInst>(CI->getOperand(0))) {
       BooleanContent BoolCt = TLI->getBooleanContents(Ty);
       if ((isa<ZExtInst>(CI) && BoolCt == ZeroOrOneBooleanContent) ||
           (isa<SExtInst>(CI) && BoolCt == ZeroOrNegativeOneBooleanContent))
         return TTI::TCC_Free; 
     }
     if (isa<SExtInst>(CI) || isa<ZExtInst>(CI) || isa<FPExtInst>(CI))
       return static_cast<T *>(this)->getExtCost(CI, Operands.back()); 
   }

, but there is no TLI pointer in TargetTransformInfoImplBase, and since this is a poor-mans version of cost computations it seems like an overkill.

jonpa retitled this revision from [SystemZ] Let extends of i1 be at least TCC_Basic in getUserCost() to [CodeMetrics] Don't let extends of i1 be free..Oct 31 2018, 12:13 AM

jonpa added a subscriber: llvm-commits.

Hi Jonas,

Thank for taking care of this. As this is legacy code without any tests some additional checks, such as runs of the LNT test suite, might be needed. This small change can affect the inliner or other optimisations and introduce regressions. I'll try to run the LNT benchmarks to check. You are welcome to do this as well.

I also suggest to add Chandler as a reviewer. Maybe Eli Friedman too?

Thanks,
Evgeny

I also suggest to add Chandler as a reviewer. Maybe Eli Friedman too?

Sure!

I ran the test suite in benchmarking mode (using http://llvm.org/docs/lnt/quickstart.html) on my Intel laptop (i7 4 cores/8 threads), following all the advice including turning frequency scaling and turbo boost mode off. With 5 + 5 runs, I could not see any runtime improvements / regressions at all (When I did 2 +2 runs while using the laptop for editing I saw just a few minor cases both ways).

This should indicate that this patch would be ok for x86 at least. Of course it wouldn't hurt if someone tried some other benchmarks as well for X86, and hopefully someone could do this also for some other target. @fhahn: Could you perhaps try this on Arm?

Tests?

This patch was never sent to llvm-commits.

Please abandon this Phabricator revision and create a new one following the process:
https://llvm.org/docs/Phabricator.html

And make sure that llvm-commits is subscribed from the beginning so that it follows standard LLVM codereview process.

In D53373#1303836, @lebedev.ri wrote:

Tests?

I first made a test for this using the loop-unroller, but then I realized that I could not find any other tests for getUserCost(), so I removed it again, since it was a bit hacky. See the first version under 'History'. Should I put it back, or is there a better way?

In D53373#1303838, @chandlerc wrote:

This patch was never sent to llvm-commits.

Please abandon this Phabricator revision and create a new one following the process:
https://llvm.org/docs/Phabricator.html

And make sure that llvm-commits is subscribed from the beginning so that it follows standard LLVM codereview process.

OK, please jump to https://reviews.llvm.org/D54742

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfoImpl.h

8 lines

Diff 171866

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 799 Lines • ▼ Show 20 Lines	if (auto CS = ImmutableCallSite(U)) {
return static_cast<T *>(this)		return static_cast<T *>(this)
->getCallCost(cast<FunctionType>(FTy), CS.arg_size());		->getCallCost(cast<FunctionType>(FTy), CS.arg_size());
}		}

SmallVector<const Value *, 8> Arguments(CS.arg_begin(), CS.arg_end());		SmallVector<const Value *, 8> Arguments(CS.arg_begin(), CS.arg_end());
return static_cast<T *>(this)->getCallCost(F, Arguments);		return static_cast<T *>(this)->getCallCost(F, Arguments);
}		}

if (const CastInst *CI = dyn_cast<CastInst>(U)) {		if (const CastInst *CI = dyn_cast<CastInst>(U))
// Result of a cmp instruction is often extended (to be used by other
// cmp instructions, logical or return instructions). These are usually
// nop on most sane targets.
if (isa<CmpInst>(CI->getOperand(0)))
return TTI::TCC_Free;
if (isa<SExtInst>(CI) \|\| isa<ZExtInst>(CI) \|\| isa<FPExtInst>(CI))		if (isa<SExtInst>(CI) \|\| isa<ZExtInst>(CI) \|\| isa<FPExtInst>(CI))
return static_cast<T *>(this)->getExtCost(CI, Operands.back());		return static_cast<T *>(this)->getExtCost(CI, Operands.back());
}

return static_cast<T *>(this)->getOperationCost(		return static_cast<T *>(this)->getOperationCost(
Operator::getOpcode(U), U->getType(),		Operator::getOpcode(U), U->getType(),
U->getNumOperands() == 1 ? U->getOperand(0)->getType() : nullptr);		U->getNumOperands() == 1 ? U->getOperand(0)->getType() : nullptr);
}		}

int getInstructionLatency(const Instruction *I) {		int getInstructionLatency(const Instruction *I) {
SmallVector<const Value *, 4> Operands(I->value_op_begin(),		SmallVector<const Value *, 4> Operands(I->value_op_begin(),
Show All 33 Lines