This is an archive of the discontinued LLVM Phabricator instance.

[X86] Transform zext+seteq+cmp into shr+lzcnt on btver2 architecture.
AbandonedPublic

Authored by pgousseau on Jul 6 2016, 3:08 AM.

Details

Summary

Hi All,

I would like to propose a change to turn zext+seteq+cmp into shr+lzcnt.
This optimisation is beneficial on Jaguar architecture only, where the lzcnt has a good reciprocal throughput.
Other architectures such as Intel's Haswell/Broadwell or AMD's Bulldozer/PileDriver do not benefit from it.
For this reason the change also add a "HasFastLZCNT" feature which gets enabled for Jaguar.

Diff Detail

Event Timeline

pgousseau updated this revision to Diff 62837.Jul 6 2016, 3:08 AM
pgousseau retitled this revision from to [X86] Transform zext+seteq+cmp into shr+lzcnt on btver2 architecture..
pgousseau updated this object.
pgousseau added a subscriber: llvm-commits.
RKSimon edited edge metadata.Jul 6 2016, 7:08 AM

This seems to be limited to (a==0) and ((a== 0) || (b== 0)) patterns - is that the best way to do this? Can this be easily compounded to support different numbers of tests?

lib/Target/X86/X86Subtarget.h
137

This isn't a processor/cpuid feature, please move this further down to be closer to the other fast/slow characteristic features.

test/CodeGen/X86/lzcnt-zext-cmp.ll
2

Please regenerate with utils/update_llc_test_checks.py

pgousseau updated this revision to Diff 62882.Jul 6 2016, 9:13 AM
pgousseau edited edge metadata.

Following Simon's comments:

  • Move the new flag declaration among declarations with similar meaning.
  • Regenerated test's asserts with update_llc_test_checks.py

This seems to be limited to (a==0) and ((a== 0) || (b== 0)) patterns - is that the best way to do this? Can this be easily compounded to support different numbers of tests?

Yes I was hoping to do the generic case initially, I had a version implemented in X86ISelLowering which was handling the case (ak == 0 || ... || an == 0), but this caused some optimisation opportunities in h264's hot functions to be missed. Not quite sure why this happened, it seems that there are optimisations occurring before Instruction selection that will undo lzcnt + shr optimisations.

spatel edited edge metadata.Jul 6 2016, 9:32 AM

Did you look at doing this in DAGCombine with a TLI hook? PPC already has this or something very close in PPCTargetLowering::LowerSETCC(), so it seems like the code could simply be lifted up from there?

Would x86 targets other than btver2 want to do this transform when optimizing for size?

Did you look at doing this in DAGCombine with a TLI hook? PPC already has this or something very close in PPCTargetLowering::LowerSETCC(), so it seems like the code could simply be lifted up from there?

Would x86 targets other than btver2 want to do this transform when optimizing for size?

I did not know there was a similar change in PPCTargetLowering, thanks for pointing it out, I will investigate this.
For the TLI hook I could add a method 'bool isCTLZFast()', which on X86, I suppose would be something like:
bool isCTLZFast() { return (hasLZCNT() && getCPU() == "btver2");}
It looks a bit less maintainable than the Target feature way though, what do you think?

Regarding the size, I have noticed that openssl does get smaller but I would need to do more testing to ensure this is always the case.

spatel added a comment.Jul 7 2016, 7:29 AM

Did you look at doing this in DAGCombine with a TLI hook? PPC already has this or something very close in PPCTargetLowering::LowerSETCC(), so it seems like the code could simply be lifted up from there?

Would x86 targets other than btver2 want to do this transform when optimizing for size?

I did not know there was a similar change in PPCTargetLowering, thanks for pointing it out, I will investigate this.
For the TLI hook I could add a method 'bool isCTLZFast()', which on X86, I suppose would be something like:
bool isCTLZFast() { return (hasLZCNT() && getCPU() == "btver2");}
It looks a bit less maintainable than the Target feature way though, what do you think?

You should keep the HasFastLZCNT attribute that you already have proposed, and the hook will trigger off of that. We don't want to base transforms on CPU models; that doesn't evolve well.

Did you look at doing this in DAGCombine with a TLI hook? PPC already has this or something very close in PPCTargetLowering::LowerSETCC(), so it seems like the code could simply be lifted up from there?

Would x86 targets other than btver2 want to do this transform when optimizing for size?

I did not know there was a similar change in PPCTargetLowering, thanks for pointing it out, I will investigate this.
For the TLI hook I could add a method 'bool isCTLZFast()', which on X86, I suppose would be something like:
bool isCTLZFast() { return (hasLZCNT() && getCPU() == "btver2");}
It looks a bit less maintainable than the Target feature way though, what do you think?

You should keep the HasFastLZCNT attribute that you already have proposed, and the hook will trigger off of that. We don't want to base transforms on CPU models; that doesn't evolve well.

Makes sense yes thanks. This should be done in 3 patches I think.
The first patch will be NFC, adding the "isCTLZFast" TLI hook, enabling it for PPC and moving the PPC specific code to DAGCombiner.
The second patch will enable "isCTLZFast" for X86.
The third patch will add the transformation for the OR case.
I think this is what you and Simon are suggesting?
Will experiment a bit with it and abandon this review once the first patch is ready.

spatel added a comment.Jul 8 2016, 7:43 AM

Makes sense yes thanks. This should be done in 3 patches I think.
The first patch will be NFC, adding the "isCTLZFast" TLI hook, enabling it for PPC and moving the PPC specific code to DAGCombiner.
The second patch will enable "isCTLZFast" for X86.
The third patch will add the transformation for the OR case.
I think this is what you and Simon are suggesting?

That sounds like a good plan to me.

Abandon this now that D23446 is committed?

pgousseau abandoned this revision.Oct 17 2016, 8:00 AM

Abandon this now that D23446 is committed?

Ah yes! Thanks for reminding me.