This is an archive of the discontinued LLVM Phabricator instance.

[X86]: Match (xor TSize - 1, ctlz) to `bsr` instead of `lzcnt` + `xor`
ClosedPublic

Authored by goldstein.w.n on Jan 11 2023, 12:47 AM.

Details

Summary

Was previously de-optimizating if -march supported lzcnt as there is
no reason to add the extra instruction.

Diff Detail

Event Timeline

goldstein.w.n created this revision.Jan 11 2023, 12:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2023, 12:47 AM
goldstein.w.n requested review of this revision.Jan 11 2023, 12:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2023, 12:47 AM
craig.topper added inline comments.
llvm/lib/Target/X86/X86InstrCompiler.td
2254 ↗(On Diff #488100)

This doesn't produce the correct value for $src being 0. The ctlz would return 16 and the xor would turn that to 31. BSR16rr will produce an undefined value.

I think this is only valid for the zero_undef nodes.

The bsr implementation in hardware also reads the destination register to return it if the other input is zero. This can create unintentional dependencies on older instructions.

bsr is also a multiple uop instruction on some AMD CPUs such as Zen1, 2, and 3. According to uops.info its improved on Zen4.

The bsr implementation in hardware also reads the destination register to return it if the other input is zero. This can create unintentional dependencies on older instructions.

bsr is also a multiple uop instruction on some AMD CPUs such as Zen1, 2, and 3. According to uops.info its improved on Zen4.

Would be fair to not do this on zen1/2/3. Could also make it last use only (where dst==src is almost guranteed) if the dependency
on dst is a major concern, although for many intel x86 impls lzcnt also has a false-dep on dst so its mostly equivilent.

goldstein.w.n added inline comments.Jan 11 2023, 1:53 AM
llvm/lib/Target/X86/X86InstrCompiler.td
2254 ↗(On Diff #488100)

This doesn't produce the correct value for $src being 0. The ctlz would return 16 and the xor would turn that to 31. BSR16rr will produce an undefined value.

I think this is only valid for the zero_undef nodes.

I think you're right although call i32 @llvm.ctlz.i32(i32, i1 true) doesn't seem to match ctlz_zero_undef.
Also taking a look at lzcnt def:

  def LZCNT16rr : I<0xBD, MRMSrcReg, (outs GR16:$dst), (ins GR16:$src),
                    "lzcnt{w}\t{$src, $dst|$dst, $src}",
                    [(set GR16:$dst, (ctlz GR16:$src)), (implicit EFLAGS)]>,
                    XS, OpSize16, Sched<[WriteLZCNT]>;
  def LZCNT16rm : I<0xBD, MRMSrcMem, (outs GR16:$dst), (ins i16mem:$src),
                    "lzcnt{w}\t{$src, $dst|$dst, $src}",
                    [(set GR16:$dst, (ctlz (loadi16 addr:$src))),
                     (implicit EFLAGS)]>, XS, OpSize16, Sched<[WriteLZCNTLd]>;

  def LZCNT32rr : I<0xBD, MRMSrcReg, (outs GR32:$dst), (ins GR32:$src),
                    "lzcnt{l}\t{$src, $dst|$dst, $src}",
                    [(set GR32:$dst, (ctlz GR32:$src)), (implicit EFLAGS)]>,
                    XS, OpSize32, Sched<[WriteLZCNT]>;
...

Which is just set as ctlz so I thought it was just a very confusing overloaded wording in the TD. Do you know what I'm missing?

We have TuningFastLZCNT which should be a good enough predicate to control when NOT to perform this fold - you will need to add test coverage to clz.ll for fast/slow lzcnt

The bsr implementation in hardware also reads the destination register to return it if the other input is zero. This can create unintentional dependencies on older instructions.

bsr is also a multiple uop instruction on some AMD CPUs such as Zen1, 2, and 3. According to uops.info its improved on Zen4.

Would be fair to not do this on zen1/2/3. Could also make it last use only (where dst==src is almost guranteed) if the dependency
on dst is a major concern, although for many intel x86 impls lzcnt also has a false-dep on dst so its mostly equivilent.

The false dependency was fixed on Skylake which released in 2015. Is it safe to assume the majority of users won’t be using a CPU older than that?

Only do for ctlz_zero_undef. Add more tests

We have TuningFastLZCNT which should be a good enough predicate to control when NOT to perform this fold - you will need to add test coverage to clz.ll for fast/slow lzcnt

Done in V2 (I think, used HasFastLZCNT).

The bsr implementation in hardware also reads the destination register to return it if the other input is zero. This can create unintentional dependencies on older instructions.

bsr is also a multiple uop instruction on some AMD CPUs such as Zen1, 2, and 3. According to uops.info its improved on Zen4.

Would be fair to not do this on zen1/2/3. Could also make it last use only (where dst==src is almost guranteed) if the dependency
on dst is a major concern, although for many intel x86 impls lzcnt also has a false-dep on dst so its mostly equivilent.

The false dependency was fixed on Skylake which released in 2015. Is it safe to assume the majority of users won’t be using a CPU older than that?

llvm/lib/Target/X86/X86ISelLowering.cpp
51899

Use curly braces here for consistency with the other blocks.

51923

This needs to be

SDVTList VTs = DAG.getVTList(OpVT, MVT::i32);                                  
Op = DAG.getNode(X86ISD::BSR, dl, VTs, Op);

We have X86ISD::BSR plumbed to also returns flags so it has a second i32 out put for the flags.

craig.topper added inline comments.Jan 11 2023, 4:04 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
51912

You can use C->getZExtValue() since we know the VT is 64-bits or less.

goldstein.w.n marked 3 inline comments as done.Jan 11 2023, 9:50 PM

Fix some nits

pengfei added inline comments.Jan 12 2023, 12:39 AM
llvm/test/CodeGen/X86/clz.ll
968

The FIXME is solved :)
But maybe better to leave a comment the FASTLZCNT is intended.

goldstein.w.n marked an inline comment as done.Jan 12 2023, 9:33 AM
goldstein.w.n added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
51923

This needs to be

SDVTList VTs = DAG.getVTList(OpVT, MVT::i32);                                  
Op = DAG.getNode(X86ISD::BSR, dl, VTs, Op);

We have X86ISD::BSR plumbed to also returns flags so it has a second i32 out put for the flags.

Propegate test changes. Update fixme comment with comment explaining fast-lzcnt

Just a couple of minor comments

llvm/lib/Target/X86/X86ISelLowering.cpp
52076

(style) assert message

52108

if (!C)

52126

(style) remove braces

llvm/test/CodeGen/X86/clz.ll
968

What target can manage 1c latency bsr?

goldstein.w.n marked 4 inline comments as done.

Fix some nits

llvm/lib/Target/X86/X86ISelLowering.cpp
52108

if (!C)

Are the two equivilent? Always thought !C is a zero/non-zero check but nullptr != 0 is techincally possible.

llvm/test/CodeGen/X86/clz.ll
968

What target can manage 1c latency bsr?

Zen4 and I think a few other AMD ones (although more have 1c lzcnt and expensive bsr).

Matt added a subscriber: Matt.Feb 1 2023, 8:23 PM
RKSimon accepted this revision.Feb 6 2023, 8:28 AM

LGTM - cheers

This revision is now accepted and ready to land.Feb 6 2023, 8:28 AM
This revision was landed with ongoing or failed builds.Feb 6 2023, 12:16 PM
This revision was automatically updated to reflect the committed changes.