This is an archive of the discontinued LLVM Phabricator instance.

[X86] X86DAGToDAGISel::matchBEXTRFromAndImm(): if can't use BEXTR, fallback to BZHI (PR43381)
ClosedPublic

Authored by lebedev.ri on Sep 21 2019, 4:46 AM.

Details

Summary

PR43381 notes that while we are good at matching (X >> C1) & C2 as BEXTR/BEXTRI,
we only do that if we either have BEXTRI (TBM),
or if BEXTR is marked as being fast (-mattr=+fast-bextr).
In all other cases we don't match.

But that is mainly only true for AMD CPU's.
However, for all the CPU's for which we have sched models,
the BZHI is always fast (or the sched models are all bad.)

So if we decide that it's unprofitable to emit BEXTR/BEXTRI,
we should fall-back to BZHI if it is available,
and follow-up with the shift.

While it's really tempting to do something because it's cool
it is wise to first thing whether it actually makes sense to do.
We shouldn't just use BZHI because we can, but only it it is beneficial.
In particular, it isn't really worth it if the input is a register,
mask is small, or we can fold a load.
But it is worth it if the mask does not fit into 32-bits.

(careful, i don't know much about intel cpu's my choice of -mcpu may be bad here)
Thus we manage to fold a load:
https://godbolt.org/z/Er0OQz
Or if we'd end up using BZHI anyways because the mask is large:
https://godbolt.org/z/dBJ_5h
But this isn'r actually profitable in general case,
e.g. here we'd increase microop count
(the register renaming is free, mca does not model that there it seems)
https://godbolt.org/z/k6wFoz
Likewise, not worth it if we just get load folding:
https://godbolt.org/z/1M1deG

https://bugs.llvm.org/show_bug.cgi?id=43381

Diff Detail

Event Timeline

lebedev.ri created this revision.Sep 21 2019, 4:46 AM
lebedev.ri edited the summary of this revision. (Show Details)Sep 21 2019, 4:46 AM

Fixup comment, NFC.

Neat! If you have the time, the BZHI bits-to-preserve operand only needs MOVB for initialization. That being said, MOVL probably avoids partial register update stalls, so maybe that’s why you’re seeing a performance gain.

If you have the time, the BZHI bits-to-preserve operand only needs MOVB for initialization. That being said, MOVL probably avoids partial register update stalls, so maybe that’s why you’re seeing a performance gain.

Yes, it is intentional to use 32-bit writes to avoid partial register access.

davezarzycki accepted this revision.Sep 21 2019, 11:04 AM

Thanks for doing this work! I'm not an expert in this source code, so please wait for somebody else to approve it.

This revision is now accepted and ready to land.Sep 21 2019, 11:04 AM
craig.topper added inline comments.Sep 22 2019, 10:59 AM
llvm/test/CodeGen/X86/bmi-x86_64.ll
33

This doesn't look like an obvious improvement. The movq in the original code is basically free. So it was really 2 uops. The new code is 3 uops.

lebedev.ri added inline comments.Sep 22 2019, 11:17 AM
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
3441–3444

So, should we then fallback to BZHI only if we manage to fold the load?

llvm/test/CodeGen/X86/bmi-x86_64.ll
33

That is not what mca says, i guess it's not modelled in those sched models?
Is there any particularly good intel cpu schedule model that is in LLVM i should use as reference?

But yes, that is true.

craig.topper added inline comments.
llvm/test/CodeGen/X86/bmi-x86_64.ll
33

I don't think move elimination is modeled in any of the scheduler models. @andreadb is that right?

davezarzycki added inline comments.Sep 22 2019, 12:02 PM
llvm/test/CodeGen/X86/bmi-x86_64.ll
33

Hi @craig.topper – I think the original bug report about a load not being folded into BZHI is being lost in the noise of this change proposal.

craig.topper added inline comments.Sep 22 2019, 12:20 PM
llvm/test/CodeGen/X86/bmi-x86_64.ll
33

I think this proposed change is too general. The original case is only using BZHI to avoid a MOVABSQ to load the AND mask. So I think it's largely an i64 specific issue and we should probably handle it as such.

lebedev.ri edited the summary of this revision. (Show Details)

Scaling down ambigiousness of the patch: if not using BEXTR, only use BZHI if either the mask is larger than 32-bit, or the load folding will happen.

I'm not sure if it's okay to leave newly-created noted like that if we decide it's not worthwhile folding?

lebedev.ri marked 5 inline comments as done.Sep 22 2019, 12:54 PM

What if we just do the larger than 32-bit mask? Its not clear that making BZHI just to fold a load is an improvement. You have to materialize an immediate instead so the total uops increased.

What if we just do the larger than 32-bit mask? Its not clear that making BZHI just to fold a load is an improvement. You have to materialize an immediate instead so the total uops increased.

Actually, i think it's clearly still not an improvement: https://godbolt.org/z/1M1deG

lebedev.ri edited the summary of this revision. (Show Details)
andreadb added inline comments.Sep 22 2019, 1:35 PM
llvm/test/CodeGen/X86/bmi-x86_64.ll
33

Sorry for the late reply. I only saw the message now.

Move elimination is currently only modelled for BtVer2 (Jaguar allows a limited form of move elimination for cases where the source operand is known to be zero). That being said, BtVer2 does not feature BMI2.

Move elimination is only enabled for models that provide a definition of tablegen class RegisterFile and a definition of IsOptimizableRegisterMove.

What if we just do the larger than 32-bit mask? Its not clear that making BZHI just to fold a load is an improvement. You have to materialize an immediate instead so the total uops increased.

Actually, i think it's clearly still not an improvement: https://godbolt.org/z/1M1deG

@craig.topper adjusted in last update

lebedev.ri requested review of this revision.Sep 22 2019, 2:39 PM
craig.topper accepted this revision.Sep 22 2019, 2:54 PM

LGTM with those comment fixes.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
3485

available*

3487

Stray period at the beginning of the comment.

This revision is now accepted and ready to land.Sep 22 2019, 2:54 PM
lebedev.ri marked 2 inline comments as done.

Fixup comments.

This revision was automatically updated to reflect the committed changes.