This is an archive of the discontinued LLVM Phabricator instance.

X86: Fold masked merge pattern when and-not is not available
ClosedPublic

Authored by MatzeB on Oct 28 2021, 1:48 PM.

Details

Summary

There are two equivalent expressions to compute a "masked merge":
a) (m & x) | (~m & y)
b) ((x ^ y) & m) ^ y

Variant a) is preferable when an and-not instruction is available and we already optimize for that (see DAGCombiner::unfoldMaskedMerge). This adds the reverse operation TargetLowering::foldMaskedMerge and uses it in the X86 target when and-not is not available (it's part of the BMI extension so not present on older cores). This speed up cryptographic hash functions (I've seen it in md5 and sha256 implementations).

Diff Detail

Event Timeline

MatzeB created this revision.Oct 28 2021, 1:48 PM
MatzeB requested review of this revision.Oct 28 2021, 1:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 28 2021, 1:48 PM
RKSimon added inline comments.Oct 29 2021, 5:58 AM
llvm/include/llvm/CodeGen/TargetLowering.h
4616 ↗(On Diff #383152)

Description?

llvm/lib/Target/X86/X86ISelLowering.cpp
46637

you don't need to make this part of X86TargetLowering, you have access via:

const TargetLowering &TLI = DAG.getTargetLoweringInfo();

46729

TLI.foldMaskedMerge(N, DAG)

Also, we have TargetLowering::hasAndNot - could we not use that and work to make this a generic combine?

MatzeB added a comment.EditedOct 29 2021, 1:40 PM

Also, we have TargetLowering::hasAndNot - could we not use that and work to make this a generic combine?

Only some targets implement hasAndNot: AArch64, PowerPC, VE and X86. Some of the have AndNot like instruction anyway. On the other hand adding hasAndNot to those targets I see failures in recognizing min/max/saturation patterns. So unfortunately implementing this in a generic fashion is not really an option, hence I provide a generic helper function here that target authors can start to use when they made sure there's no accidental regressions.

MatzeB updated this revision to Diff 383496.Oct 29 2021, 2:03 PM
MatzeB marked 2 inline comments as done.

Address review feedback.

MatzeB marked an inline comment as done.Oct 29 2021, 2:03 PM
RKSimon added inline comments.Nov 1 2021, 2:54 PM
llvm/include/llvm/CodeGen/TargetLowering.h
4621 ↗(On Diff #383496)

Based on the regressions you saw on other targets, do you have any thoughts on whether any other targets will be able to use this? Otherwise it might make sense to move this into X86ISelLowering.cpp until there's a need to make this generic.

llvm/test/CodeGen/X86/fold-masked-merge.ll
8

Please can you add some i8/i16/i32/i64 test coverage (it might be a good idea to rename the tests to be more descriptive as well).

You probably need to add some negative tests as well? Mismatching not-ops for instance.

MatzeB updated this revision to Diff 384609.Nov 3 2021, 4:25 PM
  • pre-commit tests in separate diff.
  • move code to X86ISelLowering.cpp
MatzeB marked an inline comment as done.Nov 3 2021, 4:28 PM
MatzeB added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
4621 ↗(On Diff #383496)

It mostly means other targets have to implement TargetInstrInfo::hasNot properly and then adapt ISel patterns to still trigger where necessary.

I have no immediate plans to enable the code for other targets. ARM, AArch64, PowerPC feature an and-not instruction anyway and don't need this; It may help targets like RISCV, WebAssembly, BCC at a first glance, but I guess we leave that for the respective target authors to discover then.

Ok, I'll move the code to X86ISelLowering.cpp we can always move it back to a shared space when a 2nd target starts using it.

llvm/test/CodeGen/X86/fold-masked-merge.ll
30–39

Note that this version currently fails because SelectionDAG does not seem to consistently move all operations to i16:

t0: ch = EntryToken
t2: i32,ch = CopyFromReg t0, Register:i32 %0
        t5: i32,ch = CopyFromReg t0, Register:i32 %1
      t19: i32 = and t2, t5
        t8: i32,ch = CopyFromReg t0, Register:i32 %2
            t3: i16 = truncate t2
          t12: i16 = xor t3, Constant:i16<-1>
        t24: i32 = any_extend t12
      t25: i32 = and t8, t24
    t22: i32 = or t19, t25
  t23: i16 = truncate t22
t17: ch,glue = CopyToReg t0, Register:i16 $ax, t23
t18: ch = X86ISD::RET_FLAG t17, TargetConstant:i32<0>, Register:i16 $ax, t17:1

(note the stray t24: i32 = any_extend t12...)

I'll leave the test here, but fixing this is outside the scope of this diff.

MatzeB updated this revision to Diff 384610.Nov 3 2021, 4:35 PM

clang-format

A minor style comments

Does anyone else have any comments?

llvm/lib/Target/X86/X86ISelLowering.cpp
46588

(style) remove unnecessary braces

46591

(style)

if (NotOp != And1_L)
  return SDValue();
46606

PerformDAGCombing ?

46610

assert message

46730

(style) remove braces

spatel added inline comments.Nov 8 2021, 9:11 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
46612

I suspect we want to avoid this transform if either operand has more than one use. Please add tests with something like this:

define i32 @masked_merge0_extra_use(i32 %a0, i32 %a1, i32 %a2, i32* %p1) {
  %and0 = and i32 %a0, %a1
  %not = xor i32 %a0, -1
  %and1 = and i32 %not, %a2
  store i32 %and1, i32* %p1 ; extra use of intermediate op
  %or = or i32 %and0, %and1
  ret i32 %or
}
46726

typo: available

@matze reverse ping?

MatzeB updated this revision to Diff 389273.Nov 23 2021, 11:29 AM
MatzeB marked 7 inline comments as done.

Address review feedback

spatel accepted this revision.Nov 24 2021, 5:55 AM

LGTM

This revision is now accepted and ready to land.Nov 24 2021, 5:55 AM
MatzeB updated this revision to Diff 390486.Nov 29 2021, 3:08 PM
MatzeB updated this revision to Diff 390488.Nov 29 2021, 3:13 PM
This revision was automatically updated to reflect the committed changes.
llvm/test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll