lebedev.ri (Roman Lebedev)
User

Projects

User does not belong to any projects.

User Details

User Since
Oct 27 2012, 6:35 AM (286 w, 2 d)

Recent Activity

Today

lebedev.ri added inline comments to D45976: [InstCombine] Simplify Add with remainder expressions as operands..
Mon, Apr 23, 11:28 AM
lebedev.ri added inline comments to D45733: [DAGCombiner] Unfold scalar masked merge if profitable.
Mon, Apr 23, 10:46 AM
lebedev.ri updated the diff for D45733: [DAGCombiner] Unfold scalar masked merge if profitable.

Update with @spatel's suggested matcher.

Mon, Apr 23, 10:46 AM
lebedev.ri added a comment to D45944: Disallow pointers to const in __sync_fetch_and_xxx.

Please always upload all patches with full context (-U99999)

Mon, Apr 23, 4:34 AM

Yesterday

lebedev.ri added a comment to D45766: [Sema] Add -Wno-self-assign-overloaded.

Ping. At least one of these needs to land.

Sun, Apr 22, 2:34 PM
lebedev.ri created D45931: [ASTMatchers] Don't garble the profiling output when multiple TU's are processed.
Sun, Apr 22, 10:46 AM

Sat, Apr 21

lebedev.ri updated the diff for D45733: [DAGCombiner] Unfold scalar masked merge if profitable.

NFC, rebased.

Sat, Apr 21, 10:57 AM
lebedev.ri updated the diff for D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.

Split tests with variable mask and constant mask into separate files, add a bit more tests with different constant mask patterns.

Sat, Apr 21, 10:57 AM
lebedev.ri added a comment to D45862: [InstCombine] Relax restriction in foldSelectInstWithICmp for sake of smaller code size.

I don't think we should be doing any of these select-of-constant transforms in instcombine.

It's worse for code analysis, more IR instructions, and may be detrimental to perf. Think about the cases where a conditional move executes at the same speed as a simple add (Ryzen?) or we have profile data for the compare, so branch prediction is perfect.

There's lots of code that does this kind of thing in the DAG, and that's where I think it belongs (using target hooks as needed). There was some discussion about this on llvm-dev here:
https://groups.google.com/forum/#!topic/llvm-dev/pid_thv2X-A

So I think we should be removing some of these transforms from instcombine rather than adding to them.

Sat, Apr 21, 3:53 AM
lebedev.ri added a comment to D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.

LGTM. You could increase diversity in the constant mask tests by not using a single-string-of-set-bits constant (eg 0x0f0f0f0f instead of 0x0000ffff).

Sat, Apr 21, 2:11 AM
lebedev.ri added a comment to D45893: add more "anchors".

(the fact that phabricator just ignores some mails is annoying)

Sat, Apr 21, 1:51 AM

Fri, Apr 20

lebedev.ri added inline comments to D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.
Fri, Apr 20, 11:50 AM
lebedev.ri updated the diff for D45733: [DAGCombiner] Unfold scalar masked merge if profitable.

NFC, rebased ontop of rebased tests with CFI noise dropped.

Fri, Apr 20, 11:50 AM
lebedev.ri updated the diff for D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.

Get rid of CFI noise now that it is possibe after rL330453.

Fri, Apr 20, 11:50 AM
lebedev.ri added a comment to D45862: [InstCombine] Relax restriction in foldSelectInstWithICmp for sake of smaller code size.

Other than that inline comment, i think it would be nice to commit
the baseline tests (as of trunk), so the effect of this proposal could be observed.
Even if the code changes won't land, this would at least document the current behavior.

Fri, Apr 20, 11:01 AM
lebedev.ri added inline comments to D45862: [InstCombine] Relax restriction in foldSelectInstWithICmp for sake of smaller code size.
Fri, Apr 20, 2:58 AM
lebedev.ri added a dependency for D45867: [InstCombine] Unfold masked merge with constant mask: D45664: [InstCombine] Canonicalize variable mask in masked merge .
Fri, Apr 20, 2:14 AM
lebedev.ri added a dependent revision for D45664: [InstCombine] Canonicalize variable mask in masked merge : D45867: [InstCombine] Unfold masked merge with constant mask.
Fri, Apr 20, 2:14 AM
lebedev.ri added a dependency for D45867: [InstCombine] Unfold masked merge with constant mask: D45866: [InstCombine][NFC] Add tests for unfolding masked merge with constant mask.
Fri, Apr 20, 2:14 AM
lebedev.ri added a dependent revision for D45866: [InstCombine][NFC] Add tests for unfolding masked merge with constant mask: D45867: [InstCombine] Unfold masked merge with constant mask.
Fri, Apr 20, 2:14 AM
lebedev.ri created D45867: [InstCombine] Unfold masked merge with constant mask.
Fri, Apr 20, 2:13 AM
lebedev.ri created D45866: [InstCombine][NFC] Add tests for unfolding masked merge with constant mask.
Fri, Apr 20, 2:12 AM
lebedev.ri updated the diff for D45664: [InstCombine] Canonicalize variable mask in masked merge .

Streamlined the matcher, rebased ontop of revisited multiuse tests.

Fri, Apr 20, 2:10 AM
lebedev.ri updated the diff for D45663: [InstCombine][NFC] Add tests for variable mask canonicalization in masked merge.

Revisit multi-use tests.

Fri, Apr 20, 2:09 AM

Thu, Apr 19

lebedev.ri added a comment to D45855: [InstCombine] Support BitTests in ThreeWayComparison. General case, part 1.

Please upload all patches with the full context (-U99999).

Thu, Apr 19, 11:21 PM
lebedev.ri added a comment to D45854: [InstCombine] Support BitTests in ThreeWayComparison. Trivial case.

Please upload all patches with the full context (-U99999).

Thu, Apr 19, 11:21 PM
lebedev.ri added a comment to D45856: [InstCombine] Support BitTests in ThreeWayComparison. General case, part 2.

Please upload all patches with the full context (-U99999).

Thu, Apr 19, 11:21 PM
lebedev.ri added inline comments to D45862: [InstCombine] Relax restriction in foldSelectInstWithICmp for sake of smaller code size.
Thu, Apr 19, 11:21 PM
lebedev.ri added inline comments to D45828: [PatternMatch] Stabilize the matching order of commutative matchers.
Thu, Apr 19, 2:54 PM
lebedev.ri added inline comments to D45828: [PatternMatch] Stabilize the matching order of commutative matchers.
Thu, Apr 19, 10:46 AM
lebedev.ri added a dependency for D45664: [InstCombine] Canonicalize variable mask in masked merge : D45828: [PatternMatch] Stabilize the matching order of commutative matchers.
Thu, Apr 19, 10:34 AM
lebedev.ri added a dependent revision for D45828: [PatternMatch] Stabilize the matching order of commutative matchers: D45664: [InstCombine] Canonicalize variable mask in masked merge .
Thu, Apr 19, 10:34 AM
lebedev.ri updated the diff for D45664: [InstCombine] Canonicalize variable mask in masked merge .

Rebased ontop of D45828

Thu, Apr 19, 10:34 AM
lebedev.ri created D45828: [PatternMatch] Stabilize the matching order of commutative matchers.
Thu, Apr 19, 10:33 AM
lebedev.ri updated the diff for D45663: [InstCombine][NFC] Add tests for variable mask canonicalization in masked merge.
Thu, Apr 19, 10:14 AM
lebedev.ri added inline comments to D45664: [InstCombine] Canonicalize variable mask in masked merge .
Thu, Apr 19, 9:32 AM
lebedev.ri abandoned D45654: [InstCombine][NFC] Add tests for mask canonicalization in masked merge.
Thu, Apr 19, 3:55 AM
lebedev.ri abandoned D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern.

See my comment in D45733. Sorry that I didn't get a chance to review/discuss this sooner.

I doubt the premise that this pattern should exist in the first place, so I suspect we don't want to add logic to transform it. But if it does exist, then and/and/or is the best canonical form for the IR - it's better for bit-tracking analysis and better for codegen.

Thu, Apr 19, 3:54 AM

Wed, Apr 18

lebedev.ri updated the diff for D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.

Revisited tests once, more, added some more complex patterns (with non-trivial 'y' and/or 'm'), that i expect could be failed to be matched.

Wed, Apr 18, 11:10 AM
lebedev.ri updated the diff for D45733: [DAGCombiner] Unfold scalar masked merge if profitable.
  • Rebased ontop of revised tests
  • Stop handling cases with constant mask. instcombine should unfold them.
Wed, Apr 18, 11:10 AM
lebedev.ri added a dependency for D45733: [DAGCombiner] Unfold scalar masked merge if profitable: D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.
Wed, Apr 18, 11:06 AM
lebedev.ri added a dependent revision for D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding: D45733: [DAGCombiner] Unfold scalar masked merge if profitable.
Wed, Apr 18, 11:06 AM
lebedev.ri added inline comments to D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.
Wed, Apr 18, 10:57 AM
lebedev.ri added a comment to D45733: [DAGCombiner] Unfold scalar masked merge if profitable.

I don't think it will be possible to check that until after the instcombine part has landed, so ok, at least for now i will stop unfolding [constant mask] in dagcombine.

While there, any hint re pattern matchers for this code?

Unfortunately, DAG nodes don't have any equivalent match() infrastructure like IR that I know of.

Boo :(

Wed, Apr 18, 10:01 AM
lebedev.ri added a comment to D45733: [DAGCombiner] Unfold scalar masked merge if profitable.

Yeah, that is the question, i'm having. I did look at mca output.

Here is what MCA says about that for -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a75


Or is this a scheduling info problem?

Cool - a chance to poke at llvm-mca! (cc @andreadb and @courbet)

First thing I see is that it's harder to get the sequence we're after on x86 using the basic source premise:

int andandor(int x, int y)  {
  __asm volatile("# LLVM-MCA-BEGIN ands");
  int r = (x & 42) | (y & ~42);
  __asm volatile("# LLVM-MCA-END ands");
  return r;
}

int xorandxor(int x, int y) {
  __asm volatile("# LLVM-MCA-BEGIN xors");
  int r = ((x ^ y) & 42) ^ y;
  __asm volatile("# LLVM-MCA-END xors");
  return r;
}

...because the input param register doesn't match the output result register. We'd have to hack that in asm...or put the code in a loop, but subtract the loop overhead somehow. Things work/look alright to me other than that.

I simply stored the lhs and rhs side of // CHECK lines from aarch64's @in32_constmask in two local files,
run llvm-mca on each of them, and diffed the output, no clang was involved.

Wed, Apr 18, 9:30 AM
lebedev.ri added a comment to D45733: [DAGCombiner] Unfold scalar masked merge if profitable.

If the mask is constant, right now i always unfold it.

Let me make sure I understand. The fold in question is:

%n0 = xor i4 %x, %y
%n1 = and i4 %n0, C1
%r  = xor i4 %n1, %y
=>
%mx = and i4 %x, C1
%my = and i4 %y, ~C1
%r = or i4 %mx, %my

Yes.

Wed, Apr 18, 8:12 AM
lebedev.ri created D45766: [Sema] Add -Wno-self-assign-overloaded.
Wed, Apr 18, 5:52 AM
lebedev.ri retitled D45744: [libFuzzer] Add experimental feature to not use AFL's deferred forkserver. from Add experimental feature to not use AFL's deferred forkserver. to [libFuzzer] Add experimental feature to not use AFL's deferred forkserver..
Wed, Apr 18, 4:36 AM

Tue, Apr 17

lebedev.ri added a comment to D45736: [SimplifyLibcalls] Replace locked IO with unlocked IO.

to gain better speed

Out of curiosity, what are the motivational cases, benchmarks?

if we know, that there is no fork or pthread_create calls in the current module.

I'm not sure this is sufficient.
I'll be super surprised if this could be done outside of a LTO build with no dynamic linking.

Motivation? I tried to "getchar and putchar" 2 MB file and using unlocked variants I got time difference around 0,1 s.

I believe there are more reasons to apply such optimizations, if possible. If we can have faster IO code, why not?

Why not sufficient?
Can you explain it more?

Tue, Apr 17, 3:26 PM
lebedev.ri added inline comments to D45733: [DAGCombiner] Unfold scalar masked merge if profitable.
Tue, Apr 17, 3:20 PM
lebedev.ri added a comment to D45736: [SimplifyLibcalls] Replace locked IO with unlocked IO.

to gain better speed

Tue, Apr 17, 2:26 PM
lebedev.ri updated the summary of D45733: [DAGCombiner] Unfold scalar masked merge if profitable.
Tue, Apr 17, 11:54 AM
lebedev.ri created D45733: [DAGCombiner] Unfold scalar masked merge if profitable.
Tue, Apr 17, 11:54 AM
lebedev.ri updated the diff for D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.

Revised tests, dropped vectors for now.

Tue, Apr 17, 11:54 AM
lebedev.ri added a comment to D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.

Don't know what changes are planned here, but this is on the right track. We want to have coverage of the possible canonical IR variations for various targets.

I'm working on that right now, got it working, maybe will update this + post the dagcombiner part (that is where i should have put it, right?) in a few hours.

Tue, Apr 17, 9:30 AM
lebedev.ri added inline comments to D45601: Warn on bool* to bool conversion.
Tue, Apr 17, 5:28 AM

Mon, Apr 16

lebedev.ri added a comment to D45685: [Sema] Add -wtest global flag that silences -Wself-assign for overloaded operators..

Uuuh, the fact that phab posts the top-postings, but silently ignores inline replies is annoying.

Mon, Apr 16, 12:41 PM
lebedev.ri added a comment to D45685: [Sema] Add -wtest global flag that silences -Wself-assign for overloaded operators..

I'm not sure this is a practical direction to pursue - though perhaps
others disagree.

Mon, Apr 16, 12:07 PM
lebedev.ri added a comment to D45685: [Sema] Add -wtest global flag that silences -Wself-assign for overloaded operators..

I don't understand the spelling of this option. You spell it -wtest (because rjmccall suggested that spelling); but for my money it should be spelled -Wno-self-assign-nonfield.

You did read the documentation change, right?

Mon, Apr 16, 11:18 AM
lebedev.ri updated the diff for D45685: [Sema] Add -wtest global flag that silences -Wself-assign for overloaded operators..

Actually make it -wtest (all lowercase).

Mon, Apr 16, 8:10 AM
lebedev.ri updated the diff for D45685: [Sema] Add -wtest global flag that silences -Wself-assign for overloaded operators..
  • Don't mis-spell the name of the flag.
Mon, Apr 16, 6:44 AM
lebedev.ri added a comment to D44883: [Sema] Extend -Wself-assign and -Wself-assign-field to warn on overloaded self-assignment (classes).

There are several options:

  1. @rjmccall's idea: -wtest (lowercase), which in this case will disable that new code in BuildOverloadedBinOp(). i quite like it actually.
  2. split it up like i had in the first revision - `-Wself-assign-builtin, -Wself-assign-field-builtin; -Wself-assign-overloaded, -Wself-assign-field-overloaded`
    • we could just assume that BuildOverloadedBinOp() implies overloaded,
    • or check that the particular operator is non-trivial
  3. ???

    @rjmccall, @thakis, @dblaikie, @aaron.ballman, @brooksmoses, @chandlerc

    I'm going to go ahead and look into 1., since it does not seem there will be any consensus in a timely manner.
Mon, Apr 16, 6:41 AM
lebedev.ri created D45685: [Sema] Add -wtest global flag that silences -Wself-assign for overloaded operators..
Mon, Apr 16, 6:38 AM
lebedev.ri updated the diff for D45664: [InstCombine] Canonicalize variable mask in masked merge .

Rebased

Mon, Apr 16, 3:37 AM
lebedev.ri updated the diff for D45663: [InstCombine][NFC] Add tests for variable mask canonicalization in masked merge.

Add one more test file, with all the variations with inverted operands, deduplicated.
Originally generated with


While some of those aren't really a masked merge, they could be reduced further.

Mon, Apr 16, 3:37 AM
lebedev.ri added a comment to D44883: [Sema] Extend -Wself-assign and -Wself-assign-field to warn on overloaded self-assignment (classes).

There are several options:

  1. @rjmccall's idea: -wtest (lowercase), which in this case will disable that new code in BuildOverloadedBinOp(). i quite like it actually.
  2. split it up like i had in the first revision - `-Wself-assign-builtin, -Wself-assign-field-builtin; -Wself-assign-overloaded, -Wself-assign-field-overloaded`
    • we could just assume that BuildOverloadedBinOp() implies overloaded,
    • or check that the particular operator is non-trivial
  3. ???
Mon, Apr 16, 3:06 AM
lebedev.ri added a comment to D44883: [Sema] Extend -Wself-assign and -Wself-assign-field to warn on overloaded self-assignment (classes).

Re false-positives - at least two [post-]reviewers need to agree on the way forward (see previous comments, mail thread), and then i will implement it.

Mon, Apr 16, 2:11 AM
lebedev.ri added a comment to D44883: [Sema] Extend -Wself-assign and -Wself-assign-field to warn on overloaded self-assignment (classes).

I have noticed two things when attempting to release LLVM with this revision internally at Google:

  1. It's catching real bugs, all in constructors where someone wrote "member_ = member_" when they meant "member_ = member".

Nice, just like the one that caused me to write this :)

Mon, Apr 16, 12:01 AM

Sun, Apr 15

lebedev.ri added a comment to D45631: [InstCombine] Simplify 'xor'/'add' to 'or' if no common bits are set..

LGTM.

Thank you for the review!

Sun, Apr 15, 11:35 AM
lebedev.ri added inline comments to D45631: [InstCombine] Simplify 'xor'/'add' to 'or' if no common bits are set..
Sun, Apr 15, 8:18 AM
lebedev.ri updated the diff for D45631: [InstCombine] Simplify 'xor'/'add' to 'or' if no common bits are set..

Adjusted based on @spatel review.
Thank you for the review!

Sun, Apr 15, 8:18 AM
lebedev.ri updated the diff for D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern.

Rebased, slightly cleanup the matcher by using m_CombineAnd().

Sun, Apr 15, 6:24 AM
lebedev.ri updated the diff for D45664: [InstCombine] Canonicalize variable mask in masked merge .

Slightly cleanup the matcher by using m_CombineAnd(),
which allows to use m_c_And() when looking for A part.

Sun, Apr 15, 6:05 AM
lebedev.ri updated the diff for D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern.

nullptr-init Mc too.
Does not appear to matter, but i have already hit the bug when
i forgot to do that to M, so better safe than sorry.

Sun, Apr 15, 3:49 AM
lebedev.ri added a dependency for D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern: D45664: [InstCombine] Canonicalize variable mask in masked merge .
Sun, Apr 15, 3:37 AM
lebedev.ri added a dependent revision for D45664: [InstCombine] Canonicalize variable mask in masked merge : D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern.
Sun, Apr 15, 3:37 AM
lebedev.ri updated the diff for D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern.

Rebased ontop of less controversial D45664, reducing the diff's size.

Sun, Apr 15, 3:37 AM
lebedev.ri added a dependent revision for D45663: [InstCombine][NFC] Add tests for variable mask canonicalization in masked merge: D45664: [InstCombine] Canonicalize variable mask in masked merge .
Sun, Apr 15, 3:11 AM
lebedev.ri added a dependency for D45664: [InstCombine] Canonicalize variable mask in masked merge : D45663: [InstCombine][NFC] Add tests for variable mask canonicalization in masked merge.
Sun, Apr 15, 3:11 AM
lebedev.ri created D45663: [InstCombine][NFC] Add tests for variable mask canonicalization in masked merge.
Sun, Apr 15, 3:11 AM
lebedev.ri created D45664: [InstCombine] Canonicalize variable mask in masked merge .
Sun, Apr 15, 3:11 AM
lebedev.ri updated the diff for D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern.

Drop duplicate tests.
I have just realized that non-const mask is not canonicalized yet too https://godbolt.org/g/NKPHGF, so i'll rework this slightly..

Sun, Apr 15, 2:27 AM
lebedev.ri updated the diff for D45654: [InstCombine][NFC] Add tests for mask canonicalization in masked merge.

Drop duplicate tests.

Sun, Apr 15, 2:27 AM

Sat, Apr 14

lebedev.ri updated the diff for D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern.

Tidy up comments and the isNonCanonicalMask().

Sat, Apr 14, 8:51 AM
lebedev.ri added a dependency for D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern: D45654: [InstCombine][NFC] Add tests for mask canonicalization in masked merge.
Sat, Apr 14, 8:14 AM
lebedev.ri added a dependent revision for D45654: [InstCombine][NFC] Add tests for mask canonicalization in masked merge: D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern.
Sat, Apr 14, 8:14 AM
lebedev.ri created D45655: [InstCombine][RFC] Canonicalize constant mask in masked merge mattern.
Sat, Apr 14, 8:14 AM
lebedev.ri created D45654: [InstCombine][NFC] Add tests for mask canonicalization in masked merge.
Sat, Apr 14, 8:14 AM

Fri, Apr 13

lebedev.ri retitled D45631: [InstCombine] Simplify 'xor'/'add' to 'or' if no common bits are set. from [InstCombine] Simplify 'xor'/'and' to 'or' if no common bits are set. to [InstCombine] Simplify 'xor'/'add' to 'or' if no common bits are set..
Fri, Apr 13, 3:04 PM
lebedev.ri updated the diff for D45631: [InstCombine] Simplify 'xor'/'add' to 'or' if no common bits are set..

Show that add is also handled.

Fri, Apr 13, 3:04 PM
lebedev.ri planned changes to D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.

Need to think more about the patterns/tests. Not needed *yet* anyway.

Fri, Apr 13, 1:11 PM
lebedev.ri created D45631: [InstCombine] Simplify 'xor'/'add' to 'or' if no common bits are set..
Fri, Apr 13, 10:42 AM
lebedev.ri added a comment to D45601: Warn on bool* to bool conversion.

...
This makes me wary of making this a compiler diagnostic, but clang-tidy may still be a reasonable place for this functionality to live.

Fri, Apr 13, 6:07 AM
lebedev.ri added inline comments to D45601: Warn on bool* to bool conversion.
Fri, Apr 13, 4:50 AM
lebedev.ri added a comment to D45615: [builtins] __builtin_dump_struct : added more types format.

Tests?

Fri, Apr 13, 2:56 AM
lebedev.ri added a comment to D45601: Warn on bool* to bool conversion.

@hiraditya I personally don't like when i'm being told so, but i'd like to see some numbers...
Please run this on some big C++ project (LLVM (but you'll have to enable this diag specifically), google chrome, ???), and analyse the results.

Fri, Apr 13, 2:52 AM

Thu, Apr 12

lebedev.ri updated the diff for D45539: [InstCombine]: foldSelectICmpAndAnd(): and is commutative.

Make foldSelectICmpAndAnd() much smaller, as suggested by @spatel.

Thu, Apr 12, 1:56 PM
lebedev.ri added a comment to D45539: [InstCombine]: foldSelectICmpAndAnd(): and is commutative.

Side note: this patch / commit should not be labeled 'NFC'.

Thu, Apr 12, 1:45 PM
lebedev.ri retitled D45539: [InstCombine]: foldSelectICmpAndAnd(): and is commutative from [InstCombine][NFC]: foldSelectICmpAndAnd(): and is commutative to [InstCombine]: foldSelectICmpAndAnd(): and is commutative.
Thu, Apr 12, 1:45 PM
lebedev.ri updated the diff for D45563: [X86][AArch64][NFC] Add tests for masked merge unfolding.

Actually test aarch64 in aarch64 test file, oops.

Thu, Apr 12, 7:44 AM