Page MenuHomePhabricator

Type legalizer for masked gather/scatter intrinsics
ClosedPublic

Authored by delena on Oct 11 2015, 4:08 AM.

Details

Summary

Full type legalizer that works with all vectors length - from 2 to 16, (i32, i64, float, double).

This intrinsic, for example
void @llvm.masked.scatter.v2f32(<2 x float>%data , <2 x float*>%ptrs , i32 align , <2 x i1>%mask )
requires type widening for data and type promotion for mask.

Diff Detail

Repository
rL LLVM

Event Timeline

delena updated this revision to Diff 37052.Oct 11 2015, 4:08 AM
delena retitled this revision from to Type legalizer for masked gather/scatter intrinsics.
delena updated this object.
delena added reviewers: hfinkel, spatel, qcolombet.
delena set the repository for this revision to rL LLVM.
delena added a subscriber: llvm-commits.
delena updated this revision to Diff 38113.Oct 22 2015, 5:03 AM
delena added reviewers: mkuper, zansari.

Ping.

spatel resigned from this revision.Oct 22 2015, 8:01 AM
spatel removed a reviewer: spatel.

Sorry, but I haven't looked at scatter/gather at all, so I don't think I can provide any help with this patch.

mbodart added inline comments.Oct 29 2015, 1:43 PM
../lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
503 ↗(On Diff #38113)

minor typos: "ther" => "the"

And a little below, and in several other places in this file/change set: "Legalized the chain result ..." => "Legalize the chain result ..."

1235 ↗(On Diff #38113)

This is the first time I've looked at vector type legalization, so there's something I don't quite understand. A masked store (PromoteIntOp_MSTORE) is in some sense just a degenerative case of a masked scatter. So intuitively it would seem PromoteIntOp_MSTORE would have a more simple implementation than PromoteIntOp_MSCATTER. But here it seems the reverse is true. Why does PromoteIntOp_MSTORE try to legalize both the mask and data simultaneously, while for scatter we just do the one operand? Could PromoteIntOp_MSTORE be simplified? If not, what is the essential difference?

../lib/CodeGen/SelectionDAG/LegalizeTypes.h
744 ↗(On Diff #38113)

Can you please change the parameter name from WidenVT to NVT, so that it matches this comment and the code in LegalizeVectorTypes.cpp?

Also a typo: defalut => default

746–751 ↗(On Diff #38113)

The implementation of UseExistingVal gives it an ambiguous meaning. If a widening operation is an even multiple of the original size, then the full original operand is replicated to fill out the new value. Otherwise, only its first element is replicated. Was that the intent? If so, it seems like a difficult interface to use robustly.

An enum with possible values of Zext, Splat or Undef would make the widening interface
more clear, assuming the Splat ambiguity is resolved to consistently replicate all original values.

But as a further question, when would a larger vector length not be a multiple of a smaller one?
Aren't they all powers of 2?

../lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2704–2714 ↗(On Diff #38113)

Now that ModifyToType supports zero fill, why aren't we using it here (and in WidenVecOp_MSTORE)?

3663–3664 ↗(On Diff #38113)

I would think we want this moved below the check for "if (InVT == NVT)".

../lib/Target/X86/X86ISelLowering.cpp
1595–1596 ↗(On Diff #38113)

Why would 512-bit gathers and scatters have different operation actions here?

19510 ↗(On Diff #38113)

It's unfortunate that we have to duplicate much of the functionality of the generic type modifier in CodeGen's ModifyToType.

Is there a way to unify them?

If not, please add a source comment describing how this implementation differs from the one in CodeGen.

../test/CodeGen/X86/masked_gather_scatter.ll
2 ↗(On Diff #38113)

Have you tried your changes with gathers/scatters targetting 32-bit X86?

Poking around in the X86 test area, it seems all triples use x86_64 for gather/scatter tests.

delena marked 3 inline comments as done.Nov 2 2015, 5:07 AM
delena added inline comments.
../lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
1235 ↗(On Diff #38113)

I wanted to simplify MLOAD / MSTORE but it requires additional changes in X86ISelLowering.cpp. I can do this in a separate patch.

../lib/CodeGen/SelectionDAG/LegalizeTypes.h
746–751 ↗(On Diff #38113)

The implementation of UseExistingVal gives it an ambiguous meaning. If a widening operation is an even multiple of the original size, then the full original operand is replicated to fill out the new value. Otherwise, only its first element is replicated. Was that the intent? If so, it seems like a difficult interface to use robustly.

When I extend vector of indices, I want to add existing values (replicated small vector or replicated first element - does not matter). The gather/scatter will be faster, as far as I know (I'll check it again). But now I think that may be this is too X86 specific?
May be I should fill indices with "undef" and then replace these "undefs" in target specific part?

An enum with possible values of Zext, Splat or Undef would make the widening interface more clear

I thought about this, but I don't see too many enums in LLVM. All enums have global senсe, not per function.

But as a further question, when would a larger vector length not be a multiple of a smaller one?

The original type may not be power of 2. It is type legalizer, it should be able to deal with any type.
< 3 x i64 > -> < 4 x i64 >

../lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2704–2714 ↗(On Diff #38113)

I'll simplify MSTORE in a separate patch.

../lib/Target/X86/X86ISelLowering.cpp
1595–1596 ↗(On Diff #38113)

The SCATTER is more problematic than GATHER. MSCATTER node returns only the Chain.
The VPSCATTER instruction in X86 zeroes mask operand and I should specify it as "return value".

19510 ↗(On Diff #38113)

I simplified the X86 version. I call it ExtendToType. It works with legal types only and optimized for X86.

../test/CodeGen/X86/masked_gather_scatter.ll
2 ↗(On Diff #38113)

I tried before, saw that it works. I added tests for 32-bit.

delena updated this revision to Diff 38897.Nov 2 2015, 5:12 AM
delena removed a reviewer: zansari.

Updated the patch according to Mitch's comments.

mbodart added inline comments.Nov 3 2015, 9:57 AM
../include/llvm/CodeGen/SelectionDAGNodes.h
2117–2119 ↗(On Diff #38897)

This new assertion is fine.

But why are we removing the assertion that the mask's element type is i1?

../lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
1235 ↗(On Diff #38897)

OK

../lib/CodeGen/SelectionDAG/LegalizeTypes.h
746–752 ↗(On Diff #38897)

I agree that this optimization is too X86 specific, and CodeGen should simply extend with undef values.

If you want to proceed with those X86-specific extensions, please do so in a separate change set.

That will simplify this change set, and render the other comments here moot for now.
If it turns out that replication is needed, we will want to define its behavior more crisply.

../lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2704–2715 ↗(On Diff #38897)

That's fine. But then can you please add a "FIX ME" comment here indicating
such a change is desired?

2739 ↗(On Diff #38897)

A simple source comment here to the effect "// Zero extend the mask" would help readers remember the meaning of "true".

2754 ↗(On Diff #38897)

Another instance of "Legalized" => "Legalize".

../lib/Target/X86/X86ISelLowering.cpp
11732 ↗(On Diff #38897)

Unrelated to your change, I'm finding the optimizations of MVT::i1 here confusing as there
are no checks for SubVecVT.getVectorNumElements(). The code appears to be making the
assumption that when IdxVal is 0, or half the full vector length, then SubVec is exactly half the
size of Vec. Why is that not being checked?

It's also unclear to me how VSHLI/VSHRI operate on a vector of MVT::i1 elements.
When operating on a vector of i32 or i64 elements, I would have thought these
instructions perform a bit shift of each individual element (not cross element).
But it seems the code here is trying to shift the whole i1 vector left and right,
as if it is one big integer. Maybe that's how VSHLI/VSHRI are defined to behave,
I don't know. Can you clarify?

19484 ↗(On Diff #38897)

Another case where this new Undef creation should probably be moved below the InVT == NVT early return.

../test/CodeGen/X86/masked_gather_scatter.ll
170 ↗(On Diff #38897)

I don't understand how SKX can use 32-bit indices here (i.e., vpgatherdd instead of vpgatherqd), when targetting x86_64. How does lowering of the @llvm.masked.gather.v8i32 calls know that only the low 32-bits of the <8 x i32*> pointer values are needed?
Is there some analysis of the insertelement;shufflevector/getelementptr instructions which feed to < 8 x i32*> pointers?

delena updated this revision to Diff 39058.Nov 3 2015, 11:25 AM

Extending indices does not require replication. I simplified the functions that modify values to a new type.

delena added inline comments.Nov 3 2015, 11:48 AM
../lib/CodeGen/SelectionDAG/LegalizeTypes.h
746–750 ↗(On Diff #39058)

I removed "Slpat". I don't need it.
"Zext" extension is not target specific, at least for the mask.

../lib/Target/X86/X86ISelLowering.cpp
11732 ↗(On Diff #39058)

There are the special SHIFT instructions for mask vector KSHIFTR, KSHIFTL (in AVX-512)

When we insert v8i1 into v16i1, the index should be 8 (I'll add an assertion) or 0.
If v16i1 is allzero, it's enough to shift the input vector left-right.

../test/CodeGen/X86/masked_gather_scatter.ll
170 ↗(On Diff #39058)

The base address in %rdi. (In %edi for 32-bit). In %ymm0 we have only indices. The real address of each element is "base +index*scale" -> %rdi + ymm[i] * 4

delena marked an inline comment as done.Nov 5 2015, 5:28 AM
delena added inline comments.
../include/llvm/CodeGen/SelectionDAGNodes.h
2117–2119 ↗(On Diff #39058)

Mask types are not always legal. When <2xi1> is illegal, the type legalizer promotes it to <2 x i64>.
I create the node with <2 x i64> mask and then handle it in X86 code.

mbodart added inline comments.Nov 6 2015, 4:04 PM
../include/llvm/CodeGen/SelectionDAGNodes.h
2117–2119 ↗(On Diff #39058)

Allowing non-i1 element types, even temporarily, seems like it breaks the definition of MGATHER/MSCATTER. Do we need to update the LLVM documentation to allow simd masks here? Or is it the intent to require i1 masks throughout LLVM IR, and only allow simd masks while we perform type legalization during during DAG selection? If the latter, that still seems a bit risky as type legalization can use utilities in the non-target-specific parts of CodeGen, and they would not like to see a non-i1 mask element. That's just paranoid speculation on my part. I don't know whether DAG selection is in general allowed to take such liberties.

A related question: Why isn't <2xi1> legalized to a full mask register size (<64xi1>, <32xi1>, <16xi1> or <8xi1>) instead of <2xi64>?
Is this just inherent in type legalization of Nxi1 vectors, or is it needed for an AVX2 simd mask?

../test/CodeGen/X86/masked_gather_scatter.ll
170 ↗(On Diff #39058)

OK, thanks.

delena updated this revision to Diff 41020.Nov 24 2015, 3:57 AM

I added a function WidenTargetBoolean() for mask legalization in all masked operations load/store/gather/scatter.

Hi Elena,

I added just a few more comments in line.

regards,

  • mitch
../include/llvm/CodeGen/SelectionDAGNodes.h
2116 ↗(On Diff #41020)

Not your change, but minor typo: PathThru => PassThru

../lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
521 ↗(On Diff #41020)

Is there any particular reason that the mask operand legalization is handled in a different place for MLOAD and MGATHER? For MLOAD it is processed during PromoteIntRes_MLOAD, while for MGATHER it is processed when legalizing its operands. The MGATHER method seems cleaner, but I'm curious as to why MLOAD does it differently. Also note that for MLOAD, the call to PromoteTargetBoolean is conditioned on whether the type mismatches, while in PromoteIntOp_MGATHER, the call is unconditional. I don't know if these inconsistencies are innocuous from a functionality standpoint, but they certainly make it difficult to understand the requirements.

1239 ↗(On Diff #41020)

There appears to be an extraneous blank line here.

../lib/CodeGen/SelectionDAG/LegalizeTypes.cpp
1115 ↗(On Diff #41020)

It seems important here that the value returned by GetPromotedInteger matches the promotion operation (e.g.. sign extension) performed by PromoteTargetBoolean.
How is that guaranteed?

../lib/CodeGen/SelectionDAG/LegalizeTypes.h
177 ↗(On Diff #41020)

Should that be "of ValVT", not "if"?

../lib/Target/X86/X86ISelLowering.cpp
19738 ↗(On Diff #41020)

It seems odd to me to allow a masked store/scatter where the value element size differs from the memory element size. Is this an unavoidable result of type legalization? What are the semantics of such an operation? If the value is i64 and the memory is i32, apparently we just store the low 32 bits of each i64 element. Is that the expected behavior? Do we ever have to worry about a size difference in the opposite direction, e.g. storing an i32 to an i64?

../test/CodeGen/X86/masked_gather_scatter.ll
1244 ↗(On Diff #41020)

Please keep the last <3 x i32> all on the same line.

delena updated this revision to Diff 41614.Dec 2 2015, 5:31 AM
delena marked 5 inline comments as done.

Fixed WidenTargetBoolean() and added more tests to check that the mask is sign-extended.
Thanks to Mitch for catching this.

Addressed all other issues.

Sorry for adding a line to X86InstrAVX512.td, just one of the new tests triggered the failure.
I'm trying to reproduce the same failure on a standalone test to make a separate commit.

mbodart added inline comments.Dec 2 2015, 5:06 PM
../lib/CodeGen/SelectionDAG/LegalizeTypes.cpp
1115 ↗(On Diff #41020)

Using SExt unconditionally seems incorrect.
Don't we want to make the choice of ZExt vs SExt dependent on getBooleanContents?
Perhaps we should add getExtendForVectorContent, similar to the existing scalar getExtendForContent.

delena added inline comments.Dec 3 2015, 3:55 AM
../lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
521 ↗(On Diff #41020)

No reason. Just the old code. I removed the mask promotion and all tests still pass.

1239 ↗(On Diff #41020)

I did not see any blank line. I'll run dos2unix on all files again.

../lib/CodeGen/SelectionDAG/LegalizeTypes.cpp
1115 ↗(On Diff #41020)

Fixed! Thank you.

../lib/Target/X86/X86ISelLowering.cpp
19738 ↗(On Diff #41020)

This is the case of promotion of v2i32 to v2i64. Only MemVT keeps the original VT. I'm actually "redo" the TypeLegalizer's work.
The type legalizer promoted v2i32 to v2i64 and I retrieve v2i32 with shuffle {0, 2, -1, -1} and then widen the result to v4i32.
I'll add comments.

delena updated this revision to Diff 42651.Dec 13 2015, 12:16 AM

Updated revision according to Mitch's comments.

Hi Elena,

I added few more minor comments and a question, but in general it LGTM.

regards,

  • mitch
../lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3216 ↗(On Diff #42651)

Redundant assert, already checked a few lines above.

../lib/Target/X86/X86ISelLowering.cpp
19849 ↗(On Diff #42651)

Minor mechanics question: should we first check if Mask's element type is i1, and if so, avoid inserting the TRUNCATE?

19946 ↗(On Diff #42651)

For a 32-bit target, we can gather 16 elements with a single gather.
Though the comment says 8 is the minimum number of elements in
a gather, which is true, the code also seems to be treating this as
the maximum number. How are 16-element gathers supported?

26980 ↗(On Diff #42651)

It is hard to interpret this comment, and thus the correctness of this routine,
without any additional context.
Why is a mask always truncated at this point?
How do you know that it is truncated down to
the element size of the SIGN_EXTEND_INREG's operand?
Please add some supporting comments answering these kinds of questions.

../test/CodeGen/X86/masked_gather_scatter.ll
2–4 ↗(On Diff #42651)

I would think we also want a test for i386 and avx512vl.

delena marked an inline comment as done.Dec 15 2015, 12:24 AM
delena added inline comments.
../lib/Target/X86/X86ISelLowering.cpp
19849 ↗(On Diff #42651)

getNode(ISD::TRUNCATE..) does this optimization.

19946 ↗(On Diff #42651)

We always can sign-extend the indices. VGATHERQPS is supported in 32-bit mode.

26980 ↗(On Diff #42651)

I added more words here:

// Gather and Scatter instructions use k-registers for masks. The type of
// the masks is v*i1. So the mask will be truncated anyway.
// The SIGN_EXTEND_INREG my be dropped.
../test/CodeGen/X86/masked_gather_scatter.ll
2–4 ↗(On Diff #42651)

I added.

This revision was automatically updated to reflect the committed changes.