Page MenuHomePhabricator

AVX512: VMOVDQU8/16/32/64 (load) intrinsic implementation.
ClosedPublic

Authored by igorb on Jan 13 2016, 2:16 AM.

Details

Summary

AVX512: VMOVDQU8/16/32/64 (load) intrinsic implementation.

LowerOperationWrapper() for X86 target was changed in this patch.
The reason is the following: we have Load Intrinsics on avx512 with 64-bit mask (the <64 x i8> case). I64 requires legalization on 32-bit mode.
All other intrinsics with 64-bit mask are lowered during type legalization and it works perfect.
But the Load Intrinsic is legalized and converted to a Load SDNode. And the Load SDNode has 2 results - value and chain.
That's why we decided to customize the LowerOperationWrapper() - in order to push more than one result.
But, in some cases, the original node has one result and the new node comes with two. This is the case of SINT_TO_FP with store-load (fild).
In this case we just drop the second result (the chain).

Diff Detail

Repository
rL LLVM

Event Timeline

igorb updated this revision to Diff 44719.Jan 13 2016, 2:16 AM
igorb retitled this revision from to AVX512: VMOVDQU8/16/32/64 (load) intrinsic implementation..
igorb updated this object.
igorb added reviewers: delena, AsafBadouh.
igorb set the repository for this revision to rL LLVM.
igorb added a subscriber: llvm-commits.
igorb updated this object.Jan 14 2016, 6:35 AM
igorb added a reviewer: DavidKreitzer.
mbodart edited edge metadata.Jan 14 2016, 3:33 PM

I'm not sure I understand why this is just now becoming an issue.

Is the need for an X86-specific override of LowerOperationWrapper driven by an existing problem
with SINT_TO_FP lowering, or by a problem that is only exposed when adding the new masked load intrinsics?

Is the chain being dropped for both SINT_TO_FP and masked loads, or just one of them.

What are the safety consequences of dropping the chain, wrt losing an ordering dependence?

And why is I64 write mask legalization fine for most masked intrinsics, but not the masked loads?
Is it because none of the other existing I64-masked intrinsics produce an additional chain result?

A concrete example or two, showing the DAG snippets during legalization , would be helpful.

  • mitch
lib/Target/X86/X86ISelLowering.h
689 ↗(On Diff #44719)

Please add the "override" indicator.

lib/Target/X86/X86InstrAVX512.td
2753 ↗(On Diff #44719)

Can you please explain why these patterns are being deleted?

test/CodeGen/X86/avx512-intrinsics.ll
6643–6644 ↗(On Diff #44719)

This kind of test is OK for now.
But note that the second vmovdqu32 is completely redundant.
And the last vmovdqu32's source operand could be replaced by %zmm0.
So the test may need maintenance as optimizations improve.

This same pattern is used in many test functions in these test files,
and of course they will all have the same issue.

igorb updated this revision to Diff 45104.Jan 17 2016, 4:43 AM
igorb edited edge metadata.
igorb marked 2 inline comments as done.

Thanks for the review!

igorb added a comment.Jan 17 2016, 4:44 AM

I'm not sure I understand why this is just now becoming an issue.

Is the need for an X86-specific override of LowerOperationWrapper driven by an existing problem
with SINT_TO_FP lowering, or by a problem that is only exposed when adding the new masked load intrinsics?

The problem exposed only when adding new masked load intrinsics. In previous implementation only one value was taken ( chain was dropped ).

Is the chain being dropped for both SINT_TO_FP and masked loads, or just one of them.

Only for SINT_TO_FP ( or any other similar nodes).

What are the safety consequences of dropping the chain, wrt losing an ordering dependence?

In SINT_TO_FP store->load chain preserved, as i understand LOAD node chain could be dropped.

And why is I64 write mask legalization fine for most masked intrinsics, but not the masked loads?
Is it because none of the other existing I64-masked intrinsics produce an additional chain result?

Yes, masked load intrinsic (packed bytes) operand type legalization is the first one that produce additional chain result.

A concrete example or two, showing the DAG snippets during legalization , would be helpful.

 SINT_TO_FP DAG snippet

 t6: f64 = sint_to_fp t5
    t5: i64 = build_pair t2, t4
      t2: i32,ch = CopyFromReg t0, Register:i32 %vreg0
        t0: ch = EntryToken
      t4: i32,ch = CopyFromReg t0, Register:i32 %vreg1

Transformed to

  t13: f64,ch = X86ISD::FILD<LD8[FixedStack0]> t11, FrameIndex:i32<0>, ValueType:ch:i64
    t11: ch = store<ST8[FixedStack0](align=4)> t0, t5, FrameIndex:i32<0>, undef:i32
      t0: ch = EntryToken
      t5: i64 = build_pair t2, t4
        t2: i32,ch = CopyFromReg t0, Register:i32 %vreg0
        t4: i32,ch = CopyFromReg t0, Register:i32 %vreg1                 
  --------------------------------------------------------
masked loads snippet

  t12: v64i8,ch = llvm.x86.avx512.mask.loadu.b.512<LD64[%x0](align=1)> t0, TargetConstant:i32<4681>, t3, t5, t17
    t0: ch = EntryToken
    t3: i32,ch = load<LD4[FixedStack-1](align=16)> t0, FrameIndex:i32<-1>, undef:i32
    t5: v64i8,ch = CopyFromReg t0, Register:v64i8 %vreg0
    t17: i64,ch = load<LD8[FixedStack-2](align=4)> t0, FrameIndex:i32<-2>, undef:i32
    
Transformed to
    
  t30: v64i8,ch = masked_load<LD64[%x0](align=1)> t0, t3, t29, t5
    t0: ch = EntryToken
    t3: i32,ch = load<LD4[FixedStack-1](align=16)> t0, FrameIndex:i32<-1>, undef:i32
    t29: v64i1 = concat_vectors t27, t28
      t27: v32i1 = bitcast t24
        t24: i32 = extract_element t17, Constant:i32<0>
          t17: i64,ch = load<LD8[FixedStack-2](align=4)> t0, FrameIndex:i32<-2>, undef:i32
      t28: v32i1 = bitcast t26
        t26: i32 = extract_element t17, Constant:i32<1>
    t5: v64i8,ch = CopyFromReg t0, Register:v64i8 %vreg0
lib/Target/X86/X86InstrAVX512.td
2753 ↗(On Diff #44719)

This intrinsics is handled by DAG Legalization pass (X86ISelLowering.cpp , lowerINTRINSIC_W_CHAIN() function)

Hi Asaf,

Thanks for the answers.

Note that SelectionDAGLegalize::LegalizeOp, in the case of custom lowering, already handles
the case of multiple results, albeit with this FIXME comment:

// FIXME: The handling for custom lowering with multiple results is
// a complete mess.

So it is not clear to me why the new masked load intrinsics require additional changes,
especially since we have existing masked load intrinsics with a chain result.
I'm guessing it's because legalization of the I64 mask operand on IA32 does not go through
this SelectionDAGLegalize::LegalizeOp path. Is that correct?

Assuming we do need an X86 override of LowerOperationWrapper, then my next thought
would be to try and fix the FILD issue in X86TargetLowering::FP_TO_INTHelper. But I don't
know how to express there, in Selection Dag representation, that only one result is needed.
If you know how to do that, then that would be a preferable solution.

If we really need LowerOperationWrapper to discard the FILD's Chain operand, then I think
we need to make it clear in that routine that it should only be happening for the FILD case,
as in general it is not safe to simply drop a result. So a source comment there describing the
situation is needed. And when we do need to drop a result, I think there should be
an assertion checking exactly for the FILD case. That is, we should only ever drop
one result, and it should be a Chain, and if reasonable, we should check that the retained
result is an FILD. This will make it clear to developers why this code is needed, and
thus make it easier to modify in the future.

Thanks for updating the test functions!
One other thing I noticed about the test files is that most of them do not test for a 32-bit target.
That doesn't need to be added for this change set. But as I64 masking has interesting behavior
on IA32, it would be good to beef up testing there.

regards,

  • mitch
AsafBadouh edited edge metadata.Jan 19 2016, 9:10 AM

I think you meant "Hi Igor"
:)

Oops! Thanks for pointing that out Asaf.

Yes, thanks Igor

delena edited edge metadata.Jan 19 2016, 12:16 PM
delena added a subscriber: delena.
  • Elena

Hi Asaf,

Thanks for the answers.

Note that SelectionDAGLegalize::LegalizeOp, in the case of custom lowering, already handles
the case of multiple results, albeit with this FIXME comment:

// FIXME: The handling for custom lowering with multiple results is
// a complete mess.

So it is not clear to me why the new masked load intrinsics require additional changes,
especially since we have existing masked load intrinsics with a chain result.
I'm guessing it's because legalization of the I64 mask operand on IA32 does not go through
this SelectionDAGLegalize::LegalizeOp path. Is that correct?

The existing mask load intrinsics does not have i64 operand. (You mean @llvm.masked.load(), right).
They go to promote/widening.

Assuming we do need an X86 override of LowerOperationWrapper, then my next thought
would be to try and fix the FILD issue in X86TargetLowering::FP_TO_INTHelper. But I don't
know how to express there, in Selection Dag representation, that only one result is needed.
If you know how to do that, then that would be a preferable solution.

This workaround is not necessary. There are many places in the code, where we convert a one-result-node to two-result-node.
It happens every time, when we use memory for node transformation. In all these cases we drop the second result. We even don't check that the second result is a chain. See ReplaceAllUsesWith() code:

void SelectionDAG::ReplaceAllUsesWith(SDNode *From, const SDValue *To) {

if (From->getNumValues() == 1)  // Handle the simple case efficiently.
  return ReplaceAllUsesWith(SDValue(From, 0), To[0]);

If we really need LowerOperationWrapper to discard the FILD's Chain operand, then I think
we need to make it clear in that routine that it should only be happening for the FILD case,
as in general it is not safe to simply drop a result. So a source comment there describing the
situation is needed. And when we do need to drop a result, I think there should be
an assertion checking exactly for the FILD case. That is, we should only ever drop
one result, and it should be a Chain, and if reasonable, we should check that the retained
result is an FILD. This will make it clear to developers why this code is needed, and
thus make it easier to modify in the future.

Thanks for updating the test functions!
One other thing I noticed about the test files is that most of them do not test for a 32-bit target.
That doesn't need to be added for this change set. But as I64 masking has interesting behavior
on IA32, it would be good to beef up testing there.

regards,

  • mitch

Thanks for the explanations (and patience)!

As dropping the chain seems to be common practice, and the source comment in LowerOperationWrapper describes that, I am fine with the changes.

So, LGTM.

  • mitch

Thanks everybody for the review!

This revision was automatically updated to reflect the committed changes.