This is an archive of the discontinued LLVM Phabricator instance.

AVX512 : Integer Truncate with/without saturation support
ClosedPublic

Authored by igorb on Jul 15 2015, 6:20 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

igorb updated this revision to Diff 29779.Jul 15 2015, 6:20 AM
igorb retitled this revision from to AVX512 : Integer Truncate with/without saturation support.
igorb updated this object.
igorb added reviewers: delena, AsafBadouh.
igorb set the repository for this revision to rL LLVM.
igorb added a subscriber: llvm-commits.
ab added a subscriber: ab.Jul 15 2015, 7:49 AM
ab added inline comments.
include/llvm/IR/IntrinsicsX86.td
5294

So, this file is starting to get pretty unwieldy. Why not use multiclasses?

include/llvm/Target/TargetSelectionDAG.td
494–495

I'm not a fan of this intermediate state. Can you add the same comment that's above ld/st ("Don't use this directly..."), and/or move this somewhere else?

Basically I'd prefer masked_load and masked_store to be adjacent.

496

Indentation?

lib/Target/X86/X86ISelLowering.cpp
1371–1372

From my understanding, BWI doesn't imply VLX (even though SKX has both). So, shouldn't these be predicated by hasVLX() as well? I mention the same issue in another comment, so I might be missing something.

12488

"vpmovqwb" ?

12489–12497

I see that 512 bit vectors are handled in l12497, so the added block is only useful for smaller VTRUNCs. Two questions:

  • should l12498 also check hasBWI() for WB? Otherwise we'll get 512 bit BW VTRUNCs even without BWI, no? (not selectable)
  • should l12491-12494 also check hasVLX() when hasBWI()? Otherwise we'll get 128/256 bit BW VTRUNCs even without VLX, no? (Also not selectable, I think?)

So, would it make sense to replace l12491-12494 with:

if (Subtarget->hasVLX() && !InVT.is512BitVector() &&
    (InVT.getVectorElementType() != MVT::i16 || Subtarget->hasBWI())
15225–15226

How about:

necessary:
- casting ...
- extend ...
15231

capitalization? extend -> Extend (here and elsewhere)

But I'm not convinced it's a good idea to reuse the same function. Perhaps it's different enough to warrant another function for masking truncstores?

15242–15256

Would it be wrong to do this all the time, even for non-truncstores?

So that you can remove extendMask completely, and just decide based on MaskVT/Mask.getVT() ?

15248
}
else{

->

} else {
15267

Why? I see the patterns match X86select, but what's wrong with ISD::VSELECT?

15405

true -> /*ExtendMask=*/true

24128

unnecessary whitespace, -> if (TLI

lib/Target/X86/X86InstrAVX512.td
5786–5788

Multiclass? !add(0x10) or something for the opcode.

lib/Target/X86/X86InstrFragmentsSIMD.td
117–124

Vtunc -> Vtrunc

Why "Op"?

Also, is there a better place for this? Perhaps with the other SDTPs (or the lone one in this file)?

lib/Target/X86/X86IntrinsicsInfo.h
178

What about the saturating _mem variants? If you add them, would that let you remove the avx512_trunc_mr_lowering patterns?

test/CodeGen/X86/avx512-trunc-ext.ll
0

Do you mind splitting the file in trunc and ext tests?

9

readnone isn't useful; also, can you use an attribute group?

delena added inline comments.Jul 15 2015, 11:14 AM
include/llvm/IR/IntrinsicsX86.td
5294

Ahmed, we don't want to change format of this file now. We need to add hundreds of intrinsics and we generate these lines. Multiclasses will require manual code. It will significantly slow down the process.

hfinkel added inline comments.
include/llvm/IR/IntrinsicsX86.td
5294

Could you contribute the code that you use to generate the intrinsic definitions?

I'm not entirely sure that a multiclass would help here (as I recall, TableGen does not have a 'map' data structure, and that's probably what you want here). However, to the extent possible, we should have human-maintainable code. TableGen alone might not currently be the best tool for this right now, but if so, we should have a detailed understanding of why.

delena added inline comments.Jul 16 2015, 1:35 AM
include/llvm/IR/IntrinsicsX86.td
5294

We can't contribute the generator. It is an informal semiautomatic process based on parsing intrinscs headers. But it significantly simplifies our life when we need to add such amount of intrinsics.

ab added inline comments.Jul 16 2015, 3:14 AM
include/llvm/IR/IntrinsicsX86.td
5294

At least some of it looks easily multiclass-able, no? For instance, the _mem_ variants, or the saturation kind. That's already a factor of 6. We can also have a multiclass for the size variants. So, even without being smart with types, you can do:

multiclass int_avx512_mask_pmov_base<string SatKind, string Size,
                                     LLVMType InVT, LLVMType OutVT, LLVMType ScVT> {
  def int_x86_avx512_mask_pmov#SatKind#_#NAME#_#Size :
          GCCBuiltin<"__builtin_ia32_pmov"#SatKind#NAME#Size#"_mask">,
          Intrinsic<[OutVT],
                    [InVT, OutVT, ScVT],
                    [IntrNoMem]>;
  def int_x86_avx512_mask_pmov#SatKind#_#NAME#_mem_#Size :
          GCCBuiltin<"__builtin_ia32_pmov"#SatKind#NAME#Size#"mem_mask">,
          Intrinsic<[],
                    [llvm_ptr_ty, InVT, ScVT],
                    [IntrReadWriteArgMem]>;
}

multiclass int_x86_avx512_mask_pmov_base_sized<string Size, LLVMType InVT, LLVMType OutVT,
                                               LLVMType ScVT> {
  defm NAME#"" : int_avx512_mask_pmov_base<  "", Size, InVT, OutVT, ScVT>;
  defm NAME#"" : int_avx512_mask_pmov_base< "s", Size, InVT, OutVT, ScVT>;
  defm NAME#"" : int_avx512_mask_pmov_base<"us", Size, InVT, OutVT, ScVT>;
}

multiclass int_x86_avx512_mask_pmov_all<LLVMType ScVT128, LLVMType InVT128, LLVMType OutVT128,
                                        LLVMType ScVT256, LLVMType InVT256, LLVMType OutVT256,
                                        LLVMType ScVT512, LLVMType InVT512, LLVMType OutVT512> {
  defm NAME : int_x86_avx512_mask_pmov_base_sized<"128", InVT128, OutVT128, ScVT128>;
  defm NAME : int_x86_avx512_mask_pmov_base_sized<"256", InVT256, OutVT256, ScVT256>;
  defm NAME : int_x86_avx512_mask_pmov_base_sized<"512", InVT512, OutVT512, ScVT512>;
}

let TargetPrefix = "x86" in {
  defm qb : int_x86_avx512_mask_pmov_all<llvm_i8_ty,  llvm_v2i64_ty,  llvm_v16i8_ty,
                                         llvm_i8_ty,  llvm_v4i64_ty,  llvm_v16i8_ty,
                                         llvm_i8_ty,  llvm_v8i64_ty,  llvm_v16i8_ty>;
  defm qw : int_x86_avx512_mask_pmov_all<llvm_i8_ty,  llvm_v2i64_ty,  llvm_v8i16_ty,
                                         llvm_i8_ty,  llvm_v4i64_ty,  llvm_v8i16_ty,
                                         llvm_i8_ty,  llvm_v8i64_ty,  llvm_v8i16_ty>;
  defm qd : int_x86_avx512_mask_pmov_all<llvm_i8_ty,  llvm_v2i64_ty,  llvm_v4i32_ty,
                                         llvm_i8_ty,  llvm_v4i64_ty,  llvm_v4i32_ty,
                                         llvm_i8_ty,  llvm_v8i64_ty,  llvm_v8i32_ty>;
  defm db : int_x86_avx512_mask_pmov_all<llvm_i8_ty,  llvm_v4i32_ty,  llvm_v16i8_ty,
                                         llvm_i8_ty,  llvm_v8i32_ty,  llvm_v16i8_ty,
                                         llvm_i16_ty, llvm_v16i32_ty, llvm_v16i8_ty>;
  defm dw : int_x86_avx512_mask_pmov_all<llvm_i8_ty,  llvm_v4i32_ty,  llvm_v8i16_ty,
                                         llvm_i8_ty,  llvm_v8i32_ty,  llvm_v8i16_ty,
                                         llvm_i16_ty, llvm_v16i32_ty, llvm_v16i16_ty>;
  defm wb : int_x86_avx512_mask_pmov_all<llvm_i8_ty,  llvm_v8i16_ty,  llvm_v16i8_ty,
                                         llvm_i16_ty, llvm_v16i16_ty, llvm_v16i8_ty,
                                         llvm_i32_ty, llvm_v32i16_ty, llvm_v32i8_ty>;
}

Which makes the differences more obvious, I think (though it's not ideal either) (and I think we can handle types by faking maps (using !cast<>(##)) and just passing the scalar types).

I admit trunc/ext are one of the worst examples though (but only because of the widening of the smaller types, I think). For instance, a few hundred lines above, the arithmetic instructions seem even more easily factorizable.

ab added a comment.Jul 16 2015, 3:15 AM

(Note that you can also reduce noise by passing ValueTypes, and building LLVMTypes in the base multiclass. I hit another tblgen bug while trying that though, I'll have to look into it some more ;)

igorb updated this revision to Diff 29913.Jul 16 2015, 8:18 AM
igorb marked 14 inline comments as done.

Ahmed,Thank you very much for your review! I have updated the patch according to your comments

igorb added inline comments.Jul 16 2015, 11:21 AM
lib/Target/X86/X86ISelLowering.cpp
15231

extendMask flag removed

15267

ISD::VSELECT can be lower and combined later to different sequence ( broadcast for example), to prevent this behavior i use X86ISD::SELECT

15405

case removed

lib/Target/X86/X86InstrAVX512.td
5786–5788

I think that in this case a multiclass would not help here, it will not decrease the code lines numbers and most probably make code more difficult for understanding.

lib/Target/X86/X86IntrinsicsInfo.h
178

Yes , you are correct, it will also simplify the implementation . But in order to do this i need to extend StoreSDNode and MaskedStoreSDNode class to hold additional information - saturation type.
Do you think it is a good idea to extend these classes?

delena added inline comments.Jul 16 2015, 11:23 AM
lib/Target/X86/X86ISelLowering.cpp
15242

space before {

15264

Please add a comment why do you need X86ISD::SELECT here

16098

You do not use IntrData and Subtarget in this function.

16109

remove 1 empty line

16116

remove spaces

16129

indentation

igorb updated this revision to Diff 30106.Jul 18 2015, 11:51 PM
igorb marked 6 inline comments as done.

Elena,Thank you very much for your review! I have updated the patch according to your comments.

delena added inline comments.Jul 21 2015, 2:13 AM
lib/Target/X86/X86ISelLowering.cpp
15267

X86ISD::SELECT

We can't use ISD::VSELECT here because it is not allays "Legal" for the destination type.

16095

1-2 lines of comments before function, please.

16269

remove {}

24127

The comments here are also should be fixed as bellow.

24128

remove {}

24243

Comments:
The truncating store is legal in some cases. For example, the following SKX instructions are designated for truncate store.
In this case we don't need any further transformations.

24244

remove {}

igorb updated this revision to Diff 30256.Jul 21 2015, 7:25 AM
igorb marked 7 inline comments as done.
delena edited edge metadata.Jul 21 2015, 8:17 AM

Please fix 2 more typos and you can commit this patch.

lib/Target/X86/X86ISelLowering.cpp
15267

always

15268

requires (2 places)

This revision was automatically updated to reflect the committed changes.
igorb updated this revision to Diff 30580.Jul 24 2015, 10:06 AM
igorb edited edge metadata.
igorb removed rL LLVM as the repository for this revision.
igorb marked 2 inline comments as done.

new diff uploaded.
previous commit was reverted (tablegen failed, Sparc target).
New changes was made, part of implementation was moved from TargetSelectionDAG.td to X86InstrFragmentsSIMD.td