the clang part: https://reviews.llvm.org/D38672
Details
Diff Detail
Event Timeline
lib/IR/AutoUpgrade.cpp | ||
---|---|---|
1284 | Can you write this more like this code? I think its clearer. void decodeVSHUF64x2FamilyMask(MVT VT, unsigned Imm, SmallVectorImpl<int> &ShuffleMask) { unsigned NumLanes = VT.getSizeInBits() / 128; unsigned NumElementsInLane = 128 / VT.getScalarSizeInBits(); unsigned ControlBitsMask = NumLanes - 1; unsigned NumControlBits = NumLanes / 2; for (unsigned l = 0; l != NumLanes; ++l) { unsigned LaneMask = (Imm >> (l * NumControlBits)) & ControlBitsMask; // We actually need the other source. if (l >= NumLanes / 2) LaneMask += NumLanes; for (unsigned i = 0; i != NumElementsInLane; ++i) ShuffleMask.push_back(LaneMask * NumElementsInLane + i); } } | |
lib/Target/X86/X86ISelLowering.cpp | ||
29084 | You should do this in combineBitcastForMaskedOp. It was already created to fix shuffles to work with masking. |
Please can you add suitable -fast-isel.ll tests that match the code generated on the clang side in the avx512*-builtins.c tests
Also, it's really pedantic but please can you prefix your patch subject titles with "[X86][AVX512]" - it makes it easier to spot them in all the llvm-commits/cfe-commits emails.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
29084 | this function is called only if the select operand is bitcast, but in 64x2 intrinsics there are no bitcast and this function wont be called. |
For the 64-bit case its probably still better to do the conversion from combineSelect similar to what we do for the bitcasted case. Its somewhat unusual to walk the users in DAG combine.
Though it might make sense to turn all X86ISD::VPERM2X128 that we can into SHUF128 regardless of it being masked. Assuming they have the same latency and port bindings. This would open up the XMM16-XMM31 register space for them.
I believe the X86ISelLowering.cpp changes are no longer necessary after r317403 and r317410. I've taught lowering to prefer the EVEX encoded SHUF instructions by default and then EVEX->VEX will turn them back into VPERM2F128 if they don't end up being masked, using broadcast loads, or using XMM16-31.
I think there's still a lost load folding opportunity since the SHUF instructions aren't commutable like the VPERM instructions. We can probably recover that with some isel patterns if we want.
the X86ISelLowering.cpp changes are no longer necessary. those are the changes without it. @craig.topper
test/CodeGen/X86/avx512-intrinsics-fast-isel.ll | ||
---|---|---|
7 | first 3 tests should be corrected same as the comments in Clang part |
Can you write this more like this code? I think its clearer.