For more details about these instructions, please refer to the latest ISE document: https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
clang/lib/Basic/Targets/X86.cpp | ||
---|---|---|
1059 | alphabetical order. | |
llvm/test/CodeGen/X86/avxvnniint16-intrinsics.ll | ||
3 | X64,CHECK | |
4 | X86,CHECK | |
llvm/test/CodeGen/X86/stack-folding-int-avxvnniint16.ll | ||
3 | Is this required? | |
3 | Don't need avx2 | |
llvm/test/MC/Disassembler/X86/avx-vnni-int16.txt | ||
1 ↗ | (On Diff #539822) | Remove blank line. |
llvm/test/MC/Disassembler/X86/x86-64-avx-vnni-int16.txt | ||
1–2 ↗ | (On Diff #539822) | Remove this. |
Address comments, fix lit fails, align naming convention in IntrinsicsX86.td
llvm/test/CodeGen/X86/avxvnniint16-intrinsics.ll | ||
---|---|---|
3 | This couldn't help merging the CHECKs here. Do we need it? |
llvm/test/CodeGen/X86/avxvnniint16-intrinsics.ll | ||
---|---|---|
4 | I thought the common prefix had to be first? But I might be wrong |
llvm/test/CodeGen/X86/avxvnniint16-intrinsics.ll | ||
---|---|---|
4 | You are right 👍 |
We see a crash bisected to this patch about using an illegal instruction.
Here's the CPUInfo for the machine:
CPU info: current cpu id: 22 total 32(physical cores 16) (assigned logical cores 32) (assigned physical cores 16) (assigned_sockets:2 of 2) (8 cores per cpu, 2 threads per core) family 6 model 45 stepping 7 microcode 0x71a, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, vzeroupper, avx, aes, clmul, ht, tsc, tscinvbit, tscinv, clflush AvgLoads: 0.30, 0.10, 0.18 CPU Model and flags from /proc/cpuinfo: model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d Online cpus: 0-31 Offline cpus: BIOS frequency limitation: <Not Available> Frequency switch latency (ns): 20000 Available cpu frequencies: <Not Available> Current governor: schedutil Core performance/turbo boost: <Not Available>
I don't see avxvnniint16 in the flags list nor avx2. So, this (relatively new) instruction shouldn't be generated for this machine. Any ideas on why this might be happening?
As far as I can see from the patch, the only way to generate avxvnniint16 instructions is to call its specific intrinsics explicitly. And we will check compiling options in FE before allowing to call the intrinsics. We do have an optimization to generate vnni instructions without intrinsics, but we haven't extend it to avxvnniint16 so far.
So I don't know what's wrong in your case, could you provide a reproducer for your problem?
I've investigated what is going on. With this patch, we are now passing in +avxvnniint16 into machine attributes. With that attribute, we now generate an instruction which is illegal on sandybridge machine:
0x3013f2af: jmpq 0x3013f09b 0x3013f2b4: mov %rax,%rdi 0x3013f2b7: and $0xfffffffffffffff0,%rdi => 0x3013f2bb: vpbroadcastd %xmm0,%ymm2 0x3013f2c0: vpbroadcastd %xmm1,%ymm3
The instruction vpbroadcastd %xmm0,%ymm2 requires AVX2 CPU flag: https://www.felixcloutier.com/x86/vpbroadcast. However, the machine has only AVX flag.
This is the complete mattr generated:
!3 = !{!"-mattr=-prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,-xsaves,-avx512fp16,-sm4,+sse4.1,-avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,-invpcid,+64bit,-xsavec,-avx512vpopcntdq,+cmov,-avx512vp2intersect,-avx512cd,-movbe,-avxvnniint8,-avx512er,-amx-int8,-kl,-sha512,-avxvnni,-rtm,-adx,-avx2,-hreset,-movdiri,-serialize,-vpclmulqdq,-avx512vl,-uintr,-clflushopt,-raoint,-cmpccxadd,-bmi,-amx-tile,+sse,-gfni,+avxvnniint16,-amx-fp16,+xsaveopt,-rdrnd,-avx512f,-amx-bf16,-avx512bf16,-avx512vnni,+cx8,-avx512bw,+sse3,-pku,-fsgsbase,-clzero,-mwaitx,-lwp,-lzcnt,-sha,-movdir64b,-wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,-bmi2,-fma,+popcnt,-avxifma,-f16c,-avx512bitalg,-rdpru,-clwb,+mmx,+sse2,-rdseed,-avx512vbmi2,-prefetchi,-rdpid,-fma4,-avx512vbmi,-shstk,-vaes,-waitpkg,-sgx,+fxsr,-avx512dq,-sse4a"}
I've confirmed if we changed to -avxvnniint16 we do not generate vpbroadcastd.
W.r.t. how we get the machine attributes generated through our front-end:
if (!sys::getHostCPUFeatures(Features)) return std::move(mattr); // Fill mattr with default values. mattr.reserve(Features.getNumItems()); for (auto &I : Features) { std::string attr(I.first()); mattr.emplace_back(std::string(I.second ? "+" : "-") + attr); }
So, the problem is in getHostCPUFeatures, possibly this line from the patch :
Features["avxvnniint16"] = HasLeaf7Subleaf1 && ((EDX >> 10) & 1) && HasAVXSave;.
Does this patch help
diff --git a/llvm/lib/TargetParser/Host.cpp b/llvm/lib/TargetParser/Host.cpp index 1141df09307c..11a6879fb76a 100644 --- a/llvm/lib/TargetParser/Host.cpp +++ b/llvm/lib/TargetParser/Host.cpp @@ -1769,7 +1769,7 @@ bool sys::getHostCPUFeatures(StringMap<bool> &Features) { Features["amx-tile"] = HasLeaf7 && ((EDX >> 24) & 1) && HasAMXSave; Features["amx-int8"] = HasLeaf7 && ((EDX >> 25) & 1) && HasAMXSave; bool HasLeaf7Subleaf1 = - MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX); + HasLeaf7 && EAX >= 1 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX); Features["sha512"] = HasLeaf7Subleaf1 && ((EAX >> 0) & 1); Features["sm3"] = HasLeaf7Subleaf1 && ((EAX >> 1) & 1); Features["sm4"] = HasLeaf7Subleaf1 && ((EAX >> 2) & 1);
Can you try to help us understand what is happening?
Sandy Bridge doesn't use leaf 7 at all, but has leaves after it. I thought it should always return 0 for all EAX, EBX, ECX, EDX. The EAX value for Leaf 7 contains how many subleaves of leaf 7 exist.
The documentation says that invalid subleaves of leaf 7 should return all 0s. So we thought it was safe to check the bits of sub leaf 1 even if eax from subleaf 0 doesn't say subleaf 1 is supported.
Can you capture the values of EAX, EBX, ECX, and EDX after the two calls to getX86CpuIDAndInfoEx that have 0x7 as the first argument? Maybe there's a bug in CPUID on Sandy Bridge.
Can you capture the values of EAX, EBX, ECX, and EDX after the two calls to getX86CpuIDAndInfoEx that have 0x7 as the first argument? Maybe there's a bug in CPUID on Sandy Bridge.
Sure, on the original code before the patch you suggested right?
The two calls are:
bool HasLeaf7 = MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x0, &EAX, &EBX, &ECX, &EDX); + llvm::errs() << "Before setting fsgsbase the value for EAX: " << EAX + << " EBX: " << EBX << " ECX: " << ECX << " EDX: " << EDX + << "\n"; Features["fsgsbase"] = HasLeaf7 && ((EBX >> 0) & 1); .... bool HasLeaf7Subleaf1 = MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX); + llvm::errs() << "Before setting sha512 the value for EAX: " << EAX + << " EBX: " << EBX << " ECX: " << ECX << " EDX: " << EDX + << "\n"; Features["sha512"] = HasLeaf7Subleaf1 && ((EAX >> 0) & 1); ... we set avxvnniint16 after this
Takes a while to get a build on this machine, should have the output soon.
Thanks @anna and @craig.topper
I think we can dump the value with the simple code
$ cat cpuid.c #include <stdio.h> #include <cpuid.h> int main() { unsigned int info[4]; for (int i = 0; i < 2; ++i) { __get_cpuid_count(7, i, info, info + 1, info + 2, info + 3); printf("%08x\n", info[0]); printf("%08x\n", info[1]); printf("%08x\n", info[2]); printf("%08x\n", info[3]); } } $ clang cpuid.c && ./a.out
@craig.topper here is the output:
Before setting fsgsbase the value for EAX: 0 EBX: 0 ECX: 0 EDX: 2617246720 // this is after the HasLeaf7 calculation Before setting sha512 the value for EAX: 0 EBX: 0 ECX: 0 EDX: 2617246720 // this is after the HasLeaf7Subleaf1 calculation
So, with your patch HasLeaf7Subleaf1 is 0 as EAX is 0. Pls let me know if you need any additional diagnostics output (we actually lose access to the machine on friday, since it is being retired!).
The documentation says that invalid subleaves of leaf 7 should return all 0s. So we thought it was safe to check the bits of sub leaf 1 even if eax from subleaf 0 doesn't say subleaf 1 is supported.
This means the CPUID doesn't satisfy the documentation since EDX != 0 for SubLeaf1?
The identical EDX value looks dubious to me. Could you compile and run above code and paste the result here? Thanks!
Interestingly all of the bits set in EDX are features that were things that were added in microcode patches in the wake of vulnerabilities like Spectre and Meltdown. Maybe the microcode patch forgot to check the subleaf since there was no subleaf implemented when sandy bridge was originally made.
I think my patch is the correct fix given that information. I'll post a patch for review shortly.
alphabetical order.