This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add AVX-VNNI-INT16 instructions.
ClosedPublic

Authored by FreddyYe on Jul 12 2023, 7:11 PM.

Diff Detail

Event Timeline

FreddyYe created this revision.Jul 12 2023, 7:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 12 2023, 7:11 PM
FreddyYe requested review of this revision.Jul 12 2023, 7:11 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJul 12 2023, 7:11 PM
pengfei added inline comments.Jul 13 2023, 5:31 AM
clang/lib/Basic/Targets/X86.cpp
1074

alphabetical order.

llvm/test/CodeGen/X86/avxvnniint16-intrinsics.ll
3

X64,CHECK

4

X86,CHECK

llvm/test/CodeGen/X86/stack-folding-int-avxvnniint16.ll
3

Is this required?

3

Don't need avx2

llvm/test/MC/Disassembler/X86/avx-vnni-int16.txt
1 ↗(On Diff #539822)

Remove blank line.

llvm/test/MC/Disassembler/X86/x86-64-avx-vnni-int16.txt
1–2 ↗(On Diff #539822)

Remove this.

FreddyYe updated this revision to Diff 540244.Jul 13 2023, 6:12 PM
FreddyYe marked 7 inline comments as done.

Address comments, fix lit fails, align naming convention in IntrinsicsX86.td

llvm/test/CodeGen/X86/avxvnniint16-intrinsics.ll
3

This couldn't help merging the CHECKs here. Do we need it?

craig.topper added inline comments.Jul 13 2023, 8:25 PM
llvm/test/CodeGen/X86/avxvnniint16-intrinsics.ll
4

I thought the common prefix had to be first? But I might be wrong

craig.topper added inline comments.Jul 13 2023, 8:30 PM
llvm/lib/Target/X86/X86InstrSSE.td
8303

This needs to be indented 1 character more

8306

This needs to be indented 1 character more so that it looks nested under the set

pengfei added inline comments.Jul 13 2023, 8:49 PM
llvm/test/CodeGen/X86/avxvnniint16-intrinsics.ll
4

You are right 👍

FreddyYe updated this revision to Diff 540849.Jul 16 2023, 8:13 PM
FreddyYe marked 4 inline comments as done.

Address comments.

FreddyYe updated this revision to Diff 540896.Jul 17 2023, 1:16 AM

Rename disassembler tests and remove -x86-asm-syntax=intel

FreddyYe updated this revision to Diff 540908.Jul 17 2023, 1:48 AM

Remove -check-prefix=CHECK

FreddyYe updated this revision to Diff 540953.Jul 17 2023, 4:35 AM

Remove #include <stddef.h>

FreddyYe retitled this revision from Add AVX-VNNI-INT16 instructions. to [X86] Add AVX-VNNI-INT16 instructions..Jul 17 2023, 4:41 AM
RKSimon added inline comments.Jul 17 2023, 5:33 AM
clang/lib/Headers/avxvnniint16intrin.h
27

doxygen descriptions?

llvm/test/MC/Disassembler/X86/avx-vnni-int16-64.txt
4

try to use some x86_64-- specific registers to improve test coverage

FreddyYe updated this revision to Diff 541494.Jul 18 2023, 6:13 AM
FreddyYe marked 2 inline comments as done.

Address comments.

FreddyYe updated this revision to Diff 541796.Jul 18 2023, 5:28 PM

Add missing in doxygen

ping... Anyone help accept?

pengfei accepted this revision.Jul 19 2023, 10:38 PM

LGTM.

This revision is now accepted and ready to land.Jul 19 2023, 10:38 PM
skan accepted this revision.Jul 19 2023, 11:19 PM
This revision was landed with ongoing or failed builds.Jul 19 2023, 11:31 PM
This revision was automatically updated to reflect the committed changes.
anna added a subscriber: anna.Jul 28 2023, 11:51 AM

We see a crash bisected to this patch about using an illegal instruction.
Here's the CPUInfo for the machine:

CPU info:
current cpu id: 22
total 32(physical cores 16) (assigned logical cores 32) (assigned physical cores 16) (assigned_sockets:2 of 2) (8 cores per cpu, 2 threads per core) family 6 model 45 stepping 7 microcode 0x71a, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, vzeroupper, avx, aes, clmul, ht, tsc, tscinvbit, tscinv, clflush
AvgLoads: 0.30, 0.10, 0.18
CPU Model and flags from /proc/cpuinfo:
model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Online cpus: 0-31
Offline cpus:
BIOS frequency limitation: <Not Available>
Frequency switch latency (ns): 20000
Available cpu frequencies: <Not Available>
Current governor: schedutil
Core performance/turbo boost: <Not Available>

I don't see avxvnniint16 in the flags list nor avx2. So, this (relatively new) instruction shouldn't be generated for this machine. Any ideas on why this might be happening?

pengfei added a comment.EditedJul 28 2023, 7:47 PM

We see a crash bisected to this patch about using an illegal instruction.
Here's the CPUInfo for the machine:

CPU info:
current cpu id: 22
total 32(physical cores 16) (assigned logical cores 32) (assigned physical cores 16) (assigned_sockets:2 of 2) (8 cores per cpu, 2 threads per core) family 6 model 45 stepping 7 microcode 0x71a, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, vzeroupper, avx, aes, clmul, ht, tsc, tscinvbit, tscinv, clflush
AvgLoads: 0.30, 0.10, 0.18
CPU Model and flags from /proc/cpuinfo:
model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Online cpus: 0-31
Offline cpus:
BIOS frequency limitation: <Not Available>
Frequency switch latency (ns): 20000
Available cpu frequencies: <Not Available>
Current governor: schedutil
Core performance/turbo boost: <Not Available>

I don't see avxvnniint16 in the flags list nor avx2. So, this (relatively new) instruction shouldn't be generated for this machine. Any ideas on why this might be happening?

As far as I can see from the patch, the only way to generate avxvnniint16 instructions is to call its specific intrinsics explicitly. And we will check compiling options in FE before allowing to call the intrinsics. We do have an optimization to generate vnni instructions without intrinsics, but we haven't extend it to avxvnniint16 so far.
So I don't know what's wrong in your case, could you provide a reproducer for your problem?

anna added a comment.Aug 1 2023, 12:39 PM

We see a crash bisected to this patch about using an illegal instruction.
Here's the CPUInfo for the machine:

CPU info:
current cpu id: 22
total 32(physical cores 16) (assigned logical cores 32) (assigned physical cores 16) (assigned_sockets:2 of 2) (8 cores per cpu, 2 threads per core) family 6 model 45 stepping 7 microcode 0x71a, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, vzeroupper, avx, aes, clmul, ht, tsc, tscinvbit, tscinv, clflush
AvgLoads: 0.30, 0.10, 0.18
CPU Model and flags from /proc/cpuinfo:
model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Online cpus: 0-31
Offline cpus:
BIOS frequency limitation: <Not Available>
Frequency switch latency (ns): 20000
Available cpu frequencies: <Not Available>
Current governor: schedutil
Core performance/turbo boost: <Not Available>

I don't see avxvnniint16 in the flags list nor avx2. So, this (relatively new) instruction shouldn't be generated for this machine. Any ideas on why this might be happening?

As far as I can see from the patch, the only way to generate avxvnniint16 instructions is to call its specific intrinsics explicitly. And we will check compiling options in FE before allowing to call the intrinsics. We do have an optimization to generate vnni instructions without intrinsics, but we haven't extend it to avxvnniint16 so far.
So I don't know what's wrong in your case, could you provide a reproducer for your problem?

I've investigated what is going on. With this patch, we are now passing in +avxvnniint16 into machine attributes. With that attribute, we now generate an instruction which is illegal on sandybridge machine:

 0x3013f2af:	jmpq   0x3013f09b
   0x3013f2b4:	mov    %rax,%rdi
   0x3013f2b7:	and    $0xfffffffffffffff0,%rdi
=> 0x3013f2bb:	vpbroadcastd %xmm0,%ymm2
   0x3013f2c0:	vpbroadcastd %xmm1,%ymm3

The instruction vpbroadcastd %xmm0,%ymm2 requires AVX2 CPU flag: https://www.felixcloutier.com/x86/vpbroadcast. However, the machine has only AVX flag.

This is the complete mattr generated:

!3 = !{!"-mattr=-prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,-xsaves,-avx512fp16,-sm4,+sse4.1,-avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,-invpcid,+64bit,-xsavec,-avx512vpopcntdq,+cmov,-avx512vp2intersect,-avx512cd,-movbe,-avxvnniint8,-avx512er,-amx-int8,-kl,-sha512,-avxvnni,-rtm,-adx,-avx2,-hreset,-movdiri,-serialize,-vpclmulqdq,-avx512vl,-uintr,-clflushopt,-raoint,-cmpccxadd,-bmi,-amx-tile,+sse,-gfni,+avxvnniint16,-amx-fp16,+xsaveopt,-rdrnd,-avx512f,-amx-bf16,-avx512bf16,-avx512vnni,+cx8,-avx512bw,+sse3,-pku,-fsgsbase,-clzero,-mwaitx,-lwp,-lzcnt,-sha,-movdir64b,-wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,-bmi2,-fma,+popcnt,-avxifma,-f16c,-avx512bitalg,-rdpru,-clwb,+mmx,+sse2,-rdseed,-avx512vbmi2,-prefetchi,-rdpid,-fma4,-avx512vbmi,-shstk,-vaes,-waitpkg,-sgx,+fxsr,-avx512dq,-sse4a"}

I've confirmed if we changed to -avxvnniint16 we do not generate vpbroadcastd.

W.r.t. how we get the machine attributes generated through our front-end:

if (!sys::getHostCPUFeatures(Features))
      return std::move(mattr);
  
    // Fill mattr with default values.
    mattr.reserve(Features.getNumItems());
    for (auto &I : Features) {
      std::string attr(I.first());
      mattr.emplace_back(std::string(I.second ? "+" : "-") + attr);
    }

So, the problem is in getHostCPUFeatures, possibly this line from the patch :
Features["avxvnniint16"] = HasLeaf7Subleaf1 && ((EDX >> 10) & 1) && HasAVXSave;.

We see a crash bisected to this patch about using an illegal instruction.
Here's the CPUInfo for the machine:

CPU info:
current cpu id: 22
total 32(physical cores 16) (assigned logical cores 32) (assigned physical cores 16) (assigned_sockets:2 of 2) (8 cores per cpu, 2 threads per core) family 6 model 45 stepping 7 microcode 0x71a, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, vzeroupper, avx, aes, clmul, ht, tsc, tscinvbit, tscinv, clflush
AvgLoads: 0.30, 0.10, 0.18
CPU Model and flags from /proc/cpuinfo:
model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Online cpus: 0-31
Offline cpus:
BIOS frequency limitation: <Not Available>
Frequency switch latency (ns): 20000
Available cpu frequencies: <Not Available>
Current governor: schedutil
Core performance/turbo boost: <Not Available>

I don't see avxvnniint16 in the flags list nor avx2. So, this (relatively new) instruction shouldn't be generated for this machine. Any ideas on why this might be happening?

As far as I can see from the patch, the only way to generate avxvnniint16 instructions is to call its specific intrinsics explicitly. And we will check compiling options in FE before allowing to call the intrinsics. We do have an optimization to generate vnni instructions without intrinsics, but we haven't extend it to avxvnniint16 so far.
So I don't know what's wrong in your case, could you provide a reproducer for your problem?

I've investigated what is going on. With this patch, we are now passing in +avxvnniint16 into machine attributes. With that attribute, we now generate an instruction which is illegal on sandybridge machine:

 0x3013f2af:	jmpq   0x3013f09b
   0x3013f2b4:	mov    %rax,%rdi
   0x3013f2b7:	and    $0xfffffffffffffff0,%rdi
=> 0x3013f2bb:	vpbroadcastd %xmm0,%ymm2
   0x3013f2c0:	vpbroadcastd %xmm1,%ymm3

The instruction vpbroadcastd %xmm0,%ymm2 requires AVX2 CPU flag: https://www.felixcloutier.com/x86/vpbroadcast. However, the machine has only AVX flag.

This is the complete mattr generated:

!3 = !{!"-mattr=-prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,-xsaves,-avx512fp16,-sm4,+sse4.1,-avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,-invpcid,+64bit,-xsavec,-avx512vpopcntdq,+cmov,-avx512vp2intersect,-avx512cd,-movbe,-avxvnniint8,-avx512er,-amx-int8,-kl,-sha512,-avxvnni,-rtm,-adx,-avx2,-hreset,-movdiri,-serialize,-vpclmulqdq,-avx512vl,-uintr,-clflushopt,-raoint,-cmpccxadd,-bmi,-amx-tile,+sse,-gfni,+avxvnniint16,-amx-fp16,+xsaveopt,-rdrnd,-avx512f,-amx-bf16,-avx512bf16,-avx512vnni,+cx8,-avx512bw,+sse3,-pku,-fsgsbase,-clzero,-mwaitx,-lwp,-lzcnt,-sha,-movdir64b,-wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,-bmi2,-fma,+popcnt,-avxifma,-f16c,-avx512bitalg,-rdpru,-clwb,+mmx,+sse2,-rdseed,-avx512vbmi2,-prefetchi,-rdpid,-fma4,-avx512vbmi,-shstk,-vaes,-waitpkg,-sgx,+fxsr,-avx512dq,-sse4a"}

I've confirmed if we changed to -avxvnniint16 we do not generate vpbroadcastd.

W.r.t. how we get the machine attributes generated through our front-end:

if (!sys::getHostCPUFeatures(Features))
      return std::move(mattr);
  
    // Fill mattr with default values.
    mattr.reserve(Features.getNumItems());
    for (auto &I : Features) {
      std::string attr(I.first());
      mattr.emplace_back(std::string(I.second ? "+" : "-") + attr);
    }

So, the problem is in getHostCPUFeatures, possibly this line from the patch :
Features["avxvnniint16"] = HasLeaf7Subleaf1 && ((EDX >> 10) & 1) && HasAVXSave;.

Does this patch help

diff --git a/llvm/lib/TargetParser/Host.cpp b/llvm/lib/TargetParser/Host.cpp
index 1141df09307c..11a6879fb76a 100644
--- a/llvm/lib/TargetParser/Host.cpp
+++ b/llvm/lib/TargetParser/Host.cpp
@@ -1769,7 +1769,7 @@ bool sys::getHostCPUFeatures(StringMap<bool> &Features) {
   Features["amx-tile"]   = HasLeaf7 && ((EDX >> 24) & 1) && HasAMXSave;
   Features["amx-int8"]   = HasLeaf7 && ((EDX >> 25) & 1) && HasAMXSave;
   bool HasLeaf7Subleaf1 =
-      MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
+      HasLeaf7 && EAX >= 1 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
   Features["sha512"]     = HasLeaf7Subleaf1 && ((EAX >> 0) & 1);
   Features["sm3"]        = HasLeaf7Subleaf1 && ((EAX >> 1) & 1);
   Features["sm4"]        = HasLeaf7Subleaf1 && ((EAX >> 2) & 1);
anna added a comment.Aug 2 2023, 7:01 AM

We see a crash bisected to this patch about using an illegal instruction.
Here's the CPUInfo for the machine:

CPU info:
current cpu id: 22
total 32(physical cores 16) (assigned logical cores 32) (assigned physical cores 16) (assigned_sockets:2 of 2) (8 cores per cpu, 2 threads per core) family 6 model 45 stepping 7 microcode 0x71a, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, vzeroupper, avx, aes, clmul, ht, tsc, tscinvbit, tscinv, clflush
AvgLoads: 0.30, 0.10, 0.18
CPU Model and flags from /proc/cpuinfo:
model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Online cpus: 0-31
Offline cpus:
BIOS frequency limitation: <Not Available>
Frequency switch latency (ns): 20000
Available cpu frequencies: <Not Available>
Current governor: schedutil
Core performance/turbo boost: <Not Available>

I don't see avxvnniint16 in the flags list nor avx2. So, this (relatively new) instruction shouldn't be generated for this machine. Any ideas on why this might be happening?

As far as I can see from the patch, the only way to generate avxvnniint16 instructions is to call its specific intrinsics explicitly. And we will check compiling options in FE before allowing to call the intrinsics. We do have an optimization to generate vnni instructions without intrinsics, but we haven't extend it to avxvnniint16 so far.
So I don't know what's wrong in your case, could you provide a reproducer for your problem?

I've investigated what is going on. With this patch, we are now passing in +avxvnniint16 into machine attributes. With that attribute, we now generate an instruction which is illegal on sandybridge machine:

 0x3013f2af:	jmpq   0x3013f09b
   0x3013f2b4:	mov    %rax,%rdi
   0x3013f2b7:	and    $0xfffffffffffffff0,%rdi
=> 0x3013f2bb:	vpbroadcastd %xmm0,%ymm2
   0x3013f2c0:	vpbroadcastd %xmm1,%ymm3

The instruction vpbroadcastd %xmm0,%ymm2 requires AVX2 CPU flag: https://www.felixcloutier.com/x86/vpbroadcast. However, the machine has only AVX flag.

This is the complete mattr generated:

!3 = !{!"-mattr=-prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,-xsaves,-avx512fp16,-sm4,+sse4.1,-avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,-invpcid,+64bit,-xsavec,-avx512vpopcntdq,+cmov,-avx512vp2intersect,-avx512cd,-movbe,-avxvnniint8,-avx512er,-amx-int8,-kl,-sha512,-avxvnni,-rtm,-adx,-avx2,-hreset,-movdiri,-serialize,-vpclmulqdq,-avx512vl,-uintr,-clflushopt,-raoint,-cmpccxadd,-bmi,-amx-tile,+sse,-gfni,+avxvnniint16,-amx-fp16,+xsaveopt,-rdrnd,-avx512f,-amx-bf16,-avx512bf16,-avx512vnni,+cx8,-avx512bw,+sse3,-pku,-fsgsbase,-clzero,-mwaitx,-lwp,-lzcnt,-sha,-movdir64b,-wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,-bmi2,-fma,+popcnt,-avxifma,-f16c,-avx512bitalg,-rdpru,-clwb,+mmx,+sse2,-rdseed,-avx512vbmi2,-prefetchi,-rdpid,-fma4,-avx512vbmi,-shstk,-vaes,-waitpkg,-sgx,+fxsr,-avx512dq,-sse4a"}

I've confirmed if we changed to -avxvnniint16 we do not generate vpbroadcastd.

W.r.t. how we get the machine attributes generated through our front-end:

if (!sys::getHostCPUFeatures(Features))
      return std::move(mattr);
  
    // Fill mattr with default values.
    mattr.reserve(Features.getNumItems());
    for (auto &I : Features) {
      std::string attr(I.first());
      mattr.emplace_back(std::string(I.second ? "+" : "-") + attr);
    }

So, the problem is in getHostCPUFeatures, possibly this line from the patch :
Features["avxvnniint16"] = HasLeaf7Subleaf1 && ((EDX >> 10) & 1) && HasAVXSave;.

Does this patch help

diff --git a/llvm/lib/TargetParser/Host.cpp b/llvm/lib/TargetParser/Host.cpp
index 1141df09307c..11a6879fb76a 100644
--- a/llvm/lib/TargetParser/Host.cpp
+++ b/llvm/lib/TargetParser/Host.cpp
@@ -1769,7 +1769,7 @@ bool sys::getHostCPUFeatures(StringMap<bool> &Features) {
   Features["amx-tile"]   = HasLeaf7 && ((EDX >> 24) & 1) && HasAMXSave;
   Features["amx-int8"]   = HasLeaf7 && ((EDX >> 25) & 1) && HasAMXSave;
   bool HasLeaf7Subleaf1 =
-      MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
+      HasLeaf7 && EAX >= 1 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
   Features["sha512"]     = HasLeaf7Subleaf1 && ((EAX >> 0) & 1);
   Features["sm3"]        = HasLeaf7Subleaf1 && ((EAX >> 1) & 1);
   Features["sm4"]        = HasLeaf7Subleaf1 && ((EAX >> 2) & 1);

Yes, @craig.topper that works! Thanks! Could you pls land the patch, if possible?

We see a crash bisected to this patch about using an illegal instruction.
Here's the CPUInfo for the machine:

CPU info:
current cpu id: 22
total 32(physical cores 16) (assigned logical cores 32) (assigned physical cores 16) (assigned_sockets:2 of 2) (8 cores per cpu, 2 threads per core) family 6 model 45 stepping 7 microcode 0x71a, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, vzeroupper, avx, aes, clmul, ht, tsc, tscinvbit, tscinv, clflush
AvgLoads: 0.30, 0.10, 0.18
CPU Model and flags from /proc/cpuinfo:
model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Online cpus: 0-31
Offline cpus:
BIOS frequency limitation: <Not Available>
Frequency switch latency (ns): 20000
Available cpu frequencies: <Not Available>
Current governor: schedutil
Core performance/turbo boost: <Not Available>

I don't see avxvnniint16 in the flags list nor avx2. So, this (relatively new) instruction shouldn't be generated for this machine. Any ideas on why this might be happening?

As far as I can see from the patch, the only way to generate avxvnniint16 instructions is to call its specific intrinsics explicitly. And we will check compiling options in FE before allowing to call the intrinsics. We do have an optimization to generate vnni instructions without intrinsics, but we haven't extend it to avxvnniint16 so far.
So I don't know what's wrong in your case, could you provide a reproducer for your problem?

I've investigated what is going on. With this patch, we are now passing in +avxvnniint16 into machine attributes. With that attribute, we now generate an instruction which is illegal on sandybridge machine:

 0x3013f2af:	jmpq   0x3013f09b
   0x3013f2b4:	mov    %rax,%rdi
   0x3013f2b7:	and    $0xfffffffffffffff0,%rdi
=> 0x3013f2bb:	vpbroadcastd %xmm0,%ymm2
   0x3013f2c0:	vpbroadcastd %xmm1,%ymm3

The instruction vpbroadcastd %xmm0,%ymm2 requires AVX2 CPU flag: https://www.felixcloutier.com/x86/vpbroadcast. However, the machine has only AVX flag.

This is the complete mattr generated:

!3 = !{!"-mattr=-prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,-xsaves,-avx512fp16,-sm4,+sse4.1,-avx512ifma,+xsave,-avx512pf,+sse4.2,-tsxldtrk,-ptwrite,-widekl,-sm3,-invpcid,+64bit,-xsavec,-avx512vpopcntdq,+cmov,-avx512vp2intersect,-avx512cd,-movbe,-avxvnniint8,-avx512er,-amx-int8,-kl,-sha512,-avxvnni,-rtm,-adx,-avx2,-hreset,-movdiri,-serialize,-vpclmulqdq,-avx512vl,-uintr,-clflushopt,-raoint,-cmpccxadd,-bmi,-amx-tile,+sse,-gfni,+avxvnniint16,-amx-fp16,+xsaveopt,-rdrnd,-avx512f,-amx-bf16,-avx512bf16,-avx512vnni,+cx8,-avx512bw,+sse3,-pku,-fsgsbase,-clzero,-mwaitx,-lwp,-lzcnt,-sha,-movdir64b,-wbnoinvd,-enqcmd,-prefetchwt1,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,-bmi2,-fma,+popcnt,-avxifma,-f16c,-avx512bitalg,-rdpru,-clwb,+mmx,+sse2,-rdseed,-avx512vbmi2,-prefetchi,-rdpid,-fma4,-avx512vbmi,-shstk,-vaes,-waitpkg,-sgx,+fxsr,-avx512dq,-sse4a"}

I've confirmed if we changed to -avxvnniint16 we do not generate vpbroadcastd.

W.r.t. how we get the machine attributes generated through our front-end:

if (!sys::getHostCPUFeatures(Features))
      return std::move(mattr);
  
    // Fill mattr with default values.
    mattr.reserve(Features.getNumItems());
    for (auto &I : Features) {
      std::string attr(I.first());
      mattr.emplace_back(std::string(I.second ? "+" : "-") + attr);
    }

So, the problem is in getHostCPUFeatures, possibly this line from the patch :
Features["avxvnniint16"] = HasLeaf7Subleaf1 && ((EDX >> 10) & 1) && HasAVXSave;.

Does this patch help

diff --git a/llvm/lib/TargetParser/Host.cpp b/llvm/lib/TargetParser/Host.cpp
index 1141df09307c..11a6879fb76a 100644
--- a/llvm/lib/TargetParser/Host.cpp
+++ b/llvm/lib/TargetParser/Host.cpp
@@ -1769,7 +1769,7 @@ bool sys::getHostCPUFeatures(StringMap<bool> &Features) {
   Features["amx-tile"]   = HasLeaf7 && ((EDX >> 24) & 1) && HasAMXSave;
   Features["amx-int8"]   = HasLeaf7 && ((EDX >> 25) & 1) && HasAMXSave;
   bool HasLeaf7Subleaf1 =
-      MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
+      HasLeaf7 && EAX >= 1 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
   Features["sha512"]     = HasLeaf7Subleaf1 && ((EAX >> 0) & 1);
   Features["sm3"]        = HasLeaf7Subleaf1 && ((EAX >> 1) & 1);
   Features["sm4"]        = HasLeaf7Subleaf1 && ((EAX >> 2) & 1);

Yes, @craig.topper that works! Thanks! Could you pls land the patch, if possible?

Can you try to help us understand what is happening?

Sandy Bridge doesn't use leaf 7 at all, but has leaves after it. I thought it should always return 0 for all EAX, EBX, ECX, EDX. The EAX value for Leaf 7 contains how many subleaves of leaf 7 exist.

The documentation says that invalid subleaves of leaf 7 should return all 0s. So we thought it was safe to check the bits of sub leaf 1 even if eax from subleaf 0 doesn't say subleaf 1 is supported.

Can you capture the values of EAX, EBX, ECX, and EDX after the two calls to getX86CpuIDAndInfoEx that have 0x7 as the first argument? Maybe there's a bug in CPUID on Sandy Bridge.

anna added a comment.Aug 2 2023, 10:36 AM

Can you capture the values of EAX, EBX, ECX, and EDX after the two calls to getX86CpuIDAndInfoEx that have 0x7 as the first argument? Maybe there's a bug in CPUID on Sandy Bridge.

Sure, on the original code before the patch you suggested right?
The two calls are:

 bool HasLeaf7 =
        MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x0, &EAX, &EBX, &ECX, &EDX);
+   llvm::errs() << "Before setting fsgsbase the value for EAX: " << EAX
+                     << " EBX: " << EBX << " ECX: " << ECX << "  EDX: " << EDX
+                     << "\n";
  
    Features["fsgsbase"]   = HasLeaf7 && ((EBX >>  0) & 1);
....
bool HasLeaf7Subleaf1 =
        MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
+   llvm::errs() << "Before setting sha512 the value for EAX: " << EAX
+                     << " EBX: " << EBX << " ECX: " << ECX << "  EDX: " << EDX
+                     << "\n";
    Features["sha512"]     = HasLeaf7Subleaf1 && ((EAX >> 0) & 1);
...
we set avxvnniint16 after this

Takes a while to get a build on this machine, should have the output soon.

pengfei added a comment.EditedAug 2 2023, 8:01 PM

Thanks @anna and @craig.topper
I think we can dump the value with the simple code

$ cat cpuid.c
#include <stdio.h>
#include <cpuid.h>

int main() {
  unsigned int info[4];
  for (int i = 0; i < 2; ++i) {
    __get_cpuid_count(7, i, info, info + 1, info + 2, info + 3);
    printf("%08x\n", info[0]);
    printf("%08x\n", info[1]);
    printf("%08x\n", info[2]);
    printf("%08x\n", info[3]);
  }
}

$ clang cpuid.c && ./a.out
anna added a comment.EditedAug 2 2023, 8:01 PM

Can you capture the values of EAX, EBX, ECX, and EDX after the two calls to getX86CpuIDAndInfoEx that have 0x7 as the first argument? Maybe there's a bug in CPUID on Sandy Bridge.

Sure, on the original code before the patch you suggested right?
The two calls are:

 bool HasLeaf7 =
        MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x0, &EAX, &EBX, &ECX, &EDX);
+   llvm::errs() << "Before setting fsgsbase the value for EAX: " << EAX
+                     << " EBX: " << EBX << " ECX: " << ECX << "  EDX: " << EDX
+                     << "\n";
  
    Features["fsgsbase"]   = HasLeaf7 && ((EBX >>  0) & 1);
....
bool HasLeaf7Subleaf1 =
        MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
+   llvm::errs() << "Before setting sha512 the value for EAX: " << EAX
+                     << " EBX: " << EBX << " ECX: " << ECX << "  EDX: " << EDX
+                     << "\n";
    Features["sha512"]     = HasLeaf7Subleaf1 && ((EAX >> 0) & 1);
...
we set avxvnniint16 after this

Takes a while to get a build on this machine, should have the output soon.

@craig.topper here is the output:

Before setting fsgsbase the value for EAX: 0 EBX: 0 ECX: 0  EDX: 2617246720 // this is after the HasLeaf7 calculation
Before setting sha512 the value for EAX: 0 EBX: 0 ECX: 0  EDX: 2617246720 // this is after the HasLeaf7Subleaf1 calculation

So, with your patch HasLeaf7Subleaf1 is 0 as EAX is 0. Pls let me know if you need any additional diagnostics output (we actually lose access to the machine on friday, since it is being retired!).

The documentation says that invalid subleaves of leaf 7 should return all 0s. So we thought it was safe to check the bits of sub leaf 1 even if eax from subleaf 0 doesn't say subleaf 1 is supported.

This means the CPUID doesn't satisfy the documentation since EDX != 0 for SubLeaf1?

Can you capture the values of EAX, EBX, ECX, and EDX after the two calls to getX86CpuIDAndInfoEx that have 0x7 as the first argument? Maybe there's a bug in CPUID on Sandy Bridge.

Sure, on the original code before the patch you suggested right?
The two calls are:

 bool HasLeaf7 =
        MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x0, &EAX, &EBX, &ECX, &EDX);
+   llvm::errs() << "Before setting fsgsbase the value for EAX: " << EAX
+                     << " EBX: " << EBX << " ECX: " << ECX << "  EDX: " << EDX
+                     << "\n";
  
    Features["fsgsbase"]   = HasLeaf7 && ((EBX >>  0) & 1);
....
bool HasLeaf7Subleaf1 =
        MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
+   llvm::errs() << "Before setting sha512 the value for EAX: " << EAX
+                     << " EBX: " << EBX << " ECX: " << ECX << "  EDX: " << EDX
+                     << "\n";
    Features["sha512"]     = HasLeaf7Subleaf1 && ((EAX >> 0) & 1);
...
we set avxvnniint16 after this

Takes a while to get a build on this machine, should have the output soon.

@craig.topper here is the output:

Before setting fsgsbase the value for EAX: 0 EBX: 0 ECX: 0  EDX: 2617246720 // this is after the HasLeaf7 calculation
Before setting sha512 the value for EAX: 0 EBX: 0 ECX: 0  EDX: 2617246720 // this is after the HasLeaf7Subleaf1 calculation

So, with your patch HasLeaf7Subleaf1 is 0 as EAX is 0. Pls let me know if you need any additional diagnostics output (we actually lose access to the machine on friday, since it is being retired!).

The documentation says that invalid subleaves of leaf 7 should return all 0s. So we thought it was safe to check the bits of sub leaf 1 even if eax from subleaf 0 doesn't say subleaf 1 is supported.

This means the CPUID doesn't satisfy the documentation since EDX != 0 for SubLeaf1?

The identical EDX value looks dubious to me. Could you compile and run above code and paste the result here? Thanks!

Can you capture the values of EAX, EBX, ECX, and EDX after the two calls to getX86CpuIDAndInfoEx that have 0x7 as the first argument? Maybe there's a bug in CPUID on Sandy Bridge.

Sure, on the original code before the patch you suggested right?
The two calls are:

 bool HasLeaf7 =
        MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x0, &EAX, &EBX, &ECX, &EDX);
+   llvm::errs() << "Before setting fsgsbase the value for EAX: " << EAX
+                     << " EBX: " << EBX << " ECX: " << ECX << "  EDX: " << EDX
+                     << "\n";
  
    Features["fsgsbase"]   = HasLeaf7 && ((EBX >>  0) & 1);
....
bool HasLeaf7Subleaf1 =
        MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
+   llvm::errs() << "Before setting sha512 the value for EAX: " << EAX
+                     << " EBX: " << EBX << " ECX: " << ECX << "  EDX: " << EDX
+                     << "\n";
    Features["sha512"]     = HasLeaf7Subleaf1 && ((EAX >> 0) & 1);
...
we set avxvnniint16 after this

Takes a while to get a build on this machine, should have the output soon.

@craig.topper here is the output:

Before setting fsgsbase the value for EAX: 0 EBX: 0 ECX: 0  EDX: 2617246720 // this is after the HasLeaf7 calculation
Before setting sha512 the value for EAX: 0 EBX: 0 ECX: 0  EDX: 2617246720 // this is after the HasLeaf7Subleaf1 calculation

So, with your patch HasLeaf7Subleaf1 is 0 as EAX is 0. Pls let me know if you need any additional diagnostics output (we actually lose access to the machine on friday, since it is being retired!).

The documentation says that invalid subleaves of leaf 7 should return all 0s. So we thought it was safe to check the bits of sub leaf 1 even if eax from subleaf 0 doesn't say subleaf 1 is supported.

This means the CPUID doesn't satisfy the documentation since EDX != 0 for SubLeaf1?

Interestingly all of the bits set in EDX are features that were things that were added in microcode patches in the wake of vulnerabilities like Spectre and Meltdown. Maybe the microcode patch forgot to check the subleaf since there was no subleaf implemented when sandy bridge was originally made.

I think my patch is the correct fix given that information. I'll post a patch for review shortly.

Can you capture the values of EAX, EBX, ECX, and EDX after the two calls to getX86CpuIDAndInfoEx that have 0x7 as the first argument? Maybe there's a bug in CPUID on Sandy Bridge.

Sure, on the original code before the patch you suggested right?
The two calls are:

 bool HasLeaf7 =
        MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x0, &EAX, &EBX, &ECX, &EDX);
+   llvm::errs() << "Before setting fsgsbase the value for EAX: " << EAX
+                     << " EBX: " << EBX << " ECX: " << ECX << "  EDX: " << EDX
+                     << "\n";
  
    Features["fsgsbase"]   = HasLeaf7 && ((EBX >>  0) & 1);
....
bool HasLeaf7Subleaf1 =
        MaxLevel >= 7 && !getX86CpuIDAndInfoEx(0x7, 0x1, &EAX, &EBX, &ECX, &EDX);
+   llvm::errs() << "Before setting sha512 the value for EAX: " << EAX
+                     << " EBX: " << EBX << " ECX: " << ECX << "  EDX: " << EDX
+                     << "\n";
    Features["sha512"]     = HasLeaf7Subleaf1 && ((EAX >> 0) & 1);
...
we set avxvnniint16 after this

Takes a while to get a build on this machine, should have the output soon.

@craig.topper here is the output:

Before setting fsgsbase the value for EAX: 0 EBX: 0 ECX: 0  EDX: 2617246720 // this is after the HasLeaf7 calculation
Before setting sha512 the value for EAX: 0 EBX: 0 ECX: 0  EDX: 2617246720 // this is after the HasLeaf7Subleaf1 calculation

So, with your patch HasLeaf7Subleaf1 is 0 as EAX is 0. Pls let me know if you need any additional diagnostics output (we actually lose access to the machine on friday, since it is being retired!).

The documentation says that invalid subleaves of leaf 7 should return all 0s. So we thought it was safe to check the bits of sub leaf 1 even if eax from subleaf 0 doesn't say subleaf 1 is supported.

This means the CPUID doesn't satisfy the documentation since EDX != 0 for SubLeaf1?

Interestingly all of the bits set in EDX are features that were things that were added in microcode patches in the wake of vulnerabilities like Spectre and Meltdown. Maybe the microcode patch forgot to check the subleaf since there was no subleaf implemented when sandy bridge was originally made.

I think my patch is the correct fix given that information. I'll post a patch for review shortly.

Thanks Craig! That makes sense to me.