This is an archive of the discontinued LLVM Phabricator instance.

[X86] Make _Int instructions the preferred instructon for the assembly parser and disassembly parser to remove inconsistencies between VEX and EVEX.
ClosedPublic

Authored by craig.topper on Apr 9 2019, 12:44 AM.

Details

Summary

Many of our instructions have both a _Int form used by intrinsics and a form
used by other IR constructs. In the EVEX space the _Int versions usually cover
all the capabilities include broadcasting and rounding. While the other version
only covers simple register/register or register/load forms. For this reason
in EVEX, the non intrinsic form is usually marked isCodeGenOnly=1.

In the VEX encoding space we were less consistent, but usually the _Int version
was the isCodeGenOnly version.

This commit makes the VEX instructions match the EVEX instructions. This was
done by manually studying the AsmMatcher table so its possible I missed some
cases, but we should be closer now.

I'm thinking about using the isCodeGenOnly bit to simplify the EVEX2VEX
tablegen code that disambiguates the _Int and non _Int versions. Currently it
checks register class sizes and Record the memory operands come from. I have
some other changes I was looking into for D59266 that may break the memory check.

I had to make a few scheduler hacks to keep the _Int versions from being treated
differently than the non _Int version.

I'm not sure to do about the int-to-fpu-forwarding-2.s tests. That seems to be
an issue with the fact that we don't model the tied input on the non _Int
instructions in SSE.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper created this revision.Apr 9 2019, 12:44 AM
Herald added a project: Restricted Project. · View Herald Transcript

Many of our instructions have both a _Int form used by intrinsics and a form used by other IR constructs.

Is there any documentation, a comment, a mail thread somewhere that explains why this is the way it is?
I.e. why are those _Int variants need to exist? (are they temporary, or to stay forever)

The mca(?) regression is troubling.

Many of our instructions have both a _Int form used by intrinsics and a form used by other IR constructs.

Is there any documentation, a comment, a mail thread somewhere that explains why this is the way it is?
I.e. why are those _Int variants need to exist? (are they temporary, or to stay forever)

The mca(?) regression is troubling.

There may be a workaround to fix that regression. I plan to send a comment where I describe that workaround.

Many of our instructions have both a _Int form used by intrinsics and a form used by other IR constructs.

Is there any documentation, a comment, a mail thread somewhere that explains why this is the way it is?
I.e. why are those _Int variants need to exist? (are they temporary, or to stay forever)

The mca(?) regression is troubling.

The _Int instructions use VR128 regclasses while the non _Int versions use the smaller FR32/FR64 register classes. For unary operations like cvtss2sd, the X86 hardware definition reads two source registers, one of them determines the input for the operation that is being performed, the other register is just used to define the final upper bits. For the _Int versions we model this with 2 inputs. For the non _Int version we only model one of the inputs for the legacy SSE encoding. For the VEX encoding we do model both inputs and set one to IMPLICIT_DEF. We have to do this for VEX since the operands are "tied" so the register allocator must assign a register for the implicit_def. For the SSE instructions we do have a late pass that knows these instructions have a "partialRegUpdate" and will insert a dependency breaking XOR based on how long its been since that register was last written. For VEX we try to reassign the register to the oldest register we can find and if that doesn't work we use an XOR to break the dependency.

For binary instructions like addss the lower bits are calculated by adding the lower bits of both sources. The upper bits of the output are defined by upper bits of the first source register. For _Int we use VR128 for both sources and the destination. For non _Int we use FR32/FR64 for both sources and the output.

If we were to merge them it would require a bunch of COPY_TO_REGCLASS conversions to be added to the isel patterns. I fear it would have weird effects on the coalescer and how the register allocate calculates spill slot sizes. For pure scalar float code coming from C not using intrinsics woudl be pessimized. After isel the instructions would produce a VR128, it would be copied to FR32 and then it would be copied back to VR128 for the next instruction. The register coalescer would merge those copies and only VR128 would exist. Then any spills would use a 128-bit spill slot even though we don't care about the upper bits.

I do think we should look into fixing the unary non _Int instructions to list their pass through input and assign it to IMPLICIT_DEF like we do for VEX.

The llvm-mca change does reflect the data the scheduler would see when the user used intrinsics. So its not exactly a "regression". Its showing an existing difference in modeling between the _Int and non _Int instructions. Even though they use the same encoding and the same hardware.

The reason why there is a regression is because this patch adds an extra input operand to the following instructions:

cvtsi2ssl  %ecx, %xmm0
cvtsi2sdl  %ecx, %xmm0

For example:

cvtsi2ssl       %ecx, %xmm0     # <MCInst #775 CVTSI2SSrr_Int
                                #  <MCOperand Reg:142>
                                #  <MCOperand Reg:142>
                                #  <MCOperand Reg:25>>

XMM0 is now an in/out operand. Before this patch, XMM0 was only an output register.
This is equivalent to introducing a false dependency on the output register. That is what causes the extra latency from those two mca tests.

There is a way to workaround the problem introduced by the presence of that extra register read.
We can force that extra read to always have zero-latency by introducing a special "ReadAdvance" definition.

--- X86Schedule.td      (revision 357997)
+++ X86Schedule.td      (working copy)
@@ -22,6 +22,7 @@
 // This SchedRead describes a bypass delay caused by data being moved from the
 // integer unit to the floating point unit.
 def ReadInt2Fpu : SchedRead;
+def ReadAfterInt2Fpu : SchedRead;

For BdVer2 ReadAfterInt2Fpu would be defined as:

def : ReadAdvance<ReadAfterInt2Fpu, 13>;

For BtVer2 it would be defined as follows:

def : ReadAdvance<ReadAfterInt2Fpu, 7>;

The last step is to add ReadAfterInt2Fpu to the schedule read/write list of SSE CVTSI*_Int definitions.
In my experimental change, I had to change multiclass sse12_cvt_sint_3addr by adding an extra param (see below):

Index: X86InstrSSE.td
===================================================================
--- X86InstrSSE.td      (revision 357997)
+++ X86InstrSSE.td      (working copy)
@@ -1026,6 +989,7 @@
 multiclass sse12_cvt_sint_3addr<bits<8> opc, RegisterClass SrcRC,
                     RegisterClass DstRC, X86MemOperand x86memop,
                     string asm, X86FoldableSchedWrite sched,
+                    SchedRead ReadAdv = ReadDefault,
                     bit Is2Addr = 1> {

That required the following changes too:

+let Predicates = [UseAVX] in {
+defm VCVTSI2SS : sse12_cvt_sint_3addr<0x2A, GR32, VR128,
+          i32mem, "cvtsi2ss{l}", WriteCvtI2SS, ReadDefault, 0>, XS, VEX_4V;
+defm VCVTSI642SS : sse12_cvt_sint_3addr<0x2A, GR64, VR128,
+          i64mem, "cvtsi2ss{q}", WriteCvtI2SS, ReadDefault, 0>, XS, VEX_4V, VEX_W;
+defm VCVTSI2SD : sse12_cvt_sint_3addr<0x2A, GR32, VR128,
+          i32mem, "cvtsi2sd{l}", WriteCvtI2SD, ReadDefault, 0>, XD, VEX_4V;
+defm VCVTSI642SD : sse12_cvt_sint_3addr<0x2A, GR64, VR128,
+          i64mem, "cvtsi2sd{q}", WriteCvtI2SD, ReadDefault, 0>, XD, VEX_4V, VEX_W;
+}
+let Constraints = "$src1 = $dst" in {
+  defm CVTSI2SS : sse12_cvt_sint_3addr<0x2A, GR32, VR128,
+                        i32mem, "cvtsi2ss{l}", WriteCvtI2SS, ReadAfterInt2Fpu>, XS;
+  defm CVTSI642SS : sse12_cvt_sint_3addr<0x2A, GR64, VR128,
+                        i64mem, "cvtsi2ss{q}", WriteCvtI2SS, ReadAfterInt2Fpu>, XS, REX_W;
+  defm CVTSI2SD : sse12_cvt_sint_3addr<0x2A, GR32, VR128,
+                        i32mem, "cvtsi2sd{l}", WriteCvtI2SD, ReadAfterInt2Fpu>, XD;
+  defm CVTSI642SD : sse12_cvt_sint_3addr<0x2A, GR64, VR128,
+                        i64mem, "cvtsi2sd{q}", WriteCvtI2SD, ReadAfterInt2Fpu>, XD, REX_W;
+}

This was enough to avoid the regression on BtVer2.

There is one last change required for BdVer2:

===================================================================
--- X86ScheduleBdVer2.td        (revision 357997)
+++ X86ScheduleBdVer2.td        (working copy)
@@ -898,7 +899,8 @@
   let Latency = 13;
   let NumMicroOps = 2;
 }
-def : InstRW<[PdWriteCVTSI642SDrr_CVTSI642SSrr_CVTSI2SDr_CVTSI2SSrr], (instrs CVTSI642SDrr, CVTSI642SSrr, CVTSI2SDrr, CVTSI2SSrr)>;
+def : InstRW<[PdWriteCVTSI642SDrr_CVTSI642SSrr_CVTSI2SDr_CVTSI2SSrr, ReadAfterInt2Fpu], (instrs CVTSI642SDrr, CVTSI642SSrr, CVTSI2SDrr, CVTSI2SSrr,
+                                                                              CVTSI642SDrr_Int, CVTSI642SSrr_Int, CVTSI2SDrr_Int, CVTSI2SSrr_Int)>;

Basically, we need to explicitly pass ReadAfterInt2Fpu to that InstRW. Otherwise the regression would not go away.

Tbh. I don't know if there is another way to fix that regression.
If we want to fix it, then this may be a way to do it.

I hope it helps.

Architecturally that read really does exist. Its not a false dependency. That read defines the upper bits of the result. The fact that AVX and SSE were different before this patch seems like a bug. It looks like with your proposed change they would still be different. That doesn't seem right.

Architecturally that read really does exist. Its not a false dependency. That read defines the upper bits of the result. The fact that AVX and SSE were different before this patch seems like a bug. It looks like with your proposed change they would still be different. That doesn't seem right.

Right. Sorry. The upper bits of the result are unmodified for the SSE variants. So yes, it was a bug before, and that read does exist in practice.

andreadb accepted this revision.Apr 10 2019, 4:56 AM

Architecturally that read really does exist. Its not a false dependency. That read defines the upper bits of the result. The fact that AVX and SSE were different before this patch seems like a bug. It looks like with your proposed change they would still be different. That doesn't seem right.

Right. Sorry. The upper bits of the result are unmodified for the SSE variants. So yes, it was a bug before, and that read does exist in practice.

So, I found out the reason why the modified BtVer2 test was expecting a higher IPC.
I am specifically talking about the code comment from that test that says `# Throughput for the SSE code snippets below should tend to 1.00 IPC.`

The microbenchmark which was used to measure the actual throughput on the target was pre-initializing XMM0 to all-zeroes.
It shouldn't have done that because AMD processors (at least since AMDFam15h) implement a "register merge optimization" based on the knowledge of zero bits in XMM registers (see below).

Quoting the AMDFam16h SOG:

2.11 XMM Register Merge Optimization
The AMD Family 16h processor implements an XMM register merge optimization.
The processor keeps track of XMM registers whose upper portions have been cleared to zeros. This information
can be followed through multiple operations and register destinations until non-zero data is written into a
register. For certain instructions, this information can be used to bypass the usual result merging for the upper
parts of the register.

Instruction CVTSI2SS and CVTSI2SD are listed by that document as instructions that can benefit from that register merge optimization.

So... I have rerun my original microbenchmark. This time, I made sure not to set XMM0 to all-zeroes.
This is what I've got:

cycles:           105415644                                       ( +- 0.20% )
instructions:     26000303         #   0.25 insn per cycle        ( +- 0.00% )
micro-opcodes:    51640304         #   0.49 uOps per cycle        ( +- 0.01% )

So, yes. The test should not have expected a 1.00 IPC. It was a bug.
It should have been 0.25 IPC instead (which is what we would get with your patch).


I also noticed that this same optimization is done by Fam15h processors (so, it applies to Piledriver). That same paragraph can be found in AMD Fam15h SOG - Section 5.5 Partial-Register Writes.

@lebedev.ri can probably verify those numbers for BdVer2.

Again, sorry for the confusion caused by my previous post. I was trying to be useful but I failed... (I keep forgetting about those SSE partial writes).

This patch looks good to me. Hopefully Roman will be able to verify those numbers for BdVer2.

-Andrea

This revision is now accepted and ready to land.Apr 10 2019, 4:56 AM

I also noticed that this same optimization is done by Fam15h processors (so, it applies to Piledriver). That same paragraph can be found in AMD Fam15h SOG - Section 5.5 Partial-Register Writes.

@lebedev.ri can probably verify those numbers for BdVer2.

Sorry, i lost the track here. What's the exact methodology and the test?
Apply this patch and then $ perf stat ./bin/llvm-exegesis -opcode-name=CVTSI2SSrr_Int -mode=inverse_throughput -num-repetitions=1000000 ?

This revision was automatically updated to reflect the committed changes.