This is an archive of the discontinued LLVM Phabricator instance.

[X86][MCA] Address other issues with MULX reported in PR51495.
ClosedPublic

Authored by andreadb on Aug 25 2021, 2:13 PM.

Details

Summary

It turns out that SchedWrite WriteIMulH was always assigned to the low half of the result of a MULX (rather than to the high half).

To avoid confusion, this patch swaps the two MULX writes in the tablegen definition of MULX32/64.
That way, write names better describe what they actually refer to; this also avoids further complications, if in future we decide to reuse the same MulH writes to also model other scalar integer multiply instructions.
I also had to swap the latency values for the two MULX writes to make sure that the change is effectively an NFC. In fact, none of the existing x86 tests were affected by this small refactoring.

This patch also fixes a bug in MCA: a wrong latency value was propagated for instructions that perform multiple writes to a same register.
This last issue was found by Roman while testing MULX on targets that define a different latency for the Low/High part of the result.

Diff Detail

Event Timeline

andreadb created this revision.Aug 25 2021, 2:13 PM
andreadb requested review of this revision.Aug 25 2021, 2:13 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2021, 2:13 PM
andreadb edited the summary of this revision. (Show Details)Aug 25 2021, 2:15 PM
lebedev.ri added a comment.EditedAug 25 2021, 2:23 PM
This comment has been deleted.
llvm/lib/MCA/HardwareUnits/RegisterFile.cpp
293

This doesn't compile for me

/repositories/llvm-project/llvm/lib/MCA/HardwareUnits/RegisterFile.cpp:293:34: error: use of undeclared identifier 'RegISterMappings'; did you mean 'RegisterMappings'?
    const WriteRef &OtherWrite = RegISterMappings[RegID].first;
                                 ^~~~~~~~~~~~~~~~
                                 RegisterMappings
/repositories/llvm-project/llvm/include/llvm/MCA/HardwareUnits/RegisterFile.h:190:32: note: 'RegisterMappings' declared here
  std::vector<RegisterMapping> RegisterMappings;
                               ^

I think we should rename WriteIMulH/WriteIMulH to better convey that it is for the low half of the multiplicative result.

I think we should rename WriteIMulH/WriteIMulH to better convey that it is for the low half of the multiplicative result.

So, rather than swapping the position of the two writes, you suggest to rename WriteIMulH into something like WriteIMulLo?

llvm/lib/MCA/HardwareUnits/RegisterFile.cpp
293

I will fix it. Sorry.

I think we should rename WriteIMulH/WriteIMulH to better convey that it is for the low half of the multiplicative result.

So, rather than swapping the position of the two writes, you suggest to rename WriteIMulH into something like WriteIMulLo?

No, i mean in addition to the current diff, also rename the WriteIMulH.

I think we should rename WriteIMulH/WriteIMulH to better convey that it is for the low half of the multiplicative result.

So, rather than swapping the position of the two writes, you suggest to rename WriteIMulH into something like WriteIMulLo?

No, i mean in addition to the current diff, also rename the WriteIMulH.

With this patch, WriteIMulH now correctly references the high half.

If you think that the name should be changed then what name do you suggest to use?

lebedev.ri accepted this revision.Aug 25 2021, 3:13 PM

I think we should rename WriteIMulH/WriteIMulH to better convey that it is for the low half of the multiplicative result.

So, rather than swapping the position of the two writes, you suggest to rename WriteIMulH into something like WriteIMulLo?

No, i mean in addition to the current diff, also rename the WriteIMulH.

With this patch, WriteIMulH now correctly references the high half.

If you think that the name should be changed then what name do you suggest to use?

Ah, hmm, i think i got fooled by overrides in znver3 model.
Looking at this again, i believe this is correct as-is.
I will fix Zen3 model afterwards.

LG
@RKSimon ?

This revision is now accepted and ready to land.Aug 25 2021, 3:13 PM
andreadb updated this revision to Diff 368753.Aug 25 2021, 3:25 PM

Address review comment.

andreadb marked an inline comment as done.Aug 25 2021, 3:25 PM
RKSimon accepted this revision.Aug 26 2021, 3:07 AM

LGTM

lebedev.ri added a comment.EditedAug 26 2021, 5:25 AM

I believe there is some other llvm-mca bug, because now i can not fix znver3 since llvm-mca simply hangs on existing tests (ninja check-llvm-tools-llvm-mca) after:

diff --git a/llvm/lib/Target/X86/X86ScheduleZnver3.td b/llvm/lib/Target/X86/X86ScheduleZnver3.td
index c2be9ec6085d..be07c069aae1 100644
--- a/llvm/lib/Target/X86/X86ScheduleZnver3.td
+++ b/llvm/lib/Target/X86/X86ScheduleZnver3.td
@@ -617,45 +617,11 @@ defm : Zn3WriteResIntPair<WriteIMul16, [Zn3Multiplier], 3, [3], 3, /*LoadUOps=*/
 defm : Zn3WriteResIntPair<WriteIMul16Imm, [Zn3Multiplier], 4, [4], 2>; // Integer 16-bit multiplication by immediate.
 defm : Zn3WriteResIntPair<WriteIMul16Reg, [Zn3Multiplier], 3, [1], 1>; // Integer 16-bit multiplication by register.
 defm : Zn3WriteResIntPair<WriteIMul32, [Zn3Multiplier], 3, [3], 2>;    // Integer 32-bit multiplication.
-defm : Zn3WriteResIntPair<WriteMULX32, [Zn3Multiplier], 4, [1], 2>;    // Integer 32-bit Unsigned Multiply Without Affecting Flags.
-
-def Zn3MULX32rr : SchedWriteRes<[Zn3Multiplier]> {
-  let Latency = 4;
-  let ResourceCycles = [1];
-  let NumMicroOps = 2;
-}
-def : InstRW<[Zn3MULX32rr, WriteIMulH], (instrs MULX32rr)>;
-
-def Zn3MULX32rm : SchedWriteRes<[Zn3AGU012, Zn3Load, Zn3Multiplier]> {
-  let Latency = !add(Znver3Model.LoadLatency, Zn3MULX32rr.Latency);
-  let ResourceCycles = [1, 1, 2];
-  let NumMicroOps = Zn3MULX32rr.NumMicroOps;
-}
-def : InstRW<[Zn3MULX32rm, WriteIMulHLd,
-              ReadDefault, ReadDefault, ReadDefault, ReadDefault, ReadDefault,
-              ReadAfterLd], (instrs MULX32rm)>;
-
+defm : Zn3WriteResIntPair<WriteMULX32, [Zn3Multiplier], 3, [1], 2>;    // Integer 32-bit Unsigned Multiply Without Affecting Flags.
 defm : Zn3WriteResIntPair<WriteIMul32Imm, [Zn3Multiplier], 3, [1], 1>; // Integer 32-bit multiplication by immediate.
 defm : Zn3WriteResIntPair<WriteIMul32Reg, [Zn3Multiplier], 3, [1], 1>; // Integer 32-bit multiplication by register.
 defm : Zn3WriteResIntPair<WriteIMul64, [Zn3Multiplier], 3, [3], 2>;    // Integer 64-bit multiplication.
-defm : Zn3WriteResIntPair<WriteMULX64, [Zn3Multiplier], 4, [1], 2>;    // Integer 32-bit Unsigned Multiply Without Affecting Flags.
-
-def Zn3MULX64rr : SchedWriteRes<[Zn3Multiplier]> {
-  let Latency = 4;
-  let ResourceCycles = [1];
-  let NumMicroOps = 2;
-}
-def : InstRW<[Zn3MULX64rr, WriteIMulH], (instrs MULX64rr)>;
-
-def Zn3MULX64rm : SchedWriteRes<[Zn3AGU012, Zn3Load, Zn3Multiplier]> {
-  let Latency = !add(Znver3Model.LoadLatency, Zn3MULX64rr.Latency);
-  let ResourceCycles = [1, 1, 2];
-  let NumMicroOps = Zn3MULX64rr.NumMicroOps;
-}
-def : InstRW<[Zn3MULX64rm, WriteIMulHLd,
-              ReadDefault, ReadDefault, ReadDefault, ReadDefault, ReadDefault,
-              ReadAfterLd], (instrs MULX64rm)>;
-
+defm : Zn3WriteResIntPair<WriteMULX64, [Zn3Multiplier], 3, [1], 2>;    // Integer 32-bit Unsigned Multiply Without Affecting Flags.
 defm : Zn3WriteResIntPair<WriteIMul64Imm, [Zn3Multiplier], 3, [1], 1>; // Integer 64-bit multiplication by immediate.
 defm : Zn3WriteResIntPair<WriteIMul64Reg, [Zn3Multiplier], 3, [1], 1>; // Integer 64-bit multiplication by register.
 defm : Zn3WriteResInt<WriteIMulHLd, [], !add(4, Znver3Model.LoadLatency), [], 0>;  // Integer multiplication, high part.

I believe there is some other llvm-mca bug, because now i can not fix znver3 since llvm-mca simply hangs on existing tests (ninja check-llvm-tools-llvm-mca) after:

diff --git a/llvm/lib/Target/X86/X86ScheduleZnver3.td b/llvm/lib/Target/X86/X86ScheduleZnver3.td
index c2be9ec6085d..be07c069aae1 100644
--- a/llvm/lib/Target/X86/X86ScheduleZnver3.td
+++ b/llvm/lib/Target/X86/X86ScheduleZnver3.td
@@ -617,45 +617,11 @@ defm : Zn3WriteResIntPair<WriteIMul16, [Zn3Multiplier], 3, [3], 3, /*LoadUOps=*/
 defm : Zn3WriteResIntPair<WriteIMul16Imm, [Zn3Multiplier], 4, [4], 2>; // Integer 16-bit multiplication by immediate.
 defm : Zn3WriteResIntPair<WriteIMul16Reg, [Zn3Multiplier], 3, [1], 1>; // Integer 16-bit multiplication by register.
 defm : Zn3WriteResIntPair<WriteIMul32, [Zn3Multiplier], 3, [3], 2>;    // Integer 32-bit multiplication.
-defm : Zn3WriteResIntPair<WriteMULX32, [Zn3Multiplier], 4, [1], 2>;    // Integer 32-bit Unsigned Multiply Without Affecting Flags.
-
-def Zn3MULX32rr : SchedWriteRes<[Zn3Multiplier]> {
-  let Latency = 4;
-  let ResourceCycles = [1];
-  let NumMicroOps = 2;
-}
-def : InstRW<[Zn3MULX32rr, WriteIMulH], (instrs MULX32rr)>;
-
-def Zn3MULX32rm : SchedWriteRes<[Zn3AGU012, Zn3Load, Zn3Multiplier]> {
-  let Latency = !add(Znver3Model.LoadLatency, Zn3MULX32rr.Latency);
-  let ResourceCycles = [1, 1, 2];
-  let NumMicroOps = Zn3MULX32rr.NumMicroOps;
-}
-def : InstRW<[Zn3MULX32rm, WriteIMulHLd,
-              ReadDefault, ReadDefault, ReadDefault, ReadDefault, ReadDefault,
-              ReadAfterLd], (instrs MULX32rm)>;
-
+defm : Zn3WriteResIntPair<WriteMULX32, [Zn3Multiplier], 3, [1], 2>;    // Integer 32-bit Unsigned Multiply Without Affecting Flags.
 defm : Zn3WriteResIntPair<WriteIMul32Imm, [Zn3Multiplier], 3, [1], 1>; // Integer 32-bit multiplication by immediate.
 defm : Zn3WriteResIntPair<WriteIMul32Reg, [Zn3Multiplier], 3, [1], 1>; // Integer 32-bit multiplication by register.
 defm : Zn3WriteResIntPair<WriteIMul64, [Zn3Multiplier], 3, [3], 2>;    // Integer 64-bit multiplication.
-defm : Zn3WriteResIntPair<WriteMULX64, [Zn3Multiplier], 4, [1], 2>;    // Integer 32-bit Unsigned Multiply Without Affecting Flags.
-
-def Zn3MULX64rr : SchedWriteRes<[Zn3Multiplier]> {
-  let Latency = 4;
-  let ResourceCycles = [1];
-  let NumMicroOps = 2;
-}
-def : InstRW<[Zn3MULX64rr, WriteIMulH], (instrs MULX64rr)>;
-
-def Zn3MULX64rm : SchedWriteRes<[Zn3AGU012, Zn3Load, Zn3Multiplier]> {
-  let Latency = !add(Znver3Model.LoadLatency, Zn3MULX64rr.Latency);
-  let ResourceCycles = [1, 1, 2];
-  let NumMicroOps = Zn3MULX64rr.NumMicroOps;
-}
-def : InstRW<[Zn3MULX64rm, WriteIMulHLd,
-              ReadDefault, ReadDefault, ReadDefault, ReadDefault, ReadDefault,
-              ReadAfterLd], (instrs MULX64rm)>;
-
+defm : Zn3WriteResIntPair<WriteMULX64, [Zn3Multiplier], 3, [1], 2>;    // Integer 32-bit Unsigned Multiply Without Affecting Flags.
 defm : Zn3WriteResIntPair<WriteIMul64Imm, [Zn3Multiplier], 3, [1], 1>; // Integer 64-bit multiplication by immediate.
 defm : Zn3WriteResIntPair<WriteIMul64Reg, [Zn3Multiplier], 3, [1], 1>; // Integer 64-bit multiplication by register.
 defm : Zn3WriteResInt<WriteIMulHLd, [], !add(4, Znver3Model.LoadLatency), [], 0>;  // Integer multiplication, high part.

I'll see if I can reproduce it. Thanks for reporting it.

@lebedev.ri I finally found out what the problem was. Luckily the fix was simple, and I was able to test it using your modified scheduling model.

That bug is fixed by this commit:

[llvm] 1eb7536 - [MCA][RegisterFile] Consistently update the PRF in the presence of multiple writes to the same register.

Please let me know if that works for you too.

Apologies for all the issues caused by this change.

Thanks,
-Andrea

@lebedev.ri I finally found out what the problem was. Luckily the fix was simple, and I was able to test it using your modified scheduling model.

That bug is fixed by this commit:

[llvm] 1eb7536 - [MCA][RegisterFile] Consistently update the PRF in the presence of multiple writes to the same register.

Please let me know if that works for you too.

Apologies for all the issues caused by this change.

Thanks,
-Andrea

Thank you!