This is an archive of the discontinued LLVM Phabricator instance.

[X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s
ClosedPublic

Authored by andreadb on Aug 31 2018, 7:56 AM.

Details

Summary

A ReadAdvance was incorrectly added to the list of SchedReadWrite for the following opcodes:

sqrtss
sqrtsd
rsqrtss
rcpss

As a consequence, a wrong operand latency was computed for the register operand used by the folded load.

This patch removes the wrong ReadAdvance, and updates the llvm-mca test cases. Now the llvm-mca timeline report shows correct timings for those unary fp SSE1/SSE2 instructions.

Please let me know if okay to commit.

Thanks,
-Andrea

Diff Detail

Repository
rL LLVM

Event Timeline

andreadb created this revision.Aug 31 2018, 7:56 AM

I think this requires an understanding of the intent of ReadAfterLd:

// Instructions with folded loads need to read the memory operand immediately,
// but other register operands don't have to be read until the load is ready.
// These operands are marked with ReadAfterLd.

...that D51534 did not. That's because a broadcast only has one source operand, so ReadAfterLd doesn't even make sense on that instruction?

In this case, we have 2 source operands:

  1. The loaded value that we're doing the math on.
  2. The unchanging vector lanes of the second source (destination) register.

The patch has the intended effect of making the math op depend on the load operand, but it's not clear to me what is or should be happening in a case like this on skylake:

Trunk:

[0,0]     DeeeeeeeeeER.   dppd	$1, %xmm1, %xmm2 <--- long latency, but pipelined
[0,1]     D=eE-------R.   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=eeeeeeeeeER   rsqrtss	(%rax), %xmm2 <--- wrong: this can't start executing before %rax is loaded

Apply this patch (remove ReadAfterLd:)

[0,0]     DeeeeeeeeeER .   dppd	$1, %xmm1, %xmm2
[0,1]     D=eE-------R .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D==eeeeeeeeeER   rsqrtss	(%rax), %xmm2  <--- is this right? the calc can begin before xmm2 is known?

But with AVX the 2nd source is explicit, and ReadAfterLd has a different effect:

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

No ReadAfterLd:

[0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known

Hi Sanjay,

Thanks for the feedback.

I think this requires an understanding of the intent of ReadAfterLd:

// Instructions with folded loads need to read the memory operand immediately,
// but other register operands don't have to be read until the load is ready.
// These operands are marked with ReadAfterLd.

...that D51534 did not. That's because a broadcast only has one source operand, so ReadAfterLd doesn't even make sense on that instruction?

Not only it didn't make any sense. It was even harmful because it was decreasing the use latency of the register used as the base address for the folded load by 'ReadAfterLd' cycles..

In this case, we have 2 source operands:

  1. The loaded value that we're doing the math on.
  2. The unchanging vector lanes of the second source (destination) register.

I think you are getting confused. These are not instructions with 2 input operands.
These are just SSE1/SSE2 unary operations (one def, and one use; see below the tablegen definition).

def m : I<opc, MRMSrcMem, (outs RC:$dst), (ins x86memop:$src1)

The patch has the intended effect of making the math op depend on the load operand, but it's not clear to me what is or should be happening in a case like this on skylake:

Trunk:

[0,0]     DeeeeeeeeeER.   dppd	$1, %xmm1, %xmm2 <--- long latency, but pipelined
[0,1]     D=eE-------R.   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=eeeeeeeeeER   rsqrtss	(%rax), %xmm2 <--- wrong: this can't start executing before %rax is loaded

Apply this patch (remove ReadAfterLd:)

[0,0]     DeeeeeeeeeER .   dppd	$1, %xmm1, %xmm2
[0,1]     D=eE-------R .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D==eeeeeeeeeER   rsqrtss	(%rax), %xmm2  <--- is this right? the calc can begin before xmm2 is known?

Here rsqrtss cannot start if %rax is not available.
The value of %xmm2 is not needed because a) it is not an input register, and b) false dependencies on %xmm2 are eliminated by the register renamer.

But with AVX the 2nd source is explicit, and ReadAfterLd has a different effect:

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

Here the use latency of %xmm2 is decreased by a number of cycles specified by the read-advance.
From the look of that timeline, you are probably testing for Skylake. Skylake defines ReadAfterLd as follows:

def : ReadAdvance<ReadAfterLd, 5>;

So the vrsqrtss can assume that %xmm2 is available 5 cycles in advance (i.e the use latency is decreased by 5 cycles).

No ReadAfterLd:

[0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known

My patch doesn't modify/remove that ReadAfterLd. It would be wrong.
As I wrote, this patch only affects SSE1/2 definitions.

I hope this helps
-Andrea

Hi Sanjay,

Thanks for the feedback.

I think this requires an understanding of the intent of ReadAfterLd:

// Instructions with folded loads need to read the memory operand immediately,
// but other register operands don't have to be read until the load is ready.
// These operands are marked with ReadAfterLd.

...that D51534 did not. That's because a broadcast only has one source operand, so ReadAfterLd doesn't even make sense on that instruction?

Not only it didn't make any sense. It was even harmful because it was decreasing the use latency of the register used as the base address for the folded load by 'ReadAfterLd' cycles..

In this case, we have 2 source operands:

  1. The loaded value that we're doing the math on.
  2. The unchanging vector lanes of the second source (destination) register.

I think you are getting confused. These are not instructions with 2 input operands.
These are just SSE1/SSE2 unary operations (one def, and one use; see below the tablegen definition).

Sorry. You are right. The output register is only partially updated.
Forget about my other comment on the register renamer; if the processor doesn't keep track of XMM registers whose upper portions have been cleared to zeros, then the instruction has to wait on the previous value of the destination register.
This is quite tricky to model for llvm-mca without adding extra domain knowledge.

That being said, the ReadAfterLd is still wrong as it applies to the register containing the base address of $src1.

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

lib/Target/X86/X86InstrSSE.td
2725 โ†—(On Diff #163533)

Intrinsic version needs updating as well? Although AFAICT this can't be tested within llvm-mca....

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

Yep.

I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.

lib/Target/X86/X86InstrSSE.td
2725 โ†—(On Diff #163533)

I think the intrinsic version is fine because it uses three operands, and ReadAdvance correctly applies to the register operand (i.e. $src1), and not the memory operand.
Also, src1 is tied to $dst.

It would not be easy to test it. As you wrote, we cannot write an llvm-mca since it is a "codegen only" instruction...

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

Yep.

I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.

Raised bug https://bugs.llvm.org/show_bug.cgi?id=38813.

spatel added a comment.Sep 3 2018, 7:10 AM

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

Yep.

I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.

Raised bug https://bugs.llvm.org/show_bug.cgi?id=38813.

Thanks! I think everyone agrees that the ReadAfterLd is not behaving as intended on the SSE instructions. I am still wondering about the AVX case (and the SSE intrinsic is the same?):
As noted, this is testing with llvm-mca with skylake CPU:

With ReadAfterLd:

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

No ReadAfterLd:

[0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known

Do we have confirmation from hardware testing or Intel docs that the 1st timeline is correct? We can punt this question to another patch to not delay this one if needed, but I'd like to understand what the hardware does in this case.

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

Yep.

I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.

Raised bug https://bugs.llvm.org/show_bug.cgi?id=38813.

Thanks! I think everyone agrees that the ReadAfterLd is not behaving as intended on the SSE instructions. I am still wondering about the AVX case (and the SSE intrinsic is the same?):
As noted, this is testing with llvm-mca with skylake CPU:

With ReadAfterLd:

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

No ReadAfterLd:

[0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known

Do we have confirmation from hardware testing or Intel docs that the 1st timeline is correct? We can punt this question to another patch to not delay this one if needed, but I'd like to understand what the hardware does in this case.

VRSQRTSS depends on XMM2, so it delayed by vdppd.
XMM2 doesn't have to necessarily be ready immediately when VRSQRTSS starts executing. That is because the uOp for the folded load is executed first, and then the loaded value is forwarded to the pipeline that executes VRSQRTSS. Only at that point, XMM2 has to be available.

I don't have a skylake to microbenchmark that sequence. However, (in my mental model) it kind of makes sense to have a ReadAfterLd there. I would be a "strange" otherwise.

spatel accepted this revision.Sep 3 2018, 7:50 AM

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

Yep.

I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.

Raised bug https://bugs.llvm.org/show_bug.cgi?id=38813.

Thanks! I think everyone agrees that the ReadAfterLd is not behaving as intended on the SSE instructions. I am still wondering about the AVX case (and the SSE intrinsic is the same?):
As noted, this is testing with llvm-mca with skylake CPU:

With ReadAfterLd:

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

No ReadAfterLd:

[0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known

Do we have confirmation from hardware testing or Intel docs that the 1st timeline is correct? We can punt this question to another patch to not delay this one if needed, but I'd like to understand what the hardware does in this case.

VRSQRTSS depends on XMM2, so it delayed by vdppd.
XMM2 doesn't have to necessarily be ready immediately when VRSQRTSS starts executing. That is because the uOp for the folded load is executed first, and then the loaded value is forwarded to the pipeline that executes VRSQRTSS. Only at that point, XMM2 has to be available.

I don't have a skylake to microbenchmark that sequence. However, (in my mental model) it kind of makes sense to have a ReadAfterLd there. I would be a "strange" otherwise.

Note: we can observe similar behavior with btver2 as the CPU model with llvm-mca (because the ReadAfterLd is on the default instruction definition):

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     .DeeE------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     . D====eeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3

And I was thinking the opposite - it seems strange to me that hardware would devote resources to make this case faster, but maybe it's not as difficult as I'm imagining.

We don't need to hold this patch up before we answer that, so LGTM.

This revision is now accepted and ready to land.Sep 3 2018, 7:50 AM
This revision was automatically updated to reflect the committed changes.