This is an archive of the discontinued LLVM Phabricator instance.

[x86] allow single source horizontal op matching (PR39195)
ClosedPublic

Authored by spatel on Oct 8 2018, 1:03 PM.

Details

Summary

This is intended to restore horizontal codegen to what it looked like before IR demanded elements improved in:
rL343727

As noted in PR39195:
https://bugs.llvm.org/show_bug.cgi?id=39195
...horizontal ops can be worse for performance than a shuffle+regular binop, so I've added a TODO. Ideally, we'd solve that in a machine instruction pass, but a quicker solution may be adding a 'HasSlowHorizontalOp' feature/bug bit to deal with it here in the DAG.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Oct 8 2018, 1:03 PM

Hi Sanjay,

On Jaguar only, those unary HADD are going to be as fast as the SHUFFLE+ADD sequence.
In terms of overall latency, both sequences are pretty much equivalent.

HADD has a worse throughput than the SHUFFLE+ADD sequence (1 IPC) mainly because it can only execute on pipe0. SHUFFLE+ADD gives more flexibility to the HW scheduler.
The biggest advantage on Jaguar is that HADD is not microcoded. The XMM variant is fast-path single, which allows us to achieve a better throughput from the decoders (w.r.t. the SHUFFLE+ADD).

I don't see it as a big problem if we start "regressing" this particular case on Jaguar.

I don't have a problem with aggressively selecting HADD at ISel stage, provided that we "undo" that canonicalization in a later (machine combiner?) pass.
Using HADD is not just slow for Intel, it is going to be slow for other AMD processors too. Similarly to what we do for other instructions (CMOV/LEA) which may be further expanded later on.

The problem with having a rule in the machine combiner is that we need to account for register pressure and block frequency too. Essentially, we need a (not too trivial) cost model there; simply comparing code snippets in term of throughput and latency is probably not enough at that stage.
We could have a post-RA pass (before we run the post-RA scheduler) that decides when it is profitable to revert the HADD canonicalization and expand it back to a shuffle+add.

Just my opinion.

spatel added a comment.Oct 9 2018, 5:18 AM

I don't see it as a big problem if we start "regressing" this particular case on Jaguar.

Ok - that's not the impression I got reading PR39195. @dyung - how important is this pattern?

I don't have a problem with aggressively selecting HADD at ISel stage, provided that we "undo" that canonicalization in a later (machine combiner?) pass.
Using HADD is not just slow for Intel, it is going to be slow for other AMD processors too. Similarly to what we do for other instructions (CMOV/LEA) which may be further expanded later on.

Yes, I agree. I even filed a bug. :)
https://bugs.llvm.org/show_bug.cgi?id=26859

So 2 options for moving forward:

  1. Allow this transform as shown here because it is mostly just restoring the behavior of last week. Follow that up with a subtarget feature to prevent the transform (not ideal, but the alternative 'undo' is much harder).
  2. Limit this transform to 'optsize' right now because it's a size win in all cases.

So 2 options for moving forward:

  1. Allow this transform as shown here because it is mostly just restoring the behavior of last week. Follow that up with a subtarget feature to prevent the transform (not ideal, but the alternative 'undo' is much harder).
  2. Limit this transform to 'optsize' right now because it's a size win in all cases.

I'd vote for (1) for this patch - optsize + HasFastHorinzontalOp might be necessary depending on how soon we can agree on a scheduler model driven mechanism that re-expands HADD later on (per Andrea's suggestion - but hopefully we can discuss that at the devmtg)

So 2 options for moving forward:

  1. Allow this transform as shown here because it is mostly just restoring the behavior of last week. Follow that up with a subtarget feature to prevent the transform (not ideal, but the alternative 'undo' is much harder).
  2. Limit this transform to 'optsize' right now because it's a size win in all cases.

I'd vote for (1) for this patch - optsize + HasFastHorinzontalOp might be necessary depending on how soon we can agree on a scheduler model driven mechanism that re-expands HADD later on (per Andrea's suggestion - but hopefully we can discuss that at the devmtg)

Same.
I am okay with (1) for now.

RKSimon accepted this revision.Oct 9 2018, 10:04 AM

LGTM - thanks

This revision is now accepted and ready to land.Oct 9 2018, 10:04 AM
dyung added a comment.Oct 9 2018, 11:32 AM

I don't see it as a big problem if we start "regressing" this particular case on Jaguar.

Ok - that's not the impression I got reading PR39195. @dyung - how important is this pattern?

I will defer to Andrea's expertise in this area. I was under the impression that generating the horizontal add/sub instructions was preferred for these cases (especially for btver2), but if that is not the case, I can update our test to not expect it to be generated.

spatel added a comment.Oct 9 2018, 1:39 PM

I don't see it as a big problem if we start "regressing" this particular case on Jaguar.

Ok - that's not the impression I got reading PR39195. @dyung - how important is this pattern?

I will defer to Andrea's expertise in this area. I was under the impression that generating the horizontal add/sub instructions was preferred for these cases (especially for btver2), but if that is not the case, I can update our test to not expect it to be generated.

Thanks. I think we're all in agreement: we have 2 wrongs making it right (less h-ops) for most CPUs (but not Jaguar) currently. This patch will remove 1 of those wrongs, but there's little regression potential because it's just restoring what happened before (h-ops are unusually good for Jaguar). I should have the follow-up patch to try to make every CPU happy posted soon after this is committed.

This revision was automatically updated to reflect the committed changes.