Allow target to decide when to cluster loads/stores in misched
ClosedPublic

Authored by rampitec on Sep 11 2017, 10:30 AM.

Details

Summary

MachineScheduler when clustering loads or stores checks if base
pointers point to the same memory. This check is done through
comparison of base registers of two memory instructions. This
works fine when instructions have separate offset operand. If
they require a full calculated pointer such instructions can
never be clustered according to such logic.

Changed shouldClusterMemOps to accept base registers as well and
let it decide what to do about it.

Diff Detail

Repository
rL LLVM
rampitec created this revision.Sep 11 2017, 10:30 AM

I'd suggest changing the name to doMemOpsHaveSameBase[Ptr]

rampitec updated this revision to Diff 114638.Sep 11 2017, 10:54 AM
rampitec retitled this revision from Add target callback areMemOpsHaveSameBasePtr to Add target callback doMemOpsHaveSameBasePtr.

Renamed callback as suggested by Brian.
Cleanup the test.

arsenm added inline comments.Sep 11 2017, 11:03 AM
include/llvm/Target/TargetInstrInfo.h
1094–1095 ↗(On Diff #114638)

This needs more clarification on how it's different from getMemOpBaseRegImmOfs.

rampitec updated this revision to Diff 114639.Sep 11 2017, 11:12 AM
rampitec marked an inline comment as done.

Updated comment as suggested.

hfinkel added inline comments.
lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗(On Diff #114639)

Can you make *this* the default implementation?

rampitec added inline comments.Sep 11 2017, 8:44 PM
lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗(On Diff #114639)

This will result in both improvements an regressions across the board as it will increase amount of clustering dramatically. Having no access to all HW and benchmark suites and targets I left the default behavior intact. In particular I had to limit clustering in amdgpu below to maintain reasonable average performance, which I cannot do for all other targets.

I.e. I would be more than happy to make it default behavior, but I think much more people should agree to spend their time tuning it after for their targets.

What we can do, if this is eventually clonned to the majority of targets requiring clustering, then change the default.

What others think? Is this a generally desired default behaviour?

hfinkel added inline comments.Sep 11 2017, 9:12 PM
lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗(On Diff #114639)

If we generally adopted your current methodology, then we'd very rarely get any improvements to the common components. We need to do things that generally make sense. People are encouraged to benchmark them on a variety of targets, and we'll often give them a heads up (on llvm-dev and/or pinging target code owners), but otherwise, they'll notice post-commit if things regress. That having been said...

As currently defined by this patch, this implementation is how I'd recommend implementing the target-independent code for the callback. The question is then: is this the right callback to add? Is it too general? Do you think that, some tuning aside, this is the right thing to do on most/all relevant targets? Do we actually want this callback or do we want just a Boolean switch (this behavior vs. the old behavior)? [I can certainly imagine hardware store combining where the base register actually matters, and I can also imagine hardware where the address is all that matters, but I'd assume most modern hardware wants the latter.]

rampitec added inline comments.Sep 11 2017, 9:33 PM
lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗(On Diff #114639)

In general there are several questionable things about it if we generalize it for all targets:

  1. It does not respect offset distance, which may be important. I.e. some targets may only want it if they hit the same cacheline. It does not apply to amdgpu because we always want to form so called "memory clause", a batch of memory instructions.
  1. It is not a fastest code out there as it uses GetUnderlyingObject.
  1. Some targets may still want to implement it for the sake of further combining loads, but relying on memory operands that may not be generally possible to do.
  1. That is unclear if all targets maintain reasonable memory operands for this to be useful, however it still compares base registers as its first resort.

All of that does not prohibit it to be a default implementation though if can be overwritten.

vpykhtin accepted this revision.Sep 12 2017, 9:07 AM

SIInstrInfo::doMemOpsHaveSameBasePtr looks generic enough to be default indeed but should we commit this first and then make doMemOpsHaveSameBasePtr default so we could rollback to this one in case of severe regressions? LGTM by the way.

This revision is now accepted and ready to land.Sep 12 2017, 9:07 AM

SIInstrInfo::doMemOpsHaveSameBasePtr looks generic enough to be default indeed but should we commit this first and then make doMemOpsHaveSameBasePtr default so we could rollback to this one in case of severe regressions? LGTM by the way.

Good point. I will create a separate patch to switch the default implementation.

rampitec marked 4 inline comments as done.
rampitec added inline comments.
lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗(On Diff #114639)

D37755 moves AMDGPU implementation into core.

hfinkel added inline comments.Sep 13 2017, 2:05 AM
lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗(On Diff #114639)

To comment on the conditions:

In general there are several questionable things about it if we generalize it for all targets:

  1. It does not respect offset distance, which may be important. I.e. some targets may only want it if they hit the same cacheline. It does not apply to amdgpu because we always want to form so called "memory clause", a batch of memory instructions.

This is fair, although the current clustering logic does not consider this.

  1. It is not a fastest code out there as it uses GetUnderlyingObject.

We already essentially call this once per MMO-carrying instruction in ScheduleDAGInstrs.cpp. Doing it again here does not seem so bad. Unless we're now doing it O(N^2) times. Are we?

  1. Some targets may still want to implement it for the sake of further combining loads, but relying on memory operands that may not be generally possible to do.
  2. That is unclear if all targets maintain reasonable memory operands for this to be useful, however it still compares base registers as its first resort.

Targets should maintain reasonable MMOs, and as far as I know, all (in tree) targets now do.

360 ↗(On Diff #114639)

D37755 moves AMDGPU implementation into core.

Thanks for posting the follow-up patch, but I don't think that we should do that. Looking at how AArch64 uses the existing implementation, I do think that we should rename the callback to be something specific to store clustering.

As far as I can tell from the discussion in D18376, the store clustering, as used by AArch64, is specifically setup to better enable the AArch64LoadStoreOptimizer, and that pass is specifically looking for stores that share the same base register (within some small window, which is why this helps).

As a result, I think it would be good to make the callback specific to store clustering.

rampitec marked an inline comment as done.Sep 13 2017, 8:57 AM
rampitec added inline comments.
lib/Target/AMDGPU/SIInstrInfo.cpp
360 ↗(On Diff #114639)

Current implementation is O(N). Agree on the rest of conditions comments too.

Then it looks like for the targets in the tree only AMDGPU and AArch64 call createLoadClusterDAGMutation and createStoreClusterDAGMutation, and AArch needs the current implementation as it uses it for further combining. So I agree, D37755 is premature.

I am not sure about renaming though as both targets need both load and store clustering. For AMDGPU we definitely want it for both loads and stores.

  • When seeing the name doMemOpsHaveSameBasePtr I would expect this to do exactly BaseReg1 == BaseReg2. Can you explain how you happen to have cases with different base pointers that still reference the same object?
  • Given that this callback is very specific to the selectiondag clustering mutator, maybe we better add a callback to createLoadClusterDAGMutation/createStoreClusterDAGMutation rather than in TargetInstructionInfo. That way you could also choose a better name like shouldCluster.
  • We should indeed not change the behavior for aarch64, as the mutation mainly exists there to enable formation of load/store double instructions which require the same base pointer.
  • When seeing the name doMemOpsHaveSameBasePtr I would expect this to do exactly BaseReg1 == BaseReg2. Can you explain how you happen to have cases with different base pointers that still reference the same object?

That can only work if instruction has base register and offset operands. AMDGPU flat_load and flat_store instructions do not, their address operand is a single register. If you need to load two values from the same base pointer you do:

flat_load %ptr1
%ptr2 = %ptr1 + offset
flat_load %ptr2

  • Given that this callback is very specific to the selectiondag clustering mutator, maybe we better add a callback to createLoadClusterDAGMutation/createStoreClusterDAGMutation rather than in TargetInstructionInfo. That way you could also choose a better name like shouldCluster.

I cannot directly inherit LoadClusterMutation or StoreClusterMutation, it is defined in the MachineScheduler.cpp, and then if I move it to common interface I do not want to duplicate the code in both. I can pass a lambda/functor into create(Load|Store)ClusterDAGMutation as a last argument. Is that what you have in mind?
As an alternative I can just drop this portion of code from BaseMemOpClusterMutation::clusterNeighboringMemOps() completely:

if (MemOpRecords[Idx].BaseReg != MemOpRecords[Idx+1].BaseReg) {
  ClusterLength = 1;
  continue;
}

and let TII->shouldClusterMemOps() decide. Maybe it is better. Thoughts?

  • We should indeed not change the behavior for aarch64, as the mutation mainly exists there to enable formation of load/store double instructions which require the same base pointer.

Agree. I am about to close D37755.

  • When seeing the name doMemOpsHaveSameBasePtr I would expect this to do exactly BaseReg1 == BaseReg2. Can you explain how you happen to have cases with different base pointers that still reference the same object?

That can only work if instruction has base register and offset operands. AMDGPU flat_load and flat_store instructions do not, their address operand is a single register. If you need to load two values from the same base pointer you do:

flat_load %ptr1
%ptr2 = %ptr1 + offset
flat_load %ptr2

  • Given that this callback is very specific to the selectiondag clustering mutator, maybe we better add a callback to createLoadClusterDAGMutation/createStoreClusterDAGMutation rather than in TargetInstructionInfo. That way you could also choose a better name like shouldCluster.

I cannot directly inherit LoadClusterMutation or StoreClusterMutation, it is defined in the MachineScheduler.cpp, and then if I move it to common interface I do not want to duplicate the code in both. I can pass a lambda/functor into create(Load|Store)ClusterDAGMutation as a last argument. Is that what you have in mind?

While you could just as well move the class declaration into a header an extra function argument is probably easiest.

As an alternative I can just drop this portion of code from BaseMemOpClusterMutation::clusterNeighboringMemOps() completely:

if (MemOpRecords[Idx].BaseReg != MemOpRecords[Idx+1].BaseReg) {
  ClusterLength = 1;
  continue;
}

and let TII->shouldClusterMemOps() decide. Maybe it is better. Thoughts?

Indeed, that sounds like the best/easiest solution so far!

  • We should indeed not change the behavior for aarch64, as the mutation mainly exists there to enable formation of load/store double instructions which require the same base pointer.

Agree. I am about to close D37755.

rampitec planned changes to this revision.Sep 13 2017, 1:57 PM

As an alternative I can just drop this portion of code from BaseMemOpClusterMutation::clusterNeighboringMemOps() completely:

if (MemOpRecords[Idx].BaseReg != MemOpRecords[Idx+1].BaseReg) {
  ClusterLength = 1;
  continue;
}

and let TII->shouldClusterMemOps() decide. Maybe it is better. Thoughts?

Indeed, that sounds like the best/easiest solution so far!

I will do so then. Looks cleaner to me.

rampitec updated this revision to Diff 115121.Sep 13 2017, 2:39 PM
rampitec retitled this revision from Add target callback doMemOpsHaveSameBasePtr to Allow target to decide when to cluster loads/stores in misched.
rampitec edited the summary of this revision. (Show Details)

Changed to let shouldClusterMemOps decide as discussed.

This revision is now accepted and ready to land.Sep 13 2017, 2:39 PM
MatzeB accepted this revision.Sep 13 2017, 2:43 PM

LGTM

This revision was automatically updated to reflect the committed changes.