This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix cluster size threshold calculation
AbandonedPublic

Authored by foad on Jan 7 2020, 5:17 AM.

Details

Summary

The intention was to limit the size of a cluster to 16 bytes, but it was
testing NumLoads which is the number of loads/stores in the cluster
*before* adding the current one. So in fact it would have happily
clustered two dwordx4 loads, making a total cluster size of 32 bytes.

Diff Detail

Event Timeline

foad created this revision.Jan 7 2020, 5:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 7 2020, 5:17 AM

Unit tests: pass. 61291 tests passed, 0 failed and 736 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: pass.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Don't we *want* clusters that large, and even larger?

Consider some code that loads an array-of-structures (AoS). We really want to cluster that as aggressively as possible, to increase the chance of lowest-level cache hits on successive instructions? I would say the method is *very* inexact :)

The comment talks about not wanting to drive register pressure up too much. That's a legitimate concern, but this approach here seems to be quite wrong to me. The scheduler ought to track register pressure properly, and that's where the knowledge about whether to break clusters based on register pressure should be.

foad added a comment.Jan 9 2020, 7:05 AM

Don't we *want* clusters that large, and even larger?

Maybe :-)

Consider some code that loads an array-of-structures (AoS). We really want to cluster that as aggressively as possible, to increase the chance of lowest-level cache hits on successive instructions? I would say the method is *very* inexact :)

The comment talks about not wanting to drive register pressure up too much. That's a legitimate concern, but this approach here seems to be quite wrong to me. The scheduler ought to track register pressure properly, and that's where the knowledge about whether to break clusters based on register pressure should be.

shouldClusterMemOps runs as part of a DAG mutation to insert "cluster" edges in the DAG, before we try to schedule the DAG. So yes you could argue that shouldClusterMemOps should aspire to cluster as much as possible, and it should be up to the the scheduler proper to worry about register pressure, and decide whether or not to schedule those mem ops contiguously. The scheduler does already track register pressure, so it should be able to make this kind of decision, but I don't know how well it works in practice.

Is this needed anymore?

foad added a comment.Mar 24 2020, 3:40 AM

Is this needed anymore?

Since D73292 landed this is equivalent to fixing the FIXME at the end of SIInstrInfo::shouldClusterMemOps. I'll rebase it accordingly. As for whether it's actually needed, I think only benchmarks can tell, and I don't have any up-to-date numbers at the moment.

foad updated this revision to Diff 252257.Mar 24 2020, 3:41 AM

Rebase.

arsenm accepted this revision.Mar 24 2020, 8:47 AM

LGTM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
481

Extra parens around LHS

This revision is now accepted and ready to land.Mar 24 2020, 8:47 AM
foad abandoned this revision.Jul 22 2020, 12:56 AM

Abandoning as there have been other changes to SIInstrInfo::shouldClusterMemOps since I wrote this patch.