This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Limit load clustering to 16 bytes instead of 4 instructions
ClosedPublic

Authored by tstellarAMD on Mar 24 2016, 9:34 AM.

Details

Summary

This helps prevent load clustering from drastically increasing register
pressure by trying to cluster 4 SMRDx8 loads together. The limit of 16
bytes was chosen, because it seems like that was the original intent
of setting the limit to 4 instructions, but more analysis could show
that a different limit is better.

This fixes yields small decreases in register usage with shader-db, but
also helps avoid a large increase in register usage when lane mask
tracking is enabled in the machine scheduler, because lane mask tracking
enables more opportunities for load clustering.

shader-db stats:

2379 shaders in 477 tests
Totals:
SGPRS: 49744 -> 48600 (-2.30 %)
VGPRS: 34120 -> 34076 (-0.13 %)
Code Size: 1282888 -> 1283184 (0.02 %) bytes
LDS: 28 -> 28 (0.00 %) blocks
Scratch: 495616 -> 492544 (-0.62 %) bytes per wave
Max Waves: 6843 -> 6853 (0.15 %)
Wait states: 0 -> 0 (0.00 %)

Diff Detail

Event Timeline

tstellarAMD retitled this revision from to AMDGPU/SI: Limit load clustering to 16 bytes instead of 4 instructions.
tstellarAMD updated this object.
tstellarAMD added reviewers: nhaehnle, arsenm.
tstellarAMD added a subscriber: llvm-commits.
arsenm accepted this revision.Mar 24 2016, 9:36 AM
arsenm edited edge metadata.

LGTM

This revision is now accepted and ready to land.Mar 24 2016, 9:36 AM
This revision was automatically updated to reflect the committed changes.