This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Improve codegen of extractelement/insertelement in some cases
ClosedPublic

Authored by jpages on May 25 2022, 9:21 AM.

Details

Summary

This patch improves the codegen of extractelement and insertelement for vector containing 8 elements.
Before, a dag combine transformation was generating a sequence of 8 select/cmp.
This patch changes the upper limit for this transformation and the movrel
instruction will eventually be used instead.
Extractlement/insertelement for vectors containing less than 8 elements are unchanged.

This patch is saving some instructions in these cases.
This case was identified in several game shaders.

Diff Detail

Event Timeline

jpages created this revision.May 25 2022, 9:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 25 2022, 9:21 AM
jpages requested review of this revision.May 25 2022, 9:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 25 2022, 9:21 AM
jpages retitled this revision from [AMDGPU] Improve codegen of extractelement in some cases to [AMDGPU] Improve codegen of extractelement/insertelement in some cases.May 25 2022, 9:24 AM
jpages edited the summary of this revision. (Show Details)
jpages updated this revision to Diff 432090.May 25 2022, 12:46 PM
jpages added a reviewer: foad.

Updated a testcase

Any performance numbers? The 8 element case was driven by a specific customer program and the performance of the cmp/select was better than movrel.

foad added a comment.May 26 2022, 6:51 AM

Any performance numbers? The 8 element case was driven by a specific customer program and the performance of the cmp/select was better than movrel.

I don't know why that would be. Maybe the performance characteristics are different on GFX10+ compared to GFX9.

Also on GFX10+ sgpr usage does not affect occupancy, so perhaps the heuristic could be tweaked to make it more likely to use s_movrel (not v_movrel) on GFX10+.

Any performance numbers? The 8 element case was driven by a specific customer program and the performance of the cmp/select was better than movrel.

I don't know why that would be. Maybe the performance characteristics are different on GFX10+ compared to GFX9.

Also on GFX10+ sgpr usage does not affect occupancy, so perhaps the heuristic could be tweaked to make it more likely to use s_movrel (not v_movrel) on GFX10+.

I will try to get some performance numbers on specific games. Do you know if this performance problem was specific to an architecture?
Like Jay said, I could tweak the heuristic for this and only generate it for GFX10+.

Any performance numbers? The 8 element case was driven by a specific customer program and the performance of the cmp/select was better than movrel.

I don't know why that would be. Maybe the performance characteristics are different on GFX10+ compared to GFX9.

Also on GFX10+ sgpr usage does not affect occupancy, so perhaps the heuristic could be tweaked to make it more likely to use s_movrel (not v_movrel) on GFX10+.

I will try to get some performance numbers on specific games. Do you know if this performance problem was specific to an architecture?
Like Jay said, I could tweak the heuristic for this and only generate it for GFX10+.

Real cases that was tested with were divergent, so it was v_movrel_b32. A performance of s_movrel_b32 could be different. Unfortunately we do not know here if that is divergent or not.

rampitec added a comment.EditedMay 26 2022, 9:45 AM

Any performance numbers? The 8 element case was driven by a specific customer program and the performance of the cmp/select was better than movrel.

I don't know why that would be. Maybe the performance characteristics are different on GFX10+ compared to GFX9.

Also on GFX10+ sgpr usage does not affect occupancy, so perhaps the heuristic could be tweaked to make it more likely to use s_movrel (not v_movrel) on GFX10+.

I will try to get some performance numbers on specific games. Do you know if this performance problem was specific to an architecture?
Like Jay said, I could tweak the heuristic for this and only generate it for GFX10+.

All the tests were done on gfx9 flavors. It is also worth noting gfx9 does not have movrel, but uses s_set_gpr_idx_on instead, which may be the difference.

I.e. it may be linked to FeatureMovrel.

Any performance numbers? The 8 element case was driven by a specific customer program and the performance of the cmp/select was better than movrel.

I don't know why that would be. Maybe the performance characteristics are different on GFX10+ compared to GFX9.

Also on GFX10+ sgpr usage does not affect occupancy, so perhaps the heuristic could be tweaked to make it more likely to use s_movrel (not v_movrel) on GFX10+.

I will try to get some performance numbers on specific games. Do you know if this performance problem was specific to an architecture?
Like Jay said, I could tweak the heuristic for this and only generate it for GFX10+.

All the tests were done on gfx9 flavors. It is also worth noting gfx9 does not have movrel, but uses s_set_gpr_idx_on instead, which may be the difference.

I.e. it may be linked to FeatureMovrel.

I measured the performance with/without the patch on the original game in which we spotted this issue.
It's a small improvement of about 0.5% in the current version compared to without the patch. This was measured on GFX1030.

It seems to be a little bit faster with the patch, I can't really test on gfx9 with this game.

If it's better to keep the initial version for gfx9, I could:

  • Only do this patch for GFX10+
  • Some flag?
piotr added a subscriber: piotr.May 31 2022, 6:55 AM

Any performance numbers? The 8 element case was driven by a specific customer program and the performance of the cmp/select was better than movrel.

I don't know why that would be. Maybe the performance characteristics are different on GFX10+ compared to GFX9.

Also on GFX10+ sgpr usage does not affect occupancy, so perhaps the heuristic could be tweaked to make it more likely to use s_movrel (not v_movrel) on GFX10+.

I will try to get some performance numbers on specific games. Do you know if this performance problem was specific to an architecture?
Like Jay said, I could tweak the heuristic for this and only generate it for GFX10+.

All the tests were done on gfx9 flavors. It is also worth noting gfx9 does not have movrel, but uses s_set_gpr_idx_on instead, which may be the difference.

I.e. it may be linked to FeatureMovrel.

I measured the performance with/without the patch on the original game in which we spotted this issue.
It's a small improvement of about 0.5% in the current version compared to without the patch. This was measured on GFX1030.

It seems to be a little bit faster with the patch, I can't really test on gfx9 with this game.

If it's better to keep the initial version for gfx9, I could:

  • Only do this patch for GFX10+
  • Some flag?

Since the current heuristic was driven by gfx9 that doesn't have movrel, Stas's suggestion to make it conditional on FeatureMovrel makes perfect sense to me.

jpages updated this revision to Diff 433214.May 31 2022, 3:54 PM

Rebased, keep the current version for GFX9 and only apply this patch on architectures with movrel.

Thanks for the comments!

rampitec accepted this revision.May 31 2022, 4:04 PM
This revision is now accepted and ready to land.May 31 2022, 4:04 PM

I wouldn’t really expect there to be a performance difference between gpr indexing and movrel requences. Gfx8 has both so you could directly compare there

jpages added a comment.Jun 1 2022, 9:56 AM

I wouldn’t really expect there to be a performance difference between gpr indexing and movrel requences. Gfx8 has both so you could directly compare there

I'm using a game for my benchmarks and unfortunately this is a ray-tracing title, I can't really test it on GFX8 :(
These tests were performed on GFX1030 for this reason.

If you still have access to the benchmarks done on GFX8, I could try to test again with this patch.

This revision was landed with ongoing or failed builds.Jun 2 2022, 2:10 PM
This revision was automatically updated to reflect the committed changes.