This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix scalar operand folding bug that causes SHOC performance regression
ClosedPublic

Authored by alex-t on Dec 30 2018, 11:17 AM.

Details

Summary

SIFoldOperands::foldInstOperand iterates over the
operand uses calling the function that changes def-use iteratorson the
way. As a result loop exits immediately when def-use iterator is
changed. Hence, the operand is folded to the very first use instruction
only. This makes VGPR live along the whole basic block and increases
register pressure significantly. The performance drop observed in SHOC
DeviceMemory test is caused by this bug.

Proposed fix: collect uses to separate container for further processing
in another loop.

Tests: CodeGen/AMDGPU
SHOC performance testing.

Diff Detail

Event Timeline

alex-t created this revision.Dec 30 2018, 11:17 AM

generally seems fine to me.
Would it be reasonable/useful to have a lit test that somewhat represents what we observed in the DeviceMemory test ?
if you think the fdiv32-to-rcp-folding.ll adequately covers it, then thats fine by me.

ronlieb accepted this revision.Dec 31 2018, 9:16 AM

LGTM, pending what you decide about adding another lit test.

This revision is now accepted and ready to land.Dec 31 2018, 9:16 AM
alex-t retitled this revision from [AMDGPU] Fix scalar operand folding. to [AMDGPU] Fix scalar operand folding bug that causes SHOC performance regression.Jan 3 2019, 10:06 AM
alex-t edited the summary of this revision. (Show Details)
alex-t edited the summary of this revision. (Show Details)
alex-t edited the summary of this revision. (Show Details)Jan 3 2019, 10:49 AM
This revision was automatically updated to reflect the committed changes.
arsenm added a comment.Jan 3 2019, 2:18 PM

You can avoid the separate container by incrementing the iterator before the transform as is done in other places