This is an archive of the discontinued LLVM Phabricator instance.

MachineCSE: Add a target query for the LookAheadLimit heurisitic
ClosedPublic

Authored by tstellarAMD on May 4 2015, 8:09 AM.

Details

Summary

This heurisitc is used to determine whether or not to CSE physical register defs. My understanding is that this pass will only CSE a physical register def that is within 'LookAheadLimit' instructions of a common expression.

I'm adding a target query for this because the R600 backend needs this LookAheadLimit to be MAX_UINT.

I'm not sure what the original intent of this heuristic was and why the value is 5, so another possibility would be to remove it entirely.

Diff Detail

Repository
rL LLVM

Event Timeline

tstellarAMD retitled this revision from to MachineCSE: Add a target query for the LookAheadLimit heurisitic.
tstellarAMD updated this object.
tstellarAMD edited the test plan for this revision. (Show Details)
tstellarAMD set the repository for this revision to rL LLVM.
tstellarAMD added a subscriber: Unknown Object (MLST).
MatzeB edited edge metadata.May 4 2015, 10:32 AM

Hi Tom,

I don't know the history here, but as this does scan forward for each instruction of the basic block it looks like a way to avoid quadratic runtime behavior for (corner) cases with thousands of instructions in a basic block. I think it is no problem to go to a much higher limit than 5. But why go completely boundless, do you need a guarantee here that the CSE is happening?

  • Matthias

Hi Tom,

I don't know the history here, but as this does scan forward for each instruction of the basic block it looks like a way to avoid quadratic runtime behavior for (corner) cases with thousands of instructions in a basic block. I think it is no problem to go to a much higher limit than 5. But why go completely boundless, do you need a guarantee here that the CSE is happening?

Yes, I would like a guarantee that CSE is happening. For AMD GPUs, there is a control register (m0), which is used to clamp memory addresses to avoid out of bound reads and writes. Before each load/store instruction, we emit: s_mov_b32 m0, -1 (-1 disables address clamping) and then rely on MachineCSE to eliminate all the unnecessary moves.

MatzeB accepted this revision.May 4 2015, 3:11 PM
MatzeB edited edge metadata.

Hi Tom,

I don't know the history here, but as this does scan forward for each instruction of the basic block it looks like a way to avoid quadratic runtime behavior for (corner) cases with thousands of instructions in a basic block. I think it is no problem to go to a much higher limit than 5. But why go completely boundless, do you need a guarantee here that the CSE is happening?

Yes, I would like a guarantee that CSE is happening. For AMD GPUs, there is a control register (m0), which is used to clamp memory addresses to avoid out of bound reads and writes. Before each load/store instruction, we emit: s_mov_b32 m0, -1 (-1 disables address clamping) and then rely on MachineCSE to eliminate all the unnecessary moves.

I assume SelectionDAG CSE isn't enough for you because it does not cover multiple blocks. I'm not happy about the quadratic runtime resulting from an unbounded search, but as this is a target specific option, the patch LGTM.

This revision is now accepted and ready to land.May 4 2015, 3:11 PM
hfinkel added inline comments.May 5 2015, 2:59 PM
include/llvm/Target/TargetInstrInfo.h
1242 ↗(On Diff #24880)

Can we just say something like, "The default lookahead is small to prevent unprofitable quadratic behavior." Next year, no one will know to whom "I" refers without a lot of work ;)

This revision was automatically updated to reflect the committed changes.