Computing the remaining latency can be very expensive especially
on graphs of N nodes where the number of edges approaches N^2.
This reduces the compile time of a pathological case with the
AMDGPU backend from ~7.5 seconds to ~3 seconds. This test case has
a basic block with 2655 stores, each with somewhere between 500
and 1500 successors and predecessors.
I suggest adding a brief comment here. I think this is just a quick check to bypass computeRemLatency, and ultimately we're trying to determine whether the current cycle plus remaining latency is greater than the critical path in the scheduling region.