Computing the remaining latency can be very expensive especially
on graphs of N nodes where the number of edges approaches N^2.
This reduces the compile time of a pathological case with the
AMDGPU backend from ~7.5 seconds to ~3 seconds. This test case has
a basic block with 2655 stores, each with somewhere between 500
and 1500 successors and predecessors.