As originally implemented, the rematerialization of valid instructions across
the suspend point would iterate 4 times, meaning that up to 4 instructions could
be rematerialized.
This implementation changes that approach to instead build a graph of
rematerializable instructions, then move all of them. This is faster than the
original approach and is not limited to an arbitrary limit.
Since RematNode is only used in RematGraph. I feel slightly better to make RematNode a private class definition for RematGraph.