The entire algorithm operates per basic-block, so for cache locality
it should be better to re-optimize a basic-block immediately rather than
in a separate loop.
I don't have performance measurements.
Change-Id: I85106570bd623c4ff277faaa50ee43258e1ddcc5