This contains two changes that reduce the time spent in WQM, with the
intention of reducing bandwidth required by VMEM loads:
- Sampling instructions by themselves don't need to run in WQM, only their coordinate inputs need it (unless of course there is a dependent sampling instruction). The initial scanInstructions step is modified accordingly.
- When switching back from WQM to Exact, switch back as soon as possible. This affects the logic in processBlock.
This should always be a win or at best neutral.
There are also some cleanups (e.g. remove unused ExecExports) and some new
debugging output.
I think it would be simpler to just write to the ostream directly instead of building a string and printing that