We were previously doing it after LTO, which did have the desired effect
of having the un-exported symbols marked as private extern in the final
output binary, but doing it before LTO creates more optimization
opportunities.
One observable difference is that LTO can now elide un-exported symbols
entirely, so they may not even be present as private externs in the
output.
This is also what ld64 implements.
Instead of needing to set didCompile on every loop iteration, we could extract lto->compile into another variable and set didCompile based on the size of that variable? It's minor but feels cleaner this way.