bc files of DeviceRTL are compiled with -O3, the same as the static library.
LG assuming this doesn't break anything anymore, we used to have problems with definitions getting optimized out but it seems to be fixed. The plan is still to remove this in favor of the static library and LTO, but this should improve things until we make the change.
I don't have any graphs, but most applications will see some performance gain when using a more optimized runtime library. I've looked at XSBench, RSBench, MiniQMC, and SU3Bench. Is there a reason having O3 is not desirable? It should only slightly increase the build times for LLVM, which is hardly worth slower execution times.
When I compared miniQMC kernel performance w/ w/o LTO, the difference comes from bc files (slower) being compiled with O1 and the LTO used static library (faster) being compiled with O3. About 30% difference on a kernel I was monitoring.
To reduce the variants among compilation options, it is better to just use O3.
For a long time, we cannot change to O3 because of the backend rejects the kernel compiled with O3. This issue has been resolved and I changed the bc compilation to O3.