Are we better off handling this by just hard-coding the bit patterns then movgr2fr.w and movgr2frh.w? I don't know the exact latencies for fcvt.d.s but plain moves should be a bit faster.
nit: "will be added later"
lu52i.d $a0, $zero, 0x3ff movgr2fr.d $fa0, $a0
to reduce one instruction. The combination of lu52i.d and movgr2fr.d can always load $2^k$ as a f64 for all integral k in $[0, 1023]$.
But as it's already approved it can be done in a later revision.
Wow that's some serious simplification. I don't think I've seen anything like this recently. Agreed this optimization is better done in a new patch, as it's more of a peephole kind, not deeply related to the generic handling done here.