I think whatever problem the gluing was fixing has long since been fixed. We don't have any of the restrictions on FP stack stuff that existed back when this was first added.
I had to change which type we use for FILD in BuildFILD when X86 was enabled because most of the isel patterns block f32/f64 instructions when SSE1/SSE2 are enabled. So I needed to use the f80 pattern, but this shouldn't have an effect the generated code since there is only one FILD instruction anyway. We already use f80 explicitly in other other places.
I wonder if it would make sense to parallelize this. I think we can shift the v4i64 right by 32, trunc to v4i32 use sitofp to convert that part to double. Multiply that by 2^32 in double. That should all be lossless.
Then for the bottom 32 bits we can mask with 0xffffffff. OR with the double representation for 2^52. Then subtract 2^52 from it. This should also be lossless.
Then we just add the two double vectors together which should be the only part that does any rounding.