[X86] Type legalize v2f32 loads by using an f64 load and a scalar_to_vector.

Authored by craig.topper on Tue, Sep 25, 11:47 PM.



On 64-bit targets the generic legalize will use an i64 load and a scalar_to_vector for us. But on 32-bit targets i64 isn't legal and the generic legalizer will end up emitting two 32-bit loads. We have DAG combines that try to put those two loads back together with pretty good success.

This patch instead uses f64 to avoid the splitting entirely. I've made it do the same for 64-bit mode for consistency.

There are a few things in here that look like regressions in 32-bit mode, but I believe they bring us closer to the 64-bit mode codegen. And that the 64-bit mode code could be better. I think those issues should be looked at separately.

Diff Detail

craig.topper created this revision.Tue, Sep 25, 11:47 PM

Is it worth trying to improve the SSE1 codegen or just go with the SSE2+ f64 solution?

I"m open to suggestions of how to improve the SSE1 codegen.

I"m open to suggestions of how to improve the SSE1 codegen.

Its tricky - much of the movlps/movhps selection is based around either i64 or f64 elements.

I'm a bit worried about how easy it will be to fix some of these regressions for similar reasons - have you looked at follow up fixes for these yet?

craig.topper added inline comments.Sat, Sep 29, 9:38 AM
20 ↗(On Diff #167050)

For this case we would need to decide if it makes sense to split the load into 2 scalar loads when both elements are extracted separately.

174 ↗(On Diff #167050)

This regression is because DAGCombiner::visitEXTRACT_ELEMENT explicitly avoids splitting a load until after op legalization. So we form a shuffle first and then we can't recover.

I just checked to see if InstCombine would let this sequence through in the first place and it looks like it will widen the 2f32 to v8f32 and then shuffle the single element into place. Same as what was DAGCombine did. This seems not great. Why aren't we recognizing that we don't need the other elements of the v2f32 load?

spatel added inline comments.Sat, Sep 29, 12:14 PM
174 ↗(On Diff #167050)

Would there be codegen problems if we always scalarize an extractelement of a vector load with no other uses in instcombine?

define float @load_extract(<4 x float>* %p) {
  %v = load <4 x float>, <4 x float>* %p
  %s = extractelement <4 x float> %v, i32 0
  ret float %s


define float @load_extract(<4 x float>* %p) {
  %bc = bitcast <4 x float>* %p to float*
  %s = load float, float* %bc
  ret float %s

This would require an address offset (gep) in the general case.

I believe the only regressions caused by this are already issues in 64-bit mode. Are we concerned about 32-bit mode regressions here? Or can we take this and try to improve these issues as a follow up?

174 ↗(On Diff #167050)

I'm not sure.

RKSimon accepted this revision.Thu, Oct 11, 4:19 AM

LGTM as long as all the regressions are documented somewhere so we don't lose track

896 ↗(On Diff #167050)

Please can you align the arguments into the columns </pedantic>

174 ↗(On Diff #167050)

A mixture of XFormVExtractWithShuffleIntoLoad and EltsFromConsecutiveLoads would probably help here.

This revision is now accepted and ready to land.Thu, Oct 11, 4:19 AM
This revision was automatically updated to reflect the committed changes.