This is an archive of the discontinued LLVM Phabricator instance.

v2f32 ops
Needs ReviewPublic

Authored by jonpa on Feb 23 2017, 5:25 AM.

Details

Summary

A few days ago I described this issue on llvm-commits along with patch:

Hi,

I found that on SystemZ, for v2f32, four and not two scalar operations are emitted. This is because the v2f32 type is widened, which is good in cases of memory-only operations for instance. There is however no fp32 vector support on z13, so these will always be scalarized. If this is done after type-legalization, four and not two operations are produced, which is particularly bad in case of fp32 divide inside a vectorized loop.

In order to fix this, my patch unrolls these operations before type legalization (with a target DAG combine). They must also have the operation action of 'Expand', since otherwise other DAGCombiner methods may re-vectorized them again. This happened in reduceBuildVecConvertToConvertBuildVec(), where I also needed to add a call to TLI.isOperationLegalOrCustom(Opcode, VT), to check on the *result* type, which would in this case be v2f32. It would not work to
mark all the operand VT's of SINT_TO_FP as 'Expand', because this is only true for the v2f32 result case.

I do get some failing regression tests, which I am not sure about:

Failing Tests (3):

LLVM :: CodeGen/ARM/vdup.ll
LLVM :: CodeGen/X86/2009-02-26-MachineLICMBug.ll
LLVM :: CodeGen/X86/cvtv2f32.ll

Is it ok to add the check for the result type per what I did?

/Jonas

In short, this is something I did for the SystemZ target, but because I had to add an extra check in reduceBuildVecConvertToConvertBuildVec(), three tests failed.

I have now tried to regenerate the tests, but only one of them was successfully regenerated. I include that test diff, plus the two output diffs of the two remaining test cases.

I hope that either the tests have been improved as expected (which I can't tell for myself for sure), or that someone gives me a pointer on how to change the patch.

Thanks

/Jonas

Diff Detail

Event Timeline

jonpa created this revision.Feb 23 2017, 5:25 AM

On platforms where the widened operation is legal or custom-lowered, we want to use that operation. (For example, if v4f32 uitofp is custom, we also want to use that lowering for v2f32.)

It looks like x86 in particular is legalizing v2f32 uint_to_fp nodes earlier than it should, so the DAGCombine can't trigger after type legalization.

It looks like on ARM, we can't perform the transform after type legalization because we end up with a 32-bit sitofp (because ARM doesn't have 16-bit registers). The "SrcVT != InVT" check could be extended to handle this case.

That said, if you don't want to mess with other targets, maybe you could change the new check in DAGCombine to "if (LegalTypes && !TLI.isOperationLegalOrCustom(Opcode, VT))"?

test/CodeGen/X86/cvtv2f32.ll
21 ↗(On Diff #89502)

This looks worse.

31 ↗(On Diff #89502)

This... is arguably better, I guess, but you're not really reaching it in any principled manner.

test_CodeGen_ARM_vdup.diff
83 ↗(On Diff #89502)

The old code was loading directly to a vector register (vld1 is a splat load, vmovl is a sign-extend, and vcvt is the int->float conversion). The new code is loading to an integer register (ldrsh), moving to an fp register (vmov), converting to fp (vcvt), then splatting the result (vdup). The new code is slightly worse. (Ultimately, we want to do the splat before the int->fp conversion here because we can fold the splat into the load.)

test_CodeGen_X86_MLICMbug.diff
58 ↗(On Diff #89502)

This looks worse.

jonpa updated this revision to Diff 89633.Feb 24 2017, 3:03 AM

Thanks for help with the test regressions!

That said, if you don't want to mess with other targets, maybe you could change the new check in DAGCombine to "if (LegalTypes && !TLI.isOperationLegalOrCustom(Opcode, VT))"?

I don't think that would work, becuase this problem arose as an infinite loop *before* type legalization (the point was to call DAG.UnrollVectorOp() before type legalization with only two elements).

On platforms where the widened operation is legal or custom-lowered, we want to use that operation. (For example, if v4f32 uitofp is custom, we also want to use that lowering for v2f32.)

I tried this idea, by adding a check for also the widened VT, but that didn't work. In vdup.ll,
this optimization is needed to make a 'v4f32 = sint_to_fp' node. This is however marked for 'Expand', so TLI.isOperationLegalOrCustom() doesn't work.

Instead, I made my check more precise, to handle only exactly the case where before type legalization, if the result is going to be widened, and both the narrow and wider ops are 'Expand', then don't vectorize it (return SDValue()). This seems to work better, with no regressions.

This seems even possibly even like a general DAG heuristic, to actually scalarize early in this case. Or is it not?

I also realized that I don't have to call setOperationAction(Op, MVT::v2f32, Expand) at all, since this is true already per default for all vector ops in SystemZ.

RKSimon edited edge metadata.Feb 24 2017, 8:25 AM

What is stopping you from implementing SystemZTargetLowering::ReplaceNodeResults to handle these?

Please can you keep the x86/arm regressions in your patch.

jonpa added a comment.Mar 1 2017, 4:45 AM

What is stopping you from implementing SystemZTargetLowering::ReplaceNodeResults to handle these?

The problem I see is with the FP_TO_XINT nodes. The fp->v2i64 are legal, which is correct in the case of v2f64->v2i64. This also means in practice that v2f32 -> v2i64 is legal which it is not, but this is basically ok since there is no vector support for f32. Unfortunately, there is no way to specify that just v2f32->v2iXX should be custom, based on the operand type.

This is a problem there does not seem to be a solution to at the moment - it depends on checking the operand type in this case, while CustomWidenLowerNode() just checks for the result VT.

It might be possible to mark all FP_TO_XINT nodes of all integer vector types as custom, but that would also affect other things, such as cost functions and what not, so I am not sure that would be worth bothering with.

I could apply the patch as it is or move that transformation to ReplaceNodeResults() and maybe just ignore the FP_TO_XINT nodes for now, since fp32 is not a high priority.

Please can you keep the x86/arm regressions in your patch.

The test regressions are gone because I avoided it with the tighter check - sorry I didn't mention that more clearly.

As I said before, if anything I would like to try to make this a general rule that if a node is going to get widened and then expanded, then it should be expanded if possilbe before type legalization (widening). Otherwise unnecessary scalar operations will be built for the undef elements of the widened vector.

jonpa added a comment.Mar 5 2017, 11:33 PM

The main reason that ReplaceNodeResults() cannot be used in this case, is really because it is run after type-legalization and the widening performed there on these operations. Then the VT has changed to v4f32 and it is not possible anymore to expand to just two ops.

RKSimon resigned from this revision.Apr 7 2018, 9:19 AM