Page MenuHomePhabricator

[ARM] Stop using scalar FP instructions in integer-only MVE mode.

Authored by simon_tatham on Jun 28 2019, 9:34 AM.



If you compile with -mattr=+mve (enabling integer MVE instructions
but not floating-point ones), then the scalar FP registers exist
and it's legal to move things in and out of them, load and store them,
but it's not legal to do arithmetic on them.

In D60708, the calls to addRegisterClass in ARMISelLowering that
enable use of the scalar FP registers became conditionalised on
Subtarget->hasFPRegs() instead of Subtarget->hasVFP2Base(), so
that loads, stores and moves of those registers would work. But I
didn't realise that that would also enable all the operations on those
types by default.

Now, if the target doesn't have basic VFP, we follow up those
addRegisterClass calls by turning back off all the nontrivial
operations you can perform on f32 and f64. That causes several
knock-on failures, which are fixed by allowing the VMOVDcc and
VMOVScc instructions to be selected even if all you have is
HasFPRegs, and adjusting several checks for 'is this a double in a
single-precision-only world?' to the more general 'is this any FP type
we can't do arithmetic on?'. Between those, the whole of the
float-ops.ll and fp16-instructions.ll tests can now run in
MVE-without-FP mode and generate correct-looking code.

One odd side effect is that I had to relax the check lines in that
test so that they permit test functions like add_f to be generated
as tailcalls to software FP library functions, instead of ordinary
calls. Doing that is entirely legal, but the mystery is why this is
the first RUN line that's needed the relaxation: on the usual kind of
non-FP target, no tailcalls ever seem to be generated. Going by the
llc messages, I think SoftenFloatResult must be perturbing the code
generation in some way, but that's as much as I can guess.

Diff Detail


Event Timeline

simon_tatham created this revision.Jun 28 2019, 9:34 AM

I still can't say I'm totally convinced by this way of using addRegisterClass for everything and then trying to stop the compiler from ever using them. When would we want to use a floating point load, for example? But I'm not sure enough about what a sensible alternative would be, so lets at least try it and see. This certainly fixes thing. If it does keep causing headaches we can always change it later!

Can you add a test that does some MVE expansion of floating point operations, for floats and halfs. Just a simple "fadd" using +mve should be fine to show things are working.

605 ↗(On Diff #207076)

Should this be behind hasFPRegs64 instead?

617 ↗(On Diff #207076)

How come we don't have to do the same for fullfp16 to work?

5 ↗(On Diff #207076)

These should be thumbv8.1m.main-none-eabi

268 ↗(On Diff #207076)

This is worse than before by the looks of it? We move things into fp registers just to move them out again.

simon_tatham marked 3 inline comments as done.Jul 1 2019, 9:31 AM
simon_tatham added inline comments.
605 ↗(On Diff #207076)

I'm not sure it should, actually. FPRegs64 is a confusing feature, and quite likely should have been named something different (which is my fault, if so).

As far as I can see the only architectural instruction actually conditional on that feature is the scalar FP move instruction that copies a d-register (e.g. vmov.f64 d0,d1). It doesn't affect loads and stores of d-regs. So I think it shouldn't affect this decision.

On the other hand, I'm pretty sure you're right that I should be separately deciding whether to setAllExpand f32 and f64 based on different feature queries.

617 ↗(On Diff #207076)

I think because the fp16 legality boundary happens in a different place.

With the MVE-integer-only level of support for f32 or f64, you have registers of the right size which you can load and store, but you can't do arithmetic on them. But with the not-full fp16 support, you don't even have a set of registers the right size – there are no loads and stores of HPRs.

268 ↗(On Diff #207076)

Perhaps that's true, but I'd rather fix the correctness first and get a set of regression tests passing, and then we can worry about recovering the performance with those tests in place to prevent breaking correctness again.

Especially since in the general case it's not really clear how you should choose which kind of register to use for a value. Keeping it in an FP register is obviously wasteful in this case, but in another case where register pressure is high, might it save a spill or two? I think getting the right answers in cases larger than this simple one might not be trivial.

Revised patch fixes those two target triples in the tests, and makes the setAllExpand calls for f32 and f64 conditional on different things. Also, to make that less cumbersome, I've moved a few re-enabling setOperationAction calls into setAllExpand itself which otherwise had to be run after every single call.

This version of the patch is intended to apply _before_ D63937 rather than after it, because this way round it turned out easier to get the tests to pass in the intermediate state.

dmgreen accepted this revision.Jul 2 2019, 3:49 AM

LGTM with one nit.

599 ↗(On Diff #207361)

This can still be the same if? To drop a level of indentation.

268 ↗(On Diff #207076)

My worry is that this will mean every floating point load becomes a vldr, which just ends up being moved into a gpr. This probably isn't a huge deal for performance, as you will likely always be calling a __aeabi_fadd type function, but the codesize would increase quite a bit. I couldn't think of a reason when you _would_ want to load using a vldr (at least it would be fairly uncommon). I imagine that almost every operation would actually be done on a gpr for floats.

This revision is now accepted and ready to land.Jul 2 2019, 3:49 AM
dmgreen added inline comments.Jul 2 2019, 3:50 AM
268 ↗(On Diff #207076)

Forgot to say. I agree that working is better than not working. Something we may have to adjust in the future though.

This revision was automatically updated to reflect the committed changes.