This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Change VDUP type to i32 for MVE
ClosedPublic

Authored by dmgreen on Mar 17 2020, 8:54 AM.

Details

Summary

The MVE VDUP instruction take a GPR and splats into every lane of a vector register. Unlike NEON we do not have a VDUPLANE equivalent instruction. Previously a VDUP to a v4f32/v8f16 would be represented as a (v4f32 VDUP f32:$x), which would mean the instruction pattern needs to add a COPY_TO_REGCLASS to the GPR.

Instead this now converts that earlier during an ISel DAG combine, converting (VDUP x) to (VDUP (bitcast x)). This can allow instruction selection to tell that the pattern needs to be an i32, which in one of the testcases allows it to use ldr (or specifically ldm) over (vldr;vmov).

Whilst being simple enough for floats, I cannot see a target independent BITCAST equivalent for getting a half into a i32. This uses a VMOVrh ARMISD node, which doesn't know the same tricks.

Diff Detail

Event Timeline

dmgreen created this revision.Mar 17 2020, 8:54 AM

I haven't been following the MVE work that closely, but changing the operand type of MVE vdup makes sense. My one concern here is the potential for confusion due to the opcode; VDUP for NEON and MVE have the same opcode and result type, but the operand types are different. Doesn't really matter much for isel patterns, but could be confusing for writing target-specific combines.

Yes, PerformVMOVhrCombine could be improved to handle more cases, independent of what happens here.

I haven't been following the MVE work that closely, but changing the operand type of MVE vdup makes sense. My one concern here is the potential for confusion due to the opcode; VDUP for NEON and MVE have the same opcode and result type, but the operand types are different. Doesn't really matter much for isel patterns, but could be confusing for writing target-specific combines.

Thanks for taking a look. It is indeed interesting trying to keep two entirely independent vector architectures happy in the same backend. It seems mostly OK so far, and they have helpfully shared quite a bit of code. The creation of VDUP is one such place, where we only had to modify the existing code a little. I didn't change that code to force the type where they are created to try and keep it cleaner, and it can happen in multiple places. Hence the fold in PerformVDUPCombine.

I think I'd prefer to keep the same opcode between the two archs, unless you have some better suggestion? Even with the two input types, we can probably keep the logic in PerformVDUPCombine separate.

Yes, PerformVMOVhrCombine could be improved to handle more cases, independent of what happens here.

I was hoping someone would be able to say "just use an X", and I was missing something obvious. But I will look into some VMOVhr combines, at least for the loads here.

efriedma accepted this revision.Mar 19 2020, 11:21 AM

Okay; if you've considered it and think it's best to keep the opcode the same, that's fine. LGTM

I was hoping someone would be able to say "just use an X"

No, there isn't any target-independent way to bitcast between i32 and f16.

This revision is now accepted and ready to land.Mar 19 2020, 11:21 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMar 20 2020, 3:13 AM