Page MenuHomePhabricator

[AArch64][SVE] Add DAG combines to improve SVE fixed type FP_EXTEND lowering
Changes PlannedPublic

Authored by bsmith on Sep 22 2021, 5:57 AM.

Details

Summary

Currently in some cases FP_EXTEND lowering for SVE fixed types can
introduce EXTRACT_SUBVECTOR nodes between purely fixed types. Lowering
these is less than ideal as it may involve going through memory.

Instead, we can push an extra extend into a load feeding the FP_EXTEND,
so as to ensure the load gets split into the same number of components
as the FP_EXTEND. The extra truncate required to do this can then be
combined with an extend that is introduced during FP_EXTEND lowering.

The net result of this is that the extends produced by the FP_EXTEND
lowering are folded into an extending load, and requirement for
EXTRACT_SUBVECTOR nodes is removed.

Diff Detail

Event Timeline

bsmith created this revision.Sep 22 2021, 5:57 AM
bsmith requested review of this revision.Sep 22 2021, 5:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2021, 5:57 AM
bsmith updated this revision to Diff 374243.Sep 22 2021, 8:19 AM
  • Fix mistake in bail condition when logic was reversed.
sdesmalen added inline comments.Sep 23 2021, 3:18 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15265

DAGCombiner::visitFP_EXTEND tries to do something similar, but unfortunately it is a bit too conservative. Look for:

// fold (fpext (load x)) -> (fpext (fptrunc (extload x)))

It seems logical to reuse that code, rather than reimplementing the same thing here. Perhaps you can use something like isVectorLoadExtDesirable instead of isLoadExtLegal to determine whether to fold this.

bsmith added inline comments.Sep 23 2021, 4:18 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15265

There's actually a subtle difference here, that existing DAG combine is ultimately trying to fold the fpext into the load, creating an FP extending load, which is not something we can do. The DAG combine below is not trying to remove the fpext, but rather ensure the load is of the same size as the fpext so that is gets split in the same way.

I'm also not sure it makes sense for this to be in the generic DAG combine. The only reason this is sensible to do, is that our lowering of FP_EXTEND introduces an integer extend, which can then be folded into the integer truncate this DAG combine introduces. In absence of that fact, this DAG combine would likely just make things worse. It's also only really required because we can't sensibly lower extract_subvector.

Matt added a subscriber: Matt.Sep 24 2021, 9:59 AM
sdesmalen added inline comments.Wed, Sep 29, 7:16 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15265

Okay, I see what you mean.

I'm also not sure it makes sense for this to be in the generic DAG combine. The only reason this is sensible to do, is that our lowering of FP_EXTEND introduces an integer extend, which can then be folded into the integer truncate this DAG combine introduces. In absence of that fact, this DAG combine would likely just make things worse. It's also only really required because we can't sensibly lower extract_subvector.

If I understand it correctly, the target-specific part about this is how SVE's FCVT instruction works, i.e. by interpreting the input as unpacked, hence the need to bitcast/zero-extend, which becomes difficult to express in a generic way (in e.g. a DAGCombine).

15271

is there a reason this is specific to fixed-width vectors? It seems the same problem exists for:

%op1 = load <vscale x 8 x half>, <vscale x 8 x half>* %a
%res = fpext <vscale x 8 x half> %op1 to <vscale x 8 x float>
store <vscale x 8 x float> %res, <vscale x 8 x float>* %b
15292–15300

nit: is this loop maybe better expressed with llvm::any_of ?

15326

does the second operand actually matter, or can it be anything/undef?

bsmith added inline comments.Wed, Sep 29, 8:53 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15271

For the scalable case the generated extract_subvector doesn't require a trip through memory, instead it just does an unpack lo/hi pair which is much more reasonable.

(In fact this optimization actually makes things worse for scalable, we end up splitting the load but then combining it back together again with a UZP1, meaning the extract_subvectors are still present, but now also with a split load + recombine).

15292–15300

I'm not sure we can actually do that in this case. Using any_of would automatically dereference the use_iterator which calls getUser(), we also need to be able to call getUse() on the iterator.

bsmith updated this revision to Diff 375913.Wed, Sep 29, 8:59 AM
bsmith marked an inline comment as done.
  • Don't check 2nd operand of UZP1, it isn't used in this case.
  • Redo test changes after recent test reformatting.
bsmith planned changes to this revision.Tue, Oct 5, 3:34 AM