This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add X86ISD::SUBV_BROADCAST_LOAD and begin removing X86ISD::SUBV_BROADCAST (PR38969)
ClosedPublic

Authored by RKSimon on Dec 4 2020, 2:33 AM.

Details

Summary

Subvector broadcasts are only load instructions, yet X86ISD::SUBV_BROADCAST treats them more generally, requiring a lot of fallback tablegen patterns.

This initial patch replaces constant vector lowering inside lowerBuildVectorAsBroadcast with direct X86ISD::SUBV_BROADCAST_LOAD loads which helps us merge a number of equivalent loads/broadcasts.

As well as general plumbing/analysis additions for SUBV_BROADCAST_LOAD, I needed to wrap SelectionDAG::makeEquivalentMemoryOrdering so it can handle result chains from non generic LoadSDNode nodes.

Later patches will continue to replace X86ISD::SUBV_BROADCAST usage.

Diff Detail

Event Timeline

RKSimon created this revision.Dec 4 2020, 2:33 AM
RKSimon requested review of this revision.Dec 4 2020, 2:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 4 2020, 2:33 AM

Any thoughts?

yubing added a subscriber: yubing.Dec 9 2020, 4:37 AM

The changes on the tests look good to me. But I'm not sure the changes on the code. I'd like to see opinions from other reviewers.

llvm/test/CodeGen/X86/vector-trunc.ll
119 ↗(On Diff #309493)

Seems vmovaps better here.

RKSimon added inline comments.Dec 9 2020, 6:57 AM
llvm/test/CodeGen/X86/vector-trunc.ll
119 ↗(On Diff #309493)

Yes - I should have explained - LowerBUILD_VECTOR doesn't recognise undef upper elements until AFTER its tried lowerBuildVectorAsBroadcast, and because we have multiple uses SimplifyDemandedVectorElts won't touch it.

I'll adjust LowerBUILD_VECTOR separately and update this patch.

yubing added inline comments.Dec 10 2020, 12:59 AM
llvm/test/CodeGen/X86/subvector-broadcast.ll
826

Hi, Simon. I have a question about previous code.
Before ISEL, there are the following nodes:
t42: i32 = X86ISD::Wrapper TargetConstantPool:i32<<8 x i32> <i32 1, i32 0, i32 2, i32 0, i32 3, i32 0, i32 4, i32 0>> 0
t35: v8i32,ch = load<(load 32 from constant-pool)> t0, t42, undef:i32
t36: v16i32 = X86ISD::SUBV_BROADCAST t35
But they are not morphed into vbroadcasti64x4. Did you know why?

RKSimon added inline comments.Dec 10 2020, 2:24 AM
llvm/test/CodeGen/X86/subvector-broadcast.ll
826

Its because the original ymm load (t35) has multiple uses so the pattern in tablegen doesn't match.

yubing added inline comments.Dec 10 2020, 4:59 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
44573

Hi, Simon. If we have a load and subv_broadcast_load which has the same SRC, the load will be combined into a extract_subv from subv_broadcast_load. But If some optimization delete subv_broadcast_load's other users and make it has only one user i.e. extract_subv, will extract_subv(subv_broadcast_load) roll back to a simple load?

RKSimon added inline comments.Dec 10 2020, 5:44 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
44573

Yes the new SimplifyDemandedVectorElts code at line 38024 should handle that.

yubing added inline comments.Dec 12 2020, 3:42 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
44573

Besides, if there are two subv_broadcast_load(e.g. one is 4xi32->8xi32, another is 4xi32->16xi32) which has the same SRC, the subv_broadcast_load(4xi32->8xi32) is not combined into extract_subv(subv_broadcast_load(4xi32->16xi32)) with your current patch. Should we create a combinesubv_broadcast_load() procedure to handle with such case?

@ga4 = global <4 x i32> zeroinitializer, align 8
@gb4 = global <8 x i32> zeroinitializer, align 8
@gc4 = global <16 x i32> zeroinitializer, align 8
define void @main(<4 x i32> %a, <8 x i32> %b, <16 x i32> %c) {
entry:
  %0 = add <4 x i32> %a, <i32 1, i32 2, i32 3, i32 4>
  %1 = add <8 x i32> %b, <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>
  %2 = and <8 x i32> %1, <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>
  %3 = add <16 x i32> %c, <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>
  %4 = and <16 x i32> %3, <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>
  store <4 x i32> %0, <4 x i32>* @ga4, align 8
  store <8 x i32> %2, <8 x i32>* @gb4, align 8
  store <16 x i32> %4, <16 x i32>* @gc4, align 8
  ret void
}
RKSimon added inline comments.Dec 12 2020, 4:36 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
44573

Sure - I can look at adding that as part of this patch, although I don't think the existing X86ISD::SUBV_BROADCAST code does a good job of this either.

RKSimon updated this revision to Diff 311457.Dec 13 2020, 9:55 AM

Add merging of different width X86ISD::SUBV_BROADCAST_LOAD by reusing the existing X86ISD::VBROADCAST_LOAD combine.

yubing added inline comments.Dec 14 2020, 2:31 AM
llvm/test/CodeGen/X86/vector-trunc.ll
119 ↗(On Diff #309493)

Eh, Simon. Why we lower constant BuildVector into Broadcast only when it has more than one use? I saw that lowerBuildVectorAsBroadcast(...) calls isFoldableUseOfShuffle(...) which returns true if it has one use.

RKSimon added inline comments.Dec 14 2020, 4:37 AM
llvm/test/CodeGen/X86/vector-trunc.ll
119 ↗(On Diff #309493)

Because otherwise it interferes with folding the vector loads into instructions, resulting in increased register pressure. You can see the effect by editing isFoldableUseOfShuffle to just return false.

craig.topper added inline comments.Dec 14 2020, 2:45 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
8975–8976

Shouldn't this just be OldChain.hasAnyUseOfValue()? We shouldn't assume "1" and should get it from the SDValue. Not sure what to do with the 1 on the line above.

RKSimon updated this revision to Diff 311852.Dec 15 2020, 3:05 AM

rebase + update new makeEquivalentMemoryOrdering wrapper to directly take old/new memop chains

RKSimon added inline comments.Dec 15 2020, 3:11 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
8975–8976

Nice catch - I've explicitly changed this to take the chains directly and used the SDValue::use_empty() wrapper to hasAnyUseOfValue

RKSimon added inline comments.Dec 16 2020, 3:07 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
8975–8976

@craig.topper Do you think we would be better off just replacing the old makeEquivalentMemoryOrdering(LoadSDNode*,SDValue) version entirely with the new makeEquivalentMemoryOrdering(SDValue,SDValue) one?

craig.topper added inline comments.Dec 16 2020, 10:55 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
8971

Can you assert that the types for the inputs are MVT::Other?

llvm/lib/Target/X86/X86ISelLowering.cpp
38145

Ld.getValue(1)?

RKSimon updated this revision to Diff 312303.Dec 16 2020, 2:32 PM

rebased - assert correct Token type and use SDValue::getValue()

I think this looks good to me. What about you @yubing?

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
8975–8976

We can look at removing it as a follow up.

I think this looks good to me. What about you @yubing?

It looks good to me as well.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 17 2020, 2:25 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.