This is an archive of the discontinued LLVM Phabricator instance.

[TTI] Allow passing ArrayRef of context instructions (NFC).
Needs ReviewPublic

Authored by fhahn on Aug 30 2022, 12:41 PM.

Details

Summary

TTI-based alternative to D132872.

This is to allow the SLP vectorizer to pass extra context instructions
for a vector bundle.

This patch just adds the plumbing, follow-up patches linked in the stack
actually make use of the new functionality.

See discussion at D132872.

Diff Detail

Event Timeline

fhahn created this revision.Aug 30 2022, 12:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 30 2022, 12:41 PM
fhahn requested review of this revision.Aug 30 2022, 12:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 30 2022, 12:41 PM
ABataev added inline comments.Aug 30 2022, 12:59 PM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1083

= None

2290

= None

llvm/include/llvm/CodeGen/BasicTTIImpl.h
824

= None

This seems to mess up the interface to TTI quite a lot. Are there any other cases than the SLP vectorizer where se would pass a vector of Instructions?

The CxtI only has to be a context. It gets a bit fuzzy, but could we just pass the first instruction if it is similar enough to the other instructions in the TreeEntry? It looks like the first item is already passed in at the moment.

Also, on a conceptual level - mul's are expensive, addition is relatively cheap. Would it make sense to try and mark the fadd as cheap by looking at the operands? (When I've tried in the past the performance wasn't great).

fhahn updated this revision to Diff 456981.Aug 31 2022, 8:50 AM
fhahn marked 3 inline comments as done.

This seems to mess up the interface to TTI quite a lot. Are there any other cases than the SLP vectorizer where se would pass a vector of Instructions?

Yeah the new argument is specifically to support SLP's use case. I don't think other passes are in a similar situation at the moment. There's also a version that keeps the logic in SLP: D132872, but @ABataev argued to have this generally available.

The CxtI only has to be a context. It gets a bit fuzzy, but could we just pass the first instruction if it is similar enough to the other instructions in the TreeEntry? It looks like the first item is already passed in at the moment.

I think all instructiosn in a TreeEntry should be very similar in almost all cases (same opcode). But here we need to specifically look at the users to determine if the users of all instructions in the bundle will allow fusion.

Now while spelling this out, maybe we could instead fuse elegible FMUL + FADD/FSUB TreeEntry nodes directly to a single FMULADD/SUB TreeEntry intead of checking for fusion opportunities for the vector version? @ABataev do you think that would be easily do-able?

Also, on a conceptual level - mul's are expensive, addition is relatively cheap. Would it make sense to try and mark the fadd as cheap by looking at the operands? (When I've tried in the past the performance wasn't great).

I think when I tried this a while ago in the other direction it turned out less profitable.

This seems to mess up the interface to TTI quite a lot. Are there any other cases than the SLP vectorizer where se would pass a vector of Instructions?

Yeah the new argument is specifically to support SLP's use case. I don't think other passes are in a similar situation at the moment. There's also a version that keeps the logic in SLP: D132872, but @ABataev argued to have this generally available.

Maybe add a specific function which returns bool if preferable to use FMA instead?

The CxtI only has to be a context. It gets a bit fuzzy, but could we just pass the first instruction if it is similar enough to the other instructions in the TreeEntry? It looks like the first item is already passed in at the moment.

I think all instructiosn in a TreeEntry should be very similar in almost all cases (same opcode). But here we need to specifically look at the users to determine if the users of all instructions in the bundle will allow fusion.

Now while spelling this out, maybe we could instead fuse elegible FMUL + FADD/FSUB TreeEntry nodes directly to a single FMULADD/SUB TreeEntry intead of checking for fusion opportunities for the vector version? @ABataev do you think that would be easily do-able?

Everything is doable, it is just a question of time. Need to adjust the cost somehow, add a flag (probably!) to the node(s) for possible "FMAsation" and change the codegen to emit FMA instead of fmul+fadd/fsub.

Also, on a conceptual level - mul's are expensive, addition is relatively cheap. Would it make sense to try and mark the fadd as cheap by looking at the operands? (When I've tried in the past the performance wasn't great).

I think when I tried this a while ago in the other direction it turned out less profitable.

This seems to mess up the interface to TTI quite a lot. Are there any other cases than the SLP vectorizer where se would pass a vector of Instructions?

Yeah the new argument is specifically to support SLP's use case. I don't think other passes are in a similar situation at the moment. There's also a version that keeps the logic in SLP: D132872, but @ABataev argued to have this generally available.

Maybe add a specific function which returns bool if preferable to use FMA instead?

I think the issue here is that is not as simple as asking a boolean question.

We need to adjust both the scalar and vector costs, depending on whether either can use FMAs. I think if we support this in TTI, then it should be integrated into the existing APIs. If we add a new interface just geared at the SLP use case, general TTI users won't benefit anyways and then IMO it would be better to keep SLP logic in SLPVectorizer.cpp, at least initially.

The CxtI only has to be a context. It gets a bit fuzzy, but could we just pass the first instruction if it is similar enough to the other instructions in the TreeEntry? It looks like the first item is already passed in at the moment.

I think all instructiosn in a TreeEntry should be very similar in almost all cases (same opcode). But here we need to specifically look at the users to determine if the users of all instructions in the bundle will allow fusion.

Now while spelling this out, maybe we could instead fuse elegible FMUL + FADD/FSUB TreeEntry nodes directly to a single FMULADD/SUB TreeEntry intead of checking for fusion opportunities for the vector version? @ABataev do you think that would be easily do-able?

Everything is doable, it is just a question of time. Need to adjust the cost somehow, add a flag (probably!) to the node(s) for possible "FMAsation" and change the codegen to emit FMA instead of fmul+fadd/fsub.

Right, the question is what the best path forward is to incrementally improve the situation without adding too much churn until we know the cost-based decision works well for a range of targets.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1083

In the inline above, an explicit ArrayRef constructor is used. I updated the code here to do the same.

2290

No default arg needed here it seems, I removed it.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
824

In the inline above, an explicit ArrayRef constructor is used. I updated the code here to do the same.

This seems to mess up the interface to TTI quite a lot. Are there any other cases than the SLP vectorizer where se would pass a vector of Instructions?

Yeah the new argument is specifically to support SLP's use case. I don't think other passes are in a similar situation at the moment. There's also a version that keeps the logic in SLP: D132872, but @ABataev argued to have this generally available.

Maybe add a specific function which returns bool if preferable to use FMA instead?

I think the issue here is that is not as simple as asking a boolean question.

We need to adjust both the scalar and vector costs, depending on whether either can use FMAs. I think if we support this in TTI, then it should be integrated into the existing APIs.

Agree, that's why I thought it is better to make it part of TTI.

If we add a new interface just geared at the SLP use case, general TTI users won't benefit anyways and then IMO it would be better to keep SLP logic in SLPVectorizer.cpp, at least initially.

We already have SLP specific functions (at least for now) in TTI.

The CxtI only has to be a context. It gets a bit fuzzy, but could we just pass the first instruction if it is similar enough to the other instructions in the TreeEntry? It looks like the first item is already passed in at the moment.

I think all instructiosn in a TreeEntry should be very similar in almost all cases (same opcode). But here we need to specifically look at the users to determine if the users of all instructions in the bundle will allow fusion.

Now while spelling this out, maybe we could instead fuse elegible FMUL + FADD/FSUB TreeEntry nodes directly to a single FMULADD/SUB TreeEntry intead of checking for fusion opportunities for the vector version? @ABataev do you think that would be easily do-able?

Everything is doable, it is just a question of time. Need to adjust the cost somehow, add a flag (probably!) to the node(s) for possible "FMAsation" and change the codegen to emit FMA instead of fmul+fadd/fsub.

Right, the question is what the best path forward is to incrementally improve the situation without adding too much churn until we know the cost-based decision works well for a range of targets.

The cost still needs to be adjusted, before we do actual replacement.

wjschmidt added inline comments.Sep 7 2022, 12:13 PM
llvm/include/llvm/Analysis/TargetTransformInfo.h
2292

When applying this patch series, I see compilation errors like this:

In file included from /localdisk2/schmidtw/llvm-project/llvm/lib/Target/Lanai/LanaiTargetTransformInfo.h:22,

from /localdisk2/schmidtw/llvm-project/llvm/lib/Target/Lanai/La

naiTargetMachine.cpp:17:
/localdisk2/schmidtw/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo
.h: In instantiation of ‘llvm::InstructionCost llvm::TargetTransformInfo::Model<
T>::getArithmeticInstrCost(unsigned int, llvm::Type*, llvm::TargetTransformInfo:
:TargetCostKind, llvm::TargetTransformInfo::OperandValueInfo, llvm::TargetTransf
ormInfo::OperandValueInfo, llvm::ArrayRef<const llvm::Value*>, llvm::ArrayRef<co
nst llvm::Instruction*>) [with T = llvm::LanaiTTIImpl]’:
/localdisk2/schmidtw/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo
.h:2308:3: required from here
/localdisk2/schmidtw/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo
.h:2314:46: error: cannot convert ‘llvm::ArrayRef<const llvm::Instruction*>’ to
‘const llvm::Instruction*’
2314 | Args, CxtIs);

|                                              ^~~~~
|                                              |
|                                              llvm::ArrayRef<const llvm::

Instruction*>
In file included from /localdisk2/schmidtw/llvm-project/llvm/lib/Target/Lanai/La
naiTargetMachine.cpp:17:
/localdisk2/schmidtw/llvm-project/llvm/lib/Target/Lanai/LanaiTargetTransformInfo
.h:98:26: note: initializing argument 7 of ‘llvm::InstructionCost llvm::LanaiT
TIImpl::getArithmeticInstrCost(unsigned int, llvm::Type*, llvm::TargetTransformI
nfo::TargetCostKind, llvm::TargetTransformInfo::OperandValueInfo, llvm::TargetTr
ansformInfo::OperandValueInfo, llvm::ArrayRef<const llvm::Value*>, const llvm::I
nstruction*)’

98 |       const Instruction *CxtI = nullptr) {
   |       ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~

This occurs for Lanai, SystemZ, PowerPC, Hexagon, BPF, and NVPTX.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
884

When applying this patch series, I see compilation errors for various targets stemming from this. Example:

In file included from /localdisk2/schmidtw/llvm-project/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h:23, from /localdisk2/schmidtw/llvm-project/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp:15:
/localdisk2/schmidtw/llvm-project/llvm/include/llvm/CodeGen/BasicTTIImpl.h: In i
nstantiation of ‘llvm::InstructionCost llvm::BasicTTIImplBase<T>::getArithmeticI
nstrCost(unsigned int, llvm::Type*, llvm::TargetTransformInfo::TargetCostKind, llvm::TargetTransformInfo::OperandValueInfo, llvm::TargetTransformInfo::OperandValueInfo, llvm::ArrayRef<const llvm::Value*>, llvm::ArrayRef<const llvm::Instruct
ion*>) [with T = llvm::WebAssemblyTTIImpl]’: /localdisk2/schmidtw/llvm-project/llvm/lib/Target/WebAssembly/WebAssemblyTargetT
ransformInfo.cpp:60:45: required from here
/localdisk2/schmidtw/llvm-project/llvm/include/llvm/CodeGen/BasicTTIImpl.h:884:1
1: error: cannot convert ‘llvm::ArrayRef<const llvm::Instruction*>’ to ‘const ll
vm::Instruction*’

884 |           CxtIs);
    |           ^~~~~
    |           |
    |           llvm::ArrayRef<const llvm::Instruction*>

/localdisk2/schmidtw/llvm-project/llvm/lib/Target/WebAssembly/WebAssemblyTargetT
ransformInfo.cpp:57:24: note: initializing argument 7 of ‘llvm::InstructionCos
t llvm::WebAssemblyTTIImpl::getArithmeticInstrCost(unsigned int, llvm::Type*, ll
vm::TargetTransformInfo::TargetCostKind, llvm::TargetTransformInfo::OperandValue
Info, llvm::TargetTransformInfo::OperandValueInfo, llvm::ArrayRef<const llvm::Va
lue*>, const llvm::Instruction*)’

57 |     const Instruction *CxtI) {
   |     ~~~~~~~~~~~~~~~~~~~^~~~

Appears for WebAssembly, Lanai, SystemZ, Hexagon, NVPTX, PowerPC, BPF, and AMDGPU.