Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This seems to mess up the interface to TTI quite a lot. Are there any other cases than the SLP vectorizer where se would pass a vector of Instructions?
The CxtI only has to be a context. It gets a bit fuzzy, but could we just pass the first instruction if it is similar enough to the other instructions in the TreeEntry? It looks like the first item is already passed in at the moment.
Also, on a conceptual level - mul's are expensive, addition is relatively cheap. Would it make sense to try and mark the fadd as cheap by looking at the operands? (When I've tried in the past the performance wasn't great).
Yeah the new argument is specifically to support SLP's use case. I don't think other passes are in a similar situation at the moment. There's also a version that keeps the logic in SLP: D132872, but @ABataev argued to have this generally available.
The CxtI only has to be a context. It gets a bit fuzzy, but could we just pass the first instruction if it is similar enough to the other instructions in the TreeEntry? It looks like the first item is already passed in at the moment.
I think all instructiosn in a TreeEntry should be very similar in almost all cases (same opcode). But here we need to specifically look at the users to determine if the users of all instructions in the bundle will allow fusion.
Now while spelling this out, maybe we could instead fuse elegible FMUL + FADD/FSUB TreeEntry nodes directly to a single FMULADD/SUB TreeEntry intead of checking for fusion opportunities for the vector version? @ABataev do you think that would be easily do-able?
Also, on a conceptual level - mul's are expensive, addition is relatively cheap. Would it make sense to try and mark the fadd as cheap by looking at the operands? (When I've tried in the past the performance wasn't great).
I think when I tried this a while ago in the other direction it turned out less profitable.
Maybe add a specific function which returns bool if preferable to use FMA instead?
The CxtI only has to be a context. It gets a bit fuzzy, but could we just pass the first instruction if it is similar enough to the other instructions in the TreeEntry? It looks like the first item is already passed in at the moment.
I think all instructiosn in a TreeEntry should be very similar in almost all cases (same opcode). But here we need to specifically look at the users to determine if the users of all instructions in the bundle will allow fusion.
Now while spelling this out, maybe we could instead fuse elegible FMUL + FADD/FSUB TreeEntry nodes directly to a single FMULADD/SUB TreeEntry intead of checking for fusion opportunities for the vector version? @ABataev do you think that would be easily do-able?
Everything is doable, it is just a question of time. Need to adjust the cost somehow, add a flag (probably!) to the node(s) for possible "FMAsation" and change the codegen to emit FMA instead of fmul+fadd/fsub.
Also, on a conceptual level - mul's are expensive, addition is relatively cheap. Would it make sense to try and mark the fadd as cheap by looking at the operands? (When I've tried in the past the performance wasn't great).
I think when I tried this a while ago in the other direction it turned out less profitable.
I think the issue here is that is not as simple as asking a boolean question.
We need to adjust both the scalar and vector costs, depending on whether either can use FMAs. I think if we support this in TTI, then it should be integrated into the existing APIs. If we add a new interface just geared at the SLP use case, general TTI users won't benefit anyways and then IMO it would be better to keep SLP logic in SLPVectorizer.cpp, at least initially.
The CxtI only has to be a context. It gets a bit fuzzy, but could we just pass the first instruction if it is similar enough to the other instructions in the TreeEntry? It looks like the first item is already passed in at the moment.
I think all instructiosn in a TreeEntry should be very similar in almost all cases (same opcode). But here we need to specifically look at the users to determine if the users of all instructions in the bundle will allow fusion.
Now while spelling this out, maybe we could instead fuse elegible FMUL + FADD/FSUB TreeEntry nodes directly to a single FMULADD/SUB TreeEntry intead of checking for fusion opportunities for the vector version? @ABataev do you think that would be easily do-able?
Everything is doable, it is just a question of time. Need to adjust the cost somehow, add a flag (probably!) to the node(s) for possible "FMAsation" and change the codegen to emit FMA instead of fmul+fadd/fsub.
Right, the question is what the best path forward is to incrementally improve the situation without adding too much churn until we know the cost-based decision works well for a range of targets.
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
1083 | In the inline above, an explicit ArrayRef constructor is used. I updated the code here to do the same. | |
2290 | No default arg needed here it seems, I removed it. | |
llvm/include/llvm/CodeGen/BasicTTIImpl.h | ||
824 | In the inline above, an explicit ArrayRef constructor is used. I updated the code here to do the same. |
Agree, that's why I thought it is better to make it part of TTI.
If we add a new interface just geared at the SLP use case, general TTI users won't benefit anyways and then IMO it would be better to keep SLP logic in SLPVectorizer.cpp, at least initially.
We already have SLP specific functions (at least for now) in TTI.
The CxtI only has to be a context. It gets a bit fuzzy, but could we just pass the first instruction if it is similar enough to the other instructions in the TreeEntry? It looks like the first item is already passed in at the moment.
I think all instructiosn in a TreeEntry should be very similar in almost all cases (same opcode). But here we need to specifically look at the users to determine if the users of all instructions in the bundle will allow fusion.
Now while spelling this out, maybe we could instead fuse elegible FMUL + FADD/FSUB TreeEntry nodes directly to a single FMULADD/SUB TreeEntry intead of checking for fusion opportunities for the vector version? @ABataev do you think that would be easily do-able?
Everything is doable, it is just a question of time. Need to adjust the cost somehow, add a flag (probably!) to the node(s) for possible "FMAsation" and change the codegen to emit FMA instead of fmul+fadd/fsub.
Right, the question is what the best path forward is to incrementally improve the situation without adding too much churn until we know the cost-based decision works well for a range of targets.
The cost still needs to be adjusted, before we do actual replacement.
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
2292 | When applying this patch series, I see compilation errors like this: In file included from /localdisk2/schmidtw/llvm-project/llvm/lib/Target/Lanai/LanaiTargetTransformInfo.h:22, from /localdisk2/schmidtw/llvm-project/llvm/lib/Target/Lanai/La naiTargetMachine.cpp:17: | ^~~~~ | | | llvm::ArrayRef<const llvm:: Instruction*> 98 | const Instruction *CxtI = nullptr) { | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~ This occurs for Lanai, SystemZ, PowerPC, Hexagon, BPF, and NVPTX. | |
llvm/include/llvm/CodeGen/BasicTTIImpl.h | ||
884 | When applying this patch series, I see compilation errors for various targets stemming from this. Example: In file included from /localdisk2/schmidtw/llvm-project/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h:23, from /localdisk2/schmidtw/llvm-project/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp:15: 884 | CxtIs); | ^~~~~ | | | llvm::ArrayRef<const llvm::Instruction*> /localdisk2/schmidtw/llvm-project/llvm/lib/Target/WebAssembly/WebAssemblyTargetT 57 | const Instruction *CxtI) { | ~~~~~~~~~~~~~~~~~~~^~~~ Appears for WebAssembly, Lanai, SystemZ, Hexagon, NVPTX, PowerPC, BPF, and AMDGPU. |
= None