Add clamp combine. Source is fminnum(fmaxnum(Val, K0), K1) or
fmaxnum(fminnum(Val, K1), K0) with K0=0.0 and K1=1.0 or fmed3
intrinsic with 0.0 and 1.0 as two out of three operands.
Details
- Reviewers
foad arsenm - Commits
- rG0b34ffe4a61e: AMDGPU/GlobalISel: Add clamp combine
Diff Detail
Event Timeline
Missing tests for f16 cases
llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp | ||
---|---|---|
187 ↗ | (On Diff #300312) | Check this first around the whole section? |
207–216 ↗ | (On Diff #300312) | These look like two distinct apply actions to me |
224–228 ↗ | (On Diff #300312) | getConstantFPVRegVal |
233 ↗ | (On Diff #300312) | This should be implied by the matcher opcode |
234 ↗ | (On Diff #300312) | Early return on intrinsic check |
235–246 ↗ | (On Diff #300312) | I think the flow of the DAG version is easier to handle, swapping the values and then checking isClampZeroToOne at the end |
247 ↗ | (On Diff #300312) | Typo remainig |
254 ↗ | (On Diff #300312) | Should get this from the MachineIRBuilder |
256–260 ↗ | (On Diff #300312) | I think something is confused here. IEEE mode only changes signaling nan behavior. I also think trying to handle IEEE=0 correctly with snans is a fools errand since *no* operations will behave correctly. The DAG combine for this does not consider IEEE mode |
275 ↗ | (On Diff #300312) | Should not construct a new MachineIRBuilder, there should be one existing already in the combiner |
llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp | ||
---|---|---|
207–216 ↗ | (On Diff #300312) | They share same match action. If we want to split them to matchMinMaxToClamp and matchMinMaxToMed3 we might end up matching min/max pattern twice. |
235–246 ↗ | (On Diff #300312) | That swaps operands and might produce different result with IEEE=1 |
256–260 ↗ | (On Diff #300312) | IEEE=0 does not treat SNaN differently and needs no SNaN checks. With IEEE=0 DX10Clamp=1 (shaders want all possible x2, x4, div2, and clamp modifiers) there is no need for any nan check. Other operation don't really mention that backend needs to check for special values. They do have different behavior when input is SNaN but correctness in not mentioned. What do we have to do force correct behavior for IEEE=0? |
Is it possible to separate this into two different combines:
- min(max(...)) or max(min(...)) -> med(x, K0, K1)
- med(x, 0.0, 1.0) -> clamp(x)
?
llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp | ||
---|---|---|
235–246 ↗ | (On Diff #300312) | Are you saying the DAG version is broken? |
256–260 ↗ | (On Diff #300312) | With IEEE=0, no operations quiet signalling nans. All operations behave incorrectly and pass through the snan instead of quieting it. The lack of quieting the snan is a correctness issue. It would be a massive amount of work to actually implement the correct behavior and manually quiet snan inputs to implement the correct generic operation semantics, which we will probably never do. |
llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp | ||
---|---|---|
235–246 ↗ | (On Diff #300312) | DAG version should also check for SNaN (This is for IEEE=true) alongside DX10Clamp. swapping is fine for QNaN. |
256–260 ↗ | (On Diff #300312) |
Is this for llvm-ir? Would it then be acceptable to only perform combine if nnan flag (check using isKnownNeverNaN) is present on the instruction with IEEE=false? |
llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp | ||
---|---|---|
256–260 ↗ | (On Diff #300312) | Yes. Officially the IR description pretends snans do not exist, but really they do and need to be quieted by FP canonicalizing FP operations as per IEEE754. I'm not sure it even makes sense to try to get correct snan behavior with IEEE=0 since it's never going to be implemented. The comment here absolutely needs to make clear the distinction between snan and qnan handling Getting IEEE=true correct is far more interesting/important. |
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp | ||
---|---|---|
218 | I would expect the helper above to return the two constants directly rather than having to requery the registers | |
227–229 | Should swap these conditions so the cheap checks are first | |
330–336 | There's already a helper for this, getDefIgnoringCopies/getDefSrcRegIgnoringCopies |
llvm/lib/Target/AMDGPU/AMDGPUCombine.td | ||
---|---|---|
68 | Can you use the standard register_matchdata for this? |
Can you use the standard register_matchdata for this?