This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/GlobalISel: Lower G_FREM
ClosedPublic

Authored by Petar.Avramovic on Jul 22 2020, 7:07 AM.

Details

Summary

Add custom lower for G_FREM.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2020, 7:07 AM
arsenm added inline comments.Jul 22 2020, 7:12 AM
llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
2737

buildFDiv? These are all dropping the flags too

2738

buildFFloor?

2738

Is this a correct handling of frem? The AMDGPU dag expansion uses ISD::FTRUNC, but I'm not sure that was ever correct

I tried fixing the existing one to use ffloor instead of ftrunc; OpenCL conformance still fails when I plug frem into fmod

Petar.Avramovic marked an inline comment as done.Jul 22 2020, 8:22 AM
Petar.Avramovic added inline comments.
llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
2738

The G_FPTRUNC complains about src and dst being same size, I hit

assert(DstTy.getSizeInBits() < SrcTy.getSizeInBits() &&
       "invalid widening trunc");

from the variable name I thought that FFloor could work but I guess that it works only when operands have same sign. (btw vulkan cts tests where I saw this passed).
Dag expansion seems correct from the description of fmod/frem.
This generic instruction here should discard digits after decimal point, do we have such instruction?

arsenm added inline comments.Jul 22 2020, 8:27 AM
llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
2738

ISD::FTRUNC is G_INTRINSIC_TRUNC

I'm not really clear on what frem really is, or if it's really supposed to be the same as OpenCL fmod

Preserve flags, and use G_INTRINSIC_TRUNC.

I don't think copy the DAG path was necessarily the right choice. The correct thing to do might be to make the DAG path use floor? Is either even correct if this fails conformance?

I don't think this should go in generic code unless we're more sure this is the correct operation

foad added a comment.EditedJul 23 2020, 3:20 AM

I don't think copy the DAG path was necessarily the right choice. The correct thing to do might be to make the DAG path use floor? Is either even correct if this fails conformance?

I don't think this should go in generic code unless we're more sure this is the correct operation

I am convinced that trunc (not floor) is what you need here to implement IR's frem instruction, where the result has the same sign as the dividend (same as the C library fmod).

See also the OpenCL fmod spec which is pretty clear on this: http://man.opencl.org/fmod.html

foad accepted this revision.Jul 23 2020, 7:00 AM

LGTM if Matt has no further comments.

This revision is now accepted and ready to land.Jul 23 2020, 7:00 AM
arsenm requested changes to this revision.Jul 23 2020, 7:03 AM

I'd still like to understand why this is failing conformance if I use frem for opencl fmod. My current suspicion is the fsub + fmul really needs to be an FMA

This revision now requires changes to proceed.Jul 23 2020, 7:03 AM
foad added a comment.Jul 23 2020, 7:07 AM

I'd still like to understand why this is failing conformance if I use frem for opencl fmod.

What is the alternative to using frem, that passes conformance?

I'd still like to understand why this is failing conformance if I use frem for opencl fmod.

What is the alternative to using frem, that passes conformance?

A huge expansion that involves loops:
https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/ocml/src/remainderF_base.h#L38

foad added a comment.Jul 23 2020, 7:26 AM

I'd still like to understand why this is failing conformance if I use frem for opencl fmod.

What is the alternative to using frem, that passes conformance?

A huge expansion that involves loops:
https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/ocml/src/remainderF_base.h#L38

Then it needs debugging. Perhaps there are cases where the simple expansion gives fmod(x,y)==y, even though the result is supposed to have magnitude strictly less than y. Or perhaps it doesn't handle nans or infinities correctly.

I'd still like to understand why this is failing conformance if I use frem for opencl fmod.

What is the alternative to using frem, that passes conformance?

A huge expansion that involves loops:
https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/ocml/src/remainderF_base.h#L38

Then it needs debugging. Perhaps there are cases where the simple expansion gives fmod(x,y)==y, even though the result is supposed to have magnitude strictly less than y. Or perhaps it doesn't handle nans or infinities correctly.

The errors aren't small, and aren't just edge cases:
ERROR: fmod: inf ulp error at {-0x1.7a1ba8p+111 (0xf73d0dd4), -0x1.5b9526p-97 (0x8f2dca93)}: *-0x1.c5f348p-98 vs. inf (0x7f800000) at index: 3

ERROR: fmod: -inf ulp error at {0x1.80bb0ep+70 (0x62c05d87), 0x1.08e51ap-82 (0x1684728d)}: *0x1.9d1d8cp-83 vs. -inf (0xff800000) at index: 0

ERROR: fmod: -134961856.000000 ulp error at {-0x1.f47464p-69 (0x9d7a3a32), -0x1.9bdef4p-97 (0x8f4def7a)}: *-0x1.682a78p-98 vs. -0x1.17eep-94 (0x908bf700) at index: 0

ERROR: fmod: inf ulp error at {-0x1.50c0b6p+67 (0xe128605b), 0x1.80b7ep-90 (0x12c05bf0)}: *-0x1.fa594p-91 vs. inf (0x7f800000) at index: 3

ERROR: fmod: -672311475662299076755456.000000 ulp error at {0x1.f8807ep+111 (0x777c403f), -0x1.7711aap+32 (0xcfbb88d5)}: *0x1.b2ab38p+30 vs. -0x1.1cbc28p+86 (0xea8e5e14) at index: 1

ERROR: fmod: 20258841443692227182914624580021649408.000000 ulp error at {-0x1.0107a4p+41 (0xd40083d2), -0x1.144c6ap-86 (0x948a2635)}: *-0x1.85d4acp-87 vs. 0x1.e7b6cp+13 (0x4673db60) at index: 2

ERROR: fmod: -29506071830531670016.000000 ulp error at {0x1.804b2cp+9 (0x44402596), -0x1.477f18p-57 (0xa323bf8c)}: *0x1.41dc88p-57 vs. -0x1.997aap-16 (0xb7ccbd50) at index: 4

ERROR: fmod: 211623838063271919251058575015936.000000 ulp error at {0x1.4d90a4p+41 (0x5426c852), 0x1.96784ap-65 (0x1f4b3c25)}: *0x1.61a4fp-68 vs. 0x1.4de23p+16 (0x47a6f118) at index: 3

ERROR: fmod: -inf ulp error at {0x1.9aedb8p+83 (0x694d76dc), 0x1.4c23f2p-119 (0x42611f9)}: *0x1.bba984p-120 vs. -inf (0xff800000) at index: 0

ERROR: fmod: 100959080964579999364158242467109404672.000000 ulp error at {0x1.aa412p+26 (0x4cd52090), -0x1.92be06p-101 (0x8d495f03)}: *0x1.59267p-101 vs. 0x1.2fd00cp+2 (0x4097e806) at index: 0

ERROR: fmod: inf ulp error at {-0x1.6ebdap+99 (0xf1375ed0), -0x1.fe72bp-36 (0xadff3958)}: *-0x1.39174p-38 vs. inf (0x7f800000) at index: 0

ERROR: fmod: -101767765295104.000000 ulp error at {-0x1.7f0c94p+21 (0xca3f864a), 0x1.8df2eap-29 (0x3146f975)}: *-0x1.191792p-29 vs. -0x1.723aap-6 (0xbcb91d50) at index: 4

ERROR: fmod: 5685162310369280.000000 ulp error at {-0x1.8ccd58p-14 (0xb8c666ac), 0x1.b15ed6p-71 (0x1c58af6b)}: *-0x1.b6d34p-74 vs. 0x1.432ap-45 (0x29219500) at index: 2

ERROR: fmod: inf ulp error at {-0x1.5e45dep+81 (0xe82f22ef), -0x1.4bae4cp-112 (0x87a5d726)}: *-0x1.56f99p-114 vs. inf (0x7f800000) at index: 1

ERROR: fmod: -inf ulp error at {0x1.cd2da6p+126 (0x7ee696d3), -0x1.647a26p-40 (0xabb23d13)}: *0x1.03ad2ap-40 vs. -inf (0xff800000) at index: 3

ERROR: fmod: -50753958115442425856.000000 ulp error at {-0x1.abb154p+118 (0xfad5d8aa), 0x1.8eb61ap+61 (0x5e475b0d)}: *-0x1.eeecp+52 vs. -0x1.602d24p+94 (0xeeb01692) at index: 0

ERROR: fmod: 65524330144892614344704.000000 ulp error at {0x1.ce604ap+11 (0x45673025), 0x1.0bbbaap-65 (0x1f05ddd5)}: *0x1.e25edp-66 vs. 0x1.bc0298p-14 (0x38de014c) at index: 1

ERROR: fmod: -inf ulp error at {0x1.c98fd6p+122 (0x7ce4c7eb), 0x1.fe4a44p-116 (0x5ff2522)}: *0x1.98780cp-116 vs. -inf (0xff800000) at index: 0

ERROR: fmod: inf ulp error at {-0x1.3132dep+84 (0xe998996f), -0x1.a91dcep-110 (0x88d48ee7)}: *-0x1.9270eap-110 vs. inf (0x7f800000) at index: 4

ERROR: fmod: -inf ulp error at {0x1.24c0e6p+36 (0x51926073), -0x1.a5528ap-122 (0x82d2a945)}: *0x1.86bd68p-122 vs. -inf (0xff800000) at index: 1

ERROR: fmod: 328137422309548672426878384939712118784.000000 ulp error at {-0x1.f16f6cp+68 (0xe1f8b7b6), 0x1.34d296p-58 (0x229a694b)}: *-0x1.5fa6ap-61 vs. 0x1.edb9fp+43 (0x5576dcf8) at index: 2

ERROR: fmod: -355087763374080.000000 ulp error at {0x1.b2ab98p+95 (0x6f5955cc), -0x1.ffc7fcp+45 (0xd67fe3fe)}: *0x1.fe7558p+45 vs. -0x1.42f35p+70 (0xe2a179a8) at index: 7

ERROR: fmod: -22169001879878959779282944.000000 ulp error at {-0x1.92d1bap-15 (0xb84968dd), -0x1.a4b14cp-100 (0x8dd258a6)}: *-0x1.38fb4p-103 vs. -0x1.25678p-42 (0xaa92b3c0) at index: 1

ERROR: fmod: inf ulp error at {-0x1.080adep+92 (0xed84056f), -0x1.c2f17cp-77 (0x996178be)}: *-0x1.cd9d5p-79 vs. inf (0x7f800000) at index: 0

ERROR: fmod: inf ulp error at {-0x1.17f76ap+95 (0xef0bfbb5), -0x1.4e0fe4p-108 (0x89a707f2)}: *-0x1.a2f8c8p-109 vs. inf (0x7f800000) at index: 2

ERROR: fmod: -6874084436590301837445300224.000000 ulp error at {0x1.697b38p+54 (0x5ab4bd9c), 0x1.f3988cp-39 (0x2c79cc46)}: *0x1.b4c1ap-40 vs. -0x1.6361cp+29 (0xce31b0e0) at index: 1

ERROR: fmod: -8738995383377592320.000000 ulp error at {0x1.496b64p+85 (0x6a24b5b2), 0x1.bae466p+24 (0x4bdd7233)}: *0x1.7a58cp+20 vs. -0x1.e51c98p+59 (0xdd728e4c) at index: 1

ERROR: fmod: 343065079314668060672.000000 ulp error at {-0x1.21b74ap-47 (0xa810dba5), 0x1.fdb82ep-117 (0x57edc17)}: *-0x1.606e5ap-117 vs. 0x1.298fcp-72 (0x1b94c7e0) at index: 4

ERROR: fmod: -inf ulp error at {0x1.8cdda2p+60 (0x5dc66ed1), 0x1.aeea96p-109 (0x957754b)}: *0x1.6ac0ep-112 vs. -inf (0xff800000) at index: 2

ERROR: fmod: -2991471016484578263040000.000000 ulp error at {-0x1.6a71bep+46 (0xd6b538df), -0x1.4091d2p-34 (0xaea048e9)}: *-0x1.7398e8p-36 vs. -0x1.3cbbfcp+22 (0xca9e5dfe) at index: 3

ERROR: fmod: -inf ulp error at {0x1.ec3bf8p+71 (0x63761dfc), -0x1.36870ep-68 (0x9d9b4387)}: *0x1.06b884p-68 vs. -inf (0xff800000) at index: 0

ERROR: fmod: -1561864113370783809536.000000 ulp error at {0x1.c99b3ap+127 (0x7f64cd9d), -0x1.8f9088p+58 (0xdcc7c844)}: *0x1.aed24p+55 vs. -0x1.52acep+102 (0xf2a95670) at index: 0

foad added a comment.Jul 23 2020, 8:10 AM

The errors aren't small, and aren't just edge cases:

The ones that return +/- inf are because the division overflows. The others look like rounding error in the division when the result of x/y is large but doesn't overflow - this can easily lead to a result with the wrong sign, or with magnitude larger than y. I don't think it's realistic to try to fix these problems with an inline expansion. It really needs a library function.

I suppose the question is: is this patch still a useful default implementation of frem?

The errors aren't small, and aren't just edge cases:

The ones that return +/- inf are because the division overflows. The others look like rounding error in the division when the result of x/y is large but doesn't overflow - this can easily lead to a result with the wrong sign, or with magnitude larger than y. I don't think it's realistic to try to fix these problems with an inline expansion. It really needs a library function.

I suppose the question is: is this patch still a useful default implementation of frem?

For something that doesn't work perfectly, I don't think it belongs in the generic code. It would be more palatable to keep this in AMDGPU to match the DAG behavior

The errors aren't small, and aren't just edge cases:

The ones that return +/- inf are because the division overflows. The others look like rounding error in the division when the result of x/y is large but doesn't overflow - this can easily lead to a result with the wrong sign, or with magnitude larger than y. I don't think it's realistic to try to fix these problems with an inline expansion. It really needs a library function.

I suppose the question is: is this patch still a useful default implementation of frem?

For something that doesn't work perfectly, I don't think it belongs in the generic code. It would be more palatable to keep this in AMDGPU to match the DAG behavior

Maybe it should also fail to legalize if it's not afn?

foad added a comment.Jul 24 2020, 2:00 AM

The errors aren't small, and aren't just edge cases:

The ones that return +/- inf are because the division overflows. The others look like rounding error in the division when the result of x/y is large but doesn't overflow - this can easily lead to a result with the wrong sign, or with magnitude larger than y. I don't think it's realistic to try to fix these problems with an inline expansion. It really needs a library function.

I suppose the question is: is this patch still a useful default implementation of frem?

For something that doesn't work perfectly, I don't think it belongs in the generic code. It would be more palatable to keep this in AMDGPU to match the DAG behavior

Sounds reasonable.

Petar.Avramovic edited the summary of this revision. (Show Details)

Switch to custom lowering and update to match changes in dag custom lowering for frem (and use same lowering for s16 also).

foad accepted this revision.Aug 6 2020, 6:06 AM

Looks OK to me but please wait to hear from @arsenm too.

llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h
1643–1649

Maybe put this declaration next to buildFAdd / buildFSub ?

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
623–625

Does this need to be conditional on ST.has16BitInsts ?

1775–1776

Use buildIntrinsicTrunc?

arsenm added inline comments.Aug 6 2020, 7:09 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
623–625

It doesn't strictly have to be, but it would produce a better result to force promotion to 32-bit first

Addressed review comments.

foad added inline comments.Aug 7 2020, 6:15 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
623–625

I assume Matt meant to force promotion to 32-bit first if the subtarget doesn't have 16-bit instructions. Compared to the previous version of your patch, the code for fast_frem_f16 has got better for CI but worse for VI.

Force promotion to 32-bit first when subtarget doesn't have 16-bit instruction.

foad accepted this revision.Aug 7 2020, 6:35 AM
This revision was not accepted when it landed; it landed in state Needs Review.Aug 10 2020, 1:18 AM
This revision was automatically updated to reflect the committed changes.