This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Add an enhanced conversion from i64 to f32.
ClosedPublic

Authored by hliao on Jul 30 2021, 10:57 AM.

Diff Detail

Event Timeline

hliao created this revision.Jul 30 2021, 10:57 AM
hliao requested review of this revision.Jul 30 2021, 10:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 30 2021, 10:57 AM

This is a minor code quality improvement and helps reduce the conversion code length by half.

arsenm added inline comments.Jul 30 2021, 11:13 AM
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2527

Why is zero undef OK? The high half could be all zeroes

2565

This is redundant since it should assert in the getNode

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2100–2101

You need to assign these constants to variables. The evaluation order of function arguments isn't defined so this could result in a different instruction ordering depending on the host compiler

2123–2124

You can just do {Dst} directly to the buildInstr call

hliao added inline comments.Jul 30 2021, 11:37 AM
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2527

For the all 0 case, the following 'select' will return 32, so that all hi bits are shifted out. In that case, only the low 32 bits need converting.

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2123–2124

I remembered I tried that but still cannot compile it. Let me try again

hliao marked an inline comment as done.Jul 30 2021, 11:46 AM
hliao added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2123–2124

Still cannot compile. I noticed that 'buildIntrinsic' has two prototypes. {Dst} make that ambiguous for the compiler to choose.

hliao updated this revision to Diff 363175.Jul 30 2021, 11:57 AM

Revise following comments and pre-merge checks.

hliao updated this revision to Diff 363177.Jul 30 2021, 12:04 PM

Update part of tests.

arsenm added inline comments.Jul 31 2021, 6:38 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2123–2124

This ambiguity isn't great. Using ArrayRef<Register>{Dst} works

foad added inline comments.Aug 2 2021, 2:50 AM
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2455–2456

Can you rewrite this whole block comment to only describe what the code does now?

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2071

Move this inside the "if (Signed)" case.

llvm/test/CodeGen/AMDGPU/GlobalISel/cvt_f32_ubyte.ll
1085–1097

Not related to your patch, but we should generate v_cvt_f32_ubyte0 here, shouldn't we?

llvm/test/CodeGen/AMDGPU/sint_to_fp.i64.ll
1–2

Can you autogenerate the checks for this file and uint_to_fp.i64.ll and then rebase?

arsenm added inline comments.Aug 2 2021, 6:23 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/cvt_f32_ubyte.ll
1085–1097

Yes, but nothing is trying to reduce the bitwidth of anything right now

foad added inline comments.Aug 2 2021, 7:55 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2105

Can't you buildCTLZ here and let the legalizer worry about how to handle the ZERO case? In particular it would be better to do v_ffbh followed by v_min_u32 with 32, instead of a compare and select.

hliao updated this revision to Diff 363733.Aug 3 2021, 7:26 AM

Rvise the patch following reviewers' comment.

hliao marked 3 inline comments as done.Aug 3 2021, 7:32 AM
hliao added inline comments.
llvm/test/CodeGen/AMDGPU/GlobalISel/cvt_f32_ubyte.ll
1085–1097

the constant folding on those intrinsics is not supported yet. You may notice that the unsigned byte conversion is already simplified a lot due to the general SDNode used, where the only missing part if LDEXP(v, 0). Once we teach the combiner to understand ffbh_i32 and ldexp, we should get the expected result. I also found other minor issues related, will prepare patches to improve them soon.

hliao added inline comments.Aug 3 2021, 8:46 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2105

I reused part of the original code and didn't try whether we support CLTZ without undefined zero behavior. We may try that later.

hliao updated this revision to Diff 363797.Aug 3 2021, 10:12 AM

use CTLZ directly.

hliao marked an inline comment as done.Aug 3 2021, 10:13 AM
hliao added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2105

Yeah, that works. I could use CTLZ directly.

arsenm accepted this revision.Aug 3 2021, 5:13 PM

LGTM with nit

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2482

Subtarget is already available in the class

This revision is now accepted and ready to land.Aug 3 2021, 5:13 PM
foad added a comment.Aug 4 2021, 12:49 AM

Please rebase on D107429 or similar. Without that, I can't see the full ISA for either the old or the new conversion sequences.

hliao updated this revision to Diff 364084.Aug 4 2021, 6:35 AM

Reuse 'Subtarget' in the target lowering. Rebase against the main branch again.

hliao updated this revision to Diff 364091.Aug 4 2021, 6:55 AM

Rebase onto D107429.

foad accepted this revision.Aug 4 2021, 9:21 AM

Patch looks OK thanks, just some minor comments inline.

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
2102–2103

You could use buildUMin(S32, LS, MaxShAmt) here instead of compare+select.

2105

Nit: I think LLVM style is to put braces around the "else" part as well, if there are braces around the "if" part.

llvm/test/CodeGen/AMDGPU/uint_to_fp.i64.ll
42–43

Why does this suddenly start using valu instructions for uniform values?

foad added inline comments.Aug 4 2021, 9:32 AM
llvm/test/CodeGen/AMDGPU/sint_to_fp.i64.ll
63–64

I think you might be able to shave off another instruction from this sequence with something like:

v_alignbit v0, v1, v2, 31 ; extract bits 62..31
v_ashrrev v3, 31, v2 ; duplicate sign bit 32 times
v_xor v0, v0, v3 ; mask is 0 where bits 62..31 match sign bit
v_ffbh_u32 v0, v0 ; count how many of the high bits from 62..31 match the sign bit
v_min_u32 v0, 32, v0 ; clamp to 32

Now v0 is the shift amount for the v_lshlrev_b64.

This revision was landed with ongoing or failed builds.Aug 4 2021, 12:33 PM
This revision was automatically updated to reflect the committed changes.
hliao added inline comments.Aug 4 2021, 3:23 PM
llvm/test/CodeGen/AMDGPU/sint_to_fp.i64.ll
63–64

see D107507 for further enhancement. I choose another sequence with less instruction because v_alignbit is not available on pre-GCN targets as well as SALU. Also, the final 32-bit integer conversion is also revised inspired by D107474. Overall, it helps reduce the uitofp by 1 insn or 2 (with D107474). and sitofp by 2 insn.