This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add intrinsic support for gpr<->fpr flavors of fixed-point converts
Needs ReviewPublic

Authored by rmcclure on Jun 6 2022, 2:58 PM.

Details

Summary

This patch adds patterns to generate (for example) "UCVTF Dd, Wn, #imm" from a call to "aarch64.neon.vcvtfxu2fp.f64.i32" (and similar for other fixed-point converts).

Diff Detail

Event Timeline

rmcclure created this revision.Jun 6 2022, 2:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 6 2022, 2:58 PM
rmcclure requested review of this revision.Jun 6 2022, 2:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 6 2022, 2:58 PM

One of the existing tests (do_stuff in CodeGen/AArch64/arm64-fixed-point-scalar-cvt-dagcombine.ll) fails with this change, because we now generate an fmov followed by a gpr->fpr ucvtf, instead of the expected fpr->fpr ucvtf. Of course, other tests show that we now avoid an unnecessary fmov after this change.
Is there a preferred method for making a better decision for when to use the gpr->fpr or fpr->fpr flavor of these instructions?
I see the AArch64AdvSIMDScalarPass pass which converts certain GPR ops into AdvSIMD scalar ops when it would save on copies, which sounds mildly similar to the problem I'd like to solve here. Would it be reasonable to leverage that pass for this purpose?

Hello - Can you give more details about why we want these to be added? Are we missing ACLE intrinsics that clang doesn't have, or are some intrinsics broken? I don't see definitions for vcvtd_n_f64_s32 in the ACLE for example.

Hello - Can you give more details about why we want these to be added? Are we missing ACLE intrinsics that clang doesn't have, or are some intrinsics broken? I don't see definitions for vcvtd_n_f64_s32 in the ACLE for example.

The main goal is just to expose these instructions to anyone who might want them, since they expose functionality that isn't available with the existing intrinsic support.

I agree that vcvtd_n_f64_s32 doesn't exist, but vcvth_n_s64_f16 does, for example, which is where things get a bit more interesting.

In mainline, LLVM currently produces something like the following:

fcvtzs  h0, h0, #10
fmov    x0, d0

This technically matches what the ACLE lists as the generated instruction (FCVTSZ Hd,Hn,#n), but that seems wrong to me - it drops the sign information from the result (e.g. converting -1.0f produces a positive result).
For comparison, after this change, LLVM generates:

fcvtzs  x0, h0, #10

which preserves the sign information. It also supports results that exceed the range of a 16-bit integer, which also seems sensible.

As two other points of reference:
For the example above, GCC also generates fcvtzs x0, h0, #10
For vcvth_s64_f16 (the fp->integer flavor), mainline LLVM (and GCC) generates fcvtzs x0, h0, which technically violates the ACLE spec in the same way I described above (it expects FCVTZ Hd,Hn), so making the fixed-point<->fp converts behave similarly seems reasonable.

I agree that the LLVM intrinsics should convert between the exact types listed in the signature of the intrinsic, but we need to make sure we still have an intrinsic that produces "scvtf h0, h0, #16" etc.

This patch also seems like it's changing a few too many things at once... can you split the patch into pieces that are more easy to review?

For something like "llvm.aarch64.neon.vcvtfp2fxs.i32.f32", I guess there are two possible instructions with effectively equivalent semantics. Ideally, the compiler is just clever enough to figure out the best one from the context. Given the way SelectionDAG isel works, you probably have to do some sort of post-isel fixup like AArch64AdvSIMDScalarPass .

In terms of what clang generates, we have to follow the NEON spec; we can't use an instruction that produces a different result just because it's more useful. So if the spec says the conversion produces a 16-bit integer, we have to produce a 16-bit integer. That said, it's a little suspicious that the spec says vcvth_n_s16_f16, vcvth_n_s32_f16, and vcvth_n_s64_f16 all produce exactly the same instruction; maybe there's a spec bug? @dmgreen @fpetrogalli can you confirm whether the spec is right?

@dmgreen @fpetrogalli can you confirm whether the spec is right?

Thanks for the info. I've asked internally if anyone has a clear memory. It does look like GCC produces the x/w variants: https://godbolt.org/z/1GqeGv9xn

The main goal is just to expose these instructions to anyone who might want them, since they expose functionality that isn't available with the existing intrinsic support.

OK I see. My main reason for asking was whether the fp_to_si(fmul(x, C)) form would be acceptable to your use-case. Compared to the @llvm.aarch64.neon... intrinsics, whilst not always perfectly identical, do have certain benefits. It depends what the user is after, but the plain IR instructions benefit from all the constantfolding/range analysis/vectorization/etc that can happen in the mid-end, where the neon intrinsics often remain as black-boxes to optimizations. The intrinsics would probably be more accurately specified as fptosi_sat(fmul(x, C)), so long as the constants were precise, but I'm not sure if there is lowering for that yet.

If the fptosi_sat(fmul(x, C)) form is precisely equivalent to the intrinsics, my opinion would be to remove the @llvm.aarch64.neon.vcvt.. intrinsics entirely and reply on pure codegen. You always run into the possibility that the compiler may produce a worse result by mis-optimizing, but the chances of improvement usually outweigh down sides. (I'm not sure if it works in all cases though, if 2^16 can't be represented as a fp16).

we need to make sure we still have an intrinsic that produces "scvtf h0, h0, #16" etc.

That particular example is easy, since there's only one instruction form that does the int16_t -> float16_t conversion.
For the cases where there are multiple choices, are you saying there should be a way to force a particular form, even if it is suboptimal? e.g. something to force generation of scvtf s0, w0, #10 instead of scvtf s0, s0, #10, even if the source is already in an FPR?

For something like "llvm.aarch64.neon.vcvtfp2fxs.i32.f32", I guess there are two possible instructions with effectively equivalent semantics. Ideally, the compiler is just clever enough to figure out the best one from the context. Given the way SelectionDAG isel works, you probably have to do some sort of post-isel fixup like AArch64AdvSIMDScalarPass .

Yes, this was my thought process, too. If AArch64AdvSIMDScalarPass is a reasonable place to add that "cleverness", I'm happy to add it there.

In terms of what clang generates, we have to follow the NEON spec; we can't use an instruction that produces a different result just because it's more useful. So if the spec says the conversion produces a 16-bit integer, we have to produce a 16-bit integer.

Agreed! My claim is that the spec is currently inconsistent here (e.g. vcvth_n_s64_f16 should return an int64_t, but the instruction used in the spec returns an int16_t). This change causes LLVM to match the source/return type specified in the spec, rather than the instruction specified in the spec (as mentioned, this causes us to match LLVM's behavior for the integer <-> fp converts, where the spec is similarly inconsistent).

This patch also seems like it's changing a few too many things at once... can you split the patch into pieces that are more easy to review?

Hm... the patch effectively has 3 components:

  1. adding patterns to map the instrinsics to the instructions
  2. relocates the fp_to_si(fmul(x, C)) (and similar) patterns
  3. adds tests for the new patterns

All 3 components seem mostly inseparable (the existence of the patterns in #2 are protected by tests, so I'm loathe to remove them).
Having said that, I could separate the "fixed-point -> fp" and "fp -> fixed-point" changes, but given how similar the two directions are to each other, I thought it made more sense to keep them together than to separate them.

My main reason for asking was whether the fp_to_si(fmul(x, C)) form would be acceptable to your use-case.

Unfortunately, that form isn't sufficient for my use-case. As you mentioned, there are ranges of fixed-point placement (2^16, and even higher, since scvtf h0, x0, #n supports an immediate up to 64).
Also, my experience is that some optimization pass tends to change the sequence into a form that the backend no longer recognizes (this probably applied more to the fdiv(si_to_fp(x), C) patterns).

Ping! Any further thoughts on this change? To summarize the opens:

  1. arm64-fixed-point-scalar-cvt-dagcombine.ll has a failure because we now select a different flavor of the convert than the test expects. My plan here would be to temporarily disable the test, and submit a subsequent change that lets LLVM select the ~best flavor depending on which register file the inputs/outputs prefer to be in
  2. This technically violates the ACLE spec for these intrinsics. I believe the ACLE spec should be changed, but since LLVM already violates the spec for the int <-> fp converts, I don't believe that this change should be gated by the ACLE spec change.

Hello. I think the first step needs to be to update the ACLE specification, to make sure the changes are the correct approach to take. The specification is at https://github.com/ARM-software/acle and accepts pull requests. With GCC implementing the other semantics for the fp16 scalar vector intrinsics, there should be a high chance such a change would be accepted.