This is an archive of the discontinued LLVM Phabricator instance.

[X86][FP16] Change the order of the operands in complex FMA intrinsics to allow swap between the mul operands.
ClosedPublic

Authored by pengfei on Sep 12 2021, 4:36 AM.

Diff Detail

Event Timeline

pengfei created this revision.Sep 12 2021, 4:36 AM
pengfei requested review of this revision.Sep 12 2021, 4:36 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptSep 12 2021, 4:36 AM

Does gcc use the same builtin name? Our general policy is to have the same interface as gcc if we have a builtin. So if gcc has these builtins the should work the same way.

For IFMA I think I made them commutable by swapping the operands between the builtin and the internal intrinsic using the handling for the X86IntrinsicTable

Does gcc use the same builtin name? Our general policy is to have the same interface as gcc if we have a builtin. So if gcc has these builtins the should work the same way.

No. We don't sync with GCC on the builtin name during the development. We had a disscussion and decided to not keep them aligned due to 1) target specific builtins are compiler private names that no need to keep it compatible with other compilers; and 2) we already differentiate the target builtins with GCC long ago on the naming, masking etc. Currently, regardless the name, GCC uses the same C, A, B order with our existing implementation. https://gitlab.com/x86-gcc/gcc/-/blob/users/intel/liuhongt/independentfp16_wip/gcc/config/i386/avx512fp16intrin.h#L6672

For IFMA I think I made them commutable by swapping the operands between the builtin and the internal intrinsic using the handling for the X86IntrinsicTable

Is this one https://github.com/llvm/llvm-project/commit/80c8b80919e0049da32f018d98e4d75ff562cfa8? Do you mean I should add Commutative in IntrinsicsX86.td too? What's this flag used for? I saw we only add them in a few intrinsics.

Does gcc use the same builtin name? Our general policy is to have the same interface as gcc if we have a builtin. So if gcc has these builtins the should work the same way.

No. We don't sync with GCC on the builtin name during the development. We had a disscussion and decided to not keep them aligned due to 1) target specific builtins are compiler private names that no need to keep it compatible with other compilers; and 2) we already differentiate the target builtins with GCC long ago on the naming, masking etc. Currently, regardless the name, GCC uses the same C, A, B order with our existing implementation. https://gitlab.com/x86-gcc/gcc/-/blob/users/intel/liuhongt/independentfp16_wip/gcc/config/i386/avx512fp16intrin.h#L6672

I thought we were pretty consistent on names with gcc for most of sse and avx and most of avx512. The names aren't completely private occasionally users due try to use them. If we happen to have the same name we should have the same behavior to avoid confusion.

For IFMA I think I made them commutable by swapping the operands between the builtin and the internal intrinsic using the handling for the X86IntrinsicTable

Is this one https://github.com/llvm/llvm-project/commit/80c8b80919e0049da32f018d98e4d75ff562cfa8? Do you mean I should add Commutative in IntrinsicsX86.td too? What's this flag used for? I saw we only add them in a few intrinsics.

I was referring to 47e14ead54d229aa8ac5436594e91bfb8943230e for AVX512IFMA

The flag in IntrinsicsX86.td is used if the int_x86_* is used in the isel pattern, but I don't know if we have any of those any more.

Does gcc use the same builtin name? Our general policy is to have the same interface as gcc if we have a builtin. So if gcc has these builtins the should work the same way.

No. We don't sync with GCC on the builtin name during the development. We had a disscussion and decided to not keep them aligned due to 1) target specific builtins are compiler private names that no need to keep it compatible with other compilers; and 2) we already differentiate the target builtins with GCC long ago on the naming, masking etc. Currently, regardless the name, GCC uses the same C, A, B order with our existing implementation. https://gitlab.com/x86-gcc/gcc/-/blob/users/intel/liuhongt/independentfp16_wip/gcc/config/i386/avx512fp16intrin.h#L6672

I thought we were pretty consistent on names with gcc for most of sse and avx and most of avx512. The names aren't completely private occasionally users due try to use them. If we happen to have the same name we should have the same behavior to avoid confusion.

I'm not so optimistic. I had a coarse-grained statistics on the use of x86 builtins in Clang and GCC. It shows Clang only defines 2/3 of GCC's builtins and 1/4 of Clang builtins have different names with GCC's. Command below:

cat gcc/config/i386/*.h | grep -o "\b__builtin_ia32_\w\+" |sort|uniq|tee gcc.txt|wc -l
2788

ls clang/lib/Headers/*.h |grep -v fp16 |xargs cat |grep -o "\b__builtin_ia32_\w\+" |sort|uniq|tee clang.txt|wc -l
1808

comm -12 gcc.txt clang.txt |wc -l
1347

Regarding this case, we already have a different name with GCC, I think it worthwhile to use a different order for the swapping optimization.
With a bit research on AVX512IFMA, I found:

  1. The use of C, A, B order in GCC is not consistent on its AVX512IFMA builtins. It supposes GCC should change to A, B, C order if considering consistency;
  2. We aren't consistent on AVX512IFMA builtins with GCC either due to the use of select.

By the way, GCC folks told me GCC has ability to specify arbitrary operands that can be commutative. But I found both SDNode and MI only have ability on the first 2 operands, which is insufficient for instruction like CFMA. Do you know if we have other mechanism for commutable operands?

Regarding this case, we already have a different name with GCC, I think it worthwhile to use a different order for the swapping optimization.
With a bit research on AVX512IFMA, I found:

  1. The use of C, A, B order in GCC is not consistent on its AVX512IFMA builtins. It supposes GCC should change to A, B, C order if considering consistency;
  2. We aren't consistent on AVX512IFMA builtins with GCC either due to the use of select.

Do we have any builtins with the same name as gcc but different operands/behaviours? Those are the only ones that I'd be worried about.

By the way, GCC folks told me GCC has ability to specify arbitrary operands that can be commutative. But I found both SDNode and MI only have ability on the first 2 operands, which is insufficient for instruction like CFMA. Do you know if we have other mechanism for commutable operands?

Doesn't X86InstrInfo::findCommutedOpIndices handle these cases?

Does gcc use the same builtin name? Our general policy is to have the same interface as gcc if we have a builtin. So if gcc has these builtins the should work the same way.

No. We don't sync with GCC on the builtin name during the development. We had a disscussion and decided to not keep them aligned due to 1) target specific builtins are compiler private names that no need to keep it compatible with other compilers; and 2) we already differentiate the target builtins with GCC long ago on the naming, masking etc. Currently, regardless the name, GCC uses the same C, A, B order with our existing implementation. https://gitlab.com/x86-gcc/gcc/-/blob/users/intel/liuhongt/independentfp16_wip/gcc/config/i386/avx512fp16intrin.h#L6672

I thought we were pretty consistent on names with gcc for most of sse and avx and most of avx512. The names aren't completely private occasionally users due try to use them. If we happen to have the same name we should have the same behavior to avoid confusion.

I'm not so optimistic. I had a coarse-grained statistics on the use of x86 builtins in Clang and GCC. It shows Clang only defines 2/3 of GCC's builtins and 1/4 of Clang builtins have different names with GCC's. Command below:

cat gcc/config/i386/*.h | grep -o "\b__builtin_ia32_\w\+" |sort|uniq|tee gcc.txt|wc -l
2788

ls clang/lib/Headers/*.h |grep -v fp16 |xargs cat |grep -o "\b__builtin_ia32_\w\+" |sort|uniq|tee clang.txt|wc -l
1808

comm -12 gcc.txt clang.txt |wc -l
1347

Not implementing something that gcc does at least gives a compile error if some tries to use the gcc name so I’m fine with that.

Using a different name in clang for something gcc also has is kinda silly. But I guess we’re worse at this then I thought.

Using the same name and having different behavior should be avoided if the it won’t give a compile error.

Regarding this case, we already have a different name with GCC, I think it worthwhile to use a different order for the swapping optimization.
With a bit research on AVX512IFMA, I found:

  1. The use of C, A, B order in GCC is not consistent on its AVX512IFMA builtins. It supposes GCC should change to A, B, C order if considering consistency;
  2. We aren't consistent on AVX512IFMA builtins with GCC either due to the use of select.

By the way, GCC folks told me GCC has ability to specify arbitrary operands that can be commutative. But I found both SDNode and MI only have ability on the first 2 operands, which is insufficient for instruction like CFMA. Do you know if we have other mechanism for commutable operands?

That’s true for SDNode, but you can always manually add more isel patterns. We do this for FMA3.

MI uses two target specific hooks in X86InstrInfo.cpp. findCommutedOpIndices and commuteInstructionImpl are the names if I remember right.

It seems in this patch the builtins interface is aligned to intrinsics interface. Since AVX512FP16 is pretty new, I assume nobody is using the GCC builtin. Can we ask GCC guys change their builtin interface?

pengfei updated this revision to Diff 372400.Sep 13 2021, 9:59 PM

Allow MI operands of complex FMA to be commutable.

Do we have any builtins with the same name as gcc but different operands/behaviours? Those are the only ones that I'd be worried about.

I think it's rare in existing intrinsics. 1) The builtins are always straightforward passed the arguments in intrinsics in the same order; 2) We do have a few that the the order between builtins and arguments is different, I checked several manually and didn't find any problem.
I think it's a good question for new enabled ISAs, I will check FP16 when GCC patches landed.

Using a different name in clang for something gcc also has is kinda silly. But I guess we’re worse at this then I thought.

Using the same name and having different behavior should be avoided if the it won’t give a compile error.

We may diversify a few on the names with GCC in FP16, I think we can fix them once GCC landed as a nice to have.

For IFMA I think I made them commutable by swapping the operands between the builtin and the internal intrinsic using the handling for the X86IntrinsicTable

I think this is a simply approach if we want to keep the builtin in a special C, A, B order. But I don't prefer this way because we need to swapped the order 3 times, intrinsic to builtin, builtin to SDNode, SDNode to MI. Which may result confusion for future developer.

It seems in this patch the builtins interface is aligned to intrinsics interface. Since AVX512FP16 is pretty new, I assume nobody is using the GCC builtin. Can we ask GCC guys change their builtin interface?

I think we should make an agreement with them. Just found we are using different order on cfmul builtins already.

Forgot to say thanks @craig.topper and @RKSimon for the information.

craig.topper added inline comments.Sep 22 2021, 10:34 AM
clang/lib/Headers/avx512fp16intrin.h
2972

Why is this intrinsic written like this? We can't evaluate macro arguments like A and U twice. It can cause surprising results.

pengfei added inline comments.Sep 22 2021, 6:03 PM
clang/lib/Headers/avx512fp16intrin.h
2972

Thanks Craig! You are right, we can't evaluate them twice. But we have to use them twice, because we want to provide user the same functionality as former FMA intrinsics. Maybe we need to define and use temp variables.

craig.topper added inline comments.Sep 22 2021, 6:09 PM
clang/lib/Headers/avx512fp16intrin.h
2972

Is it possible to just have single builtin call and create all the IR from CGBuiltin?

pengfei added inline comments.Sep 22 2021, 6:10 PM
clang/lib/Headers/avx512fp16intrin.h
2972

Yeah, that's a good idea. Thanks!

By the way, we synced with GCC and we are using the same order in the builtins now. https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=db3b96df03fdbe2fb770729501e2e9b65e66c2da;hp=ed643e9f171e99b0aa1453b3f29ed1103e9b5c80
We still have some different builtin names due to historical reasons, e.g. __builtin_ia32_vfmulcsh_mask_round (gcc), __builtin_ia32_vfcmulcsh_mask (llvm). I think it's OK for now.
Can we let this patch in? I will solve the multi evaluation problem in another patch.

This revision is now accepted and ready to land.Sep 22 2021, 7:04 PM

Thanks Craig!

This revision was landed with ongoing or failed builds.Sep 22 2021, 8:03 PM
This revision was automatically updated to reflect the committed changes.