This is an archive of the discontinued LLVM Phabricator instance.

[compiler-rt][aarch64] Add SME ABI support routines.
ClosedPublic

Authored by sdesmalen on Jun 29 2023, 12:03 AM.

Details

Summary

When compiling for SME and using the attributes to use PSTATE.ZA,
Clang will emit calls to SME ABI support routines to save and
restore ZA state.

Diff Detail

Event Timeline

sdesmalen created this revision.Jun 29 2023, 12:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 12:03 AM
sdesmalen requested review of this revision.Jun 29 2023, 12:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 12:03 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
compiler-rt/lib/builtins/aarch64/sme-abi.S
15

I cannot see any uses of RDSVL_X15_1.

40

Is calling __arm_sme_state strictly necessary? Not sure if I've missed something but it looks like we only abort based on the value of TPIDR2_EL0 or the block to which it points. Which means we only abort when SME is known available and so can read SVCR directly?

41

Is masking necessary here? given you're only testing a single bit where you could use x0 directly.

47

Not sure if this merit in this idea but given you don't need to preserve state I wondered if a normal branch would be better here. This way any backtrace will show the parent SME ABI function, which might be more meaningful? This kind of going in hand with my following comments about not needing to persevere x29 and x30 within those ABI functions.

55–56

This looks weird. Shouldn't x16 already contain the contents of CPTR_EL3 plus you already know SME is not available by this point of execution.

62

Can this be hex so it's easier to understand what's happening?

66

You already know SME is available by this point. Looking at the code I think you it can be simplified to:

if SME not available goto A:
do SME stuff
ret;

A:
x0 = x1= 0;
reet
74

Is this necessary?

75

The calling convention says x14-x18 are free and looking at the code you're not using x17 or x18 so perhaps don't need to split x12?

89

Using addsvl here will remove the need for rdsvl and free up another register.

103

Is this necessary? I cannot see any legitimate changes of these registers.

104

As above there shouldn't be a need to use x12.

123

As above, can you use addsvl here?

142–143

By this point you already know SME is available and this the branch will never be taken.

165

Can you use xzr directly?

MaskRay added a subscriber: MaskRay.

Add code owner @compnerd

compiler-rt/lib/builtins/aarch64/sme-abi.S
7

I feel that this guard is not necessary. we should just not compile these files for non-aarch64.

rsandifo-arm added inline comments.Jul 12 2023, 2:07 AM
compiler-rt/lib/builtins/aarch64/sme-abi.S
40

True, but it might be more future-proof to assume that “SME is available” is a sufficient but not necessary condition for “TPIDR2_EL0 is available”. So SME ⇒ TPIDR2_EL0 is safe but TPIDR2_EL0 ⇒ SME seems less safe.

40

Since this function can change VL, it should save the old VG value on the stack: https://github.com/ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst#changes-in-vector-length-alpha

compnerd added inline comments.Jul 25 2023, 7:58 AM
compiler-rt/lib/builtins/aarch64/sme-abi.S
7

I believe that CMake is already setup that way. __arch64__ may also be spelt _M_ARM64 or __arm64__. I think that doing what @MaskRay is suggesting here and dropping the guard is the better thing.

47

IIRC we did have some macros for function names. Please do not call abort here but rather with macro wrapping it to account for __USER_LABEL_PREFIX__.

Can't we compile these instructions using the just-built clang and therefore directly use the real assembly mnemonics?

sdesmalen updated this revision to Diff 557178.Sep 21 2023, 7:36 AM
sdesmalen marked 20 inline comments as done.

Thanks everyone for your comments. I've tried to address all of them. It did mean rewriting large parts of the routines and adding some new functionality.

Changes:

  • Avoid spilling/filling of x29 and x30, mostly by making the calls to do_abort a tail call.
  • Added a new file sme-abi-init.c which adds code to test at compiler-rt startup whether SME is available on the platform.

Can't we compile these instructions using the just-built clang and therefore directly use the real assembly mnemonics?

We could, but there is no requirement that compiler-rt must be built with a compiler that supports these mnemonics, it's perfectly fine to build compiler-rt with an older compiler.

compiler-rt/lib/builtins/aarch64/sme-abi.S
7

Removing it makes sense. I mostly just copied what was done in other files.

40

Thanks, I didn't realise that was required!

47

Not sure if this merit in this idea but given you don't need to preserve state I wondered if a normal branch would be better here. This way any backtrace will show the parent SME ABI function, which might be more meaningful? This kind of going in hand with my following comments about not needing to persevere x29 and x30 within those ABI functions.

I think that you're right that we can use b instead of bl for the other ABI functions, but for do_abort specifically we'll need to use bl, as the unwinder may need to come back to this point to restore VG, as @rsandifo-arm pointed out.

74

I've tried to remove unnecessary spills/fills of x29/x30 (which mostly comes from tail call optimising do_abort)

75

I think it did this because only x12-x15 can be used for the ldr za addressing mode. If we rename some of the registers and use x17/x18, then spilling x12 is indeed unnecessary. Anyway, I've tried to address this in the latest revision.

Matt added a subscriber: Matt.Sep 21 2023, 3:37 PM
nsz added inline comments.Sep 25 2023, 6:35 AM
compiler-rt/cmake/builtin-config-ix.cmake
42

AT_HWCAP2 and HWCAP2_SME are linux ABI (cannot change and no need to check if they are defined in the libc headers on a linux platform), only the getauxval symbol needs a check.

on non-linux platforms with SME support the flag may not be in AT_HWCAP2 nor at 1<<23 bit.

actually getauxval is the wrong symbol: calling that from compiler-rt violates iso c and posix link namespace so it can break conforming code.

glibc has __getauxval for this purpose (i guess bionic, musl etc needs to add this symbol too to allow checking auxval bits from a compiler runtime)

compiler-rt/lib/builtins/aarch64/sme-abi-init.c
6 ↗(On Diff #557178)

why unsigned long and not say bool?

9 ↗(On Diff #557178)

why a separate value when it is always the same as __aarch64_has_sme and always used together?

i think it is enough to add a separate symbol when there is logic that sets them differently (or used in different places but then the definitions should be in different TU for static linking).

24 ↗(On Diff #557178)

i'm not sure if weakref is a good idea.

in case of static linking weakref does not cause link of the definition from a library archive so libc can only define it in a crt.o, which then cannot be optimized away when SME is not in use.

in case of shared compiler-rt lib this is ok i guess (with the caveat that -Wl,-as-needed would not link the shared lib with the definition, but since we expect the definition in libc.so that can be assumed to be linked for other reasons)

sdesmalen updated this revision to Diff 557485.Sep 29 2023, 8:30 AM
sdesmalen marked 4 inline comments as done.

Addressed @nsz's review comments, specifically:

  • Merged __aarch64_has_sme and __aarch64_has_tpidr2_el0 into a single __aarch64_has_sme_and_tpidr2_el0
  • Also changed the type of this variable from unsigned long -> _Bool
  • Test for AT_HWCAP2 instead of getauxval(AT_HWCAP2) in CMake config
  • Use __getauxval() instead of getauxval()
  • No longer use weak ref for __aarch64_sme_accessible

We should move this onto GitHub PRs.

Anyway, having a static initializer seems like a bad idea. It seems wasteful to force static initializers to be generated & run just so we can check if we have SME, which in reality is never going to be false if we're in these routines. @rsandifo-arm can we relax this check requirement?

We should move this onto GitHub PRs.

Anyway, having a static initializer seems like a bad idea. It seems wasteful to force static initializers to be generated & run just so we can check if we have SME, which in reality is never going to be false if we're in these routines. @rsandifo-arm can we relax this check requirement?

The requirement exists for streaming-compatible calls to normal functions.

A streaming-compatible function that calls printf (say) must force PSTATE.SM to 0 before the call and then restore it to its original value after the call. To do that, it needs to read the current PSTATE.SM. But in general, streaming-compatible functions are agnostic not just about whether the processor is in streaming mode, but also about whether SME is present at all. E.g. it's possible to compile a streaming-compatible function with base -march=armv8-a (without even SVE).

So a general streaming-compatible function can't read SVCR unconditionally. There needs to be a “safe” way of reading PSTATE.SM without knowing whether SME is present.

nsz added a comment.Oct 3 2023, 5:54 AM

A streaming-compatible function that calls printf (say) must force PSTATE.SM to 0 before the call and then restore it to its original value after the call. To do that, it needs to read the current PSTATE.SM. But in general, streaming-compatible functions are agnostic not just about whether the processor is in streaming mode, but also about whether SME is present at all. E.g. it's possible to compile a streaming-compatible function with base -march=armv8-a (without even SVE).

yeah and the unwinder has to deal with SM and ZA states too and must work without SME (and often not part of the libc so does not have direct information about the presence of SME).

so the libc needs an ABI to detect if SME is accessible, if this ABI is fast then no global initializer is needed for caching the result in principle:
e.g. the sme runtime functions can be ifunc dispatched (works on glibc) or the libc can set a flag at a fixed offset from TP.

i think it makes sense to have a generic solution that a platform may override with optimizations if necessary.

sdesmalen updated this revision to Diff 557608.Oct 5 2023, 3:48 AM
  • Use ldrb instead of ldr for loading _Bool (this solves relocation errors)
  • Use /*AT_HWCAP2*/26 and /*HWCAP2_SME*/(1 << 23) directly without checking for support at compiler-rt build time. We might otherwise incorrectly assume SME is not available based on the target we compile for, even when the eventual runtime supports it.
paulwalker-arm added inline comments.Oct 6 2023, 7:02 AM
compiler-rt/cmake/Modules/AddCompilerRT.cmake
315 ↗(On Diff #557608)

The next block suggests this can be just type?

compiler-rt/lib/builtins/aarch64/sme-abi-init.c
18 ↗(On Diff #557608)

If missing defines is a concern the following idiom is better because it means we'll use the correct value if for some reason it is changed.

#ifndef AT_HWCAP2                                                                   
#define AT_HWCAP2 26                                                                
#endif

I can see this is the style used by lib/builtins/cpu_model.c also.

38 ↗(On Diff #557608)

It is defined somewhere what 90 means? and thus why that's the chosen priority.

compiler-rt/lib/builtins/aarch64/sme-abi.S
7

Is it worth pointing to the ABI specification as documentation for these routines?

14

RDSVL_X14_1 looks unused?

30

I ask this below but can we not assume SVE is available?

44

My dwarf knowledge is not great but is this safe? I mean you're essentially marking uninitialised data until you actually store the value of VG. What happens if we start unwinding from inside the call to __arm_sme_state? Is it possible to store the result of cntd as part of the prologue or can we not assume SVE is available?

70–71

Are the uses of SYMBOL_NAME required here? You're defining the name above so I'm assuming it cannot be changed by the time you get here?

120

As above, is SYMBOL_NAME required here?

aemerson added inline comments.Oct 6 2023, 6:54 PM
compiler-rt/lib/builtins/aarch64/sme-abi.S
70–71

For cross-platform code, this is needed so that the right prefix gets added to the symbol name. For example on Darwin it prepends "_".

sdesmalen updated this revision to Diff 557647.Oct 9 2023, 5:41 AM
sdesmalen marked 6 inline comments as done.

Addressed review comments:

  • Use #ifndef <macro>, #define <macro> for HWCAP macros.
  • Removed SYMBOL_NAME(..) for cases where mangling is not allowed (the public ABI routines)
  • Call cntd earlier to populate VG value on the stack.
  • Use manual CFI expression for VG if the compiler does not support SME assembler syntax.
compiler-rt/lib/builtins/aarch64/sme-abi-init.c
38 ↗(On Diff #557608)

https://clang.llvm.org/docs/AttributeReference.html#constructor

Specifically:

The priority is given as an integer constant expression between 101 and 65535 (inclusive). Priorities outside of that range are reserved for use by the implementation. A lower value indicates a higher priority of initialization.

I interpreted this as priorities between the range 101 and 65535 being user-space initialisation, meaning that a value of 90 will mean this constructor is called as part of the compiler's (implementation, i.e.) internal runtime and executed before any user-space initialisation.

compiler-rt/lib/builtins/aarch64/sme-abi.S
30

At runtime, yes, but the compiler may not be able to support the asm syntax.

44

Good point, I think this is indeed unsafe. We can assume that SVE is available at runtime, so we can execute cntd earlier.

paulwalker-arm accepted this revision.Oct 11 2023, 6:41 AM

A couple of extra optimisations but otherwise this looks good to me.

compiler-rt/lib/builtins/aarch64/sme-abi.S
54–55

Can you just tbz to 2 and remove the need for the extra branch?

147

Looking at the docs both x17 and x18 are available? Which means if you use x17 to hold REG_TPIDR2_EL0 then you can remove this extra load.

This revision is now accepted and ready to land.Oct 11 2023, 6:41 AM
sdesmalen updated this revision to Diff 557684.Oct 11 2023, 8:30 AM

Removed unnecessary branch and load (addressing @paulwalker-arm's comments)

This revision was automatically updated to reflect the committed changes.

This commit seems be causing build failures on green dragon:
https://green.lab.llvm.org/green/job/clang-san-iossim/19806/consoleFull#25062429549ba4694-19c4-4d7e-bec5-911270d8a58c
https://green.lab.llvm.org/green/job/clang-stage1-RA/lastFailedBuild/consoleFull#20418663849ba4694-19c4-4d7e-bec5-911270d8a58c

/var/folders/09/r4vw4v8n5kb67jl66zvlbljw0000gn/T/sme-abi-e48b7e.s:55:3: error: ADR/ADRP relocations must be GOT relative
  adrp x16, ___aarch64_has_sme_and_tpidr2_el0
  ^
/var/folders/09/r4vw4v8n5kb67jl66zvlbljw0000gn/T/sme-abi-e48b7e.s:55:3: error: unknown AArch64 fixup kind!
  adrp x16, ___aarch64_has_sme_and_tpidr2_el0
  ^
/var/folders/09/r4vw4v8n5kb67jl66zvlbljw0000gn/T/sme-abi-e48b7e.s:56:3: error: unknown AArch64 fixup kind!
  ldrb w16, [x16, :lo12:___aarch64_has_sme_and_tpidr2_el0]
  ^
/var/folders/09/r4vw4v8n5kb67jl66zvlbljw0000gn/T/sme-abi-e48b7e.s:105:3: error: ADR/ADRP relocations must be GOT relative
  adrp x14, ___aarch64_has_sme_and_tpidr2_el0
  ^
/var/folders/09/r4vw4v8n5kb67jl66zvlbljw0000gn/T/sme-abi-e48b7e.s:105:3: error: unknown AArch64 fixup kind!
  adrp x14, ___aarch64_has_sme_and_tpidr2_el0
  ^
/var/folders/09/r4vw4v8n5kb67jl66zvlbljw0000gn/T/sme-abi-e48b7e.s:106:3: error: unknown AArch64 fixup kind!
  ldrb w14, [x14, :lo12:___aarch64_has_sme_and_tpidr2_el0]
  ^
/var/folders/09/r4vw4v8n5kb67jl66zvlbljw0000gn/T/sme-abi-e48b7e.s:144:3: error: ADR/ADRP relocations must be GOT relative
  adrp x14, ___aarch64_has_sme_and_tpidr2_el0
  ^
/var/folders/09/r4vw4v8n5kb67jl66zvlbljw0000gn/T/sme-abi-e48b7e.s:144:3: error: unknown AArch64 fixup kind!
  adrp x14, ___aarch64_has_sme_and_tpidr2_el0
  ^
/var/folders/09/r4vw4v8n5kb67jl66zvlbljw0000gn/T/sme-abi-e48b7e.s:145:3: error: unknown AArch64 fixup kind!
  ldrb w14, [x14, :lo12:___aarch64_has_sme_and_tpidr2_el0]
  ^

We are seeing build problems because of this change on some of our systems. I am suspecting there might be multiple problems, but let's start at the very beginning with:

When compiling for SME ...

New files sme-abi.S and aarch64/sme-abi-init.c get built depending on successful compilation of:

builtin_check_c_compiler_source(COMPILER_RT_HAS_ASM_SME
"
asm(\".arch armv9-a+sme\");
asm(\"smstart\");
")

This check depends on the host compiler and assembler du jour and whether that can assemble SME or not.

That doesn't seem like the way we want to do architecture extensions checks. But perhaps I am missing something @sdesmalen / @paulwalker-arm ?

We are seeing build problems because of this change on some of our systems. I am suspecting there might be multiple problems, but let's start at the very beginning with:

When compiling for SME ...

New files sme-abi.S and aarch64/sme-abi-init.c get built depending on successful compilation of:

builtin_check_c_compiler_source(COMPILER_RT_HAS_ASM_SME
"
asm(\".arch armv9-a+sme\");
asm(\"smstart\");
")

This check depends on the host compiler and assembler du jour and whether that can assemble SME or not.

That doesn't seem like the way we want to do architecture extensions checks. But perhaps I am missing something @sdesmalen / @paulwalker-arm ?

Are you using the latest top-of-tree? We should only build these files when targeting a platform with sys/auxv.h or baremetal, see https://github.com/llvm/llvm-project/pull/69423.

Are you using the latest top-of-tree? We should only build these files when targeting a platform with sys/auxv.h or baremetal, see https://github.com/llvm/llvm-project/pull/69423.

Yes, I am building ToT OSS Clang/LLVM. I see that check:

if(COMPILER_RT_HAS_ASM_SME AND (COMPILER_RT_HAS_AUXV OR COMPILER_RT_BAREMETAL_BUILD))

COMPILER_RT_HAS_ASM_SME is true and since I see auxv.h on our systems, COMPILER_RT_HAS_AUXV is true.
And it must be true, because I see compilation of sme-abi.S which is where we have problems (that's probably another problem).
The first problem is that I don't expect COMPILER_RT_HAS_ASM_SME to be true on our system. It happens to be true, because a GNU as is picked up that is happy to assemble SME, but I don't think that is the check that we should be doing.

Can this be reverted? OSS Clang/LLVM builds are failing on our AArch64 platforms.

The first problem is that I don't expect COMPILER_RT_HAS_ASM_SME to be true on our system. It happens to be true, because a GNU as is picked up that is happy to assemble SME, but I don't think that is the check that we should be doing.

If the toolchain you're using to build compiler-rt for the selected target supports assembling SME instructions, then there shouldn't be a problem assembling sme-abi.S. A similar approach is taken for the LSE builtins.

What exactly is the problem that you're encountering with building sme-abi.S?

The first problem is that I don't expect COMPILER_RT_HAS_ASM_SME to be true on our system. It happens to be true, because a GNU as is picked up that is happy to assemble SME, but I don't think that is the check that we should be doing.

If the toolchain you're using to build compiler-rt for the selected target supports assembling SME instructions, then there shouldn't be a problem assembling sme-abi.S. A similar approach is taken for the LSE builtins.

This results in completely arbitrary results. On one ubuntu system binutils 2.34 is picked up that doesn't support SME, on another system binutils 2.38 is picked up which apparantly does support it.
My question is if the binutils "du jour" should be determining this, or something else?

I have indeed seen the precedent for LSE, but I am not sure I agree with that one either (but I don't mind at the moment).

What exactly is the problem that you're encountering with building sme-abi.S?

We are getting a build failure:

compiler-rt/lib/builtins/aarch64/sme-abi.S: Assembler messages:
compiler-rt/lib/builtins/aarch64/sme-abi.S:38: Error: bad register expression

Apparently the GNU assembler doesn't like this:

.cfi_offset vg, -16

I have not yet looked what this is about, if this GNU assembler is not supporting this or if something else is going on...

My question is why we are compiling this in the first place.

Just wanted to add that I am running vanilla Ubuntu 20.04.5 LTS and Ubuntu 22.04.1 LTS.

I really want to get my builds going again. I will revert this on Monday.

My question is why we are compiling this in the first place.

As far as I understand it compiler-rt should provide any runtime support that Clang supports. In this case, when building a toolchain for AArch64, Clang supports the SME ACLE intrinsics and so compiler-rt should also provide the corresponding SME ABI routines. If for some reason the toolchain used to build compiler-rt is not 'new' enough to support the SME assembly, CMake emits a warning that the SME ABI routines are not built. In this case it seems that the issue here is that assembler we use to build compiler-rt supports SME but apparently not the SVE vg in the .cfi_offset directive.

Can you try replacing .cfi_offset vg, -16 by .cfi_offset 46, -16 and let me know if that solves the build issues you encountered?

Looking at the discussion of initialization, was there any consideration of lazily initializing the "__aarch64_has_sme_and_tpidr2_el0" bit? I mean, I see an explanation for why the bit is cached, but I don't see any constraint on when that computation happens.

nsz added a comment.Nov 8 2023, 1:46 AM

Looking at the discussion of initialization, was there any consideration of lazily initializing the "__aarch64_has_sme_and_tpidr2_el0" bit? I mean, I see an explanation for why the bit is cached, but I don't see any constraint on when that computation happens.

the first use can be in an async signal handler and getauxval is not documented to be async signal safe (i'd expect it to work in practice, but there is no guarantee).

and i think a lot of registers have to be preserved (including sve regs which can cause stack overflow on a large sve implementation) as well as errno (getauxval clobbers it in case the kernel does not set AT_HWCAP2).

(you would also need an extra conditional branch to check for init and keep the is-initialized and has-sme bits in the same word to avoid atomics in the hot path)