This is an archive of the discontinued LLVM Phabricator instance.

RFC: [openmp] Provide an assembly implementation of __kmp_invoke_microtask on ARM
ClosedPublic

Authored by mstorsjo on Nov 25 2022, 4:35 AM.

Details

Summary

This fixes passing an arbitrarily large number of arguments to
microtasks, fixing the misc_bugs/many-microtask-args.c testcase on
ARM.

I'm not entirely sure if this works on older arm architectures though;
it's been tested on armv7 (Linux and Windows, with both thumb and arm
modes).

Diff Detail

Event Timeline

mstorsjo created this revision.Nov 25 2022, 4:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 25 2022, 4:35 AM
mstorsjo requested review of this revision.Nov 25 2022, 4:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 25 2022, 4:35 AM

Not trying to end up with more comments than assembly, but it might be the easiest way to do it :)

openmp/runtime/src/z_Linux_asm.S
1425

You might want to add "for when we call pkfn below".

Since AAPCS32 says align to 4, unless at a public interface where it should be 8. Which includes calling a function.
(which evidently you knew but I did not)

Also worth saying which is the extra register.

1427

What does this load? I think p_argv and exit_frame.

1448

r3 is the arg count. So I think you are rounding up for the case where the count is an odd number, which would leave sp at 4 byte alignment? But if that's the case why do you then multiply by 8 not 4.

Because you divided by 2 in the first place? Making the calculation produce a number of 8 byte slots not 4?

Tell me where I'm going wrong there.

1461

This section you doing:

while (argc != 0) {
   copy and argument to the stack
   argc--
}

call pkfn

Correct?

1463

What does this ldr r2 and the ldr r3 below setup?

2069

And this is an elf/coff difference? No need for a equivalent coff line?

Just BTW - the assembly here isn't entirely written from scratch, but I took the existing aarch64 implementation and rewrote it, so most of the style/structure matches that.

I'll try to add comments to clarify the details that you pointed out. Thanks for the review!

openmp/runtime/src/z_Linux_asm.S
1425

Sure, can do.

1427

Yes, that's correct - I'll add a comment.

1448

Exactly - we want to adjust the stack down 4*argc bytes, but rounded up to an even increment of 8 bytes.

1461

Yes, pretty much.

1463

When calling a function, the first 4 arguments go in r0-r3 - r0/r1 are filled with &gtid and &tid, while r2/r3 should get the first arguments among the regular ones. If argc > 2, then we copy the rest of them onto the stack. In the case of aarch64, we'd copy incoming arguments into x2-x7 and the rest on the stack - here we only need to handle r2/r3 separately.

2069

.size is a ELF-specific assembly directive (and object file feature) - COFF doesn't store per-symbol sizes.

mstorsjo updated this revision to Diff 478025.Nov 25 2022, 2:41 PM

Added more comments on the assembly, simplified a few bits about the stack allocation calculation and setting up of the frame pointer.

mstorsjo updated this revision to Diff 478027.Nov 25 2022, 2:55 PM

Updated to use the assembly version on Windows on ARM too.

DavidSpickett added inline comments.Nov 29 2022, 5:46 AM
openmp/runtime/src/z_Linux_asm.S
1447

What is the -2 in argc-2? (or rather what would it be)

1452

This 1 here, is it part of the rounding or is it the extra 8 bytes for gtid and tid?

mstorsjo added inline comments.Nov 29 2022, 6:13 AM
openmp/runtime/src/z_Linux_asm.S
1447

We pass 2 arguments out of argc in registers (while the other 2 register parameters are taken by tid and gtid). In the case of aarch64 with 8 register parameters, they'd need space for argc-6 arguments on the stack.

1452

This is the extra 8 bytes for gtid and tid.

DavidSpickett added inline comments.Nov 29 2022, 6:27 AM
openmp/runtime/src/z_Linux_asm.S
1452

Got it, that's why you say above that we're over allocating. Because of that we're ok with rounding down (rounding up would need some +1 somewhere).

DavidSpickett added a comment.EditedNov 29 2022, 6:30 AM

Assembly looks good to me.

Now we need some way to tell if earlier Arm needs to be supported. Is there other existing Arm assembly for OpenMP? That would give a clue, might have to go to the git blame to find the intent when it was first added. I'm no expert on the very old architectures but I can help check it for things that look "new" or "old" (for lack of better words).

openmp/runtime/src/z_Linux_asm.S
1447

How about:

// We strictly need 4*(argc-2) bytes (2 arguments are passed in registers)
// but allocate 4*argc for....

Assembly looks good to me.

Thanks for reviewing!

Now we need some way to tell if earlier Arm needs to be supported. Is there other existing Arm assembly for OpenMP? That would give a clue, might have to go to the git blame to find the intent when it was first added. I'm no expert on the very old architectures but I can help check it for things that look "new" or "old" (for lack of better words).

No, as far as I can see, there's no other existing assembly for ARM here.

AFAIK this implementation should be interworking-safe on armv5 and above, but wouldn't necessarily work as-is for armv4t. It probably won't assemble in thumb1 mode either.

openmp/runtime/src/z_Linux_asm.S
1447

Thanks, that sounds good to me, I'll amend the patch accordingly.

The oldest architecture distro I know of for Linux is Debian (https://wiki.debian.org/ArmEabiPort) which has now moved to v5. The last supported LTS for armv4t ran out June 2022 (https://www.debian.org/releases/stretch/).

https://clang.llvm.org/docs/OpenMPSupport.html doesn't mention Arm at all. Perhaps it should.

Landing this seems like a low risk to me.

Anyone from the OpenMP side want to check my logic there? @jdoerfert perhaps?

mstorsjo updated this revision to Diff 478708.Nov 29 2022, 1:33 PM

Updated the comment as suggested.

Rebased on top of other recent related pushed commits; use the assembly instead of the C implementation on Windows (just like on aarch64), remove the XFAIL from the testcase that gets fixed by the assembly.

jdoerfert accepted this revision.Nov 30 2022, 10:46 AM

https://clang.llvm.org/docs/OpenMPSupport.html doesn't mention Arm at all. Perhaps it should.

It probably should, feel free to change that. Also openmp.llvm.org if you want to provide more information.

Landing this seems like a low risk to me.

Agreed.

Anyone from the OpenMP side want to check my logic there? @jdoerfert perhaps?

I did not check "the logic" as I am not familiar with arm but given that the test works and the assembly was proof read I think this is good to go.

This revision is now accepted and ready to land.Nov 30 2022, 10:46 AM
tianshilei1992 added inline comments.
openmp/runtime/src/z_Linux_asm.S
1429

It is reported (https://github.com/llvm/llvm-project/issues/60370) compile error encountered when compiling this. Can you take a look at it?