Page MenuHomePhabricator

[clang,ARM] Initial ACLE intrinsics for MVE.

Authored by simon_tatham on Sep 4 2019, 5:48 AM.



This commit sets up the infrastructure for auto-generating <arm_mve.h>
and doing clang-side code generation for the builtins it relies on,
and demonstrates that it works by implementing a representative sample
of the ACLE intrinsics, more or less matching the ones introduced in
LLVM IR by D67158.

Like NEON, that header file will provide a set of vector types like
uint16x8_t and C functions with names like vaddq_u32(). Unlike NEON,
the ACLE spec for <arm_mve.h> includes a polymorphism system, so that
you can write plain vaddq() and disambiguate by the vector types you
pass to it.

Unlike the corresponding NEON code, I've arranged to make every user-
facing ACLE intrinsic into a clang builtin, and implement all the code
generation inside clang. So <arm_mve.h> itself contains nothing but
typedefs and function declarations, with the latter all using the new
__attribute__((__clang_builtin)) system to arrange that the user-
facing function names correspond to the right internal BuiltinIDs.

So the new MveEmitter tablegen system specifies the full sequence of
IRBuilder operations that each user-facing ACLE intrinsic should
translate into. Where possible, the ACLE intrinsics map to standard IR
operations such as vector-typed add and fadd; where no standard
representation exists, I call down to the sample IR intrinsics
introduced in an earlier commit.

Doing it like this means that you get the polymorphism for free just
by using attribute((overloadable)): the clang overload resolution
decides which function declaration is the relevant one, and _then_ its
BuiltinID is looked up, so by the time we're doing code generation,
that's all been resolved by the standard system. It also means that
you get really nice error messages if the user passes the wrong
combination of types: clang will show the declarations from the header
file and explain why each one doesn't match.

(The obvious alternative approach would be to have wrapper functions
in <arm_mve.h> which pass their arguments to the underlying builtins.
But that doesn't work in the case where one of the arguments has to be
a constant integer: the wrapper function can't pass the constantness
through. So you'd have to do that case using a macro instead, and then
use C11 _Generic to handle the polymorphism. Then you have to add
horrible workarounds because _Generic requires even the untaken
branches to type-check successfully, and then if the user gets the
types wrong, the error message is totally unreadable!)

Diff Detail

Event Timeline

simon_tatham created this revision.Sep 4 2019, 5:48 AM

Updated along with the patches it depends on. In particular, D67159 now needs a list of valid uses of the `__clang_arm_mve_alias`, so this patch now has to have an extra piece of Tablegen backend that generates that list.

Sorry. I wasn't ignoring this (sort-of), I just knew that you were on holiday and this is a bit of a big one.

But I like this. I still have to take a deeper look into the main tablegen parts, but it looks very powerful.

I presume from here adding new intrinsics is mostly a case of adding to (with perhaps some minor parts in defs and maybe some custom bits here and there).


Do we have any/should we have some tests for these errors?


These are not used yet, right? They are not meant for the vldr gathers, those don't have polymorphic names (if I'm reading this list of intrinsics right). These are for vidup's as the comment says?


These tests all run -O3, the entire pass pipeline. I see quite a few tests in the same folder do the same thing, but in this case we will be adding quite a few tests. Random mid-end optimizations may well keep on altering them.

Is it possible to use -disable-O0-optnone and pipe them into opt -mem2reg? What would that do to the codegen, would it be a lot more verbose than it is now?

Also could/should they be using clang_cc1?


POLYMORPHIC isn't used here. Same for vcvt below (Is there a polymorphic vcvt?)


Its probably worth trying to fill in tests for most types.


Is this needed still? It seems like something that should be handled in clang-format/emacs directly.


The gathers are really a Vector of Pointers, in a way. But in the intrinsics (and IR), they are just treated as a vector of ints, so I presume a new type is not needed? We may (but not yet) want to make use of the llvm masked gathers. We would have to add codegen support for them first though (which I don't think we have plans for in the near term).



I would guess the same for other places that use snake_case.


Clang format really made a mess of this, didn't it.

Is this is old license? Should it be updated to the new one. I imagine that these generated headers might well have been missed in the switchover.

dmgreen added inline comments.Oct 2 2019, 11:12 AM



Would making this always east const be simpler? Do we generate anything with inner pointer types? Should there be a whitespace before the "const " in Name += "const "?


These don't need to be virtual and override, they can just be override. Same elsewhere.


How come this doesn't have/need a llvm_name? Are they just all handled manually?



Do you know how much this helps? Do the benefits outweigh the costs in the complexity of this code?

And do you know, once we have added a lot of instruction how long this will take to run (I know, difficult question to answer without doing it first, but is it roughly O(n log n)? And memory will never be excessive? A very unscientific test I just ran made it seem pretty quick, but that was obviously without the number of intrinsics we will have in the end). We should try and ensure that we don't make this a compile time burden for llvm.


Is it worth giving this a different name to more obviously distinguish it from llvm::Value?


What is this doing?



And why a list, out of interest?


Do you envision this may require float const or vector casts in the future?


Some of these could use StringRef's more liberally, here and elsewhere.


Can this use (something like):
bool Constant = llvm::all_of(kv.second, [&](const auto &OI) { return OI.ParamValues[i] == kv.second[0].ParamValues[i]; } ?

simon_tatham marked 20 inline comments as done.Oct 4 2019, 8:29 AM
simon_tatham added inline comments.

By the time we finish implementing all the intrinsics, there will be tests for these errors. The intrinsics that need them aren't in this preliminary commit, though.


I've defined here the complete list of polymorphic name types that will be needed for all the MVE intrinsics. I haven't included an example of every single one in this preliminary set, though.

Yes, I think PNT_WB in particular is only used for the vidup family.


The immediate problem is that if you use any form of clang | opt | FileCheck command line, then says 'skipping non-FileChecked line', because it doesn't support anything more complicated than a two-stage pipeline consisting of clang and FileCheck.

I've enhanced update_cc_test_checks to handle that case, in D68406, and respun these tests accordingly. But if that patch doesn't get approval then I'll have to rethink this again.

(For vld24.c I ended up running opt -sroa -early-cse to avoid the IR becoming noticeably more verbose. The rest worked fine with just mem2reg, though.)

Patching to support more complex pipelines, it seems to work OK: most codegen changes are trivial ones, such as modifiers not appearing on IR operations (call instead of tail call, plain shl in place of shl nuw). Only vld24 becomes significantly more complicated: for that one file I had to run opt -mem2reg -sroa -early-cse instead.


Oh yes, now I look on Stack Overflow there is a way to make emacs stop indenting things inside namespaces. Thanks for the prod :-)


Yes, I'm working so far on the assumption that we don't need to represent scatter/gather address vectors as a special vector-of-pointers type.

(If nothing else, it would be a bit strange for the 64-bit versions, where the vector element isn't even the same size as a pointer.)

If auto-generation of gather loads during vectorization turns out to need a special representation, then I suppose we'll have to rethink.


There shouldn't be a whitespace, because clang-format agrees that the right spacing is char *const p and not char * const p :-)

But, hmm, now you mention it, it may well be that we never actually need a pointer-to-pointer type in the current state of the ACLE spec. Previous versions did have pointers to pointers, for written-back base addresses in contiguous loads/stores, but those intrinsics were removed in later drafts. So perhaps you're right that this is now unnecessary complexity.


Yes: the multi-vector types like int32x4x2_t are only used by the vld2/vst4 family of intrinsics, and those are the ones I implemented using handwritten codegen in C++, precisely because they use struct types that are needed by only a tiny set of intrinsics, so it seemed pointless to build a huge amount of extra very general machinery into this tablegen system when it was quicker to just handwrite the results for that handful of intrinsics.

I'll add a comment explaining that.


I don't know for sure whether this merging system is worth it. It's hard to judge without having implemented the full set of intrinsics.

I expect it to be algorithmically more or less O(n log n). My inner data-structures nerd (ok, not very inner) isn't very fond of the string-keyed std::map that groups together all the related pieces of code, on the grounds that for string-keyed structures tries are asymptotically better than the easy approach of doing O(log n) strcmps, but I think this case is unlikely to run into the pathological case of the latter, so it should be fine.

(If there were a trie available in include/llvm/ADT, I'd love to use it, but I didn't find one...)

So I think it shouldn't be excessively slow at Tablegen time; and since its effect is to reduce the amount of C++ that has to be processed in a single function by a full optimizing compiler later in the clang build, my instinct is that it's likely to save time overall.


I agree, although it's tricky to think of a good name. I've gone for Result.


Hmm, apparently nothing – if I remove it, tablegen gives identical output. Well spotted. Possibly it was a relic of some earlier idea I had of how to do the two-pass parameter allocation.


Just on the general principle of 'don't demand more capabilities from your data structure than you actually need' – I didn't need indexed random access in this case, and a list is the simplest structure that supports constant-time append and iteration. I can make it a vector if you prefer. (Perhaps the constant factors are better.)


I certainly don't expect to need float casts, because they're destructive of information (so using one should never be the right codegen strategy for any MVE intrinsic), and also because floats never need widening to fit in an Arm register: all the vector instructions that deal with scalar floats (which as far as I can remember is just the lane get/set family) can handle each float in its existing format. (Unlike <32 bit ints, for which it will be very common that the underlying LLVM IR intrinsic finds it easier to deal in register-sized i32, so the clang codegen will need to widen and narrow appropriately.)

Bitcasting between vector types is a possibility, I suppose. We can always add another case in here if we turn out to need it.


Seems likely! I've updated it to the same license that's at the top of this source file itself.


Can't say kv.second[0] when kv.second is a set rather than a vector, but apart from that, yes, that seems to work fine.

Updated for (I think) all review comments so far.

This is looking good to me. My understanding is that is has some dependencies? The llvm side will likely needed to go in first, plus a couple of clang patches?


OK. That's fair.


Yeah, they look OK. Thanks for making the change. It looks like a useful feature being added.


Tests for bottom too would be good to see too. In general, more tests are better.


Are these tests still using old names? I'm not missing something about how these are generated, am I?


Sounds good.


Ah, I see. My brain and clang format often disagree about where the whitespace around a * should be :)


Yeah, OK. Makes sense. I was thinking about it from the other way around; it doesn't need to insert into the middle of the list anywhere, so my default would be a vector. The spatial locality can be very nice.

No opinion which is really better though. Whatever you think.

Rebased to current master, and updated to match the renamed IR intrinsics.

Minor update for the renamed vcvt intrinsic in my latest version of D67158.

dmgreen accepted this revision.Oct 23 2019, 4:38 AM

I've read this through again and it looks good. If no one else has any issues, then LGTM.

This revision is now accepted and ready to land.Oct 23 2019, 4:38 AM

(Rebased to current master; no change)

This revision was automatically updated to reflect the committed changes.

Hmm.. Let me take a look. There's a different error on the same build now, but I think it's just hiding this one.

I'll also try and fix the tests that are failing in places too, if I can.

I've hopefully fixed the build in rG7b3de1e81197, but it's hard to tell for sure with the other error.

I also fixed the tests in rG78700ef8866d (and worked out how to remove change-id's from commit messages).

Let us know if anything else looks broken.

thakis added a subscriber: thakis.Oct 24 2019, 11:21 PM
thakis added inline comments.

nit: These files are listed alphabetically.

simon_tatham marked an inline comment as done.Oct 25 2019, 1:25 AM
simon_tatham added inline comments.
rnk added a subscriber: rnk.Jan 17 2020, 4:34 PM

ArmMveAliasValid is the second slowest function to compile in all of clang, according to ClangBuildAnalyzer:

We shouldn't spend 28 CPU seconds per rebuild on a switch. Please do something to speed it up. Generally, it is best to emit tables instead of code whenever possible.

rnk added a comment.Jan 18 2020, 1:02 PM

I posted a patch to speed things up:
The numbers are different, since 9 seconds is measured in a local linux build with no debug info, vs. 28s I measured when building at work, on Windows, with debug info.