This is an archive of the discontinued LLVM Phabricator instance.

[Clang][AArch64] Inline assembly support for the ACLE type 'data512_t'.
ClosedPublic

Authored by labrinea on Jan 5 2021, 9:12 AM.

Details

Summary

When generating code to access inline assembly operands, clang is either passing them by-value or by-reference depending on the type of data. Input operands that are either scalars or scalarizable aggregates are loaded on registers before the inline assembly call. Similarly, output operands of the same kind are stored in memory following the inline assembly call. To perfrom such loads and stores, clang has to bitcast the operand type to an integer first. This would not work for the ACLE type data512_t, which is essentially an aggregate type { [8 x i64] }: we could in theory use i512 and let the backend deal with it, but clang wouldn't be able to emit the store as there's no qualified type for such a large integer:

(from clang/test/CodeGen/X86/x86_64-PR42672.c)

63 // Check Clang reports an error if attempting to return a big structure via a register.
64 void big_struct(void) {
65 #ifdef IMPOSSIBLE_BIG
66   struct {
67     long long int v1, v2, v3, v4;
68   } str;
69   asm("nop"
70       : "=r"(str));
71 #endif
72 }
73 // CHECK-IMPOSSIBLE_BIG: impossible constraint in asm: can't store value into a register

Since clang's preference for scalarizable aggregates is be to pass them by-value, we need a way to tell whether a very large scalarazible aggregate could be handled by the backend. In that case we should pass it by-reference instead of emiting a compilation error for large output operands. Input operands are not a problem as clang currently loads them on registers if they are less than or equal to 64 bits and a power of two. Anything else is passed by-reference. This patch adds a target hook to determine whether an aggregate value of a given size can be dealt by the backend.

Diff Detail

Event Timeline

labrinea requested review of this revision.Jan 5 2021, 9:12 AM
labrinea created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJan 5 2021, 9:12 AM
Matt added a subscriber: Matt.Jun 24 2021, 11:00 AM
labrinea updated this revision to Diff 355908.Jul 1 2021, 10:13 AM
labrinea retitled this revision from [Clang] Inline assembly support for the ACLE type 'data512_t'. to [Clang][AArch64] Inline assembly support for the ACLE type 'data512_t'..
labrinea edited the summary of this revision. (Show Details)
labrinea added a reviewer: momchil.velikov.

I'm confused what your goal here is, exactly. The point of allowing 512-bit inline asm operands is presumably to allow writing efficient code involving inline asm... but you're intentionally destroying any potential efficiency by forcing it to be passed/returned in memory. If the user wanted to do that, they could just use an "m" constraint.

It looks like SelectionDAG currently crashes if you try to pass an array as an inline asm operand, but that should be possible to fix, I think.

I'm confused what your goal here is, exactly. The point of allowing 512-bit inline asm operands is presumably to allow writing efficient code involving inline asm... but you're intentionally destroying any potential efficiency by forcing it to be passed/returned in memory. If the user wanted to do that, they could just use an "m" constraint.

It looks like SelectionDAG currently crashes if you try to pass an array as an inline asm operand, but that should be possible to fix, I think.

I have explained in the description why I am doing this: i512 is not a qualified type and so it is not possible to emit the store instruction required for output operands (line 2650 in the original code of clang/lib/CodeGen/CGStmt.cpp). As I said clang has already tests in place for this case (clang/test/CodeGen/X86/x86_64-PR42672.c - function big_struct), so I don't see how I am destroying the efficient codegen, which only applies to small sized integers (because they have a qualified type). Can you suggest a better solution?

Regarding the Selection DAG, my patches https://reviews.llvm.org/D94096 and https://reviews.llvm.org/D94097 are adding support for this use case in the backend. @t.p.northover has raised a concern there too, so maybe my original set of patches (including a dedicated IR type) in the RFC https://lists.llvm.org/pipermail/llvm-dev/2020-November/146860.html were a better fit?

The part I'm confused about is that you're forcing it to use "*r". At the IR level, LLVM handles something like call void asm sideeffect "#$0", "r"([8 x i64] %c) fine. You'll have to do a bit of work to teach clang to emit that, but it shouldn't be that hard. I think you can deal with it on the isel end with some relatively small changes to D94097.

The part I'm confused about is that you're forcing it to use "*r". At the IR level, LLVM handles something like call void asm sideeffect "#$0", "r"([8 x i64] %c) fine. You'll have to do a bit of work to teach clang to emit that, but it shouldn't be that hard. I think you can deal with it on the isel end with some relatively small changes to D94097.

If you discard my patch and look at the codegen for __asm__ volatile ("st64b %0,[%1]" : : "r" (*input), "r" (addr) : "memory" );, which uses the struct foo as an input operand, you'll see that clang is already passing it by reference. All I am doing is making this behavior consistent for output operands too. Whether llvm can deal with indirect asm register operands or not is a separate story (see llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:8740). I think that making clang emit what you sugggested (to pass [8 x i64] by value) is inevitably going to be inelegant in a similar way that the previous revision of this patch was. Moreover, taking this route entails introducing more inelegant changes in D94097 (workarounds for MVT::i64x8 in getCopyToParts() of the same file I previously mentioned). I have been unsuccessfully trying all the above and I can continue my efforts for a little more, but in my honest opinion I don't see the benefit.

but in my honest opinion I don't see the benefit.

The problem is, there isn't really any point to supporting "register" operands in this state. LLVM will never optimize an indirect register into a direct register, so we're guaranteed to generate an ld64b just before the inline asm block for inputs, and an st64b just after the inline asm block for outputs. At that point, it's not really any better than something like __asm__ volatile ("ld64b x0, [%0]; st64b x0,[%1]" : : "r" (input), "r" (output) : "memory", "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7" );.

That is, unless we care about source compatibility with some other compiler that supports this, I guess.

Ok, I've tried a few things. If we add a couple of new target hooks we can make clang pass both input and output asm operands by value as type { [8 x i64] } avoiding the integer conversion. One issue with that is that the inline asm verifier asserts if an inline asm statement returns a struct with one result (struct return types are meant to carry multiple results). By making adjustments to the existing target hook adjustInlineAsmType() we can even alter the asm operand type and make it [8 x i64] for example if that's preferable. Adding new calls to this hook without removing the existing ones will look ugly though, but at the same time I found it challenging given the complexity of the 400-line function CodeGenFunction::EmitAsmStmt, which needs tidying up. Unfortunately this is half of the story as by choosing an aggregate type for the asm operands we are allowing InstCombine (at -O1 and above) to turn the load/store instructions before/after the inline asm statement into insert/extract element + smaller loads/stores. I see two problems with that. Firstly, the information that the load/store comes from an inline asm operand gets lost by the time the SelectionDAG processes those nodes, and so we cannot use a target hook to select a special value type for them (as discussed in D94097 we want to narrow down the MVT specialization for an llvm type to only apply to asm operands and not universally). Moreover, having insert/extract element is pointless when the backend expects a load/store of MVT::i64x8 for custom lowering. All that said I think that the best choice is to use i512 for the asm operands since llvm cannot optimize that. The only change in clang's user visible behavior is that large aggregate output operands will not be diagnosed, like in the example at the description, but instead we'll be passing them by reference, which is what is already happening with input operands anyway.

Firstly, the information that the load/store comes from an inline asm operand gets lost by the time the SelectionDAG processes those nodes, and so we cannot use a target hook to select a special value type for them (as discussed in D94097 we want to narrow down the MVT specialization for an llvm type to only apply to asm operands and not universally).

We don't want a special value for the load/store operations feeding into a inline asm, I think? For an input, we just want to convert the final insertelement to i64x8, using something like along the lines of REG_SEQUENCE. This means we won't use an ld64b to load the registers, but I think that's what we want; in general, the input registers won't come from some contiguous hunk of memory. For example, say someone wrote something like this:

struct foo { unsigned long long x[8]; };
void store(int *in, void *addr)
{
    struct foo x = { in[0], in[1], in[4], in[16], in[25], in[36], in[49], in[64] };
    __asm__ volatile ("st64b %0,[%1]" : : "r" (x), "r" (addr) : "memory" );
}

Intuitively, I would expect this to compile to a sequence of ldr, followed by st64b. But you're expecting this should compile to a sequence of ldr, followed by a sequence of stp, followed by an ld64b, followed by an st64b?

struct foo { unsigned long long x[8]; };
void store(int *in, void *addr)
{

struct foo x = { in[0], in[1], in[4], in[16], in[25], in[36], in[49], in[64] };
__asm__ volatile ("st64b %0,[%1]" : : "r" (x), "r" (addr) : "memory" );

}

For this particular example if we pass the asm operands as i512 the compiler generates the following, which doesn't look bad.

ldpsw	x2, x3, [x0]
ldrsw	x4, [x0, #16]
ldrsw	x5, [x0, #64]
ldrsw	x6, [x0, #100]
ldrsw	x7, [x0, #144]
ldrsw	x8, [x0, #196]
ldrsw	x9, [x0, #256]
//APP
st64b	x2, [x1]
//NO_APP

Looking at the IR, it seems that SROA gets in the way. It loads all eight i32 values and constructs the i512 operand by performing bitwise operations on them. So I was wrong saying that the load of an i512 value won't get optimized.

labrinea updated this revision to Diff 360208.Jul 20 2021, 11:03 AM

This revision uses i512 to pass the asm operands by-value. I've explained in my last comment what would be the challenges had we chosen [i64 x 8].

This revision is now accepted and ready to land.Jul 26 2021, 3:14 PM
This revision was landed with ongoing or failed builds.Jul 31 2021, 1:53 AM
This revision was automatically updated to reflect the committed changes.