This is an archive of the discontinued LLVM Phabricator instance.

Adjust coercion of aggregates on RenderScript
ClosedPublic

Authored by pirama on Jul 26 2016, 10:47 AM.

Details

Summary

In RenderScript, the size of the argument or return value emitted in the
IR is expected to be the same as the size of corresponding qualified
type. For ARM and AArch64, the coercion performed by Clang can
change the parameter or return value to a type whose size is different
(usually larger) than the original aggregate type. Specifically, this
can happen in the following cases:

  • Aggregate parameters of size <= 64 bytes and return values smaller than 4 bytes on ARM
  • Aggregate parameters and return values smaller than bytes on AArch64

This patch coerces the cases above to an integer array that is the same
size and alignment as the original aggregate. A new field is added to
TargetInfo to detect a RenderScript target and limit this coercion just
to that case.

Tests added to test/CodeGen/renderscript.c

Diff Detail

Repository
rL LLVM

Event Timeline

pirama updated this revision to Diff 65553.Jul 26 2016, 10:47 AM
pirama retitled this revision from to Adjust coercion of aggregates on RenderScript.
pirama updated this object.
pirama added a reviewer: rsmith.
pirama added subscribers: llvm-commits, srhines.
pirama updated this revision to Diff 65555.Jul 26 2016, 10:51 AM

Fix typo in test

Are you aware of how inefficient the resulting ABI is once it hits CodeGen? For example using [3 x i8] will waste 3 full 64-bit registers for that struct. struct { char arr[16] }; barely avoids crashing the backend it's so bad (LLVM demotes it to sret at the last moment).

I suppose the real question is why do RenderScript passes need the types to have the same size, and have you really considered all other options? This ABI mangling would be an absolute last resort if I was trying to add support.

test/CodeGen/renderscript.c
138–139 ↗(On Diff #65555)

Shouldn't these be sret? (And above).

Are you aware of how inefficient the resulting ABI is once it hits CodeGen? For example using [3 x i8] will waste 3 full 64-bit registers for that struct. struct { char arr[16] }; barely avoids crashing the backend it's so bad (LLVM demotes it to sret at the last moment).

I suppose the real question is why do RenderScript passes need the types to have the same size, and have you really considered all other options? This ABI mangling would be an absolute last resort if I was trying to add support.

The size requirement is imposed by the RenderScript runtime. To run a compute kernel, the RenderScript runtime iterates over its input and output buffers using a stride equal to the size of the underlying type and invokes the kernel function on entries in the buffer. Disagreement between the kernel and the the runtime on the sizes of these types can lead to incorrect output,

An alternative we considered is for the runtime to duplicate this coercion logic but that'd be complex to implement in the runtime and harder to maintain.

test/CodeGen/renderscript.c
138–139 ↗(On Diff #65555)

Yes, these should be sret. I was lazy and did not check the whole signature. I'll upload an update shortly.

pirama updated this revision to Diff 65613.Jul 26 2016, 2:55 PM

Update 'CHECK's in test to check the entire function signature

To run a compute kernel, the RenderScript runtime iterates over its input and output buffers using a stride equal to the size of the underlying type and invokes the kernel function on entries in the buffer. Disagreement between the kernel and the the runtime on the sizes of these types can lead to incorrect output,

What language is the runtime written in? I assume LLVM-aware C++ (so it inspects the llvm::Function's parameters before JITing some appropriate loop to do the iteration code)? I think it would be best to pass the stride information through a side channel rather than constraining the front-end to produce sub-optimal code (perhaps a global variable with a special name related to the kernel?).

The runtime dictating the ABI like that seems backwards to me. Ideally you'd start with some given set of functions, decide what the fastest way to run them is, and make sure the runtime is capable of supporting that.

Cheers.

Tim.

To run a compute kernel, the RenderScript runtime iterates over its input and output buffers using a stride equal to the size of the underlying type and invokes the kernel function on entries in the buffer. Disagreement between the kernel and the the runtime on the sizes of these types can lead to incorrect output,

What language is the runtime written in? I assume LLVM-aware C++ (so it inspects the llvm::Function's parameters before JITing some appropriate loop to do the iteration code)? I think it would be best to pass the stride information through a side channel rather than constraining the front-end to produce sub-optimal code (perhaps a global variable with a special name related to the kernel?).

The runtime is written in a combination of LLVM-aware C++ and raw LLVM bitcode. The problem is that the source language is LLVM bitcode that was/is already lowered. Once Clang has decided to expand these types, the actual backend can't determine the original data size. If we could go back and side-channel the size information, we certainly would, but I don't think that is possible right now. I can talk to the rest of the RS team to see if they would mind passing the information differently going forward, but existing code/devices has to work this way.

The runtime dictating the ABI like that seems backwards to me. Ideally you'd start with some given set of functions, decide what the fastest way to run them is, and make sure the runtime is capable of supporting that.

Cheers.

Tim.

I can talk to the rest of the RS team to see if they would mind passing the information differently going forward,

It seems like it would improve RenderScript if they could. I'm not sure how common arguments like this are, but pessimizing them would be a shame.

but existing code/devices has to work this way.

If the RenderScript runtime changes that happens without putting this patch into Clang doesn't it?

If you've got IR already then Clang isn't involved, and trying to generate IR to run on an older runtime isn't going to work for many other reasons besides this (LLVM IR isn't compatible like that).

I can talk to the rest of the RS team to see if they would mind passing the information differently going forward,

It seems like it would improve RenderScript if they could. I'm not sure how common arguments like this are, but pessimizing them would be a shame.

but existing code/devices has to work this way.

If the RenderScript runtime changes that happens without putting this patch into Clang doesn't it?

Actually no. We could fix it for some things going forward, but the Clang that they use will always need some form of this patch for generating IR to run on older devices.

If you've got IR already then Clang isn't involved, and trying to generate IR to run on an older runtime isn't going to work for many other reasons besides this (LLVM IR isn't compatible like that).

Clang has been involved in generating RS IR for the past 6 years. This patch is merely upstreaming an existing Android-only modification to Clang. I believe this is the last such instance of an Android-specific patch. As far as LLVM IR compatibility, RS has bitcode translation that goes between the present-day in-memory IR representation and generates LLVM 3.2 bitcode (among other things). That part will remain outside of upstream LLVM because it is somewhat separable. This change, however, cannot be made in such a way that it can be layered on top of a pristine version of upstream Clang, hence our reason for upstreaming it.

Actually no. We could fix it for some things going forward, but the Clang that they use will always need some form of this patch for generating IR to run on older devices.

So do you have any kind of deprecation policy for older RS runtimes? It would be good to know that we won't have to live with this hack forever (and have some kind of comment in the code telling us when it can be nuked).

Actually no. We could fix it for some things going forward, but the Clang that they use will always need some form of this patch for generating IR to run on older devices.

So do you have any kind of deprecation policy for older RS runtimes? It would be good to know that we won't have to live with this hack forever (and have some kind of comment in the code telling us when it can be nuked).

There is no deprecation policy for old runtimes because we support everything back to the original release of RS at this point. Any change made here would be only going forward (again assuming that the team agrees that this needs to go away, and that they aren't willing to put up with the performance issues for having it forever). Considering that all the code here is bracketed by RenderScript checks, it seems like it should be easy to remove any such section when an appropriate time comes. We can add more obvious comments that these checks are specifically for RS ABI issues (at least through Android N at this point).

t.p.northover accepted this revision.Jul 27 2016, 10:01 AM
t.p.northover added a reviewer: t.p.northover.

Oh well, I expect I've been responsible for worse hacks (and will be again in the future). Given the constraints, it seems unavoidable.

This revision is now accepted and ready to land.Jul 27 2016, 10:01 AM
pirama updated this revision to Diff 65785.Jul 27 2016, 12:07 PM
pirama edited edge metadata.

Clarify comment.

Tim, thanks for the review and acceptance. I've slightly updated the comment.

This revision was automatically updated to reflect the committed changes.