This is an archive of the discontinued LLVM Phabricator instance.

[OpenCL] Allow targets emit optimized pipe functions for power of 2 type sizes
AbandonedPublic

Authored by yaxunl on Aug 4 2017, 9:51 AM.

Details

Summary

Currently Clang emits call of __read_pipe_2 or __read_pipe_4 for OpenCL read_pipe builtin,
with appended type size and alignment arguments, where 2 or 4 indicates the original
number of arguments.

For certain targets (e.g. amdgpu), there are optimized version of __read_pipe_2/__read_pipe_4
when the type size and alignment has the same power of 2 value. It is desired that Clang
emits a different function for these cases.

This patch let Clang emits __read_pipe_2_N for such cases where N is the size in bytes of
the type. (N = 1,2,4,8,...,128), so that the target runtime can use an optimized version of
read_pipe.

The same with __read_pipe_4, __write_pipe_2 and __wirte_pipe_4.

This optimization is controlled by TargetCodeGenInfo::hasOptimizedPipeBuiltin, which returns
false by default. Each target can override this function to turn on this optimization.

Diff Detail

Event Timeline

yaxunl created this revision.Aug 4 2017, 9:51 AM
yaxunl edited the summary of this revision. (Show Details)Aug 4 2017, 9:52 AM
bader edited edge metadata.Aug 7 2017, 3:27 AM

Hi Sam,

What do you think about implementing this optimization in target specific optimization pass? Since size/alignment is saved as function parameter in LLVM IR, the optimization can be done in target specific components w/o adding additional conditions to generic library.

Thanks,
Alexey

yaxunl added a comment.Aug 7 2017, 8:36 AM

Hi Sam,

What do you think about implementing this optimization in target specific optimization pass? Since size/alignment is saved as function parameter in LLVM IR, the optimization can be done in target specific components w/o adding additional conditions to generic library.

Thanks,
Alexey

Hi Alexey,

The optimization of the power-of-2 type size is implemented as a library function. Our backend lacks the capability to link in library code at ISA level, so linking of the optimized library function has to be done before any target-specific passes. It seems the only place to do this is Clang codegen since Clang/llvm does not support target-specific pre-linking passes.

Anastasia edited edge metadata.Aug 7 2017, 10:00 AM

Hi Sam,

What do you think about implementing this optimization in target specific optimization pass? Since size/alignment is saved as function parameter in LLVM IR, the optimization can be done in target specific components w/o adding additional conditions to generic library.

Thanks,
Alexey

Hi Alexey,

The optimization of the power-of-2 type size is implemented as a library function. Our backend lacks the capability to link in library code at ISA level, so linking of the optimized library function has to be done before any target-specific passes. It seems the only place to do this is Clang codegen since Clang/llvm does not support target-specific pre-linking passes.

My general feeling is that it doesn't look like a generic enough change for the frontend. Even though it is implemented in a generic way, not every target might have a special support for the power of 2 size and also if there is such a support not every implementation would handle it as a library function. But I can see that perhaps LLVM is missing flexibility in the flow to accommodate these needs. Any change we could try to extend the compilation flow such that this target specific optimization could happen before the IR linking?

b-sumner edited edge metadata.Aug 8 2017, 6:12 AM

Hi Sam,

What do you think about implementing this optimization in target specific optimization pass? Since size/alignment is saved as function parameter in LLVM IR, the optimization can be done in target specific components w/o adding additional conditions to generic library.

Thanks,
Alexey

Hi Alexey,

The optimization of the power-of-2 type size is implemented as a library function. Our backend lacks the capability to link in library code at ISA level, so linking of the optimized library function has to be done before any target-specific passes. It seems the only place to do this is Clang codegen since Clang/llvm does not support target-specific pre-linking passes.

My general feeling is that it doesn't look like a generic enough change for the frontend. Even though it is implemented in a generic way, not every target might have a special support for the power of 2 size and also if there is such a support not every implementation would handle it as a library function. But I can see that perhaps LLVM is missing flexibility in the flow to accommodate these needs. Any change we could try to extend the compilation flow such that this target specific optimization could happen before the IR linking?

It is trivial to implement the small number of specialized functions this patch adds in terms of the general one if desired, and the general one can continue to be handled as it had been.

We had actually proposed a patch (sorry I don't have the reference handy) to add general mechanism for targets to introduce pre-link passes, but it was not accepted. We can try again, but I don't really expect more progress.

Hi Sam,

What do you think about implementing this optimization in target specific optimization pass? Since size/alignment is saved as function parameter in LLVM IR, the optimization can be done in target specific components w/o adding additional conditions to generic library.

Thanks,
Alexey

Hi Alexey,

The optimization of the power-of-2 type size is implemented as a library function. Our backend lacks the capability to link in library code at ISA level, so linking of the optimized library function has to be done before any target-specific passes. It seems the only place to do this is Clang codegen since Clang/llvm does not support target-specific pre-linking passes.

My general feeling is that it doesn't look like a generic enough change for the frontend. Even though it is implemented in a generic way, not every target might have a special support for the power of 2 size and also if there is such a support not every implementation would handle it as a library function. But I can see that perhaps LLVM is missing flexibility in the flow to accommodate these needs. Any change we could try to extend the compilation flow such that this target specific optimization could happen before the IR linking?

It is trivial to implement the small number of specialized functions this patch adds in terms of the general one if desired, and the general one can continue to be handled as it had been.

We had actually proposed a patch (sorry I don't have the reference handy) to add general mechanism for targets to introduce pre-link passes, but it was not accepted. We can try again, but I don't really expect more progress.

It would be nice to understand why it has not been accepted and whether we could try to argument using this case as an example. It seems like a useful feature for toolchains with the IR linking.

Hi Sam,

What do you think about implementing this optimization in target specific optimization pass? Since size/alignment is saved as function parameter in LLVM IR, the optimization can be done in target specific components w/o adding additional conditions to generic library.

Thanks,
Alexey

Hi Alexey,

The optimization of the power-of-2 type size is implemented as a library function. Our backend lacks the capability to link in library code at ISA level, so linking of the optimized library function has to be done before any target-specific passes. It seems the only place to do this is Clang codegen since Clang/llvm does not support target-specific pre-linking passes.

My general feeling is that it doesn't look like a generic enough change for the frontend. Even though it is implemented in a generic way, not every target might have a special support for the power of 2 size and also if there is such a support not every implementation would handle it as a library function. But I can see that perhaps LLVM is missing flexibility in the flow to accommodate these needs. Any change we could try to extend the compilation flow such that this target specific optimization could happen before the IR linking?

It is trivial to implement the small number of specialized functions this patch adds in terms of the general one if desired, and the general one can continue to be handled as it had been.

We had actually proposed a patch (sorry I don't have the reference handy) to add general mechanism for targets to introduce pre-link passes, but it was not accepted. We can try again, but I don't really expect more progress.

It would be nice to understand why it has not been accepted and whether we could try to argument using this case as an example. It seems like a useful feature for toolchains with the IR linking.

The original review is here:

https://reviews.llvm.org/D20682

To cite the reason why it was rejected:

"I fundamentally do not believe that the TargetMachine should be involved in fixing language semantics issues with a "pre linking" pass at the LLVM level. Why? Because there is nothing "target" about this.

There needs to be a fundamentally more principled way of handling this at the language and frontend level IMO."

However, until now, we could not find "a fundamentally more principled way of handling this at the language and frontend level".

bader added a comment.Aug 8 2017, 11:46 AM

@rsmith do you have an opinion on what would be the right place for the kind of proposed optimization?
It looks like it can be implemented as target independent optimization, acting only for target with specified properties - in this case target must provide required built-in functions.

John, do you have any comments? Thanks.

rjmccall edited edge metadata.Aug 11 2017, 7:55 PM

Could you just implement this in SimplifyLibCalls? I assume there's some way to fill in TargetLibraryInfo appropriately for a platform. Is that too late for your linking requirements?

yaxunl added a comment.EditedAug 14 2017, 7:01 AM

Could you just implement this in SimplifyLibCalls? I assume there's some way to fill in TargetLibraryInfo appropriately for a platform. Is that too late for your linking requirements?

Both the optimized and generic versions of __read_pipe function contains call of other library functions and are complicate enough not to be generated programmatically. amdgpu target does not have the capability to link in library code after LLVM codegen. The linking has to be done before SimplifyLibCalls.

bader added a comment.Aug 14 2017, 7:43 AM

Could you just implement this in SimplifyLibCalls? I assume there's some way to fill in TargetLibraryInfo appropriately for a platform. Is that too late for your linking requirements?

Both the optimized and generic versions of __read_pipe function contains call of other library functions and are complicate enough not to be generated programmatically. amdgpu target does not have the capability to link in library code after LLVM codegen. The linking has to be done before SimplifyLibCalls.

If I understand correctly, SimplifyLibCalls is LLVM IR transformation, so it works before linking and LLVM codegen (e.g. InstCombine passes run this transformation). This pass is doing something similar to what you are trying to achieve for __read_pipe builti-ins: pow(2.0, x) -> llvm.exp2(x).

Could you just implement this in SimplifyLibCalls? I assume there's some way to fill in TargetLibraryInfo appropriately for a platform. Is that too late for your linking requirements?

Both the optimized and generic versions of __read_pipe function contains call of other library functions and are complicate enough not to be generated programmatically. amdgpu target does not have the capability to link in library code after LLVM codegen. The linking has to be done before SimplifyLibCalls.

If I understand correctly, SimplifyLibCalls is LLVM IR transformation, so it works before linking and LLVM codegen (e.g. InstCombine passes run this transformation). This pass is doing something similar to what you are trying to achieve for __read_pipe builti-ins: pow(2.0, x) -> llvm.exp2(x).

Thanks. I will take a look.

yaxunl abandoned this revision.Sep 8 2017, 8:32 AM

We implemented this optimization through some target specific llvm pass.