This is an archive of the discontinued LLVM Phabricator instance.

Use -fno-unit-at-a-time and -funit-at-a-time
AbandonedPublic

Authored by rampitec on Nov 8 2016, 5:50 PM.

Details

Summary

Current clang's behaviour -funit-at-a-time, which means IPO optimizations.
IPO is not an appropriate mode in particular for building a library module.
Therefor enable use of -fno-unit-at-a-time and pass it to PassManagerBuilder.

Diff Detail

Event Timeline

rampitec updated this revision to Diff 77297.Nov 8 2016, 5:50 PM
rampitec retitled this revision from to Use -fno-unit-at-a-time and -funit-at-a-time.
rampitec updated this object.
rampitec added reviewers: nhaustov, tejohnson.
rampitec set the repository for this revision to rL LLVM.
rampitec added a subscriber: llvm-commits.
tejohnson edited edge metadata.Nov 9 2016, 8:09 AM

I've added Richard Smith as a reviewer, as he is more familiar with the Clang FE. What is the specific issue requiring disabling unit at a time?

I've added Richard Smith as a reviewer, as he is more familiar with the Clang FE. What is the specific issue requiring disabling unit at a time?

Thank you. The issue appeared while building opencl library. Under unit-at-a-time mode PassManagerBuilder adds FunctionAttrs pass (createPostOrderFunctionAttrsLegacyPass). This pass has removeConvergentAttrs function which removes unused convergent attribute from functions. Functions in the library are unused, so it has removed the attribute in particular from sub_group_barrier, which then lead to OpenCL conformance subgroups test to hang.

I've added Richard Smith as a reviewer, as he is more familiar with the Clang FE. What is the specific issue requiring disabling unit at a time?

Thank you. The issue appeared while building opencl library. Under unit-at-a-time mode PassManagerBuilder adds FunctionAttrs pass (createPostOrderFunctionAttrsLegacyPass). This pass has removeConvergentAttrs function which removes unused convergent attribute from functions. Functions in the library are unused, so it has removed the attribute in particular from sub_group_barrier, which then lead to OpenCL conformance subgroups test to hang.

Added Justin Lebar, who implemented this change. I am not familiar with convergent attributes and OpenCL barriers, but should the visibility of any exported function in a dynamic library prevent such an optimization?

jlebar edited edge metadata.Nov 11 2016, 9:33 AM

Thank you. The issue appeared while building opencl library

First, can you provide a testcase that demonstrates the problem?

Second, I am pretty sure this is not the change that you want to make to solve your problem, see below.

This pass has removeConvergentAttrs function which removes unused convergent attribute from functions.

clang adds the "convergent" attribute to every function it emits. llvm then removes this attribute from functions that do not contain convergent instructions or call convergent functions.

This should be a safe optimization to perform when building a library. It should not remove the attribute from functions because they are unused.

It's entirely possible there's a bug in the pass, though; I can have a look if you can provide a testcase. Another possibility is that your barrier function, which presumably calls some llvm intrinsic, is getting "convergent" removed because the llvm intrinsic is not marked as convergent. That would be working as intended for the pass -- the solution there would be to properly annotate the intrinsic.

Added Justin Lebar, who implemented this change. I am not familiar with convergent attributes and OpenCL barriers, but should the visibility of any exported function in a dynamic library prevent such an optimization?

This is not a dynamic library. This is just library in .bc form for early linking.

The full testcase is rather big and requires a lot of setup, but I have created a really small one, sb1.cl:

#include <opencl-c.h>

__attribute__((overloadable, always_inline, convergent)) void
sub_group_barrier(cl_mem_fence_flags flags, memory_scope scope)
{
    if (flags)
        atomic_work_item_fence(flags, memory_order_acq_rel, scope);
}

The file opencl-c.h is from clang, and note, that attribute convergent is added because of the contents of this file:

#define __ovld __attribute__((overloadable))
#define __conv __attribute__((convergent))
void    __ovld __conv sub_group_barrier(cl_mem_fence_flags flags, memory_scope scope);

Now I compile it to produce a library .bc module:

clang -o sb1.bc -c -emit-llvm -x cl -cl-std=CL2.0 -target amdgcn-amd-amdhsa-opencl -O3 -I <path to llvm>/llvm/tools/clang/lib/Headers sb1.cl

That is the result:

; Function Attrs: alwaysinline nounwind
define void @_Z17sub_group_barrierj12memory_scope(i32 %flags, i32 %scope) local_unnamed_addr #0 {

attributes #0 = { alwaysinline nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+fp64-denormals,-fp32-denormals" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+fp64-denormals,-fp32-denormals" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #2 = { nounwind }

Convergent attribute is missing. You may argue that atomic_work_item_fence does not have convergent attribute, but it should not in fact. Then I can remove a body of the sub_group_barrier completely so it does not call anything at all, result is still the same.

If I use -fno-unit-at-a-time option, then attribute is in place:

; Function Attrs: alwaysinline convergent nounwind
define void @_Z17sub_group_barrierj12memory_scope(i32 %flags, i32 %scope) #0 {

Also note that besides functions attribute pass PassManagerBuilder adds a lot of other passes under unit-a-time-mode which are not expected to run in a non-whole program mode, we have there IPSCCP, GlobalOptimizer, DeadArgElimination etc. All of that is not supposed to run on a library module.

jlebar added a comment.EditedNov 11 2016, 1:42 PM

This is not a dynamic library. This is just library in .bc form for early linking.

OK. This does not change anything about my earlier comment, however.

[Sorry, we midair'ed on this one; reading your comment above now.]

Convergent attribute is missing. You may argue that atomic_work_item_fence does not have convergent attribute, but it should not in fact. Then I can remove a body of the sub_group_barrier completely so it does not call anything at all, result is still the same.

Ah, I think I understand your problem.

The problem is that you are explicitly marking this function as convergent, and then are surprised that LLVM is removing the convergent attribute.

Is that right?

This is WAI for how we have interpreted the convergent attribute in clang and llvm. Maybe our intended behavior is wrong, we can discuss it. But first I want to be really clear that this has nothing to do with running IPOs, or compiling a bitcode library. You would have exactly the same problem, and exactly the same bug, if your testcase were the whole program. Does that make sense?

As I said above:

clang adds the "convergent" attribute to every function it emits. llvm then removes this attribute from functions that do not contain convergent instructions or call convergent functions.

This is the reason that you are seeing what you are seeing: llvm is removing the "convergent" attribute from this function because it does not contain any convergent instructions. Again, it has nothing to do with bitcode libraries. Indeed, if we made the change proposed here, it would likely not fix your bug. As soon as you link this bitcode library into another library and run the IPOs over the combined result, it's going to again remove the convergent attribute on this function.

You may argue that atomic_work_item_fence does not have convergent attribute, but it should not in fact.

Can you elaborate on why atomic_work_item_fence should not have the convergent attribute but sub_group_barrier should be convergent?

Specifically, the convergent attribute prevents additional control-flow dependencies from being added to a callsite. Are you saying that it is incorrect to convert

sub_group_barrier();
if (a) { ... }
else { ... }

into

if (a) { sub_group_barrier(); ... }
else { sub_grouop_barrier(); ... }

but it *is* ok to perform the same transformation on atomic_work_item_fence? Why?

Then I can remove a body of the sub_group_barrier completely so it does not call anything at all, result is still the same.

Yes, there is currently no way for you to say "this empty function must be convergent."

Can you explain why this is an important feature?

Also note that besides functions attribute pass PassManagerBuilder adds a lot of other passes under unit-a-time-mode which are not expected to run in a non-whole program mode, we have there IPSCCP, GlobalOptimizer, DeadArgElimination etc. All of that is not supposed to run on a library module.

If it's OK with you, I'd like to focus on the convergent issue first. Once we get that resolved, Richard Smith should be able to speak to your larger question of funit-at-a-time.

Ah, I think I understand your problem.
The problem is that you are explicitly marking this function as convergent, and then are surprised that LLVM is removing the convergent attribute.
Is that right?

Right. Note, it is marked in the standard header. I can remove attribute from the implementation, it does not change anything.

This is WAI for how we have interpreted the convergent attribute in clang and llvm. Maybe our intended behavior is wrong, we can discuss it. But first I want to be really clear that this has nothing to do with running IPOs, or compiling a bitcode library. You would have exactly the same problem, and exactly the same bug, if your testcase were the whole program. Does that make sense?

In fact when OpenCL conformance test subgroups is compiled in the whole program mode. It does hang right now and passes if I compile library to preserve convergent attribute on sub_group_barrier.

As I said above:

clang adds the "convergent" attribute to every function it emits. llvm then removes this attribute from functions that do not contain convergent instructions or call convergent functions.

This statement confuses me, that is not what I observe:

void foo(void)
{
}

So you say clang will add convergent attribute here and llvm will remove it. Let's see:

clang -o sb0.bc -c -emit-llvm -x cl -cl-std=CL2.0 -target amdgcn-amd-amdhsa-opencl -O0 sb0.cl

I'm using -O0, so llvm does not optimize (note, if I use -O0 for original sub_group_barrier implementation then attribute is preserved as well, so I'm sure -O0 does not run this optimization). Result:

; Function Attrs: nounwind
define void @foo() #0 {
entry:
  ret void
}

attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+fp64-denormals,-fp32-denormals" "unsafe-fp-math"="false" "use-soft-float"="false" }

I do not see convergent attribute added.

This is the reason that you are seeing what you are seeing: llvm is removing the "convergent" attribute from this function because it does not contain any convergent instructions. Again, it has nothing to do with bitcode libraries. Indeed, if we made the change proposed here, it would likely not fix your bug. As soon as you link this bitcode library into another library and run the IPOs over the combined result, it's going to again remove the convergent attribute on this function.

I can tell you for sure that after linking with actual main module and optimizing in whole mode this really fixes the bug ;)

You may argue that atomic_work_item_fence does not have convergent attribute, but it should not in fact.

Can you elaborate on why atomic_work_item_fence should not have the convergent attribute but sub_group_barrier should be convergent?
Specifically, the convergent attribute prevents additional control-flow dependencies from being added to a callsite. Are you saying that it is incorrect to convert

sub_group_barrier();
if (a) { ... }
else { ... }

into

if (a) { sub_group_barrier(); ... }
else { sub_grouop_barrier(); ... }

but it *is* ok to perform the same transformation on atomic_work_item_fence? Why?

I suppose so, because fence does not act as a barrier, just makes memory visible.

Then I can remove a body of the sub_group_barrier completely so it does not call anything at all, result is still the same.

Yes, there is currently no way for you to say "this empty function must be convergent."
Can you explain why this is an important feature?

That is not important. That was just to show that lack of attribute on fence is not a problem per-se.

Also note that besides functions attribute pass PassManagerBuilder adds a lot of other passes under unit-a-time-mode which are not expected to run in a non-whole program mode, we have there IPSCCP, GlobalOptimizer, DeadArgElimination etc. All of that is not supposed to run on a library module.

If it's OK with you, I'd like to focus on the convergent issue first. Once we get that resolved, Richard Smith should be able to speak to your larger question of funit-at-a-time.

OK, even though I believe the latter is even more important.

I think I see the source of confusion. You say if a function does not call anything convergent, it is not convergent. So this sub_group_barrier shall do two things: execution barrier and fence. Barrier is convergent and fence is not. But since that is a subgroup barrier and we have SIMT we do not need to issue any instructions to make sure execution will come to the convergence point simultaneously, so barrier is omitted. We only need to make sure it is not cloned into a divergent CFG, and we want to rely on function's attribute for this purpose.

Now setting aside this issue, I still believe that in general there shall be a way to compile in a non IPO mode, and in general IPO is not suitable for a library module.

I think I see the source of confusion. You say if a function does not call anything convergent, it is not convergent. So this sub_group_barrier shall do two things: execution barrier and fence. Barrier is convergent and fence is not. But since that is a subgroup barrier and we have SIMT we do not need to issue any instructions to make sure execution will come to the convergence point simultaneously, so barrier is omitted. We only need to make sure it is not cloned into a divergent CFG, and we want to rely on function's attribute for this purpose.

Now setting aside this issue, I still believe that in general there shall be a way to compile in a non IPO mode, and in general IPO is not suitable for a library module.

In what way is this different from the IPO we do on single source files at -O2? It should be conservatively correct.

This statement confuses me, that is not what I observe:

Sorry, you're right. This is what clang does in cuda mode specifically.

You may want to do the same thing in opencl mode. In CUDA if you have

void wrapper() { my_convergent_barrier_fn(); }

we treat wrapper() as convergent. This is important because otherwise you cannot meaningfully write wrappers around convergent functions (without using attribute((convergent)), I guess, but that doesn't exist in CUDA; it's a clang thing). I don't know if you want the same behavior in opencl or not.

But since that is a subgroup barrier and we have SIMT we do not need to issue any instructions to make sure execution will come to the convergence point simultaneously, so barrier is omitted.

OK. What you want then is to call a llvm intrinsic that is a nop, but is convergent. Then this pass will do the right thing.

Perhaps the semantics in opencl are that clang must insert a call to this special intrinsic at the beginning of every function that is annotated as attribute((convergent)).

Perhaps the semantics in opencl are that clang must insert a call to this special intrinsic at the beginning of every function that is annotated as attribute((convergent)).

Why do not we just rely on an attribute? Why does it exist if we cannot use it?

In what way is this different from the IPO we do on single source files at -O2? It should be conservatively correct.

Such IPO has to be limited to non-external functions only then. That does not seem to be the case.

Why do not we just rely on an attribute? Why does it exist if we cannot use it?

You can use the attribute. It is just not implemented correctly in clang's opencl frontend. The correct way to implement it is as I explained.

To elaborate on why the suggestion I gave is the right thing to do: LLVM will inline convergent functions. After inlining a convergent function, the fact that it is convergent is lost. Therefore llvm can transform the body of the function, introducing a divergence. That will clearly be incorrect.

However, if you add a convergent nop intrinsic to the body of any function that is marked with the convergent attribute, this will behave correctly even in the face of inlining.

In what way is this different from the IPO we do on single source files at -O2? It should be conservatively correct.

Such IPO has to be limited to non-external functions only then. That does not seem to be the case.

They should be correctly limited in the case of externally visible functions. Otherwise -O2 would be failing. E.g. GlobalOpt that you mentioned earlier is only going to remove unreferenced functions that are local or otherwise discardable (linkonce, available_externally linkages types).

Do you have an example where one of these passes is doing the wrong thing?

They should be correctly limited in the case of externally visible functions. Otherwise -O2 would be failing. E.g. GlobalOpt that you mentioned earlier is only going to remove unreferenced functions that are local or otherwise discardable (linkonce, available_externally linkages types).
Do you have an example where one of these passes is doing the wrong thing?

I have checked and it looks like these optimizations limit themselves to externally visible objects, at least for the part I have checked. That is clearly not the case with convergence attribute removal. I can fix the issue with the attribute differently though.

Then it is still probably worth enabling use of these options, because we already have them and we already have functionality they have to perform. We just do not connect one to another.

However, if you add a convergent nop intrinsic to the body of any function that is marked with the convergent attribute, this will behave correctly even in the face of inlining.

That is a better solution, agree. Marking all functions as convergent seems to be an overkill to me, but I have created a discardable barrier intrinsic and used it directly in the sub_group_barrier implementation. As expected that helped conformance test to pass.

That said I believe that still makes sense to submit the current change just to connect options we already have defined to their intended functionality.

jlebar added a comment.EditedNov 13 2016, 3:21 AM

I have created a discardable barrier intrinsic and used it directly in the sub_group_barrier implementation. As expected that helped conformance test to pass.

You still plan to modify clang to insert this intrinsic during codegen at the top of every function marked with the convergent attribute, right?

You folks need to figure out the semantics for the convergent attribute in opencl, but I think it makes very little sense to require that convergent functions both (a) have the attribute and (b) have the nop barrier intrinsic. You should choose one or the other approach.

Marking all functions as convergent seems to be an overkill to me

Again, I urge you to think about this in terms of the semantics of your programming language. Am I allowed to wrap a convergent function? (People do this in CUDA all the time.) If I am allowed to do so, must I mark the wrapper (and its wrapper, and so on) with the convergent attribute? (I know little about opencl, but I notice that "convergent" doesn't appear anywhere in the opencl 2.0 spec. This suggests maybe you don't expect end-users to use this attribute in their code?)

Only once you have defined the semantics of the language will you be able to make the correct decision about whether an approach is overkill or not. But, as someone who spent a few months of his life tracking down bugs due to our CUDA support not handling this exact issue properly, I am very concerned with the above quote, because it sounds to me like you're judging solutions before figuring out what your language is supposed to do.

That said I believe that still makes sense to submit the current change just to connect options we already have defined to their intended functionality.

I don't think I agree. Use of the -fno-unit-at-a-time option could end up hiding other real issues (e.g. in your case with the convergent attribute), and also disable optimizations that are perfectly legal in those cases (e.g. global opt, etc). If someone wants to disable the option, it would again be worthwhile to have a discussion as to why.

I looked at the history of the clang support that gives the current error, which is from 2009, and the comment was "We could just warn about -fno-unit-at-a-time, but in practice people using it probably aren't going to get what they want out of clang." I still think that is true today, since this disables a whole lot, and looking at e.g. the gcc documentation for the option it only seems to disable a few things for very specific legacy code circumstances (the gcc docs say the option is there for compatibility, and it disables reordering of top-level functions for which new code should use attributes, e.g.).

I tried to look at the history of when/why DisableUnitAtATime was added to llvm, but wasn't able to find the commit log either via git (it was moved from a different file) or via google searches.

I have created a discardable barrier intrinsic and used it directly in the sub_group_barrier implementation. As expected that helped conformance test to pass.

You still plan to modify clang to insert this intrinsic during codegen at the top of every function marked with the convergent attribute, right?

No, not really. It is inserted manually at the top of the sub_group_barrier implementation. It is still needed to be there because otherwise the attribute will be removed from sub_group_barrier (and one more wrapper of it by the way). Other library convergent functions do not need it because they already have calls to something convergent, this is only sub_group_barrier special. It may happen of course we will identify more cases over the time.

You folks need to figure out the semantics for the convergent attribute in opencl, but I think it makes very little sense to require that convergent functions both (a) have the attribute and (b) have the nop barrier intrinsic. You should choose one or the other approach.

They have attribute and such barrier call. In CUDA you add attribute to every function automatically, in this case library functions are marked explicitly. In both cases there must be a convergent call inside or else IPO will remove that attribute.

Marking all functions as convergent seems to be an overkill to me

Again, I urge you to think about this in terms of the semantics of your programming language. Am I allowed to wrap a convergent function?

Yes.

If I am allowed to do so, must I mark the wrapper (and its wrapper, and so on) with the convergent attribute?

Yes.

This suggests maybe you don't expect end-users to use this attribute in their code?

I do not think we want to require people to use it, there is nothing in the language standard to require it. When we do library we can take care of it manually.
One drawback I see here is a relatively big user function (not a wrapper) calling a barrier. If a call site would be cloned into a divergent control flow it may create problem. In such situation it should make sense to add convergent attribute to such user function. When I'm speaking about the overkill I mean I would expect such marking to be added bottom to top in the call stack, not bulk added and then removed top to bottom. Was this considered?

rampitec edited edge metadata.Nov 13 2016, 10:34 AM
rampitec added a subscriber: b-sumner.
mehdi_amini requested changes to this revision.Nov 13 2016, 10:51 AM
mehdi_amini added a reviewer: mehdi_amini.
mehdi_amini added a subscriber: mehdi_amini.

I planned for a long time to *remove* the "unit_at_a_time" option in the PassManager builder, I haven't read anything compelling in this thread that justifies disabling IPO right now.
If there is a bug with a particular IPO when optimizing a library, then the bug should be addressed correctly, not by disabling the IPO.

I tried to look at the history of when/why DisableUnitAtATime was added to llvm, but wasn't able to find the commit log either via git (it was moved from a different file) or via google searches.

IIRC, it was a compatibility mode from the early days to be able to match what GCC was doing.

This revision now requires changes to proceed.Nov 13 2016, 10:51 AM

I tried to look at the history of when/why DisableUnitAtATime was added to llvm, but wasn't able to find the commit log either via git (it was moved from a different file) or via google searches.

IIRC, it was a compatibility mode from the early days to be able to match what GCC was doing.

Here is the GCC doc: https://gcc.gnu.org/onlinedocs/gcc-4.2.1/gcc/Optimize-Options.html

-funit-at-a-time
Parse the whole compilation unit before starting to produce code. This allows some extra optimizations to take place but consumes more memory (in general). There are some compatibility issues with unit-at-a-time mode:
enabling unit-at-a-time mode may change the order in which functions, variables, and top-level asm statements are emitted, and will likely break code relying on some particular ordering. The majority of such top-level asm statements, though, can be replaced by section attributes. The fno-toplevel-reorder option may be used to keep the ordering used in the input file, at the cost of some optimizations.
unit-at-a-time mode removes unreferenced static variables and functions. This may result in undefined references when an asm statement refers directly to variables or functions that are otherwise unused. In that case either the variable/function shall be listed as an operand of the asm statement operand or, in the case of top-level asm statements the attribute used shall be used on the declaration.
Static functions now can use non-standard passing conventions that may break asm statements calling functions directly. Again, attribute used will prevent this behavior.
As a temporary workaround, -fno-unit-at-a-time can be used, but this scheme may not be supported by future releases of GCC.

Enabled at levels -O, -O2, -O3, -Os.
rampitec abandoned this revision.Nov 13 2016, 12:45 PM