This is an archive of the discontinued LLVM Phabricator instance.

[FuncSpec] Support specialising recursive functions
ClosedPublic

Authored by SjoerdMeijer on Jul 21 2021, 1:08 AM.

Details

Summary

This adds support for specialising recursive functions. For example, consider this program:

int Global = 1;
void recursiveFunc(int *arg) {
  if (*arg < 4) {
    print(*arg);
    recursiveFunc(*arg + 1);
  }
}
void main() {
  recursiveFunc(&Global);
}

after 3 iterations of function specialisation, followed by inlining of the specialised versions of recursiveFunc, the main function looks like this:

void main() {
    print(1);
    print(2);
    print(3);
  }

To support this, the following has been added:

  • Update the solver and state of the new specialised functions,
  • An optimisation to propagate constant stack values after each iteration of function specialisation, which is necessary for the next iteration to recognise the constant values and trigger.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Jul 21 2021, 1:08 AM
SjoerdMeijer requested review of this revision.Jul 21 2021, 1:08 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2021, 1:08 AM

The example in the summary looks a little bit scaring at the first glance. I thought it may specialize 1000 times if the code looks like:

int Global = 1;
void recursiveFunc(int *arg) {
  if (*arg < 1000) {
    print(*arg);
    recursiveFunc(*arg + 1);
  }
}
void main() {
  recursiveFunc(&Global);
}

And I found that it is controlled by FuncSpecializationMaxIters which is 1 by default.
And my question is: what would be different if we don't change the value for FuncSpecializationMaxIters?

llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
74

mark this with static

84

maybe we need a better name such as getPromotableAlloca

Thanks for looking at this!
Have inlined my response to some queries below, and am now going to address the code comments now.

The example in the summary looks a little bit scaring at the first glance. I thought it may specialize 1000 times if the code looks like:

int Global = 1;
void recursiveFunc(int *arg) {
  if (*arg < 1000) {
    print(*arg);
    recursiveFunc(*arg + 1);
  }
}
void main() {
  recursiveFunc(&Global);
}

And I found that it is controlled by FuncSpecializationMaxIters which is 1 by default.

Exactly, this is controlled by option -function-specialization-max-iters which indeed defaults to 1, but in the new tests you'll see that higher values are used to do more iterations and specialise recursive functions.

And my question is: what would be different if we don't change the value for FuncSpecializationMaxIters?

I have no plan to change the default number of iterations at this point. First, I wanted to get this features in, so that it can be controlled with this option. But next, when we start looking at enabling the function specialisation pass by default and need to look at compile times again, we can revise this what would be acceptable. Or perhaps we do need a "aggressive" options, that sets the number of iterations to some value. But yeah, getting function specialisation enabled by default is going to be difficult, and didn't want to make it even more difficult by consider changing the default at this point

Addressed comments.

And my question is: what would be different if we don't change the value for FuncSpecializationMaxIters?

I have no plan to change the default number of iterations at this point. First, I wanted to get this features in, so that it can be controlled with this option. But next, when we start looking at enabling the function specialisation pass by default and need to look at compile times again, we can revise this what would be acceptable. Or perhaps we do need a "aggressive" options, that sets the number of iterations to some value. But yeah, getting function specialisation enabled by default is going to be difficult, and didn't want to make it even more difficult by consider changing the default at this point

I have no plan to change the default number of iterations at this point.

Yeah, I didn't neither. If the default behavior doesn't change, we don't need to measure the numbers again to save time.

Or perhaps we do need a "aggressive" options, that sets the number of iterations to some value.

This is the problem. Since the number of recursive function get specialized may increase with FuncSpecializationMaxIters, it may be scaring to users (may be other compiler engineer tuning the performance). I think the problem may come from there is no restriction between iterations.
One solution may be add a structure like Set<Set<Function*>> to record functions specialized and a structure Set<Function*, Argument*> to record the function specialized for specific argument. Then we could give higher penalty for the recursive cases for the same argument. The design for data structure may not be best. But I think I stated my concern. Simply, I don't want the number of specialized functions get increased linearly with FuncSpecializationMaxIters increases unlimitedly.

Or perhaps we do need a "aggressive" options, that sets the number of iterations to some value.

This is the problem. Since the number of recursive function get specialized may increase with FuncSpecializationMaxIters, it may be scaring to users (may be other compiler engineer tuning the performance). I think the problem may come from there is no restriction between iterations.
One solution may be add a structure like Set<Set<Function*>> to record functions specialized and a structure Set<Function*, Argument*> to record the function specialized for specific argument. Then we could give higher penalty for the recursive cases for the same argument. The design for data structure may not be best. But I think I stated my concern. Simply, I don't want the number of specialized functions get increased linearly with FuncSpecializationMaxIters increases unlimitedly.

I am happy to look into that. Shall we do that as a follow up? This gives us the baseline that we can experiment with and get numbers for, then see if we can improve that.

Or perhaps we do need a "aggressive" options, that sets the number of iterations to some value.

This is the problem. Since the number of recursive function get specialized may increase with FuncSpecializationMaxIters, it may be scaring to users (may be other compiler engineer tuning the performance). I think the problem may come from there is no restriction between iterations.
One solution may be add a structure like Set<Set<Function*>> to record functions specialized and a structure Set<Function*, Argument*> to record the function specialized for specific argument. Then we could give higher penalty for the recursive cases for the same argument. The design for data structure may not be best. But I think I stated my concern. Simply, I don't want the number of specialized functions get increased linearly with FuncSpecializationMaxIters increases unlimitedly.

I am happy to look into that. Shall we do that as a follow up? This gives us the baseline that we can experiment with and get numbers for, then see if we can improve that.

I prefer to implement it in this diff. Since I think it would be better to make a patch more self-contained. We could still experiment with this patch without troubling others.

Okay, if we want to do this here, then we need clarity on the exact problem, which is lacking at the moment. You mentioned "scary" a few times, but there's nothing scary here. ;-) First of all, this is an extra option that people need to buy into, so changes won't bother anyone.

About the suspected problem:

I think the problem may come from there is no restriction between iterations.

I don't think this is the case. In next iterations, exactly the same cost-model and heuristics are applied. Thus, I expect that if a function was not specialised before, it won't be specialised in a next iteration. Except, for recursive functions that need to be triggered by a very specific optimisation first between iterations.

Thus, without exact clarity of the problem, this feels like a premature optimisation to me, so we can do 2 things:

  • I guess it's up to me to show with performance numbers if there is a problem or not. Before doing that, I would appreciate if you can elaborate on the suspected problems (just to make sure I understand them).
  • if you're happy with my explanation thus far, and that this is an opt-in anyway, we can defer the compile time analysis that I need to do anyway when I plan to look into enable function specialisation by default.

First of all, this is an extra option that people need to buy into, so changes won't bother anyone.

Agreed. But it's not a good reason to do things arbitrarily if it is not a default behavior.

About the suspected problem:

I think the problem may come from there is no restriction between iterations.

I don't think this is the case. In next iterations, exactly the same cost-model and heuristics are applied. Thus, I expect that if a function was not specialised before, it won't be specialised in a next iteration. Except, for recursive functions that need to be triggered by a very specific optimisation first between iterations.

Yeah, the recursive functions is the problem.

Okay, if we want to do this here, then we need clarity on the exact problem,

Let me give a more formal description.
(1) Call the set of new generated functions in i-th iteration with Fs[i], and Fs[0] would be the set of functions initially. And the number of functions generated in i-th iterations would be |Fs[i]|.
(2) The penalty to specialize a function would be Penalty(F) * NumSpecialized[i]. NumSpecialized[i] would be Fs[1] + Fs[2] + ... + Fs[I].
(3) The bonus to specialize a function would be Bonus(F, ArgNo). For normal recursive function F and the specialized one SF, it's normal that Bonus(F, ArgNo) == Bonus(SF, ArgNo) and Penalty(F) == Penalty(SF). It means that SF may be specialized again if NumSpecialized[I] may not be large enough.

Then we could find that now the total number of specialized function would be controlled by NumSpecialized which increases linearly.
But here is the problem that the implementation Bonus(F, ArgNo) would increase exponentially with the depth of loops. It shows that we may get in trouble if we met a recursive function with a deep loop.
And the things we want to do is to add the iteration time I to the cost model of Penalty(F, i).

For example,

; opt -function-specialization -func-specialization-max-iters=100  -S %s
@Global = internal constant i32 1, align 4

define internal void @recursiveFunc(i32* nocapture readonly %arg) {
  %temp = alloca i32, align 4
  %arg.load = load i32, i32* %arg, align 4
  %arg.cmp = icmp slt i32 %arg.load, 10000
  br i1 %arg.cmp, label %loop1, label %ret.block

loop1:
  br label %loop2

loop2:
  br label %loop3

loop3:
  br label %loop4

loop4:
  br label %block6

block6:
  call void @print_val(i32 %arg.load)
  %arg.add = add nsw i32 %arg.load, 1
  store i32 %arg.add, i32* %temp, align 4
  call void @recursiveFunc(i32* nonnull %temp)
  br label %loop4.end

loop4.end:
  %exit_cond1 = call i1 @exit_cond()
  br i1 %exit_cond1, label %loop4, label %loop3.end

loop3.end:
  %exit_cond2 = call i1 @exit_cond()
  br i1 %exit_cond2, label %loop3, label %loop2.end

loop2.end:
  %exit_cond3 = call i1 @exit_cond()
  br i1 %exit_cond3, label %loop2, label %loop1.end

loop1.end:
  %exit_cond4 = call i1 @exit_cond()
  br i1 %exit_cond4, label %loop1, label %ret.block

ret.block:
  ret void
}

define i32 @main() {
  call void @recursiveFunc(i32* nonnull @Global)
  ret i32 0
}

declare dso_local void @print_val(i32)
declare dso_local i1 @exit_cond()

I guess I would be happy if recursiveFunc would get specialized less than 4 times even when we set func-specialization-max-iters to 100.

I guess it's up to me to show with performance numbers if there is a problem

I prefer to analysis problems from the side of cost mode instead of experiencing the large work load all the time. Since there are too many parameters and the patterns are very complex. It should be common that problems in codes may be missed. Although it is very common to fix bugs, I think it would be better to avoid problems if we noticed them.

Thanks for elaborating on this!
See my comments inline.

Let me give a more formal description.
(1) Call the set of new generated functions in i-th iteration with Fs[i], and Fs[0] would be the set of functions initially. And the number of functions generated in i-th iterations would be |Fs[i]|.
(2) The penalty to specialize a function would be Penalty(F) * NumSpecialized[i]. NumSpecialized[i] would be Fs[1] + Fs[2] + ... + Fs[I].
(3) The bonus to specialize a function would be Bonus(F, ArgNo). For normal recursive function F and the specialized one SF, it's normal that Bonus(F, ArgNo) == Bonus(SF, ArgNo) and Penalty(F) == Penalty(SF). It means that SF may be specialized again if NumSpecialized[I] may not be large enough.

Then we could find that now the total number of specialized function would be controlled by NumSpecialized which increases linearly.

100% agreed so far. This is indeed how things work at the moment to recursively/linearly specialise recursive functions.

But here is the problem that the implementation Bonus(F, ArgNo) would increase exponentially with the depth of loops. It shows that we may get in trouble if we met a recursive function with a deep loop.
And the things we want to do is to add the iteration time I to the cost model of Penalty(F, i).

Ok, thanks, I am going to look into this!

For example,

; opt -function-specialization -func-specialization-max-iters=100  -S %s
@Global = internal constant i32 1, align 4

define internal void @recursiveFunc(i32* nocapture readonly %arg) {
  %temp = alloca i32, align 4
  %arg.load = load i32, i32* %arg, align 4
  %arg.cmp = icmp slt i32 %arg.load, 10000
  br i1 %arg.cmp, label %loop1, label %ret.block

loop1:
  br label %loop2

loop2:
  br label %loop3

loop3:
  br label %loop4

loop4:
  br label %block6

block6:
  call void @print_val(i32 %arg.load)
  %arg.add = add nsw i32 %arg.load, 1
  store i32 %arg.add, i32* %temp, align 4
  call void @recursiveFunc(i32* nonnull %temp)
  br label %loop4.end

loop4.end:
  %exit_cond1 = call i1 @exit_cond()
  br i1 %exit_cond1, label %loop4, label %loop3.end

loop3.end:
  %exit_cond2 = call i1 @exit_cond()
  br i1 %exit_cond2, label %loop3, label %loop2.end

loop2.end:
  %exit_cond3 = call i1 @exit_cond()
  br i1 %exit_cond3, label %loop2, label %loop1.end

loop1.end:
  %exit_cond4 = call i1 @exit_cond()
  br i1 %exit_cond4, label %loop1, label %ret.block

ret.block:
  ret void
}

define i32 @main() {
  call void @recursiveFunc(i32* nonnull @Global)
  ret i32 0
}

declare dso_local void @print_val(i32)
declare dso_local i1 @exit_cond()

I guess I would be happy if recursiveFunc would get specialized less than 4 times even when we set func-specialization-max-iters to 100.

I am looking into this now, but now that we have cleared up the problem in previous messages, a.k.a are on the same page how things work for recursive function, I think we need to discuss how the cost-model changes would look like. Because in this example the numbers are arbitrary and a rationale is missing:

I guess I would be happy if recursiveFunc would get specialized less than 4 times even when we set func-specialization-max-iters to 100.

Why 4 times? In other words, what is the metric here? Is that code-size or something else?

I am looking into this now, but now that we have cleared up the problem in previous messages, a.k.a are on the same page how things work for recursive function, I think we need to discuss how the cost-model changes would look like. Because in this example the numbers are arbitrary and a rationale is missing:

I guess I would be happy if recursiveFunc would get specialized less than 4 times even when we set func-specialization-max-iters to 100.

Why 4 times? In other words, what is the metric here? Is that code-size or something else?

4 times is an arbitrary number. I thought the best would be 1 time only. But I worried if it is too strict. So I said 4 times. Now I think we should be more strict. So I think it would better to specialize it only 1 time.

how things work for recursive function, I think we need to discuss how the cost-model changes would look like

I think we need to handle recursive function specially. Since it is common that recursive function f and its specialized one f' have the same cost and bonus. We need to record the function specialized (and specialized argument). Then we need to specialize a function (for specific argument) we had specialized, we could forbid it or give it a much higher penalty. The detail is missing. But the key point is that we need to handle recursive function specially instead of tuning or refactoring the cost model.

4 times is an arbitrary number. I thought the best would be 1 time only. But I worried if it is too strict. So I said 4 times. Now I think we should be more strict. So I think it would better to specialize it only 1 time.

I can't follow any this. There are only arbitrary numbers here. I can't follow the decision making behind this, because no rationale is given.
Taking your example, specialising 4 times, and after cleanup passes like inline, instcombine, and simplify cfg, this would exactly give the code I would expect.
And that's the problem of this discussion, it is based on a idea, and not based on an any data that we can verify. For *this* example, I don't see which high level criteria is the problem:

  • compile-times,
  • code-size,
  • performance.

how things work for recursive function, I think we need to discuss how the cost-model changes would look like

I think we need to handle recursive function specially. Since it is common that recursive function f and its specialized one f' have the same cost and bonus. We need to record the function specialized (and specialized argument). Then we need to specialize a function (for specific argument) we had specialized, we could forbid it or give it a much higher penalty. The detail is missing. But the key point is that we need to handle recursive function specially instead of tuning or refactoring the cost model.

To help this discussion and make this concrete, can you take your own example, and point exactly on what code-generation criteria I mentioned before you would like to specialise this only once?

The problem I have with this discussion is that it is based on an idea that I cannot verify with data. While I understand the general idea, perhaps, and implementing a penalty is easy but I simply don't know what that should be. I can put a random/arbitrary number in, but why would we do that? That's why I still prefer the approach of getting this foundation in place, because it is opt-in, then we can tune this further.

4 times is an arbitrary number. I thought the best would be 1 time only. But I worried if it is too strict. So I said 4 times. Now I think we should be more strict. So I think it would better to specialize it only 1 time.

I can't follow any this. There are only arbitrary numbers here. I can't follow the decision making behind this, because no rationale is given.
Taking your example, specialising 4 times, and after cleanup passes like inline, instcombine, and simplify cfg, this would exactly give the code I would expect.
And that's the problem of this discussion, it is based on a idea, and not based on an any data that we can verify. For *this* example, I don't see which high level criteria is the problem:

  • compile-times,
  • code-size,
  • performance.

how things work for recursive function, I think we need to discuss how the cost-model changes would look like

I think we need to handle recursive function specially. Since it is common that recursive function f and its specialized one f' have the same cost and bonus. We need to record the function specialized (and specialized argument). Then we need to specialize a function (for specific argument) we had specialized, we could forbid it or give it a much higher penalty. The detail is missing. But the key point is that we need to handle recursive function specially instead of tuning or refactoring the cost model.

To help this discussion and make this concrete, can you take your own example, and point exactly on what code-generation criteria I mentioned before you would like to specialise this only once?

The problem I have with this discussion is that it is based on an idea that I cannot verify with data. While I understand the general idea, perhaps, and implementing a penalty is easy but I simply don't know what that should be. I can put a random/arbitrary number in, but why would we do that? That's why I still prefer the approach of getting this foundation in place, because it is opt-in, then we can tune this further.

Hi, for this example:

; opt -function-specialization -func-specialization-max-iters=100  -S %s
@Global = internal constant i32 1, align 4

define internal void @recursiveFunc(i32* nocapture readonly %arg) {
  %temp = alloca i32, align 4
  %arg.load = load i32, i32* %arg, align 4
  %arg.cmp = icmp slt i32 %arg.load, 10000
  br i1 %arg.cmp, label %loop1, label %ret.block

loop1:
  br label %loop2

loop2:
  br label %loop3

loop3:
  br label %loop4

loop4:
  br label %block6

block6:
  call void @print_val(i32 %arg.load)
  %arg.add = add nsw i32 %arg.load, 1
  store i32 %arg.add, i32* %temp, align 4
  call void @recursiveFunc(i32* nonnull %temp)
  br label %loop4.end

loop4.end:
  %exit_cond1 = call i1 @exit_cond()
  br i1 %exit_cond1, label %loop4, label %loop3.end

loop3.end:
  %exit_cond2 = call i1 @exit_cond()
  br i1 %exit_cond2, label %loop3, label %loop2.end

loop2.end:
  %exit_cond3 = call i1 @exit_cond()
  br i1 %exit_cond3, label %loop2, label %loop1.end

loop1.end:
  %exit_cond4 = call i1 @exit_cond()
  br i1 %exit_cond4, label %loop1, label %ret.block

ret.block:
  ret void
}

define i32 @main() {
  call void @recursiveFunc(i32* nonnull @Global)
  ret i32 0
}

declare dso_local void @print_val(i32)
declare dso_local i1 @exit_cond()

If we run bin/opt -function-specialization -func-specialization-max-iters=10 -S %s, we could find that recursiveFunc would get specialized 18 times. But I believe it would be a consensus for us that recursiveFunc shouldn't get specialized so many times.
Then I run bin/opt -function-specialization -func-specialization-max-iters=10 -inline -instcombine -simplifycfg -S %s, I got:

; ModuleID = 'RecursiveFunc.ll'
source_filename = "RecursiveFunc.ll"

@Global = internal constant i32 1, align 4
@funcspec.arg = internal constant i32 2
@funcspec.arg.3 = internal constant i32 3
@funcspec.arg.5 = internal constant i32 4
@funcspec.arg.7 = internal constant i32 5
@funcspec.arg.9 = internal constant i32 6
@funcspec.arg.11 = internal constant i32 7
@funcspec.arg.13 = internal constant i32 8
@funcspec.arg.15 = internal constant i32 9
@funcspec.arg.17 = internal constant i32 10
@funcspec.arg.19 = internal constant i32 11

define internal void @recursiveFunc(i32* nocapture readonly %arg) {
  %temp = alloca i32, align 4
  %arg.load = load i32, i32* %arg, align 4
  %arg.cmp = icmp slt i32 %arg.load, 10000
  br i1 %arg.cmp, label %loop1, label %ret.block

loop1:                                            ; preds = %loop1.end, %0
  br label %loop2

loop2:                                            ; preds = %loop2.end, %loop1
  br label %loop3

loop3:                                            ; preds = %loop3.end, %loop2
  br label %loop4

loop4:                                            ; preds = %loop4, %loop3
  call void @print_val(i32 %arg.load)
  %arg.add = add nsw i32 %arg.load, 1
  store i32 %arg.add, i32* %temp, align 4
  call void @recursiveFunc(i32* nonnull %temp)
  %exit_cond1 = call i1 @exit_cond()
  br i1 %exit_cond1, label %loop4, label %loop3.end

loop3.end:                                        ; preds = %loop4
  %exit_cond2 = call i1 @exit_cond()
  br i1 %exit_cond2, label %loop3, label %loop2.end

loop2.end:                                        ; preds = %loop3.end
  %exit_cond3 = call i1 @exit_cond()
  br i1 %exit_cond3, label %loop2, label %loop1.end

loop1.end:                                        ; preds = %loop2.end
  %exit_cond4 = call i1 @exit_cond()
  br i1 %exit_cond4, label %loop1, label %ret.block

ret.block:                                        ; preds = %loop1.end, %0
  ret void
}

define i32 @main() {
  br label %loop1.i

loop1.i:                                          ; preds = %loop1.end.i, %0
  br label %loop2.i

loop2.i:                                          ; preds = %loop2.end.i, %loop1.i
  br label %loop3.i

loop3.i:                                          ; preds = %loop3.end.i, %loop2.i
  br label %loop4.i

loop4.i:                                          ; preds = %recursiveFunc.2.exit.i, %loop3.i
  call void @print_val(i32 1)
  br label %loop1.i.i

loop1.i.i:                                        ; preds = %loop1.end.i.i, %loop4.i
  br label %loop2.i.i

loop2.i.i:                                        ; preds = %loop2.end.i.i, %loop1.i.i
  br label %loop3.i.i

loop3.i.i:                                        ; preds = %loop3.end.i.i, %loop2.i.i
  br label %loop4.i.i

loop4.i.i:                                        ; preds = %recursiveFunc.4.exit.i.i, %loop3.i.i
  call void @print_val(i32 2)
  br label %loop1.i.i.i

loop1.i.i.i:                                      ; preds = %loop1.end.i.i.i, %loop4.i.i
  br label %loop2.i.i.i

loop2.i.i.i:                                      ; preds = %loop2.end.i.i.i, %loop1.i.i.i
  br label %loop3.i.i.i

loop3.i.i.i:                                      ; preds = %loop3.end.i.i.i, %loop2.i.i.i
  br label %loop4.i.i.i

loop4.i.i.i:                                      ; preds = %recursiveFunc.6.exit.i.i.i, %loop3.i.i.i
  call void @print_val(i32 3)
  br label %loop1.i.i.i.i

loop1.i.i.i.i:                                    ; preds = %loop1.end.i.i.i.i, %loop4.i.i.i
  br label %loop2.i.i.i.i

loop2.i.i.i.i:                                    ; preds = %loop2.end.i.i.i.i, %loop1.i.i.i.i
  br label %loop3.i.i.i.i

loop3.i.i.i.i:                                    ; preds = %loop3.end.i.i.i.i, %loop2.i.i.i.i
  br label %loop4.i.i.i.i

loop4.i.i.i.i:                                    ; preds = %recursiveFunc.8.exit.i.i.i.i, %loop3.i.i.i.i
  call void @print_val(i32 4)
  br label %loop1.i.i.i.i.i

loop1.i.i.i.i.i:                                  ; preds = %loop1.end.i.i.i.i.i, %loop4.i.i.i.i
  br label %loop2.i.i.i.i.i

loop2.i.i.i.i.i:                                  ; preds = %loop2.end.i.i.i.i.i, %loop1.i.i.i.i.i
  br label %loop3.i.i.i.i.i

loop3.i.i.i.i.i:                                  ; preds = %loop3.end.i.i.i.i.i, %loop2.i.i.i.i.i
  br label %loop4.i.i.i.i.i

loop4.i.i.i.i.i:                                  ; preds = %recursiveFunc.10.exit.i.i.i.i.i, %loop3.i.i.i.i.i
  call void @print_val(i32 5)
  br label %loop1.i.i.i.i.i.i

loop1.i.i.i.i.i.i:                                ; preds = %loop1.end.i.i.i.i.i.i, %loop4.i.i.i.i.i
  br label %loop2.i.i.i.i.i.i

loop2.i.i.i.i.i.i:                                ; preds = %loop2.end.i.i.i.i.i.i, %loop1.i.i.i.i.i.i
  br label %loop3.i.i.i.i.i.i

loop3.i.i.i.i.i.i:                                ; preds = %loop3.end.i.i.i.i.i.i, %loop2.i.i.i.i.i.i
  br label %loop4.i.i.i.i.i.i

loop4.i.i.i.i.i.i:                                ; preds = %recursiveFunc.12.exit.i.i.i.i.i.i, %loop3.i.i.i.i.i.i
  call void @print_val(i32 6)
  br label %loop1.i.i.i.i.i.i.i

loop1.i.i.i.i.i.i.i:                              ; preds = %loop1.end.i.i.i.i.i.i.i, %loop4.i.i.i.i.i.i
  br label %loop2.i.i.i.i.i.i.i

loop2.i.i.i.i.i.i.i:                              ; preds = %loop2.end.i.i.i.i.i.i.i, %loop1.i.i.i.i.i.i.i
  br label %loop3.i.i.i.i.i.i.i

loop3.i.i.i.i.i.i.i:                              ; preds = %loop3.end.i.i.i.i.i.i.i, %loop2.i.i.i.i.i.i.i
  br label %loop4.i.i.i.i.i.i.i

loop4.i.i.i.i.i.i.i:                              ; preds = %recursiveFunc.14.exit.i.i.i.i.i.i.i, %loop3.i.i.i.i.i.i.i
  call void @print_val(i32 7)
  br label %loop1.i.i.i.i.i.i.i.i

loop1.i.i.i.i.i.i.i.i:                            ; preds = %loop1.end.i.i.i.i.i.i.i.i, %loop4.i.i.i.i.i.i.i
  br label %loop2.i.i.i.i.i.i.i.i

loop2.i.i.i.i.i.i.i.i:                            ; preds = %loop2.end.i.i.i.i.i.i.i.i, %loop1.i.i.i.i.i.i.i.i
  br label %loop3.i.i.i.i.i.i.i.i

loop3.i.i.i.i.i.i.i.i:                            ; preds = %loop3.end.i.i.i.i.i.i.i.i, %loop2.i.i.i.i.i.i.i.i
  br label %loop4.i.i.i.i.i.i.i.i

loop4.i.i.i.i.i.i.i.i:                            ; preds = %recursiveFunc.16.exit.i.i.i.i.i.i.i.i, %loop3.i.i.i.i.i.i.i.i
  call void @print_val(i32 8)
  br label %loop1.i.i.i.i.i.i.i.i.i

loop1.i.i.i.i.i.i.i.i.i:                          ; preds = %loop1.end.i.i.i.i.i.i.i.i.i, %loop4.i.i.i.i.i.i.i.i
  br label %loop2.i.i.i.i.i.i.i.i.i

loop2.i.i.i.i.i.i.i.i.i:                          ; preds = %loop2.end.i.i.i.i.i.i.i.i.i, %loop1.i.i.i.i.i.i.i.i.i
  br label %loop3.i.i.i.i.i.i.i.i.i

loop3.i.i.i.i.i.i.i.i.i:                          ; preds = %loop3.end.i.i.i.i.i.i.i.i.i, %loop2.i.i.i.i.i.i.i.i.i
  br label %loop4.i.i.i.i.i.i.i.i.i

loop4.i.i.i.i.i.i.i.i.i:                          ; preds = %recursiveFunc.18.exit.i.i.i.i.i.i.i.i.i, %loop3.i.i.i.i.i.i.i.i.i
  call void @print_val(i32 9)
  br label %loop1.i.i.i.i.i.i.i.i.i.i

loop1.i.i.i.i.i.i.i.i.i.i:                        ; preds = %loop1.end.i.i.i.i.i.i.i.i.i.i, %loop4.i.i.i.i.i.i.i.i.i
  br label %loop2.i.i.i.i.i.i.i.i.i.i

loop2.i.i.i.i.i.i.i.i.i.i:                        ; preds = %loop2.end.i.i.i.i.i.i.i.i.i.i, %loop1.i.i.i.i.i.i.i.i.i.i
  br label %loop3.i.i.i.i.i.i.i.i.i.i

loop3.i.i.i.i.i.i.i.i.i.i:                        ; preds = %loop3.end.i.i.i.i.i.i.i.i.i.i, %loop2.i.i.i.i.i.i.i.i.i.i
  br label %loop4.i.i.i.i.i.i.i.i.i.i

loop4.i.i.i.i.i.i.i.i.i.i:                        ; preds = %loop4.i.i.i.i.i.i.i.i.i.i, %loop3.i.i.i.i.i.i.i.i.i.i
  call void @print_val(i32 10)
  call void @recursiveFunc(i32* nonnull @funcspec.arg.19)
  %exit_cond1.i.i.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond1.i.i.i.i.i.i.i.i.i.i, label %loop4.i.i.i.i.i.i.i.i.i.i, label %loop3.end.i.i.i.i.i.i.i.i.i.i

loop3.end.i.i.i.i.i.i.i.i.i.i:                    ; preds = %loop4.i.i.i.i.i.i.i.i.i.i
  %exit_cond2.i.i.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond2.i.i.i.i.i.i.i.i.i.i, label %loop3.i.i.i.i.i.i.i.i.i.i, label %loop2.end.i.i.i.i.i.i.i.i.i.i

loop2.end.i.i.i.i.i.i.i.i.i.i:                    ; preds = %loop3.end.i.i.i.i.i.i.i.i.i.i
  %exit_cond3.i.i.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond3.i.i.i.i.i.i.i.i.i.i, label %loop2.i.i.i.i.i.i.i.i.i.i, label %loop1.end.i.i.i.i.i.i.i.i.i.i

loop1.end.i.i.i.i.i.i.i.i.i.i:                    ; preds = %loop2.end.i.i.i.i.i.i.i.i.i.i
  %exit_cond4.i.i.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond4.i.i.i.i.i.i.i.i.i.i, label %loop1.i.i.i.i.i.i.i.i.i.i, label %recursiveFunc.18.exit.i.i.i.i.i.i.i.i.i

recursiveFunc.18.exit.i.i.i.i.i.i.i.i.i:          ; preds = %loop1.end.i.i.i.i.i.i.i.i.i.i
  %exit_cond1.i.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond1.i.i.i.i.i.i.i.i.i, label %loop4.i.i.i.i.i.i.i.i.i, label %loop3.end.i.i.i.i.i.i.i.i.i

loop3.end.i.i.i.i.i.i.i.i.i:                      ; preds = %recursiveFunc.18.exit.i.i.i.i.i.i.i.i.i
  %exit_cond2.i.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond2.i.i.i.i.i.i.i.i.i, label %loop3.i.i.i.i.i.i.i.i.i, label %loop2.end.i.i.i.i.i.i.i.i.i

loop2.end.i.i.i.i.i.i.i.i.i:                      ; preds = %loop3.end.i.i.i.i.i.i.i.i.i
  %exit_cond3.i.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond3.i.i.i.i.i.i.i.i.i, label %loop2.i.i.i.i.i.i.i.i.i, label %loop1.end.i.i.i.i.i.i.i.i.i

loop1.end.i.i.i.i.i.i.i.i.i:                      ; preds = %loop2.end.i.i.i.i.i.i.i.i.i
  %exit_cond4.i.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond4.i.i.i.i.i.i.i.i.i, label %loop1.i.i.i.i.i.i.i.i.i, label %recursiveFunc.16.exit.i.i.i.i.i.i.i.i

recursiveFunc.16.exit.i.i.i.i.i.i.i.i:            ; preds = %loop1.end.i.i.i.i.i.i.i.i.i
  %exit_cond1.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond1.i.i.i.i.i.i.i.i, label %loop4.i.i.i.i.i.i.i.i, label %loop3.end.i.i.i.i.i.i.i.i

loop3.end.i.i.i.i.i.i.i.i:                        ; preds = %recursiveFunc.16.exit.i.i.i.i.i.i.i.i
  %exit_cond2.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond2.i.i.i.i.i.i.i.i, label %loop3.i.i.i.i.i.i.i.i, label %loop2.end.i.i.i.i.i.i.i.i

loop2.end.i.i.i.i.i.i.i.i:                        ; preds = %loop3.end.i.i.i.i.i.i.i.i
  %exit_cond3.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond3.i.i.i.i.i.i.i.i, label %loop2.i.i.i.i.i.i.i.i, label %loop1.end.i.i.i.i.i.i.i.i

loop1.end.i.i.i.i.i.i.i.i:                        ; preds = %loop2.end.i.i.i.i.i.i.i.i
  %exit_cond4.i.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond4.i.i.i.i.i.i.i.i, label %loop1.i.i.i.i.i.i.i.i, label %recursiveFunc.14.exit.i.i.i.i.i.i.i

recursiveFunc.14.exit.i.i.i.i.i.i.i:              ; preds = %loop1.end.i.i.i.i.i.i.i.i
  %exit_cond1.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond1.i.i.i.i.i.i.i, label %loop4.i.i.i.i.i.i.i, label %loop3.end.i.i.i.i.i.i.i

loop3.end.i.i.i.i.i.i.i:                          ; preds = %recursiveFunc.14.exit.i.i.i.i.i.i.i
  %exit_cond2.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond2.i.i.i.i.i.i.i, label %loop3.i.i.i.i.i.i.i, label %loop2.end.i.i.i.i.i.i.i

loop2.end.i.i.i.i.i.i.i:                          ; preds = %loop3.end.i.i.i.i.i.i.i
  %exit_cond3.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond3.i.i.i.i.i.i.i, label %loop2.i.i.i.i.i.i.i, label %loop1.end.i.i.i.i.i.i.i

loop1.end.i.i.i.i.i.i.i:                          ; preds = %loop2.end.i.i.i.i.i.i.i
  %exit_cond4.i.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond4.i.i.i.i.i.i.i, label %loop1.i.i.i.i.i.i.i, label %recursiveFunc.12.exit.i.i.i.i.i.i

recursiveFunc.12.exit.i.i.i.i.i.i:                ; preds = %loop1.end.i.i.i.i.i.i.i
  %exit_cond1.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond1.i.i.i.i.i.i, label %loop4.i.i.i.i.i.i, label %loop3.end.i.i.i.i.i.i

loop3.end.i.i.i.i.i.i:                            ; preds = %recursiveFunc.12.exit.i.i.i.i.i.i
  %exit_cond2.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond2.i.i.i.i.i.i, label %loop3.i.i.i.i.i.i, label %loop2.end.i.i.i.i.i.i

loop2.end.i.i.i.i.i.i:                            ; preds = %loop3.end.i.i.i.i.i.i
  %exit_cond3.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond3.i.i.i.i.i.i, label %loop2.i.i.i.i.i.i, label %loop1.end.i.i.i.i.i.i

loop1.end.i.i.i.i.i.i:                            ; preds = %loop2.end.i.i.i.i.i.i
  %exit_cond4.i.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond4.i.i.i.i.i.i, label %loop1.i.i.i.i.i.i, label %recursiveFunc.10.exit.i.i.i.i.i

recursiveFunc.10.exit.i.i.i.i.i:                  ; preds = %loop1.end.i.i.i.i.i.i
  %exit_cond1.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond1.i.i.i.i.i, label %loop4.i.i.i.i.i, label %loop3.end.i.i.i.i.i

loop3.end.i.i.i.i.i:                              ; preds = %recursiveFunc.10.exit.i.i.i.i.i
  %exit_cond2.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond2.i.i.i.i.i, label %loop3.i.i.i.i.i, label %loop2.end.i.i.i.i.i

loop2.end.i.i.i.i.i:                              ; preds = %loop3.end.i.i.i.i.i
  %exit_cond3.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond3.i.i.i.i.i, label %loop2.i.i.i.i.i, label %loop1.end.i.i.i.i.i

loop1.end.i.i.i.i.i:                              ; preds = %loop2.end.i.i.i.i.i
  %exit_cond4.i.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond4.i.i.i.i.i, label %loop1.i.i.i.i.i, label %recursiveFunc.8.exit.i.i.i.i

recursiveFunc.8.exit.i.i.i.i:                     ; preds = %loop1.end.i.i.i.i.i
  %exit_cond1.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond1.i.i.i.i, label %loop4.i.i.i.i, label %loop3.end.i.i.i.i

loop3.end.i.i.i.i:                                ; preds = %recursiveFunc.8.exit.i.i.i.i
  %exit_cond2.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond2.i.i.i.i, label %loop3.i.i.i.i, label %loop2.end.i.i.i.i

loop2.end.i.i.i.i:                                ; preds = %loop3.end.i.i.i.i
  %exit_cond3.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond3.i.i.i.i, label %loop2.i.i.i.i, label %loop1.end.i.i.i.i

loop1.end.i.i.i.i:                                ; preds = %loop2.end.i.i.i.i
  %exit_cond4.i.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond4.i.i.i.i, label %loop1.i.i.i.i, label %recursiveFunc.6.exit.i.i.i

recursiveFunc.6.exit.i.i.i:                       ; preds = %loop1.end.i.i.i.i
  %exit_cond1.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond1.i.i.i, label %loop4.i.i.i, label %loop3.end.i.i.i

loop3.end.i.i.i:                                  ; preds = %recursiveFunc.6.exit.i.i.i
  %exit_cond2.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond2.i.i.i, label %loop3.i.i.i, label %loop2.end.i.i.i

loop2.end.i.i.i:                                  ; preds = %loop3.end.i.i.i
  %exit_cond3.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond3.i.i.i, label %loop2.i.i.i, label %loop1.end.i.i.i

loop1.end.i.i.i:                                  ; preds = %loop2.end.i.i.i
  %exit_cond4.i.i.i = call i1 @exit_cond()
  br i1 %exit_cond4.i.i.i, label %loop1.i.i.i, label %recursiveFunc.4.exit.i.i

recursiveFunc.4.exit.i.i:                         ; preds = %loop1.end.i.i.i
  %exit_cond1.i.i = call i1 @exit_cond()
  br i1 %exit_cond1.i.i, label %loop4.i.i, label %loop3.end.i.i

loop3.end.i.i:                                    ; preds = %recursiveFunc.4.exit.i.i
  %exit_cond2.i.i = call i1 @exit_cond()
  br i1 %exit_cond2.i.i, label %loop3.i.i, label %loop2.end.i.i

loop2.end.i.i:                                    ; preds = %loop3.end.i.i
  %exit_cond3.i.i = call i1 @exit_cond()
  br i1 %exit_cond3.i.i, label %loop2.i.i, label %loop1.end.i.i

loop1.end.i.i:                                    ; preds = %loop2.end.i.i
  %exit_cond4.i.i = call i1 @exit_cond()
  br i1 %exit_cond4.i.i, label %loop1.i.i, label %recursiveFunc.2.exit.i

recursiveFunc.2.exit.i:                           ; preds = %loop1.end.i.i
  %exit_cond1.i = call i1 @exit_cond()
  br i1 %exit_cond1.i, label %loop4.i, label %loop3.end.i

loop3.end.i:                                      ; preds = %recursiveFunc.2.exit.i
  %exit_cond2.i = call i1 @exit_cond()
  br i1 %exit_cond2.i, label %loop3.i, label %loop2.end.i

loop2.end.i:                                      ; preds = %loop3.end.i
  %exit_cond3.i = call i1 @exit_cond()
  br i1 %exit_cond3.i, label %loop2.i, label %loop1.end.i

loop1.end.i:                                      ; preds = %loop2.end.i
  %exit_cond4.i = call i1 @exit_cond()
  br i1 %exit_cond4.i, label %loop1.i, label %recursiveFunc.1.exit

recursiveFunc.1.exit:                             ; preds = %loop1.end.i
  ret i32 0
}

declare dso_local void @print_val(i32)

declare dso_local i1 @exit_cond()

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #0

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #0

attributes #0 = { argmemonly nofree nosync nounwind willreturn }

hmmmm, I guess it must not be your expected code, right?

For *this* example, I don't see which high level criteria is the problem:
compile-times,
code-size,
performance.

I think all of them would be affected. The impact on code-size is obvious. For the compile-time, it would take a long time to compile if we set the argument func-specialization-max-iters to 2000.

That's why I still prefer the approach of getting this foundation in place, because it is opt-in, then we can tune this further.

I prefer to use small program to analysis. My work flow to optimize would be :

Run big workloads -> analysis the hot part -> extract small example from the hot part -> optimize for the hot part -> prove the effects with the big workloads.

It is hard to optimize big workloads directly since many details are missing. So I think it may be better to talk about the small example if we can. I mean, we need to take many time to find a small program from workloads to find the insight generally. But if we could talk about the small example directly, it would be much better.

I can't follow any this. There are only arbitrary numbers here. I can't follow the decision making behind this, because no rationale is given.

My bad. I mean we shouldn't make recursiveFunc arbitrarily. The limit number may get discussed more.

If we run bin/opt -function-specialization -func-specialization-max-iters=10 -S %s, we could find that recursiveFunc would get specialized 18 times.

This is not not true. It will get specialised 10 times.

Then I run bin/opt -function-specialization -func-specialization-max-iters=10 -inline -instcombine -simplifycfg -S %s, I got:
<SNIP>
hmmmm, I guess it must not be your expected code, right?

Hmmm, sorry, this is exactly what I expect. Look at the generated code:

_main:                                  ; @main
      stp     x20, x19, [sp, #-32]!           ; 16-byte Folded Spill
      stp     x29, x30, [sp, #16]             ; 16-byte Folded Spill
      adrp    x19, _funcspec.arg.17@PAGE
      add     x19, x19, _funcspec.arg.17@PAGEOFF
LBB1_1:
      mov     w0, #1
      bl      _print_val
LBB1_2:
      mov     w0, #2
      bl      _print_val
LBB1_3:
      mov     w0, #3
      bl      _print_val
LBB1_4:
      mov     w0, #4
      bl      _print_val
LBB1_5:
      mov     w0, #5
      bl      _print_val
LBB1_6:
      mov     w0, #6
      bl      _print_val
LBB1_7:
      mov     w0, #7
      bl      _print_val
LBB1_8:
      mov     w0, #8
      bl      _print_val
LBB1_9:
      mov     w0, #9
      bl      _print_val
      mov     x0, x19
      bl      _recursiveFunc
      bl      _exit_cond
      tbnz    w0, #0, LBB1_9
      bl      _exit_cond
      tbnz    w0, #0, LBB1_9
      bl      _exit_cond
      tbnz    w0, #0, LBB1_9
      bl      _exit_cond
      tbnz    w0, #0, LBB1_9
      bl      _exit_cond
      tbnz    w0, #0, LBB1_8
      bl      _exit_cond
      tbnz    w0, #0, LBB1_8
      bl      _exit_cond
      tbnz    w0, #0, LBB1_8
      bl      _exit_cond
      tbnz    w0, #0, LBB1_8
      bl      _exit_cond
      tbnz    w0, #0, LBB1_8
      bl      _exit_cond
      tbnz    w0, #0, LBB1_7
      bl      _exit_cond
      tbnz    w0, #0, LBB1_7
      bl      _exit_cond
      tbnz    w0, #0, LBB1_7
      bl      _exit_cond
      tbnz    w0, #0, LBB1_7
      bl      _exit_cond
      <SNIP: a few more>

That seem fairly optimal to me, so don't know why we wouldn't be doing this, or what we need to differently.

I think all of them would be affected. The impact on code-size is obvious. For the compile-time, it would take a long time to compile if we set the argument func-specialization-max-iters to 2000.

I am sorry, but this is an unreasonable number. There is no point in discussing such a big number, because the default will never be even close to that, and if you do want to set it to very high, then that's opt-in and your responsibility.

If we run bin/opt -function-specialization -func-specialization-max-iters=10 -S %s, we could find that recursiveFunc would get specialized 18 times.

This is not not true. It will get specialised 10 times.

Sorry, I count the suffix number directly. But it still doesn't make sense to me. recursiveFunc shouldn't be specialized so many times. I guess we would be in consensus in this.

Then I run bin/opt -function-specialization -func-specialization-max-iters=10 -inline -instcombine -simplifycfg -S %s, I got:
<SNIP>
hmmmm, I guess it must not be your expected code, right?

Hmmm, sorry, this is exactly what I expect. Look at the generated code:

_main:                                  ; @main
      stp     x20, x19, [sp, #-32]!           ; 16-byte Folded Spill
      stp     x29, x30, [sp, #16]             ; 16-byte Folded Spill
      adrp    x19, _funcspec.arg.17@PAGE
      add     x19, x19, _funcspec.arg.17@PAGEOFF
LBB1_1:
      mov     w0, #1
      bl      _print_val
LBB1_2:
      mov     w0, #2
      bl      _print_val
LBB1_3:
      mov     w0, #3
      bl      _print_val
LBB1_4:
      mov     w0, #4
      bl      _print_val
LBB1_5:
      mov     w0, #5
      bl      _print_val
LBB1_6:
      mov     w0, #6
      bl      _print_val
LBB1_7:
      mov     w0, #7
      bl      _print_val
LBB1_8:
      mov     w0, #8
      bl      _print_val
LBB1_9:
      mov     w0, #9
      bl      _print_val
      mov     x0, x19
      bl      _recursiveFunc
      bl      _exit_cond
      tbnz    w0, #0, LBB1_9
      bl      _exit_cond
      tbnz    w0, #0, LBB1_9
      bl      _exit_cond
      tbnz    w0, #0, LBB1_9
      bl      _exit_cond
      tbnz    w0, #0, LBB1_9
      bl      _exit_cond
      tbnz    w0, #0, LBB1_8
      bl      _exit_cond
      tbnz    w0, #0, LBB1_8
      bl      _exit_cond
      tbnz    w0, #0, LBB1_8
      bl      _exit_cond
      tbnz    w0, #0, LBB1_8
      bl      _exit_cond
      tbnz    w0, #0, LBB1_8
      bl      _exit_cond
      tbnz    w0, #0, LBB1_7
      bl      _exit_cond
      tbnz    w0, #0, LBB1_7
      bl      _exit_cond
      tbnz    w0, #0, LBB1_7
      bl      _exit_cond
      tbnz    w0, #0, LBB1_7
      bl      _exit_cond
      <SNIP: a few more>

That seem fairly optimal to me, so don't know why we wouldn't be doing this, or what we need to differently.

hmmmm, the generated code looks not good enough to me. There are too many redundancies in the main function.
The key point I want to say is that we shouldn't specialize recursiveFunc so many times. If you feels like that it is OK, then how about setting func-specialization-max-iters to 20 or 100? I feel like it should be easy to agree that there should be a limit for recursive functions.

I think all of them would be affected. The impact on code-size is obvious. For the compile-time, it would take a long time to compile if we set the argument func-specialization-max-iters to 2000.

I am sorry, but this is an unreasonable number. There is no point in discussing such a big number, because the default will never be even close to that, and if you do want to set it to very high, then that's opt-in and your responsibility.

I just want to note that the compile time would increase significantly with the parameter func-specialization-max-iters
For the above example, I got the following results by using opt to compile:
(compile option is -function-specialization -func-specialization-max-iters=100 -inline -instcombine -simplifycfg)

func-specialization-max-iterscompile time (seconds)
10.0096
100.1253
200.4127
502.4648
1005.9211

hmm, it looks not good.


Since the discuss involves the different perspective to review a patch. So it may be helpful to hear the opinions from other reviewers: @fhahn @jaykang10 @chill

Matt added a subscriber: Matt.Jul 28 2021, 2:57 PM

Looks like we agree after all! :-) Because I fully agree with this:

The key point I want to say is that we shouldn't specialize recursiveFunc so many times. If you feels like that it is OK, then how about setting func-specialization-max-iters to 20 or 100?

An early version of the function specialisation pass had a so called "aggressive mode". This meant that it was (still only) doing 10 iterations. That's what I envision what it could be roughly. But to be clear, I expect it to be a (very) low number. Simply because when we start to specialise more, it's going to cost more (compile-time). There's indeed no way around that, and that's why we agree on this. The cost of specialising more is not a problem with the implementation of this patch (but again, simply because of the resulting additional functions/instructions).

About what the value of -func-specialization-max-iters= should be exactly, I don't know yet. I would like to defer that question to some point later, when I pick up enabling function specialisation. I would like to see if we can get it enabled first with -func-specialization-max-iters=1, and after if we can increase it, or if needs to remain opt-in.

Like I said, I think we agree. That's why I would like to see first if we can finish this reviewer together, which I think would be a nicer result than getting other people involved at this point.

Looks like we agree after all! :-) Because I fully agree with this:

The key point I want to say is that we shouldn't specialize recursiveFunc so many times. If you feels like that it is OK, then how about setting func-specialization-max-iters to 20 or 100?

An early version of the function specialisation pass had a so called "aggressive mode". This meant that it was (still only) doing 10 iterations. That's what I envision what it could be roughly. But to be clear, I expect it to be a (very) low number. Simply because when we start to specialise more, it's going to cost more (compile-time). There's indeed no way around that, and that's why we agree on this. The cost of specialising more is not a problem with the implementation of this patch (but again, simply because of the resulting additional functions/instructions).

About what the value of -func-specialization-max-iters= should be exactly, I don't know yet. I would like to defer that question to some point later, when I pick up enabling function specialisation. I would like to see if we can get it enabled first with -func-specialization-max-iters=1, and after if we can increase it, or if needs to remain opt-in.

Like I said, I think we agree. That's why I would like to see first if we can finish this reviewer together, which I think would be a nicer result than getting other people involved at this point.

It looks like the different opinions come from different perspective. I prefer to look into the details.

Or let me ask the other question: do you think it matters that the recursive functions get specialized many times when func-specialization-max-iters increases? If yes, I think I could fix it in successive patches. It shouldn't be so hard.

Out of this, the overall patch looks good to me. If we can get consensus in above question, I would look into the implementation details.

Looks like we agree after all! :-) Because I fully agree with this:

The key point I want to say is that we shouldn't specialize recursiveFunc so many times. If you feels like that it is OK, then how about setting func-specialization-max-iters to 20 or 100?

An early version of the function specialisation pass had a so called "aggressive mode". This meant that it was (still only) doing 10 iterations. That's what I envision what it could be roughly. But to be clear, I expect it to be a (very) low number. Simply because when we start to specialise more, it's going to cost more (compile-time). There's indeed no way around that, and that's why we agree on this. The cost of specialising more is not a problem with the implementation of this patch (but again, simply because of the resulting additional functions/instructions).

About what the value of -func-specialization-max-iters= should be exactly, I don't know yet. I would like to defer that question to some point later, when I pick up enabling function specialisation. I would like to see if we can get it enabled first with -func-specialization-max-iters=1, and after if we can increase it, or if needs to remain opt-in.

Like I said, I think we agree. That's why I would like to see first if we can finish this reviewer together, which I think would be a nicer result than getting other people involved at this point.

It looks like the different opinions come from different perspective. I prefer to look into the details.

Or let me ask the other question: do you think it matters that the recursive functions get specialized many times when func-specialization-max-iters increases? If yes, I think I could fix it in successive patches. It shouldn't be so hard.

Out of this, the overall patch looks good to me. If we can get consensus in above question, I would look into the implementation details.

So yes, I think we also agree on this. Let me summarise this then:

  • At the moment the number of specialised recursive functions grows linearly by increasing func-specialization-max-iters. But that's by design, to support recursive functions.
  • And that obviously comes at a compile-time cost. That's why, realistically, in its current form func-specialization-max-iters will never be a big number (say e.g. more than 20).
  • Since compile-time is directly controlled by func-specialization-max-iters, that will remain at 1 for the moment. Users can opt-in and pay the cost if they want. Our ambition is to set it something higher, but that depends on the next steps.

Next, and crucially, to address your points:

  • A 100% agreed of course: if we can reduce compile times, if we can come up with better cost-modeling, we should do that!
  • I think that will be very easy to add that as a follow up. For example, passing in an iteration number to the bonus/penalty calculation function is trivial. We just need to know what exactly the changes to cost-model should be. I think this data is missing at the moment, see next point.
  • the 10 iterations for the "aggressive mode" was not that a coincidence: this was chosen to get completely rid of a recursive function in exchange from SPEC. When we start looking into recursive functions and increasing the max iterations, we will still want to support that case, and hopefully others too to make it general, and then find a good compile-time trade off.

So yes, I am happy to support you or look into any improvements myself in this area to improve things if we can.

Looks like we agree after all! :-) Because I fully agree with this:

The key point I want to say is that we shouldn't specialize recursiveFunc so many times. If you feels like that it is OK, then how about setting func-specialization-max-iters to 20 or 100?

An early version of the function specialisation pass had a so called "aggressive mode". This meant that it was (still only) doing 10 iterations. That's what I envision what it could be roughly. But to be clear, I expect it to be a (very) low number. Simply because when we start to specialise more, it's going to cost more (compile-time). There's indeed no way around that, and that's why we agree on this. The cost of specialising more is not a problem with the implementation of this patch (but again, simply because of the resulting additional functions/instructions).

About what the value of -func-specialization-max-iters= should be exactly, I don't know yet. I would like to defer that question to some point later, when I pick up enabling function specialisation. I would like to see if we can get it enabled first with -func-specialization-max-iters=1, and after if we can increase it, or if needs to remain opt-in.

Like I said, I think we agree. That's why I would like to see first if we can finish this reviewer together, which I think would be a nicer result than getting other people involved at this point.

It looks like the different opinions come from different perspective. I prefer to look into the details.

Or let me ask the other question: do you think it matters that the recursive functions get specialized many times when func-specialization-max-iters increases? If yes, I think I could fix it in successive patches. It shouldn't be so hard.

Out of this, the overall patch looks good to me. If we can get consensus in above question, I would look into the implementation details.

So yes, I think we also agree on this. Let me summarise this then:

  • At the moment the number of specialised recursive functions grows linearly by increasing func-specialization-max-iters. But that's by design, to support recursive functions.
  • And that obviously comes at a compile-time cost. That's why, realistically, in its current form func-specialization-max-iters will never be a big number (say e.g. more than 20).
  • Since compile-time is directly controlled by func-specialization-max-iters, that will remain at 1 for the moment. Users can opt-in and pay the cost if they want. Our ambition is to set it something higher, but that depends on the next steps.

Next, and crucially, to address your points:

  • A 100% agreed of course: if we can reduce compile times, if we can come up with better cost-modeling, we should do that!
  • I think that will be very easy to add that as a follow up. For example, passing in an iteration number to the bonus/penalty calculation function is trivial. We just need to know what exactly the changes to cost-model should be. I think this data is missing at the moment, see next point.
  • the 10 iterations for the "aggressive mode" was not that a coincidence: this was chosen to get completely rid of a recursive function in exchange from SPEC. When we start looking into recursive functions and increasing the max iterations, we will still want to support that case, and hopefully others too to make it general, and then find a good compile-time trade off.

So yes, I am happy to support you or look into any improvements myself in this area to improve things if we can.

Yeah, now the things get clearer. The overall patch looks good to me. I would try to reduce the time a recursive function get specialized when func-specialization-max-iters increases in follow up patches.

llvm/lib/Transforms/Utils/SCCPSolver.cpp
1259–1261 ↗(On Diff #360772)

When would PI be nullptr? It looks like we didn't update the solver in time.

SjoerdMeijer added inline comments.Aug 3 2021, 12:38 AM
llvm/lib/Transforms/Utils/SCCPSolver.cpp
1259–1261 ↗(On Diff #360772)

Yeah, that is a good point.
I have looked into this, but did not yet get the bottom. Either way, I thought that simply returning is more graceful than an assert (which does not even happen in a release build), especially if it is a matter of not updating the solver in time. I thought the solver should be robust against these things, and thus thought returning is better.

ChuanqiXu added inline comments.Aug 3 2021, 3:04 AM
llvm/lib/Transforms/Utils/SCCPSolver.cpp
1259–1261 ↗(On Diff #360772)

It looks like that this diff triggers the assertion and removes the assertion simply. If it is in this case, we need more explanation to remove it. Although there are redundant assertions, we can't remove them arbitrarily.
If this diff wouldn't trigger the assertion, I think we could remove this change and leave it to follow up patches (Like to make SCCPSolver more robust to not update everything in the beginning).

I have removed the changes in the SCCP solver, and have restored the cleanup of the ssa_copy intrinsics that are introduced by the solver.

Removing these intrinsics was already necessary for supporting recursive functions, but is also the easiest way to keep the internal state of the solver in a sane state for these intrinsics while iteratively invoking it.

SjoerdMeijer added inline comments.Aug 3 2021, 12:55 PM
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
532

This is now removeSSACopy, see line 188.

ChuanqiXu accepted this revision.Aug 3 2021, 6:58 PM

It looks good now. Although I agree that it is better to handle it in SCCPSolver, it could be done by an independent refactoring patch.

This revision is now accepted and ready to land.Aug 3 2021, 6:58 PM

Thanks for the help with this patch, and that sounds like a good plan to me.

This revision was automatically updated to reflect the committed changes.