This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Limit Complex Addressing mode by number of BasicBlocks to traverse
AbandonedPublic

Authored by skatkov on Nov 14 2018, 1:53 AM.

Details

Summary

The finding the combined complex addressing mode in some cases becomes
expensive. This CL limits the work by number of basic blocks to traverse.

This is a fix for PR39625.

Diff Detail

Event Timeline

skatkov created this revision.Nov 14 2018, 1:53 AM
wmi added a subscriber: wmi.Nov 15 2018, 7:47 PM
bjope added a comment.Nov 20 2018, 2:37 PM

Basically I think the code looks ok. However, I'm not so familiar with this algorithm so it is hard to comment about the actual solution.

My understanding is that you introduce a threshold, and if the size of the TraverseOrder vector grows past the threshold we bail out from findCommon. So what is the impact of this?
I assume it means that we limit the amount of "Complex Addressing mode" optimizations somehow. Is this limit only hit for "large" programs? When compiling a program that hits the threshold, do we lose some optimizations or will the amount of Complex Addressing mode optimizations in such a program be reduced significantly?

How did you choose the current threshold? (that is probably something people want to know when trying to finetune this 4 years from now, so it could be nice to say something about it in the commit msg even if it just is an estimate)

Basically I think the code looks ok. However, I'm not so familiar with this algorithm so it is hard to comment about the actual solution.

My understanding is that you introduce a threshold, and if the size of the TraverseOrder vector grows past the threshold we bail out from findCommon. So what is the impact of this?
I assume it means that we limit the amount of "Complex Addressing mode" optimizations somehow. Is this limit only hit for "large" programs? When compiling a program that hits the threshold, do we lose some optimizations or will the amount of Complex Addressing mode optimizations in such a program be reduced significantly?

How did you choose the current threshold? (that is probably something people want to know when trying to finetune this 4 years from now, so it could be nice to say something about it in the commit msg even if it just is an estimate)

Hi Bjorn, thank you for looking into this patch.

The optimization itself does the following:
Let's we have load from pointer p. We traverse all paths from p to original pointer skipping phi nodes and selects. Let's we found that p can be actually p.1, p.2, .. p.N. Each of p.i = gv.i + base.i + index.i * scale.i + offset.i.
If for all i: all parts are the same we apply optimization (move pointer computation close to loop to create a complex addressing load.
If for all i the parts are different only by one field gv, base or index then we go to complex case which is targeted by this patch.

Say the difference in the base. Generally we want to build a number of phi nodes to have right base at load.
TraverseOrder size is actually approximately the number of basic blocks between load's BB and all p.i's BB traversing be predecessors from load's BB.
The best benefit you get if all Phi nodes for base are already exists, so you will not create new Phi nodes but probably removes redundant ones (to get p).

The number 100 has no specific meaning. It is some threshold which is big enough to accept more or less big CFG.
The PR39625 contains a test which seems a corner case for this optimization. The distance between load and original p.i reaches the values about 16000 there. I did several runs with different values of this threshold and it choose 100 as giving the reasonable compile time even in debug build.

I hope it helps.

bjope added subscribers: haicheng, chandlerc.

I don't have any objections to the patch myself, but I don't really have the knowledge to understand if this could be bad for some important use cases either. A user can ofcourse override the threshold if needed, so I guess that makes this patch a little bit less "dangerous".

It looks like @chandlerc and @haicheng has contributed to this pass in the past. Can perhaps one of you help out and take a quick look at this to see if introducing this threshold is a reasonable fix?

Thank you, Bjorn.

I guess @john.brawn would be best person to review this patch because he reviewed an original implementation.

FYI: in parallel there is an idea how to simplify algorithm significantly. I'll try to implement it...

john.brawn accepted this revision.Nov 26 2018, 7:40 AM

Looks OK as a fix for the reported bug, with one minor nitpick. Taking a look at PR39625 it looks like there's an underlying problem where we insert a load of useless placeholders, e.g. for

declare void @otherfn()

define i32 @fn(i32* %arg1, i32* %arg2) {
entry:
  %gep1 = getelementptr i32, i32* %arg1, i32 4
  %gep2 = getelementptr i32, i32* %arg1, i32 8
  br i1 undef, label %a1, label %b1

a1:
  call void @otherfn()
  br label %middle

b1:
  call void @otherfn()
  br label %middle

middle:
  br i1 undef, label %a2, label %b2

a2:
  call void @otherfn()
  br label %end

b2:
  call void @otherfn()
  br label %end

end:
  %phi = phi i32* [ %gep1, %a2 ], [ %gep2, %b2 ]
  %val = load i32, i32* %phi, align 4
  ret i32 %val
}

we insert a total of 9 placeholders, but actually we need only one and the rest get deleted later. So I think there's something that could be done to improve that, but that's no reason not to do this fix.

lib/CodeGen/CodeGenPrepare.cpp
3095

This should have a << "\n" on the end.

This revision is now accepted and ready to land.Nov 26 2018, 7:40 AM

Looks OK as a fix for the reported bug, with one minor nitpick. Taking a look at PR39625 it looks like there's an underlying problem where we insert a load of useless placeholders, e.g. for

declare void @otherfn()

define i32 @fn(i32* %arg1, i32* %arg2) {
entry:
  %gep1 = getelementptr i32, i32* %arg1, i32 4
  %gep2 = getelementptr i32, i32* %arg1, i32 8
  br i1 undef, label %a1, label %b1

a1:
  call void @otherfn()
  br label %middle

b1:
  call void @otherfn()
  br label %middle

middle:
  br i1 undef, label %a2, label %b2

a2:
  call void @otherfn()
  br label %end

b2:
  call void @otherfn()
  br label %end

end:
  %phi = phi i32* [ %gep1, %a2 ], [ %gep2, %b2 ]
  %val = load i32, i32* %phi, align 4
  ret i32 %val
}

we insert a total of 9 placeholders, but actually we need only one and the rest get deleted later. So I think there's something that could be done to improve that, but that's no reason not to do this fix.

This is exactly an improvement I wrote in the comment above...

Hi @john.brawn, could you please take a look at https://reviews.llvm.org/D54932. It makes a compile time improvement and I guess we will not need this patch for a while.

skatkov abandoned this revision.Nov 29 2018, 12:44 AM

abandon in favor of D54932.