Many passes stuggle with huge basic blocks. While we definitely should address
the issues in the passes in the first place, often we discover such places too
late when the compiler 'hangs'. This patch adds a 'fuse' to prevent this from
happening by splitting huge basic blocks. It shouldn't happen in usual
scenarios, but should help us in corner cases that occur here and there. This
does not apply for -O3, as at -O3 we're usually willing to sacrifice compile
time for the sake of any extra performance.
Details
Diff Detail
- Repository
- rL LLVM
- Build Status
Buildable 16411 Build 16411: arc lint + arc unit
Event Timeline
BB->size() has multiple problems: one, it's linear time, and two, it doesn't ignore debug info.
CodeGenPrepare doesn't run at -O0.
BB->size() has multiple problems: one, it's linear time, and two, it doesn't ignore debug info.
What would be a better way to get BB size?
CodeGenPrepare doesn't run at -O0.
Correct, in the current shape the patch aims only at -Os. Do you think it would be better to have a separate pass for it and schedule it at O0 too?
Thanks,
Michael
What would be a better way to get BB size?
Given your algorithm, I don't understand why you need to get the BB size in the first place; you can just iterate over every block, and split when you hit the threshold.
Correct, in the current shape the patch aims only at -Os. Do you think it would be better to have a separate pass for it and schedule it at O0 too?
The key question is whether we're using SelectionDAG ISel, since the blocks will be un-split after ISel anyway. We don't use SelectionDAG ISel for most blocks at -O0, but we for a few. Maybe not worth worrying about.
llvm/lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
491 | You probably need to skip over all PHI nodes/landingpads/etc. so you don't try to split a block in an impossible place. And I think some Windows exception-handling blocks can't be split? (It would be unusual to have a block with over 1000 PHI nodes, but not impossible.) And like I mentioned before, you need to skip debug into intrinsics. This should have a testcase to exercise this logic (you can mess with the threshold to keep the testcase small). |
- Rewrite algorithm to Avoid using BB->size().
- Skip PHI-nodes and EH-pads.
- Don't split on terminators.
- Add tests.
Hi Eli,
Thanks for the input, please take a look at the updated patch. For now I didn't introduce a logic to skip debug intrinsics - is it needed from a correctness point of view? I've seen test-cases with thousands of llvm.dbg.value intrinsics that would also benefit from splitting (provided it's legal). Also, I wonder if we actually should only do that on -Os - maybe it would make sense to do that on -O3 too with a significantly high threshold. What do you think?
Thanks,
Michael
Adrian convinced me that we need to properly ignore debug-info intrinsics to guarantee the same code generation with and without debug info. I've updated the patch.
Thanks,
Michael
I don't know how I feel about this patch. It looks like it's papering over a huge problem, which is basically the fact that some passes down the road are either quadratic or linear with large constant factor.
If we have examples of the broken passes, maybe we should consider whether it's feasible to fix them instead of applying this hack here?
llvm/lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
218–221 | How was this picked? |
That said, the worklist algorithm you picked seems correct.
Another question that I have for you is: how is this going to work when GlobalIsel will become the standard?
My understanding is that cross-BB codegen will remove the need for CodegenPrepare, so if we start relying on BBs to be split we need to preserve this functionality somehow.
I don't know how I feel about this patch. It looks like it's papering over a huge problem, which is basically the fact that some passes down the road are either quadratic or linear with large constant factor.
If we have examples of the broken passes, maybe we should consider whether it's feasible to fix them instead of applying this hack here?
We do have examples of such passes, and we should fix them, I agree. However, I don't consider this patch as a fix for them. The purpose of this patch is to prevent compiler hangs in future - even when we fix the known issues, there is no guarantee that there are no more places like them. When we specifically want to look for such problematic spots, we can always set the option to -1, but by default it will just save us from "compiler hangs" (at least, from those coming from backend).
How was this picked?
The threshold was picked kind of randomly with the intention to make it high enough so that it doesn't affect usual cases, and in the bad cases the compiler still finishes in a reasonable time. Also, when "bad" situation occurs we still should be able to identify the problematic pass by unusually high relative time consumed.
Another question that I have for you is: how is this going to work when GlobalIsel will become the standard?
My understanding is that cross-BB codegen will remove the need for CodegenPrepare, so if we start relying on BBs to be split we need to preserve this functionality somehow.
It probably wouldn't work for GlobalISel (at least, to my understanding), but since GlobalISel operates on a wider scope, I expect that non-linearities would be discovered and fixed much quicker.
Thanks,
Michael
this does not feel like the right solution, but even if it is, this will hurt us; we *commonly* have shaders with single basic blocks exceeding 1000 instructions, and splitting them could sabotage scheduling quite badly. in the worst case, it could guarantee spilling, by splitting the block at a point that creates too many live values between the top and bottom.
How was this picked?