This pass seems to have very little effect because all it does is hoist
some instructions, but it is followed later in the codegen pipeline by
the IR CodeSinking pass which does the opposite.
Details
- Reviewers
- nhaehnle 
- Group Reviewers
- Restricted Project 
- Commits
- rGe301e071ba1a: [AMDGPU] Remove IR SpeculativeExecution pass from codegen pipeline
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
| llvm/test/CodeGen/AMDGPU/select-opt.ll | ||
|---|---|---|
| 144 | In this function SpeculativeExecution hoisted %cmp2 into %if0, which got in the way of CSE removing %cmp2 altogether. With this patch the resulting code is much simpler. | |
Can you explain why it was added (D20304 does not say) or what it does that does not get undone by code sinking?
I remember sorely needing this for a long time, and it was added specifically for GPUs. I think we would need compelling benchmarks that it doesn't do anything. It's possible SimplifyCFG has gotten more aggressive since this was added?
I've looked at the effect of this patch on our corpus of 10000 graphics shaders (compiled for gfx1030 using a frontend that includes CodeSinking in the IR optimization pipeline).
Out of a total of ~8 million instructions, the instruction frequency deltas look like this:
-89 2935 2846 v_lshl_add_u32 -9 3447 3438 v_mad_u64_u32 -6 59035 59029 s_load_dwordx4 -4 115017 115013 s_clause -3 56723 56720 s_load_dwordx8 -3 61717 61714 s_mov_b64 -2 1332 1330 v_cmpx_gt_u32 -1 2609 2608 v_cmpx_eq_u32 1 11908 11909 v_cmp_eq_u32 1 257 258 v_ceil_f32 1 5859 5860 v_or_b32 2 1468 1470 v_and_or_b32 2 25853 25855 v_and_b32 2 2924 2926 v_cmp_gt_u32 3 15607 15610 s_and_saveexec_b64 3 6897 6900 s_load_dwordx16 4 1211934 1211938 v_mul_f32 5 14578 14583 s_cbranch_vccnz 5 23221 23226 s_branch 5 301760 301765 s_waitcnt 5 34280 34285 s_and_b64 5 9682 9687 v_cmp_ngt_f32 5 98695 98700 s_buffer_load_dword 9 12325 12334 v_mul_lo_u32 24 760 784 v_subrev_nc_u32 29 187496 187525 v_mov_b32 101 19167 19268 v_lshlrev_b32 144 43422 43566 v_add_nc_u32
For each instruction the 3 numbers are: (delta from A to B), (A = # of occurrences before this patch), (B = # of occurrences after this patch).
So there are maybe 100 shift/add pairs that are no longer combined to v_lshl_add_u32 due to ending up in different basic blocks, but I have looked at a couple of examples and I don't there is anything systematically wrong there, it is just bad luck (and globalisel might fix it anyway by being able to pattern match across basic blocks). Apart from that the diffs seem to be way down in the noise.
The test diff looks surprisingly small. Intuitively speaking, the most interesting instructions for speculative execution are loads. You wouldn't necessarily see a change in the instruction frequency there...
Maybe I should have said up front that the SpeculativeExecution pass does not actually do any speculative execution! All it does is "hoists instructions to enable speculative execution", and most of that hoisting is undone by the code sinking pass that we run later.
If I remove CodeSinking but don't remove SpeculativeExecution then I see bad effects where we start with:
bb:
  %gep1 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 1
  %load1 = load volatile i32, i32 addrspace(5)* %gep1, align 4and then SpeculativeExecution hoists the gep:
  %gep1 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 1
  br i1 %cmp, label %bb, label %ret
bb:
  %load1 = load volatile i32, i32 addrspace(5)* %gep1, align 4and then isel can no longer match the base-plus-constant addressing mode for the load.
Ping! I think this is close to being a no-brainer. All SpeculativeExecution does is to hoist some instructions, and virtually all of them are later sunk by the IR code sinking pass.
I've looked at the effect of this patch on our corpus of 10000 graphics shaders (compiled for gfx1030 using a frontend that includes CodeSinking in the IR optimization pipeline).
I redid this analysis and the effect is even smaller than I thought. Only 1 of the 10000 pipelines is affected, and the instruction frequency delta (out of ~900 instructions in that pipeline) is:
2 15291 15293 v_mul_lo_u32 -2 6442 6440 v_or_b32 -1 44265 44264 v_add_nc_u32
In this function SpeculativeExecution hoisted %cmp2 into %if0, which got in the way of CSE removing %cmp2 altogether. With this patch the resulting code is much simpler.