This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Remove IR SpeculativeExecution pass from codegen pipeline
ClosedPublic

Authored by foad on Jul 21 2022, 5:31 AM.

Details

Summary

This pass seems to have very little effect because all it does is hoist
some instructions, but it is followed later in the codegen pipeline by
the IR CodeSinking pass which does the opposite.

Diff Detail

Event Timeline

foad created this revision.Jul 21 2022, 5:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2022, 5:31 AM
foad requested review of this revision.Jul 21 2022, 5:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2022, 5:31 AM
foad added inline comments.Jul 21 2022, 5:33 AM
llvm/test/CodeGen/AMDGPU/select-opt.ll
144

In this function SpeculativeExecution hoisted %cmp2 into %if0, which got in the way of CSE removing %cmp2 altogether. With this patch the resulting code is much simpler.

I doubt we actually want to remove this

foad added a comment.Jul 21 2022, 5:59 AM

I doubt we actually want to remove this

Can you explain why it was added (D20304 does not say) or what it does that does not get undone by code sinking?

I doubt we actually want to remove this

Can you explain why it was added (D20304 does not say) or what it does that does not get undone by code sinking?

I remember sorely needing this for a long time, and it was added specifically for GPUs. I think we would need compelling benchmarks that it doesn't do anything. It's possible SimplifyCFG has gotten more aggressive since this was added?

foad added a comment.Jul 21 2022, 7:44 AM

I've looked at the effect of this patch on our corpus of 10000 graphics shaders (compiled for gfx1030 using a frontend that includes CodeSinking in the IR optimization pipeline).

Out of a total of ~8 million instructions, the instruction frequency deltas look like this:

-89	2935	2846	v_lshl_add_u32
-9	3447	3438	v_mad_u64_u32
-6	59035	59029	s_load_dwordx4
-4	115017	115013	s_clause
-3	56723	56720	s_load_dwordx8
-3	61717	61714	s_mov_b64
-2	1332	1330	v_cmpx_gt_u32
-1	2609	2608	v_cmpx_eq_u32
1	11908	11909	v_cmp_eq_u32
1	257	258	v_ceil_f32
1	5859	5860	v_or_b32
2	1468	1470	v_and_or_b32
2	25853	25855	v_and_b32
2	2924	2926	v_cmp_gt_u32
3	15607	15610	s_and_saveexec_b64
3	6897	6900	s_load_dwordx16
4	1211934	1211938	v_mul_f32
5	14578	14583	s_cbranch_vccnz
5	23221	23226	s_branch
5	301760	301765	s_waitcnt
5	34280	34285	s_and_b64
5	9682	9687	v_cmp_ngt_f32
5	98695	98700	s_buffer_load_dword
9	12325	12334	v_mul_lo_u32
24	760	784	v_subrev_nc_u32
29	187496	187525	v_mov_b32
101	19167	19268	v_lshlrev_b32
144	43422	43566	v_add_nc_u32

For each instruction the 3 numbers are: (delta from A to B), (A = # of occurrences before this patch), (B = # of occurrences after this patch).

So there are maybe 100 shift/add pairs that are no longer combined to v_lshl_add_u32 due to ending up in different basic blocks, but I have looked at a couple of examples and I don't there is anything systematically wrong there, it is just bad luck (and globalisel might fix it anyway by being able to pattern match across basic blocks). Apart from that the diffs seem to be way down in the noise.

foad edited the summary of this revision. (Show Details)Jul 21 2022, 7:45 AM

The test diff looks surprisingly small. Intuitively speaking, the most interesting instructions for speculative execution are loads. You wouldn't necessarily see a change in the instruction frequency there...

I’d expect increase in select instructions

foad added a comment.Jul 22 2022, 7:27 AM

The test diff looks surprisingly small. Intuitively speaking, the most interesting instructions for speculative execution are loads. You wouldn't necessarily see a change in the instruction frequency there...

Maybe I should have said up front that the SpeculativeExecution pass does not actually do any speculative execution! All it does is "hoists instructions to enable speculative execution", and most of that hoisting is undone by the code sinking pass that we run later.

If I remove CodeSinking but don't remove SpeculativeExecution then I see bad effects where we start with:

bb:
  %gep1 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 1
  %load1 = load volatile i32, i32 addrspace(5)* %gep1, align 4

and then SpeculativeExecution hoists the gep:

  %gep1 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 1
  br i1 %cmp, label %bb, label %ret
bb:
  %load1 = load volatile i32, i32 addrspace(5)* %gep1, align 4

and then isel can no longer match the base-plus-constant addressing mode for the load.

foad added a comment.Aug 2 2022, 7:54 AM

Ping! I think this is close to being a no-brainer. All SpeculativeExecution does is to hoist some instructions, and virtually all of them are later sunk by the IR code sinking pass.

I've looked at the effect of this patch on our corpus of 10000 graphics shaders (compiled for gfx1030 using a frontend that includes CodeSinking in the IR optimization pipeline).

I redid this analysis and the effect is even smaller than I thought. Only 1 of the 10000 pipelines is affected, and the instruction frequency delta (out of ~900 instructions in that pipeline) is:

2 15291 15293 v_mul_lo_u32
-2 6442 6440 v_or_b32
-1 44265 44264 v_add_nc_u32
nhaehnle accepted this revision.Aug 2 2022, 9:15 AM

Makes sense to me.

This revision is now accepted and ready to land.Aug 2 2022, 9:15 AM
This revision was landed with ongoing or failed builds.Aug 2 2022, 9:35 AM
This revision was automatically updated to reflect the committed changes.