This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][CostModel] Refine cost model for control-flow instructions.
ClosedPublic

Authored by dfukalov on Feb 16 2021, 11:21 AM.

Details

Summary

Added cost estimation for switch instruction, updated costs of branches, fixed
phi cost.
Had to increase -amdgpu-unroll-threshold-if default value since conditional
branch cost (size) was corrected to higher value.
Test renamed to "control-flow.ll".

Removed redundant code in X86TTIImpl::getCFInstrCost() and
PPCTTIImpl::getCFInstrCost().

Diff Detail

Unit TestsFailed

Event Timeline

dfukalov created this revision.Feb 16 2021, 11:21 AM
dfukalov requested review of this revision.Feb 16 2021, 11:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 16 2021, 11:21 AM
Herald added a subscriber: wdng. · View Herald Transcript

AMDGPU estimates look reasonable, but bumping unroll threshold twice looks suspicious. Is that to accommodate some loops with switches?
I am afraid this can have quite unpredictable impact on overall performance. Did you run any perf tests?

It seems to me this threshold bump partially compensated by cbr cost increase in all cases of unroll loops with ifs, where it is multiplicated by trip count.
This threshold bumped because of test/CodeGen/AMDGPU/unroll.ll, where started to fail

; CHECK-LABEL: @unroll_for_if
; CHECK: entry:
; CHECK-NEXT: getelementptr
; CHECK-NEXT: store
; CHECK-NEXT: getelementptr
; CHECK-NEXT: store
; CHECK-NOT: br
define amdgpu_kernel void @unroll_for_if(i32 addrspace(5)* %a) {
entry:
  br label %for.body
for.body:                                         ; preds = %entry, %for.inc
  %i1 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
  %and = and i32 %i1, 1
  %tobool = icmp eq i32 %and, 0
  br i1 %tobool, label %for.inc, label %if.then
if.then:                                          ; preds = %for.body
  %0 = sext i32 %i1 to i64
  %arrayidx = getelementptr inbounds i32, i32 addrspace(5)* %a, i64 %0
  store i32 0, i32 addrspace(5)* %arrayidx, align 4
  br label %for.inc
for.inc:                                          ; preds = %for.body, %if.then
  %inc = add nuw nsw i32 %i1, 1
  %cmp = icmp ult i32 %inc, 48
  br i1 %cmp, label %for.body, label %for.end

for.end:                                          ; preds = %for.cond
  ret void
}

since cbr code size cost increased (needed increase to 250) plus phi became non-free (+50 to 300).
Perhaps at this time we should set cbr code size estimation not 4 but 3 (2 exec mask manipulations) and collect more statistics.

It seems to me this threshold bump partially compensated by cbr cost increase in all cases of unroll loops with ifs, where it is multiplicated by trip count.
This threshold bumped because of test/CodeGen/AMDGPU/unroll.ll, where started to fail

; CHECK-LABEL: @unroll_for_if
; CHECK: entry:
; CHECK-NEXT: getelementptr
; CHECK-NEXT: store
; CHECK-NEXT: getelementptr
; CHECK-NEXT: store
; CHECK-NOT: br
define amdgpu_kernel void @unroll_for_if(i32 addrspace(5)* %a) {
entry:
  br label %for.body
for.body:                                         ; preds = %entry, %for.inc
  %i1 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
  %and = and i32 %i1, 1
  %tobool = icmp eq i32 %and, 0
  br i1 %tobool, label %for.inc, label %if.then
if.then:                                          ; preds = %for.body
  %0 = sext i32 %i1 to i64
  %arrayidx = getelementptr inbounds i32, i32 addrspace(5)* %a, i64 %0
  store i32 0, i32 addrspace(5)* %arrayidx, align 4
  br label %for.inc
for.inc:                                          ; preds = %for.body, %if.then
  %inc = add nuw nsw i32 %i1, 1
  %cmp = icmp ult i32 %inc, 48
  br i1 %cmp, label %for.body, label %for.end

for.end:                                          ; preds = %for.cond
  ret void
}

since cbr code size cost increased (needed increase to 250) plus phi became non-free (+50 to 300).
Perhaps at this time we should set cbr code size estimation not 4 but 3 (2 exec mask manipulations) and collect more statistics.

Can you do some performance testing please?

dfukalov updated this revision to Diff 336553.Apr 9 2021, 1:18 PM

Updated threshold and conditional branch cost-size after performance testing.

rampitec accepted this revision.Apr 9 2021, 1:36 PM

Perf numbers look reasonable to be now.

This revision is now accepted and ready to land.Apr 9 2021, 1:36 PM
This revision was landed with ongoing or failed builds.Apr 9 2021, 11:20 PM
This revision was automatically updated to reflect the committed changes.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp