This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] vector cost model add cost to extract i1
ClosedPublic

Authored by RolandF on Jul 20 2023, 11:53 AM.

Details

Summary

Try to avoid some unprofitable predication on PPC. Recognize in the cost model that computing on i1 values will require extra mask or compare operation.

Diff Detail

Event Timeline

RolandF created this revision.Jul 20 2023, 11:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2023, 11:53 AM
RolandF requested review of this revision.Jul 20 2023, 11:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2023, 11:53 AM
RolandF added a subscriber: power-llvm-team.
RolandF updated this revision to Diff 543043.Jul 21 2023, 1:32 PM

Oops C++ fail and missed some tests.

shchenz added inline comments.Jul 21 2023, 8:40 PM
llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
738

I must miss something. I compile following cases:

define i1 @ext_ext_or_reduction_v4i1(<4 x i1> %z) {
  %z1 = extractelement <4 x i1> %z, i32 1
  ret i1 %z1
}
define i32 @ext_ext_or_reduction_v4i32(<4 x i32> %z) {
  %z1 = extractelement <4 x i32> %z, i32 1
  ret i32 %z1
}

Seems llc generates exactly same instructions for them at both pwr9 and pwr8. So don't understand why here we need extra cost for i1 types, the i1 type must have some kinds of users?

RolandF added inline comments.Jul 24 2023, 8:17 AM
llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
738

You will see the extra overhead if you compile the provided test case. But looking at your example, notice that the value produced has 64 or 32 live bits - there are garbage bits in the results. The return performs no calculation, it just functions like a copy, but the user would still have to get rid of those bits. The uses will be scalar code - if we try to charge those with the overhead we will make the scalar loop cost more too, plus scalar code will already have the masking code there.

shchenz accepted this revision.Jul 24 2023, 7:26 PM

I think the extra cost considering for i1 type looks reasonable.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
738

I see. And pwr8 seems uses more rotate/clear instructions than pwr9 does as pwr9 has vextubrx. I guess we don't need to model that detail for scalar operations?

llvm/test/Transforms/LoopVectorize/PowerPC/predcost.ll
2

nit: I run the bugpoint and get following narrow down input:

target datalayout = "e-m:e-Fn32-i64:64-n32:64-S128-v256:256:256-v512:512:512"
target triple = "powerpc64le-unknown-linux-gnu"

define dso_local void @_tc(ptr nocapture noundef %aaa, i64 noundef %bbb) local_unnamed_addr {
entry:
  br label %for.body

for.cond.cleanup.loopexit:                        ; preds = %for.inc
  ret void

for.body:                                         ; preds = %for.inc, %entry
  %i.08 = phi i64 [ %inc, %for.inc ], [ 0, %entry ]
  %arrayidx = getelementptr inbounds i8, ptr %aaa, i64 %i.08
  %0 = load i8, ptr %arrayidx, align 1
  %cmp1 = icmp eq i8 %0, 0
  br i1 %cmp1, label %if.then, label %for.inc

if.then:                                          ; preds = %for.body
  store i8 32, ptr %arrayidx, align 1
  br label %for.inc

for.inc:                                          ; preds = %if.then, %for.body
  %inc = add nuw nsw i64 %i.08, 1
  %exitcond.not = icmp eq i64 %inc, %bbb
  br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body
}
This revision is now accepted and ready to land.Jul 24 2023, 7:26 PM
RolandF added inline comments.Jul 26 2023, 11:41 AM
llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
738

For pwr8 there might be 1 added instruction or there might be 2. Uses in arithmetic/logical operations should only need 1. So I could use either 1 or 2 and 1 is enough to prevent the unprofitable case.

llvm/test/Transforms/LoopVectorize/PowerPC/predcost.ll
2

thanks, will update on commit

This revision was landed with ongoing or failed builds.Aug 14 2023, 2:04 PM
This revision was automatically updated to reflect the committed changes.