This is an archive of the discontinued LLVM Phabricator instance.

[CostModel] Remove VF from IntrinsicCostAttributes
ClosedPublic

Authored by dmgreen on Jan 23 2021, 12:11 PM.

Details

Summary

getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various parameters of the intrinsic being costed. It can either be called with a scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction (RetTy==Vector, VF==1) or from the vectorizer with a scalar type and vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered an error. Both of the vector modes are expected to be treated the same, but because this is confusing many backends end up getting it wrong.

Instead of trying work with those two values separately this removes the VF parameter, widening the RetTy/ArgTys by VF used called from the vectorizer. This keeps things simpler, but does require some other modifications to keep things consistent.

Most backends look like this will be an improvement (or were not using getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code from c230965ccf36af5c88c working. ARM removed the fix in dfac521da1b90db683, webassembly happens to get a fixup for an SLP cost issue and both X86 and AArch64 seem to now be using better costs from the vectorizer.

Diff Detail

Event Timeline

dmgreen created this revision.Jan 23 2021, 12:11 PM
dmgreen requested review of this revision.Jan 23 2021, 12:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 23 2021, 12:11 PM
Herald added subscribers: aheejin, wdng. · View Herald Transcript

Thanks for working on this. Last week I was really puzzled when looking at this code and I wasn't sure if there was a good reason for this, so good to see it get fixed!

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
781

Why are you specially calling out uadd_sat and others?

llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll
11–13

Are there changes to these costs because it would previously calculate scalarization cost + 2 * cost(sadd.sat.i16) where it can now calculates the cost as cost(sadd.sat.2i16)?

sdesmalen removed a subscriber: sdesmalen.
dmgreen added inline comments.Jan 25 2021, 12:46 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
781

Because of intrinsicHasPackedVectorBenefit above. For most other intrinsics it will just end up calling BaseT::getIntrinsicInstrCost.

llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll
11–13

Yes I believe so. At least - the old cost was from Base::getIntrinsicInstrCost as the overrides from AArch64TargetTransformInfo were not taking effect. I'm not 100% sure how it was calculated but that sounds very plausible

https://reviews.llvm.org/D95292 updated the sadd.sat costs.

sdesmalen added inline comments.Jan 25 2021, 1:00 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
781

Is this change then to keep the cost the same as before this patch? I would have expected NElts not to need any change, maybe the cost was just wrong before.

llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll
11–13

Ah, I hadn't seen that patch yet, that explains things. Given that D95292 hasn't landed, this patch is missing a dependence?

samparker added inline comments.Jan 25 2021, 3:38 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
58–59

Could it help to further reduce complexity in TTI if we moved the vectorizer-specific changes into the vectorizer? (Assuming that constructors are one of the different ways at TTI is used by the vectorizer)

tlively resigned from this revision.Jan 26 2021, 12:24 AM

I'm happy to see that this patch improves the WebAssembly code, but I am not familiar enough with the relevant code to comment on the rest of the patch. No concerns from me.

dmgreen updated this revision to Diff 319522.Jan 27 2021, 3:50 AM
dmgreen added inline comments.
llvm/lib/Analysis/TargetTransformInfo.cpp
58–59

I have given it a try, and ended up adjusting some of the other constructors at the same time. It doesn't feel like the cleanest thing in the world, but keeps the code in the vectorizer. Let me know what you think.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
781

Yeah, I think the cost was incorrect before from vector instructions, but correct from the vectorizer's costing of vectorized scalar instructions (because it was looking at RetTy and ignoring VF). I was hoping someone from AMDGPU could take a look and check this was OK. I tried some f16 FMA and ssat routines on a gfx900 and they all ended up with the same vectorization - which is a good sign at least.

sdesmalen accepted this revision.Feb 2 2021, 3:56 AM

Maybe you want to give it a day for someone to confirm the changes to AMDGPUTargetTransformInfo.cpp, but otherwise LGTM, thanks!

This revision is now accepted and ready to land.Feb 2 2021, 3:56 AM
dmgreen added a reviewer: Restricted Project.Feb 2 2021, 9:40 AM

Thanks.

arsenm accepted this revision.Feb 2 2021, 9:43 AM

^ Thanks for taking a look :)

I'll commit this hopefully tomorrow.

samparker added inline comments.Feb 3 2021, 2:40 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
58–59

Cheers Dave!

This revision was automatically updated to reflect the committed changes.
jsji added a subscriber: jsji.EditedFeb 8 2021, 9:15 AM

Looks like this is causing an assert in test-suite if built with CMake on PowerPC.
The current PowerPC buildbots are using Make, so did not trigger this (no -fenable-matrix tested), we will update that soon.

@dmgreen Can you have a quick look? Thanks.

Reduced source and command line to reproduce.

.../build/bin/clang -cc1 -fenable-matrix -triple=powerpc64le-unknown-linux-gnu -O3 -emit-llvm matrix-types-spec-f4acb3.reduced.cpp

$ cat matrix-types-spec-f4acb3.reduced.cpp
template <typename a, int b, int c, int d> void e() {
  a f, g, h;
  using i = a __attribute__((matrix_type(b, d)));
  auto j = __builtin_matrix_column_major_load(&g, b, c, b);
  auto k = __builtin_matrix_column_major_load(&f, c, d, c);
  i l = j * k;
  __builtin_matrix_column_major_store(l, &h, b);
}
template <typename, int b, int c, int> void m() { e<int, b, c, 1>(); }
int main() { m<double, 10, 1, 3>(); }
jsji added a comment.Feb 8 2021, 11:20 AM

Reduced IR:

opt bugpoint-reduced-simplified.ll -inline
target datalayout = "E-m:e-i64:64-n32:64-v256:256:256-v512:512:512"
target triple = "powerpc64-unknown-linux-gnu"

$_Z1mIdLi10ELi1ELi3EEvv = comdat any
$_Z1eIiLi10ELi1ELi1EEvv = comdat any

define void @_Z1mIdLi10ELi1ELi3EEvv() local_unnamed_addr #1 comdat {
entry:
  call void @_Z1eIiLi10ELi1ELi1EEvv()
  ret void
}

define void @_Z1eIiLi10ELi1ELi1EEvv() local_unnamed_addr #1 comdat {
entry:
  %matrix1 = call <1 x i32> @llvm.matrix.column.major.load.v1i32(i32* nonnull align 4 undef, i64 1, i1 false, i32 1, i32 1)
  %0 = call <10 x i32> @llvm.matrix.multiply.v10i32.v10i32.v1i32(<10 x i32> undef, <1 x i32> %matrix1, i32 10, i32 1, i32 1)
  ret void
}

declare <1 x i32> @llvm.matrix.column.major.load.v1i32(i32* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) #2
declare <10 x i32> @llvm.matrix.multiply.v10i32.v10i32.v1i32(<10 x i32>, <1 x i32>, i32 immarg, i32 immarg, i32 immarg) #3

Hello. Thanks for the report. Looks like there still a scalarization overhead that's assuming the VF still means something, but that doesn't hold for these matrix intrinsics!

Let me try and put together something to fix it.

jsji added a comment.Feb 8 2021, 11:37 AM

Hello. Thanks for the report. Looks like there still a scalarization overhead that's assuming the VF still means something, but that doesn't hold for these matrix intrinsics!

Let me try and put together something to fix it.

Thanks Dave.

OK. I've put up D96287 for review. It required more surgery that I was hoping to get it right. If you need revert this in the meantime then feel free, and I can re-commit the two once the followup is ready.

jsji added a comment.Feb 8 2021, 6:16 PM

OK. I've put up D96287 for review. It required more surgery that I was hoping to get it right. If you need revert this in the meantime then feel free, and I can re-commit the two once the followup is ready.

Thanks Dave for the prompt action! Consider the time needed for review, I reverted it first in 92028062413907e1c10b16e7c338e0745a11a051.