Also remove some redundancy because the source and result
types of any multiply are always the same.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
The problem here is that having narrowed:
%2:_(<2 x s64>) = G_UMULH %0:_, %1:_
into:
%3:_(<2 x s32>), %4:_(<2 x s32>) = G_UNMERGE_VALUES %0:_(<2 x s64>) %5:_(<2 x s32>), %6:_(<2 x s32>) = G_UNMERGE_VALUES %1:_(<2 x s64>) %7:_(<2 x s32>) = G_MUL %3:_, %5:_ %8:_(<2 x s32>) = G_MUL %4:_, %5:_ %9:_(<2 x s32>) = G_MUL %3:_, %6:_ %10:_(<2 x s32>) = G_UMULH %3:_, %5:_ %11:_(<2 x s32>), %12:_(<2 x s1>) = G_UADDO %8:_, %9:_ %13:_(<2 x s32>) = G_ZEXT %12:_(<2 x s1>) %14:_(<2 x s32>), %15:_(<2 x s1>) = G_UADDO %11:_, %10:_ %16:_(<2 x s32>) = G_ZEXT %15:_(<2 x s1>) %17:_(<2 x s32>) = G_ADD %13:_, %16:_ %18:_(<2 x s32>) = G_MUL %4:_, %6:_ %19:_(<2 x s32>) = G_UMULH %4:_, %5:_ %20:_(<2 x s32>) = G_UMULH %3:_, %6:_ %21:_(<2 x s32>), %22:_(<2 x s1>) = G_UADDO %18:_, %19:_ %23:_(<2 x s32>) = G_ZEXT %22:_(<2 x s1>) %24:_(<2 x s32>), %25:_(<2 x s1>) = G_UADDO %21:_, %20:_ %26:_(<2 x s32>) = G_ZEXT %25:_(<2 x s1>) %27:_(<2 x s32>) = G_ADD %23:_, %26:_ %28:_(<2 x s32>), %29:_(<2 x s1>) = G_UADDO %24:_, %17:_ %30:_(<2 x s32>) = G_ZEXT %29:_(<2 x s1>) %31:_(<2 x s32>) = G_ADD %27:_, %30:_ %32:_(<2 x s32>) = G_UMULH %4:_, %6:_ %33:_(<2 x s32>) = G_ADD %32:_, %31:_
we need a way to merge %28:_(<2 x s32>) (a vector of the low parts of each element of the result) and %33:_(<2 x s32>) (same for the high parts) into a <2 x s64> result. I don't know how to do that. G_MERGE_VALUES only works on scalars, though I don't see why it couldn't be extended to work element-wise on vectors that all have the same length. And I can't see any existing helper functions in LegalizerHelper.cpp that do what I want either.
Afaik %3:_(<2 x s32>), %4:_(<2 x s32>) = G_UNMERGE_VALUES %0:_(<2 x s64>) is not yet defined to work like that and should probably be forbidden to make such instruction in builder. Only allow scalar split and vector split to elements or sub-vectors.
Desired unmerge works if done step by step:
%5:_(s64), %6:_(s64) = G_UNMERGE_VALUES %0:_(<2 x s64>) %7:_(s32), %8:_(s32) = G_UNMERGE_VALUES %5:_(s64) %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %6:_(s64) %3:_(<2 x s32>) = G_BUILD_VECTOR %5:_(s64), %7:_(s32) %4:_(<2 x s32>) = G_BUILD_VECTOR %6:_(s64), %8:_(s32)
Similar for merge: merge s64 elements first then build <2 x s64> vector.
But <2 x s32> instructions will be scalarized, bit shift packing into <2 x s16> is most probably slower.
Why not go with scalarize at the start?
Why not go with scalarize at the start?
Good idea. I will abandon this and change AMDGPU's legalization rules instead.
Ugh, it is apprently legal but it means something different. See comments in D111132.
Why? For other targets? Maybe, but I'm not motivated to work on it until there's a concrete need for it.
Yes. Plus I'm not entirely comfortable with legalization only working if you have the magic ordering of legalizer rules. I've lost a lot of time fighting legalizer rule ordering issues
My preferred way of implementing this would be to extend G_MERGE_VALUES so that it works elementwise on vectors:
%2:_(<3 x s64>) = G_MERGE_VALUES %0:_(<3 x s32>), %1:_(<3 x s32>) // %0 is three low halves, %1 is three high halves
But that requires G_UNMERGE_VALUES to be able to do the converse:
%1:_(<3 x s32>), %2:_(<3 x s32>) = G_UNMERGE_VALUES %0:_(<3 x s64>) // %1 is three low halves, %2 is three high halves
But this was already allowed and apparently meant something different (a G_BITCAST plus the converse of G_CONCAT_VECTORS?) which is how I got sidetracked into D111132.