This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][TTI] Add cost table entry for trunc over vector of integers.
ClosedPublic

Authored by mingmingl on Aug 27 2022, 1:14 AM.

Details

Summary
  1. Tablegen patterns exist to use 'xtn' and 'uzp1' for trunc [1]. Cost table entries are updated based on the actual number of {xtn, uzp1} instructions generated.
  2. Without this, an IR instruction like trunc <8 x i16> %v to <8 x i8> is considered free and might be sinked to other basic blocks. As a result, the sinked 'trunc' is in a different basic block with its (usually not-free) vector operand and misses the chance to be combined during instruction selection. (examples in [2])
  3. It's a lot of effort to teach CodeGenPrepare.cpp to sink the operand of trunc without introducing regressions, since the instruction to compute the operand of trunc could be faster (e.g., throughput) than the instruction corresponding to "trunc (bin-vector-op".
    • For instance in [3], sinking %1 (as trunc operand) into bb.1 and bb.2 means to replace 2 xtn with 2 shrn (shrn has a throughput of 1 and only utilize v1 pipeline), which is not necessarily good, especially since ushr result needs to be preserved for store operation in bb.0. Meanwhile, it's too optimistic (for CodeGenPrepare pass) to assume machine-cse will always be able to de-dup shrn from various basic blocks into one shrn.

[1] Pattern for {v8i16->v8i8, v4i32->v4i16, v2i64->v2i32} and pattern for concat (trunc, trunc) -> uzp1

[2] pattern for trunc(umin(X, 255)) -> UQXTRN v8i8 and other {u,s}x{min,max} pattern for v8i16 operands), and pattern for shrn (v8i16->v8i8, v2i64->v2i32)

[3]

; instruction latency / throughput / pipeline on `neoverse-n1`
bb.0:
  %1 = lshr <8 x i16> %10, <i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4>   ; ushr, latency 2, throughput 1, pipeline V1
  %2 = trunc <8 x i16> %1 to <8 x i8>  ; xtn, latency 2, throughput 2, pipeline V
  %3 = store <8 x i8> %1, ptr %addr
  br cond i1 cond, label bb.1, label bb.2
    
bb.1:
  %4 = trunc <8 x i16> %1 to <8 x i8> ; xtn
    
bb.2:
  %5 = trunc <8 x i16> %1 to <8 x i8> ; xtn

Diff Detail

Event Timeline

mingmingl created this revision.Aug 27 2022, 1:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 27 2022, 1:14 AM
mingmingl requested review of this revision.Aug 27 2022, 1:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 27 2022, 1:14 AM
mingmingl retitled this revision from [AArch64][TTI] Set the cost of XTN to 1 for 2xi64 (to 2xi32) and 8xi16 (to 8xi8). to [AArch64][TTI] Set the cost of XTN to 1 for 2xi64 (to 2xi32) and 8xi16 (to8xi8)..Aug 27 2022, 1:31 AM
mingmingl edited the summary of this revision. (Show Details)
mingmingl added a reviewer: dmgreen.
mingmingl edited the summary of this revision. (Show Details)
mingmingl edited the summary of this revision. (Show Details)Aug 27 2022, 1:39 AM

This certainly sounds sensible. There are some tests in llvm/test/Analysis/CostModel/AArch64/cast.ll that suggest these costs are already 1, but that seems to not apply in all situations. I don't really understand why at the moment, I would guess that something is incorrectly marking it as free, but only some of the time.

Can you add a test case to show where it does give a difference in the costs?

Specify datalayout explicitly for llvm/test/Analysis/CostModel/AArch64/cast.ll in this patch, and in its diffbase D132856 to demonstrate that lacking datalayout could affect the accuracy of cost-analysis test.

mingmingl added a comment.EditedAug 29 2022, 10:53 AM

This certainly sounds sensible. There are some tests in llvm/test/Analysis/CostModel/AArch64/cast.ll that suggest these costs are already 1, but that seems to not apply in all situations.

This is a fairly good point. With 'gdb opt' debugging, the direct cause that costs are already 1 is that`cast.ll` doesn't explicitly specify a data-layout (that provides legal integer types) and opt won't guess from mtriple either. D132856 sets the data-layout to aarch64 le and updated tests passed. (not 100% sure if opt should make this 'guess', or data-layout should be mandatory in tests; the latter was discussed in this llvm-dev thread)

The longer story is, the code executes in the following flow without a data-layout

  1. AArch64TTIImpl::getCastInstrCost falls back to call BaseT::getCastInstrCost (callsite) after aarch64-specific cost-table look-up gives no result.
  2. BaseT::getCastInstrCost first calls BaseT::getCastInstrCost (callsite) and gets cost '1' so it continues at line 970
    • Missing data-layout in cast.ll means DataLayout:: LegalIntWidths vector is empty and thereby no integer type is legal, so 'trunc' is not free and this is where 1 is returned -> note this is NOT the '1' observed by llvm/test/Analysis/CostModel/AArch64/cast.l (3 explains why)
  1. BaseT::getCastInstrCost continues to get the legalization cost (which is the 1 relied on by cast.ll), and returns here

I don't really understand why at the moment, I would guess that something is incorrectly marking it as free, but only some of the time.

Can you add a test case to show where it does give a difference in the costs?

Of course! Created D132856 to show what a passing test looks like with aarch64 little-endian layout specified, and updated this patch to show the diff of added entries.

As a collateral result of specifying data-layout and added entries (in this patch), 'trunc <16 x i16> undef to <16 x i8>' cost now becomes 3

I wonder if the fact that trunc <16 x i16> undef to <16 x i8> is over-costly indicates an entry "{ ISD::TRUNCATE, MVT::v16i8, MVT::v16i16, 1 }," should be added in this same patch? Also i might try adding 'data-layout' for other aarch64 cost tests and see what's the diff as a next step.

mingmingl planned changes to this revision.Aug 29 2022, 5:18 PM

https://reviews.llvm.org/D132889 sweeps tests in llvm/test/Analysis/CostModel/AArch64 by adding aarch64-le target layout and updating tests -> only cast.ll and arith-overflow.ll stand out.

Given the split approach of BaseTTIImpl is used as a fallback when entry is not present in the table, it's likely that updating a few entries (as current patch does) to the correct number may still cause inaccuracies (in terms of how they are used in split approach and affecting the estimation of wider types).

Planning to add more entries (using https://gcc.godbolt.org/z/q8qodd147 as a template to get a better idea of codegen for different trunc), not sure if there is a minimum set of table entries that (works well with split approach and thereby) get all (at least all in cast.ll) combinations of trunc as accurate as possible.

Meanwhile would appreciate feedbacks on this plan! (assuming i'm on the right track to attribute cost to trunc for typical aarch64 data layout :) )

mingmingl retitled this revision from [AArch64][TTI] Set the cost of XTN to 1 for 2xi64 (to 2xi32) and 8xi16 (to8xi8). to [AArch64][TTI] Add cost table entry for trunc over vector of integers..
mingmingl edited the summary of this revision. (Show Details)

Added more table entries for trunc of vector integers. In this way, costs are specified explicitly rather than depending on the 'split half' approach in BaseTTI.

Updated arith-overflow.ll as a result of more table entries.

https://reviews.llvm.org/D132889 sweeps tests in llvm/test/Analysis/CostModel/AArch64 by adding aarch64-le target layout and updating tests -> only cast.ll and arith-overflow.ll stand out.

Given the split approach of BaseTTIImpl is used as a fallback when entry is not present in the table, it's likely that updating a few entries (as current patch does) to the correct number may still cause inaccuracies (in terms of how they are used in split approach and affecting the estimation of wider types).

Planning to add more entries (using https://gcc.godbolt.org/z/q8qodd147 as a template to get a better idea of codegen for different trunc), not sure if there is a minimum set of table entries that (works well with split approach and thereby) get all (at least all in cast.ll) combinations of trunc as accurate as possible.

Meanwhile would appreciate feedbacks on this plan! (assuming i'm on the right track to attribute cost to trunc for typical aarch64 data layout :) )

Ended up adding entries for 'trunc' over vector integers rather than relying on split approach of BaseTTI.
The numerical value is based on the number of {xtn, uzp1} in actual codegen (https://gist.github.com/minglotus-6/438e49494fe3d26876933141f889c2ac has godbolt links for many 'trunc'). 'arith-overflow.ll' is updated accordingly.

Besides, trunc <4 x i64> %var to <4 x i8> seems suboptimal -> https://gcc.godbolt.org/z/b36oEr11d gives 2 xtn + 1 uzp1 while 1 uzp1 + 1 xtn (4 x i64 -> 4 x i32, then 4 x i32 -> 4 x i8) should be sufficient IIUC. Use '3' (from actual codegen) and added a FIXME for it.

dmgreen accepted this revision.Sep 1 2022, 2:07 AM

The new values LGTM. Thanks.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1611

It looks like it is already 2 here: https://godbolt.org/z/T4rTqf1Tx. I guess it is not very reliable at the moment.

This revision is now accepted and ready to land.Sep 1 2022, 2:07 AM
mingmingl updated this revision to Diff 457283.Sep 1 2022, 8:35 AM
mingmingl marked an inline comment as done.

Changes:

  1. Update the cost of trunc <4 x i64> to <4 x i8> from 3 to 2. Filed https://github.com/llvm/llvm-project/issues/57502 to track the differences in codegen.
  2. Re-format affected table lines so columns are aligned. Only <4 x i64> -> <4 x i8> entry is updated in cast.ll and all other lines remain unchanged, compared with reviewed version (i.e., no inadvertent change due to re-formatting)
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1611

I filed https://github.com/llvm/llvm-project/issues/57502 to track this difference and took the liberty to update this number to 2 in this patch.

Re-format the table entries so columns are aligned -> only <4 x i64> -> <4 x i8> entry is updated in cast.ll as a result of this entry update, compared with reviewed version.

Please let me know if it's more idiomatic to use the higher one (when the codegen is not very reliable and the diff is small). Will hold this patch for a few more days and possibly commit on next Monday if the change from 3 to 2 looks good.

Thanks for reviews!

dmgreen added inline comments.Sep 2 2022, 1:15 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1611

I like removing the FIXME, but if the score is somewhere between 2 and 3, then 3 is probably the better option. It seems to be 3 in more cases where the result gets used, too.

mingmingl updated this revision to Diff 457618.Sep 2 2022, 9:41 AM
mingmingl marked an inline comment as done.

Update the cost of "trunc <4 x i64> to <4 x i8>" to 3 based on discussions.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1611

Got it, thanks! Updated this entry to 3 and going to commit.

This revision was landed with ongoing or failed builds.Sep 2 2022, 10:07 AM
This revision was automatically updated to reflect the committed changes.