This is an archive of the discontinued LLVM Phabricator instance.

[SLP][RISCV] Account for offset folding in getPointersChainCost
ClosedPublic

Authored by luke on May 2 2023, 7:51 AM.

Details

Summary

For a GEP in a pointer chain, if:

  1. a pointer chain is unit-strided
  2. the base pointer wasn't folded and is sitting in a register somewhere
  3. the distance between the GEP and the base pointer is small enough and can be folded into the addressing mode of the using load/store

Then we can exclude that GEP from the total cost of the pointer chain,
as it will likely be folded away.

In order to check if 3) holds, we need to know the type of memory access
being made by the users of the pointer chain. For that, we need to pass
along a new argument to getPointersChainCost. (Using the source pointer
type of the GEP isn't accurate, see https://reviews.llvm.org/D149889 for
more details).

Also note that 2) is currently an assumption, and could be modelled more
accurately.

This prevents some unprofitable cases from being SLP vectorized on
RISC-V by making the scalar costs cheaper and closer to the actual
codegen.

For now the getPointersChainCost hook is duplicated for RISC-V to prevent
disturbing other targets, but could be merged back in and shared with
other targets in a following patch.

Diff Detail

Event Timeline

luke created this revision.May 2 2023, 7:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 2 2023, 7:51 AM
luke requested review of this revision.May 2 2023, 7:51 AM
luke added inline comments.May 2 2023, 7:56 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4960

I removed this because it seems to be subsumed by the extra check in the base implementation. It's 100% equivalent as it will now cost for GEPs that don't fit into the addressing mode, but that should be more accurate right?

luke updated this revision to Diff 518736.May 2 2023, 7:59 AM

Remove fixme

ABataev added inline comments.May 2 2023, 8:02 AM
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
1057

I

1069

Add a comment with the param name for the true argument

luke updated this revision to Diff 518741.May 2 2023, 8:12 AM

Address comments

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4960

Typo, it's *not* 100% equivalent

I think this patch requires some cost analysis tests.

reames requested changes to this revision.May 2 2023, 1:17 PM
reames added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
323

Rephrase: is the type of the memory access.

327

Rename: AccessTy

(This matches the naming convention we use elsewhere for this concept.)

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
1066

I don't see anything here which requires the stride to be equal to the type size. What prevents the set of pointers from e.g. advancing by 64 bytes where the access type is 8 bytes? (i.e. what prevents this from being a strided access with non-unit stride in RISCV terms?)

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4950

KnownStride, and KnownUniform are not the same condition. I don't think your code change in the general model actually matches what you removed here.

I'd suggest by starting with a RISCV specific hook on your heuristic, and then we can merge in a post commit. I think the RISCV version is going to be more restrictive.

This revision now requires changes to proceed.May 2 2023, 1:17 PM
luke added inline comments.May 2 2023, 4:00 PM
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
1066

From what I understand, PointersChainInfo::isUniformStride() implies unit stride pointers. The only user of getPointersChainCost is in SLP's getGEPCostDiff, and the only time getKnownUniformStrided() is created is in case 2) here, which is a wide vector load:

// Here we differentiate two cases: (1) when Ptrs represent a regular
// vectorization tree node (as they are pointer arguments of scattered
// loads) or (2) when Ptrs are the arguments of loads or stores being
// vectorized as plane wide unit-stride load/store since all the
// loads/stores are known to be from/to adjacent locations.
assert(E->State == TreeEntry::Vectorize &&
       "Entry state expected to be Vectorize here.");
if (isa<LoadInst, StoreInst>(VL0)) {
  // Case 2: estimate costs for pointer related costs when vectorizing to
  // a wide load/store.
  // Scalar cost is estimated as a set of pointers with known relationship
  // between them.
  // For vector code we will use BasePtr as argument for the wide load/store
  // but we also need to account all the instructions which are going to
  // stay in vectorized code due to uses outside of these scalar
  // loads/stores.
  ScalarCost = TTI->getPointersChainCost(
      Ptrs, BasePtr, TTI::PointersChainInfo::getKnownUniformStrided(),
      ScalarTy, CostKind);

I agree this is non-obvious, I was considering renaming isUniformStride() to isUnitStride() or something more strict but deferred to keep the patch small. I can do this is a parent patch or alternatively leave a comment explaining this.

luke added a comment.May 2 2023, 4:04 PM

I think this patch requires some cost analysis tests.

I agree, although getPointersChainInfo doesn't get called anywhere useful for opt -passes="print<cost-model>", so the only way I can think of verifying the cost is via checking the debug output from SLP. Is there a better way to go about this?

luke updated this revision to Diff 522241.May 15 2023, 9:53 AM

Restore x86 hook

luke updated this revision to Diff 522556.May 16 2023, 5:20 AM

Rebase ontop of D150662 and add cost test case

luke updated this revision to Diff 522599.May 16 2023, 7:04 AM

Fix rebase

luke added a comment.May 16 2023, 9:35 AM

Doesn't seem to be any changes in x86 code on the llvm-test-suite:

Tests: 2432
Metric: size..text

Program                                        size..text                      
                                               results.head results.patch diff 
 Bitcode/Be...ral_grid/halide_bilateral_grid   59937.00     59937.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-2d     593.00       593.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-11    1185.00      1185.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-12     465.00       465.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-13     497.00       497.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-14     385.00       385.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-15    1633.00      1633.00       0.0%
 SingleSour...e/execute/GCC-C-execute-loop-2     561.00       561.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-2b     561.00       561.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-2e     769.00       769.00       0.0%
 Bitcode/Be...hmarks/Halide/blur/halide_blur   35809.00     35809.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-2f     417.00       417.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-2g     417.00       417.00       0.0%
 SingleSour...e/execute/GCC-C-execute-loop-3     417.00       417.00       0.0%
 SingleSour.../execute/GCC-C-execute-loop-3b     449.00       449.00       0.0%
                            Geomean difference      nan          nan       0.0%
         size..text                      
run    results.head results.patch    diff
count  2.432000e+03  2.432000e+03  2432.0
mean   1.692725e+04  1.692725e+04  0.0   
std    1.559107e+05  1.559107e+05  0.0   
min    3.530000e+02  3.530000e+02  0.0   
25%    3.850000e+02  3.850000e+02  0.0   
50%    5.130000e+02  5.130000e+02  0.0   
75%    4.765000e+03  4.765000e+03  0.0   
max    7.177889e+06  7.177889e+06  0.0
luke added a comment.May 16 2023, 10:02 AM

On sqlite3, -O2, there are 12 fewer VF=2 sequences vectorized on RISC-V:

$ grep -E "vsetivli\s+zero, 2, e" sqlite.head.s | wc -l
378
$ grep -E "vsetivli\s+zero, 2, e" sqlite.patch.s | wc -l
366

Most of them are in the form of something like:

	sb	a0, 5(s4)
	sb	s6, 6(s4)
	addi	s4, s4, 1
	vsetivli	zero, 2, e8, mf8, ta, ma
	vmv.v.i	v8, 0
	vse8.v	v8, (s4)

Which are now being emitted as scalar:

	sb	a0, 5(s3)
	sb	s6, 6(s3)
	sb	zero, 1(s3)
	sb	zero, 2(s3)

There are no differences in the code generated on x86. I've attached the RISC-V outputs if anyone wants to see the changes.


luke updated this revision to Diff 523736.May 19 2023, 5:12 AM

Isolate changes to RISCVTargetTransformInfo

luke marked 6 inline comments as done.May 19 2023, 5:15 AM
luke added inline comments.
llvm/test/Transforms/SLPVectorizer/RISCV/getpointerschaincost.ll
1

Apologies in advance for testing with the debug output, I can't think of another way to get access to getPointersChainCost.

luke retitled this revision from [SLP] Don't cost pointers that can be folded in getPointersChainCost to [SLP][RISCV] Account for offset folding in getPointersChainCost.May 19 2023, 5:16 AM
luke edited the summary of this revision. (Show Details)
luke added inline comments.May 19 2023, 5:19 AM
llvm/test/Transforms/SLPVectorizer/RISCV/getpointerschaincost.ll
56

Also worth noting, I tried to come up with a test case where only some of the pointers were folded and some weren't, but couldn't find a sane way to do so.
Namely for RISC-V, we need a chain of pointers that are unit-strided, but is also somehow long enough that the offset overflows 2^12.

ABataev added inline comments.May 19 2023, 6:29 AM
llvm/test/Transforms/SLPVectorizer/RISCV/getpointerschaincost.ll
1

-pass-remarks-output= inbstead, check some of the remarks_... tests in SLPVectorizer tests directory as an example. Also, better to precommit new tests separately

The RISCV and API extension bits now look good to me. Once Alexey's happy with the SLP bits, you're good to go.

Looking at the structure of the patch - thanks for the scope reduction - I realized this is a special case of pointer difference. The general form of this would be to compute the offset between each GEP and the base. If that difference fits in the scalar addressing range, that GEP has zero cost. Otherwise, it has cost.

This does raise the point that considering a constant offset GEP as zero cost is actually wrong. If the offsets are 0, and UINT_MAX, that's not a zero cost GEP on RISCV.

The only machinery I know for this in LLVM is BasicAAResult::DecomposedGEP . We could potentially reuse some of that.

There's also an interaction with the ptradd proposal here. With simpler GEPs, the subtraction becomes trivial.

This is definite definitely future work through. And probably not future work actually worth doing, at least for the general case. :)

luke added a comment.May 19 2023, 8:01 AM

This does raise the point that considering a constant offset GEP as zero cost is actually wrong. If the offsets are 0, and UINT_MAX, that's not a zero cost GEP on RISCV.

I guess with cost modelling there's not a strict definition of "wrong", but whether it could be more accurate. Whilst working on this locally I changed that assumption to check if it's foldable, but intentionally left it out to simplify things. It seems like an edge case where the approximation gets it right most of the time.

luke updated this revision to Diff 524198.May 22 2023, 2:13 AM
luke marked an inline comment as done.

Convert test to remarks test

This revision was not accepted when it landed; it landed in state Needs Review.May 22 2023, 5:55 AM
This revision was automatically updated to reflect the committed changes.