This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Don't pass the instruction pointer from getMemInstScalarizationCost.
ClosedPublic

Authored by jonpa on Sep 24 2018, 6:42 AM.

Download Raw Diff

Details

Reviewers

hfinkel
uweigand
javed.absar
MatzeB
fhahn
jonpa

Summary

I discovered while using D52351 (which fixes so that operands scalarization overhead cost is not added if target keeps addresses in GPRs), that some vector loads now got a zero cost. This was because the scalar load can be folded into e.g. an add as one of the operands. The problem is that the folding of the load can only occur in the scalar version, not if the load is vectorized.

I think the simplest solution is to not pass the instruction pointer to getMemoryOpCost() from getMemInstScalarizationCost. Only if that is passed does the SystemZ implementation consider the folding of the load into the user.

Diff Detail

Event Timeline

jonpa created this revision.Sep 24 2018, 6:42 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptSep 24 2018, 6:42 AM

Herald added subscribers: jsji, kbarton, nemanjai. · View Herald Transcript

It seems like a simpler solution to simply not pass the instruction pointer when making this query for a scalarized memory instruction, if VF > 1. This is equivalent, and it doesn't appear to break any tests. Theoretically, there may be use for the instruction pointer, but it seems like there isn't one at the moment.

I can imagine in the SystemZ case that if the user is also scalarized, then the scalarized load is actually folded also in the vectorized loop. I suspect this may happen for i64 multiply. This is however NFC on SPEC compared to using the 'Scalarized' parameter.

The test has an IV increment of 2, which makes it scalarized. Is this clear enough?

Do you agree this is simpler and better?

jonpa retitled this revision from [TargetTransformInfo] Pass a new argument 'Scalarized' to getMemoryOpCost. to [LoopVectorizer] Don't pass the instruction pointer from getMemInstScalarizationCost..Oct 5 2018, 7:46 AM

jonpa edited the summary of this revision. (Show Details)

ping!

Ping!

Since this only affects SystemZ, I suppose no one objects to this then?

uweigand added inline comments.Oct 30 2018, 6:31 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
5262	I think it would be good to add a comment why no instruction pointer is passed. Otherwise this LGTM.

Added a comment as suggested.
Committed in r345603.

This revision is now accepted and ready to land.Oct 30 2018, 7:40 AM

jonpa closed this revision.Oct 30 2018, 7:41 AM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

3 lines

test/

Transforms/

LoopVectorize/

SystemZ/

load-scalarization-cost-1.ll

26 lines

Diff 167263

lib/Transforms/Vectorize/LoopVectorize.cpp

Context not available.

	unsigned LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,	unsigned LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
	unsigned VF) {	unsigned VF) {
		assert(VF > 1 && "Scalarization cost of instruction implies vectorization.");
	Type *ValTy = getMemInstValueType(I);	Type *ValTy = getMemInstValueType(I);
	auto SE = PSE.getSE();	auto SE = PSE.getSE();

Context not available.

	Cost += VF *	Cost += VF *
	TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(), Alignment,	TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(), Alignment,
	AS, I);	AS);
		uweigandUnsubmitted Done Reply Inline Actions I think it would be good to add a comment why no instruction pointer is passed. Otherwise this LGTM. uweigand: I think it would be good to add a comment why no instruction pointer is passed. Otherwise this…

	// Get the overhead of the extractelement and insertelement instructions	// Get the overhead of the extractelement and insertelement instructions
	// we might create due to scalarization.	// we might create due to scalarization.
Context not available.

test/Transforms/LoopVectorize/SystemZ/load-scalarization-cost-1.ll

This file was added.

				; RUN: opt -mtriple=s390x-unknown-linux -mcpu=z13 -loop-vectorize \
				; RUN: -force-vector-width=4 -debug-only=loop-vectorize \
				; RUN: -enable-interleaved-mem-accesses=false -disable-output < %s 2>&1 \
				; RUN: \| FileCheck %s
				; REQUIRES: asserts
				;

				define i32 @fun(i64* %data, i64 %n, i64 %s, i32* %Src) {
				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%acc = phi i32 [ 0, %entry ], [ %acc_next, %for.body ]
				%gep = getelementptr inbounds i32, i32* %Src, i64 %iv
				%ld = load i32, i32* %gep
				%acc_next = add i32 %acc, %ld
				%iv.next = add nuw nsw i64 %iv, 2
				%cmp110.us = icmp slt i64 %iv.next, %n
				br i1 %cmp110.us, label %for.body, label %for.end

				for.end:
				ret i32 %acc_next

				; CHECK: Found an estimated cost of 4 for VF 4 For instruction: %ld = load i32, i32* %gep
				}