This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
4/10
AArch64TargetTransformInfo.cpp
-
test/
-
Analysis/CostModel/AArch64/
-
CostModel/
-
AArch64/
-
cast.ll
-
mem-op-cost-model.ll
-
store.ll
-
Transforms/
-
LoopVectorize/AArch64/
-
AArch64/
1
extend-vectorization-factor-for-unprofitable-memops.ll
1
interleaved-vs-scalar.ll
-
SLPVectorizer/AArch64/
-
AArch64/
-
gather-root.ll
-
loadi8.ll

Differential D103629

[AArch64] Cost-model i8 vector loads/stores
ClosedPublic

Authored by SjoerdMeijer on Jun 3 2021, 9:17 AM.

Download Raw Diff

Details

Reviewers

dmgreen
fhahn
david-arm
zatrazz
asavonic
efriedma

Commits

rGee752134ace3: [AArch64] Cost-model i8 vector loads/stores

Summary

Loads of e.g. <4 x i8> vectors were modeled as extremely expensive. And while we don't have a load instruction that supports this, it isn't that expensive to create a vector of i8 elements. This tweaks the cost model and enables SLP vectorisation of my motivating case loadi8.ll that I have added.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

SjoerdMeijer created this revision.Jun 3 2021, 9:17 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald TranscriptJun 3 2021, 9:17 AM

SjoerdMeijer requested review of this revision.Jun 3 2021, 9:17 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 3 2021, 9:17 AM

Harbormaster completed remote builds in B107476: Diff 349581.Jun 3 2021, 9:18 AM

SjoerdMeijer added inline comments.Jun 3 2021, 9:21 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1469	I was also wondering if this was just a bug, because what we are doing here is `NumVecElts * 2 * NumVecElts * 2`. For an `<4 x i8>` that results in a cost of 64. If this was intention, then I don't think I follow this.

sdesmalen added a subscriber: sdesmalen.Jun 3 2021, 9:28 AM

sdesmalen added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1465	Should the "else" case still be the original high cost? (or some other high cost)

SjoerdMeijer marked an inline comment as not done.Jun 3 2021, 9:47 AM

SjoerdMeijer added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1465	Thanks for taking a look! I think this is the expensive case that we want to get more correct. It's expensive if the vector is smaller than some magic number, which we check with `< ProfitableNumElements`. The "else" case is the cheap case, for which we will return `LT.first`. I think this makes sense, but will double check, and let me know if I missed something here.

And while we don't have a load instruction that supports this

If <4 x i8> loads matter, we should probably convert them to a 32-bit load followed by a zip1, which should would have a cost of 2. (Or possibly 3 on big-endian, I guess.) Basically the inverse of LowerTruncateVectorStore.

dmgreen added a reviewer: asavonic.Jun 3 2021, 10:18 PM

dmgreen added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1459	Prior to D102938, this wasn't true and seems to still not be very true in general: https://godbolt.org/z/7KMrEqcMW Although the add's can be removed. The serialized ld1's won't be very cheap though, on many cpus. A factor of two might be enough to show they are expensive, but there would probably be some cases where performance was worse. As Eli says, optimizing the 4 x i8 case at least using a 32bit load and a shuffle sounds like a good idea.
1469	My rough understanding was that you really don't want the vectorizer to produce <4 x i8> load <4 x i16> zext You want to make sure it's at least 8x: <8 x i8> load <8 x i16> zext That way you don't serialize the load/extend, using d and q reg instructions as expected. So the costs are deliberately high - high enough to prevent the scalarization and cross register bank moves. It may be higher than the cost of the individual instructions, but that is what you want to steer the vectorizer profitably.

SjoerdMeijer added inline comments.Jun 4 2021, 12:07 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1459	Prior to D102938, this wasn't true and seems to still not be very true in general: https://godbolt.org/z/7KMrEqcMW While there are some variants, the trend is still roughly 2 * #elements. A factor of two might be enough to show they are expensive. Yep, so that's what this patch does. Correct me if I am wrong, but looks like we agree on that. As Eli says, optimizing the 4 x i8 case at least using a 32bit load and a shuffle sounds like a good idea. Yep, that's a nice suggestion, will look into that first.

SjoerdMeijer added inline comments.Jun 4 2021, 12:15 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1469	There are probably a lot of different cases. When types are all the same width, yes, you want to go for a wider vector. But in case of mixed types, where e.g. a smaller type is accumulated in a bigger, vectorisation is still profitable (or can be) and we might want to pay the overhead of constructing a vector for the smaller type.

Matt added a subscriber: Matt.Jun 4 2021, 8:21 AM

In D103629#2797291, @efriedma wrote:

And while we don't have a load instruction that supports this

If <4 x i8> loads matter, we should probably convert them to a 32-bit load followed by a zip1, which should would have a cost of 2. (Or possibly 3 on big-endian, I guess.) Basically the inverse of LowerTruncateVectorStore.

Question about this. I will keep looking a bit longer because my zip1-fu is not so strong, but I was struggling to see how codegen would look like. For an example like this:

define <4 x i32> @f(<4 x i8>* %a, <4 x i32> %b) {
  %x = load <4 x i8>, <4 x i8>* %a
  %y = sext <4 x i8> %x to <4 x i32>
  %z = add <4 x i32> %y, %b
  ret <4 x i32> %z
}

I am failing to see how with something like

fmov s0, w0
zip1.8d v0, v0, v0

I would get the bytes sign extended and in the right place with zip1 for the 128-bit add.

The two-instruction sequence leaves the bits in the right positions for a <4 x i16>. If you need a <4 x i32>, you need another zip. If you need sign-extension, you need to sshr the result or something like that. So "%x = load <4 x i8>, <4 x i8>* %a %y = sext <4 x i8> %x to <4 x i32>" would be four instructions total.

In D103629#2799380, @efriedma wrote:

The two-instruction sequence leaves the bits in the right positions for a <4 x i16>. If you need a <4 x i32>, you need another zip. If you need sign-extension, you need to sshr the result or something like that. So "%x = load <4 x i8>, <4 x i8>* %a %y = sext <4 x i8> %x to <4 x i32>" would be four instructions total.

Ah, thanks, makes sense! I was stuck with the idea that this would just require 2 instructions (for an example like this), which I didn't get.

SjoerdMeijer mentioned this in D104782: [AArch64] Custom lower <4 x i8> loads.Jun 23 2021, 6:33 AM

SjoerdMeijer mentioned this in rG51e434fc2590: [AArch64] Custom lower <4 x i8> loads.Jun 25 2021, 1:54 AM

SjoerdMeijer mentioned this in rG79c98279b6cd: [SLP][AArch64] Precommit test for D103629, checking <4 x i8> loads. NFC..Jun 25 2021, 3:05 AM

With the codegen optimised for i8 loads in D105110 so that it only takes a few instructions to materialise i8 loads, it's time to return to the cost-modelling part here.

I got rid of the logic that calculates the cost and have replaced it with a look-up table that is based on this new codegen.

SjoerdMeijer added a reviewer: efriedma.Jul 1 2021, 6:53 AM

dmgreen added inline comments.Jul 1 2021, 7:53 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1456	I think you are adding too many types. It's not a truncating store every time that VT != LT.second. It just means that it is type legalized to a different type. A v32i8 is just 2 v16i8 loads, so should give LT={2, MVT::v16i8} and should get a cost of 2 (which is LT.first). Any larger type follows the same pattern of splitting until it is a legal type. So v64i8 would be LT={4, MVT::v16i8}, so cost=4. v16i8 and v8i8 are legal, so get {1, MVT::v16i8} or {1, MVT::v8i8} and cost 1 (which is LT.first again). For v4i8 the LT.second will be v4i16 and LT.first will be 1 still (I think). I was expecting it to just be <4 x i8> costs, which are now 2 (or 3 if we want to worry about big endian costs, which we don't usually do AFAIU). One for the f32 load and one for the sshll (or xtn+f32 store). The cost of any other truncation would be measured from the trunc instruction.

Harbormaster completed remote builds in B111972: Diff 355862.Jul 1 2021, 8:11 AM

SjoerdMeijer added inline comments.Jul 1 2021, 11:05 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1456	Okay, that's fair enough. I don't like this `ProfitableNumElements` business here, and would like to see that go. It's wrong for the loads, so will have to adjust that. And when I do that, I don't want to add a sort of special case for the loads and keep the `ProfitableNumElements` for the stores and that logic for the `getNumElements() < ProfitableNumElements` business. So that explains that a little bit. I think we have 2 cases: A type is legal, or a type is legal and is split up. In both cases, like you explained the cost is just `LT.first`. A type is truncated or extended. For an extend of `<4 x i8>` the cost is 2 if it's extended to i16 (1 load, 1 sshll), and the cost is 3 if it is extended to i32 (1 load, 2 sshll). So the cost is not 2 in all cases. And while we are at I added costs for some smaller and larger vectors. The check to see if it's a trunk/extend is indeed wrong. I will need to look but I am guessing that checking if `getScalarSizeInBits()` is the same for `VT` or `LT.second` or something along those line will cover that. This allows us to do a lookup for a trunc/extend (case 2), or return `LT.second`(case 1).

This fixes the trunc or ext check.

As a result, there was this check in llvm/test/Analysis/CostModel/AArch64/store.ll in the previous revision:

Found an estimated cost of 64 for instruction: store <32 x i8> undef, <32 x i8>* undef, align 4

but that cost is now back to 2, i.e. it remains unchanged with this patch.

Harbormaster completed remote builds in B112163: Diff 356127.Jul 2 2021, 2:11 AM

dmgreen added inline comments.Jul 2 2021, 8:59 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1456	I still think there are more types here than should be needed (or possible to reach). The truncation tests you point to are not very relevant here. We are just dealing with the cost of a load. As in these tests: https://godbolt.org/z/oK58Gcez5 (If you want to add costs for a combines store(trunc) or ext(load) then that is a different issue, that needs to detect the trunc/ext instruction, like we do in MVE. I don't think it's needed here though). For i8 vector types that are a different size when legalized, we are either talking about v2i8, v3i8 or v4i8. I'm pretty sure anything larger will be legalized to a v8i8 or larger, so still a i8. v4i8 is the one you added better lowering for recently, It will be legalized to a v4i16. v2i8 will legalize to a v2i32 and is still a bit messy. So I think loads and stores are roughly the same cost, and most of the cases are not needed. The new code can probably be something like: if (useNeonVector(Ty) && Ty->getScalarSizeInBits() != LT.second.getScalarSizeInBits()) { // v4i8 types are lowered to scalar load/store and sshl/xtn. if (VT == MVT::v4i8) return 2; // Otherwise we need to scalarize return cast<FixedVectorType>(Ty)->getNumElements() * 2; } I think that may get v2i16 costs more correct too, which is a nice benefit :)

Alright, let's go for that then.

Thanks. Nice change

llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll
11	Can check for <4 x i8> specifically
llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll
3–4	This comment looks old now.

This revision is now accepted and ready to land.Jul 5 2021, 2:36 AM

Harbormaster completed remote builds in B112415: Diff 356451.Jul 5 2021, 2:42 AM

Cheers, will fix that before committing.

Closed by commit rGee752134ace3: [AArch64] Cost-model i8 vector loads/stores (authored by SjoerdMeijer). · Explain WhyJul 5 2021, 3:25 AM

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rGee752134ace3: [AArch64] Cost-model i8 vector loads/stores.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

26 lines

test/

Analysis/

CostModel/

AArch64/

cast.ll

6 lines

mem-op-cost-model.ll

8 lines

store.ll

16 lines

Transforms/

LoopVectorize/

AArch64/

extend-vectorization-factor-for-unprofitable-memops.ll

7 lines

interleaved-vs-scalar.ll

8 lines

SLPVectorizer/

AArch64/

gather-root.ll

28 lines

loadi8.ll

64 lines

Diff 356463

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,417 Lines • ▼ Show 20 Lines	bool AArch64TTIImpl::useNeonVector(const Type *Ty) const {
return isa<FixedVectorType>(Ty) && !ST->useSVEForFixedLengthVectors();		return isa<FixedVectorType>(Ty) && !ST->useSVEForFixedLengthVectors();
}		}

InstructionCost AArch64TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Ty,		InstructionCost AArch64TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Ty,
MaybeAlign Alignment,		MaybeAlign Alignment,
unsigned AddressSpace,		unsigned AddressSpace,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) {		const Instruction *I) {
		EVT VT = TLI->getValueType(DL, Ty, true);
// Type legalization can't handle structs		// Type legalization can't handle structs
if (TLI->getValueType(DL, Ty, true) == MVT::Other)		if (VT == MVT::Other)
return BaseT::getMemoryOpCost(Opcode, Ty, Alignment, AddressSpace,		return BaseT::getMemoryOpCost(Opcode, Ty, Alignment, AddressSpace,
CostKind);		CostKind);

auto LT = TLI->getTypeLegalizationCost(DL, Ty);		auto LT = TLI->getTypeLegalizationCost(DL, Ty);
if (!LT.first.isValid())		if (!LT.first.isValid())
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();

// TODO: consider latency as well for TCK_SizeAndLatency.		// TODO: consider latency as well for TCK_SizeAndLatency.
Show All 10 Lines	if (ST->isMisaligned128StoreSlow() && Opcode == Instruction::Store &&
// practice on inlined block copy code.		// practice on inlined block copy code.
// We make such stores expensive so that we will only vectorize if there		// We make such stores expensive so that we will only vectorize if there
// are 6 other instructions getting vectorized.		// are 6 other instructions getting vectorized.
const int AmortizationCost = 6;		const int AmortizationCost = 6;

return LT.first * 2 * AmortizationCost;		return LT.first * 2 * AmortizationCost;
}		}

		// Check truncating stores and extending loads.
if (useNeonVector(Ty) &&		if (useNeonVector(Ty) &&
		dmgreenUnsubmitted Not Done Reply Inline Actions I think you are adding too many types. It's not a truncating store every time that VT != LT.second. It just means that it is type legalized to a different type. A v32i8 is just 2 v16i8 loads, so should give LT={2, MVT::v16i8} and should get a cost of 2 (which is LT.first). Any larger type follows the same pattern of splitting until it is a legal type. So v64i8 would be LT={4, MVT::v16i8}, so cost=4. v16i8 and v8i8 are legal, so get {1, MVT::v16i8} or {1, MVT::v8i8} and cost 1 (which is LT.first again). For v4i8 the LT.second will be v4i16 and LT.first will be 1 still (I think). I was expecting it to just be <4 x i8> costs, which are now 2 (or 3 if we want to worry about big endian costs, which we don't usually do AFAIU). One for the f32 load and one for the sshll (or xtn+f32 store). The cost of any other truncation would be measured from the trunc instruction. dmgreen: I think you are adding too many types. It's not a truncating store every time that VT != LT.
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Okay, that's fair enough. I don't like this `ProfitableNumElements` business here, and would like to see that go. It's wrong for the loads, so will have to adjust that. And when I do that, I don't want to add a sort of special case for the loads and keep the `ProfitableNumElements` for the stores and that logic for the `getNumElements() < ProfitableNumElements` business. So that explains that a little bit. I think we have 2 cases: A type is legal, or a type is legal and is split up. In both cases, like you explained the cost is just `LT.first`. A type is truncated or extended. For an extend of `<4 x i8>` the cost is 2 if it's extended to i16 (1 load, 1 sshll), and the cost is 3 if it is extended to i32 (1 load, 2 sshll). So the cost is not 2 in all cases. And while we are at I added costs for some smaller and larger vectors. The check to see if it's a trunk/extend is indeed wrong. I will need to look but I am guessing that checking if `getScalarSizeInBits()` is the same for `VT` or `LT.second` or something along those line will cover that. This allows us to do a lookup for a trunc/extend (case 2), or return `LT.second`(case 1). SjoerdMeijer: Okay, that's fair enough. I don't like this `ProfitableNumElements` business here, and would…
		dmgreenUnsubmitted Not Done Reply Inline Actions I still think there are more types here than should be needed (or possible to reach). The truncation tests you point to are not very relevant here. We are just dealing with the cost of a load. As in these tests: https://godbolt.org/z/oK58Gcez5 (If you want to add costs for a combines store(trunc) or ext(load) then that is a different issue, that needs to detect the trunc/ext instruction, like we do in MVE. I don't think it's needed here though). For i8 vector types that are a different size when legalized, we are either talking about v2i8, v3i8 or v4i8. I'm pretty sure anything larger will be legalized to a v8i8 or larger, so still a i8. v4i8 is the one you added better lowering for recently, It will be legalized to a v4i16. v2i8 will legalize to a v2i32 and is still a bit messy. So I think loads and stores are roughly the same cost, and most of the cases are not needed. The new code can probably be something like: if (useNeonVector(Ty) && Ty->getScalarSizeInBits() != LT.second.getScalarSizeInBits()) { // v4i8 types are lowered to scalar load/store and sshl/xtn. if (VT == MVT::v4i8) return 2; // Otherwise we need to scalarize return cast<FixedVectorType>(Ty)->getNumElements() * 2; } I think that may get v2i16 costs more correct too, which is a nice benefit :) dmgreen: I still think there are more types here than should be needed (or possible to reach). The…
cast<VectorType>(Ty)->getElementType()->isIntegerTy(8)) {		Ty->getScalarSizeInBits() != LT.second.getScalarSizeInBits()) {
unsigned ProfitableNumElements;		// v4i8 types are lowered to scalar a load/store and sshll/xtn.
if (Opcode == Instruction::Store)		if (VT == MVT::v4i8)
		dmgreenUnsubmitted Not Done Reply Inline Actions Prior to D102938, this wasn't true and seems to still not be very true in general: https://godbolt.org/z/7KMrEqcMW Although the add's can be removed. The serialized ld1's won't be very cheap though, on many cpus. A factor of two might be enough to show they are expensive, but there would probably be some cases where performance was worse. As Eli says, optimizing the 4 x i8 case at least using a 32bit load and a shuffle sounds like a good idea. dmgreen: Prior to D102938, this wasn't true and seems to still not be very true in general: https…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Prior to D102938, this wasn't true and seems to still not be very true in general: https://godbolt.org/z/7KMrEqcMW While there are some variants, the trend is still roughly 2 * #elements. A factor of two might be enough to show they are expensive. Yep, so that's what this patch does. Correct me if I am wrong, but looks like we agree on that. As Eli says, optimizing the 4 x i8 case at least using a 32bit load and a shuffle sounds like a good idea. Yep, that's a nice suggestion, will look into that first. SjoerdMeijer: > Prior to D102938, this wasn't true and seems to still not be very true in general: > https…
// We use a custom trunc store lowering so v.4b should be profitable.		return 2;
ProfitableNumElements = 4;		// Otherwise we need to scalarize.
else		return cast<FixedVectorType>(Ty)->getNumElements() * 2;
// We scalarize the loads because there is not v.4b register and we
// have to promote the elements to v.2.
ProfitableNumElements = 8;

if (cast<FixedVectorType>(Ty)->getNumElements() < ProfitableNumElements) {
unsigned NumVecElts = cast<FixedVectorType>(Ty)->getNumElements();
unsigned NumVectorizableInstsToAmortize = NumVecElts * 2;
// We generate 2 instructions per vector element.
return NumVectorizableInstsToAmortize * NumVecElts * 2;
SjoerdMeijerAuthorUnsubmitted Not Done Reply Inline Actions I was also wondering if this was just a bug, because what we are doing here is `NumVecElts * 2 * NumVecElts * 2`. For an `<4 x i8>` that results in a cost of 64. If this was intention, then I don't think I follow this. SjoerdMeijer: I was also wondering if this was just a bug, because what we are doing here is `NumVecElts * 2…
dmgreenUnsubmitted Not Done Reply Inline Actions My rough understanding was that you really don't want the vectorizer to produce <4 x i8> load <4 x i16> zext You want to make sure it's at least 8x: <8 x i8> load <8 x i16> zext That way you don't serialize the load/extend, using d and q reg instructions as expected. So the costs are deliberately high - high enough to prevent the scalarization and cross register bank moves. It may be higher than the cost of the individual instructions, but that is what you want to steer the vectorizer profitably. dmgreen: My rough understanding was that you really don't want the vectorizer to produce <4 x i8> load…
SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions There are probably a lot of different cases. When types are all the same width, yes, you want to go for a wider vector. But in case of mixed types, where e.g. a smaller type is accumulated in a bigger, vectorisation is still profitable (or can be) and we might want to pay the overhead of constructing a vector for the smaller type. SjoerdMeijer: There are probably a lot of different cases. When types are all the same width, yes, you want…
}
}		}

return LT.first;		return LT.first;
		sdesmalenUnsubmitted Not Done Reply Inline Actions Should the "else" case still be the original high cost? (or some other high cost) sdesmalen: Should the "else" case still be the original high cost? (or some other high cost)
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Thanks for taking a look! I think this is the expensive case that we want to get more correct. It's expensive if the vector is smaller than some magic number, which we check with `< ProfitableNumElements`. The "else" case is the cheap case, for which we will return `LT.first`. I think this makes sense, but will double check, and let me know if I missed something here. SjoerdMeijer: Thanks for taking a look! I think this is the expensive case that we want to get more correct.
}		}

InstructionCost AArch64TTIImpl::getInterleavedMemoryOpCost(		InstructionCost AArch64TTIImpl::getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,		Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond, bool UseMaskForGaps) {		bool UseMaskForCond, bool UseMaskForGaps) {
assert(Factor >= 2 && "Invalid interleave factor");		assert(Factor >= 2 && "Invalid interleave factor");
auto *VecVTy = cast<FixedVectorType>(VecTy);		auto *VecVTy = cast<FixedVectorType>(VecTy);
▲ Show 20 Lines • Show All 525 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/cast.ll

Show First 20 Lines • Show All 695 Lines • ▼ Show 20 Lines	;
ret i32 undef		ret i32 undef
}		}

define i32 @load_extends() #0 {		define i32 @load_extends() #0 {
; CHECK-LABEL: 'load_extends'		; CHECK-LABEL: 'load_extends'
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadi8 = load i8, i8* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadi8 = load i8, i8* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadi16 = load i16, i16* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadi16 = load i16, i16* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadi32 = load i32, i32* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadi32 = load i32, i32* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %loadv2i8 = load <2 x i8>, <2 x i8>* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %loadv2i8 = load <2 x i8>, <2 x i8>* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %loadv4i8 = load <4 x i8>, <4 x i8>* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %loadv4i8 = load <4 x i8>, <4 x i8>* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadv8i8 = load <8 x i8>, <8 x i8>* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadv8i8 = load <8 x i8>, <8 x i8>* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadv2i16 = load <2 x i16>, <2 x i16>* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %loadv2i16 = load <2 x i16>, <2 x i16>* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadv4i16 = load <4 x i16>, <4 x i16>* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadv4i16 = load <4 x i16>, <4 x i16>* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadv2i32 = load <2 x i32>, <2 x i32>* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadv2i32 = load <2 x i32>, <2 x i32>* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadv4i32 = load <4 x i32>, <4 x i32>* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadv4i32 = load <4 x i32>, <4 x i32>* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadnxv2i32 = load <vscale x 2 x i32>, <vscale x 2 x i32>* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadnxv2i32 = load <vscale x 2 x i32>, <vscale x 2 x i32>* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadnxv4i32 = load <vscale x 4 x i32>, <vscale x 4 x i32>* undef		; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %loadnxv4i32 = load <vscale x 4 x i32>, <vscale x 4 x i32>* undef
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %r0 = sext i8 %loadi8 to i16		; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %r0 = sext i8 %loadi8 to i16
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %r1 = zext i8 %loadi8 to i16		; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %r1 = zext i8 %loadi8 to i16
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %r2 = sext i8 %loadi8 to i32		; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %r2 = sext i8 %loadi8 to i32
▲ Show 20 Lines • Show All 109 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/mem-op-cost-model.ll

	Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction:			; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction:
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction:			; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction:
	store <8 x i8> %val, <8 x i8>* %ptr			store <8 x i8> %val, <8 x i8>* %ptr
	ret void			ret void
	}			}

	define <4 x i8> @load4(<4 x i8>* %ptr) {			define <4 x i8> @load4(<4 x i8>* %ptr) {
	; CHECK: 'Cost Model Analysis' for function 'load4':			; CHECK: 'Cost Model Analysis' for function 'load4':
	; CHECK-NEON: Cost Model: Found an estimated cost of 64 for instruction:			; CHECK-NEON: Cost Model: Found an estimated cost of 2 for instruction:
	; CHECK-SVE-128: Cost Model: Found an estimated cost of 64 for instruction:			; CHECK-SVE-128: Cost Model: Found an estimated cost of 2 for instruction:
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction:			; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction:
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction:			; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction:
	%out = load <4 x i8>, <4 x i8>* %ptr			%out = load <4 x i8>, <4 x i8>* %ptr
	ret <4 x i8> %out			ret <4 x i8> %out
	}			}

	define void @store4(<4 x i8>* %ptr, <4 x i8> %val) {			define void @store4(<4 x i8>* %ptr, <4 x i8> %val) {
	; CHECK: 'Cost Model Analysis' for function 'store4':			; CHECK: 'Cost Model Analysis' for function 'store4':
	; CHECK-NEON: Cost Model: Found an estimated cost of 1 for instruction:			; CHECK-NEON: Cost Model: Found an estimated cost of 2 for instruction:
	; CHECK-SVE-128: Cost Model: Found an estimated cost of 1 for instruction:			; CHECK-SVE-128: Cost Model: Found an estimated cost of 2 for instruction:
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction:			; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction:
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction:			; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction:
	store <4 x i8> %val, <4 x i8>* %ptr			store <4 x i8> %val, <4 x i8>* %ptr
	ret void			ret void
	}			}

	define <16 x i16> @load_256(<16 x i16>* %ptr) {			define <16 x i16> @load_256(<16 x i16>* %ptr) {
	; CHECK: 'Cost Model Analysis' for function 'load_256':			; CHECK: 'Cost Model Analysis' for function 'load_256':
	▲ Show 20 Lines • Show All 109 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/store.ll

	Show All 17 Lines
	; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <16 x half> undef, <16 x half>* undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <16 x half> undef, <16 x half>* undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <2 x i64> undef, <2 x i64>* undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <2 x i64> undef, <2 x i64>* undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <4 x i32> undef, <4 x i32>* undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <4 x i32> undef, <4 x i32>* undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <8 x i16> undef, <8 x i16>* undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <8 x i16> undef, <8 x i16>* undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <16 x i8> undef, <16 x i8>* undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <16 x i8> undef, <16 x i8>* undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <2 x double> undef, <2 x double>* undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <2 x double> undef, <2 x double>* undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <4 x float> undef, <4 x float>* undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <4 x float> undef, <4 x float>* undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <8 x half> undef, <8 x half>* undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <8 x half> undef, <8 x half>* undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: store <2 x i8> undef, <2 x i8>* undef, align 2			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: store <2 x i8> undef, <2 x i8>* undef, align 2
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <4 x i8> undef, <4 x i8>* undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <4 x i8> undef, <4 x i8>* undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %1 = load <2 x i8>, <2 x i8>* undef, align 2			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %1 = load <2 x i8>, <2 x i8>* undef, align 2
	; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %2 = load <4 x i8>, <4 x i8>* undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %2 = load <4 x i8>, <4 x i8>* undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; SIZE-LABEL: 'getMemoryOpCost'			; SIZE-LABEL: 'getMemoryOpCost'
	; SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <4 x i64> undef, <4 x i64>* undef, align 4			; SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <4 x i64> undef, <4 x i64>* undef, align 4
	; SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <8 x i32> undef, <8 x i32>* undef, align 4			; SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <8 x i32> undef, <8 x i32>* undef, align 4
	; SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <16 x i16> undef, <16 x i16>* undef, align 4			; SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <16 x i16> undef, <16 x i16>* undef, align 4
	; SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <32 x i8> undef, <32 x i8>* undef, align 4			; SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <32 x i8> undef, <32 x i8>* undef, align 4
	; SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <4 x double> undef, <4 x double>* undef, align 4			; SIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <4 x double> undef, <4 x double>* undef, align 4
	Show All 22 Lines
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 24 for instruction: store <16 x half> undef, <16 x half>* undef, align 4			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 24 for instruction: store <16 x half> undef, <16 x half>* undef, align 4
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <2 x i64> undef, <2 x i64>* undef, align 4			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <2 x i64> undef, <2 x i64>* undef, align 4
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <4 x i32> undef, <4 x i32>* undef, align 4			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <4 x i32> undef, <4 x i32>* undef, align 4
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <8 x i16> undef, <8 x i16>* undef, align 4			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <8 x i16> undef, <8 x i16>* undef, align 4
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <16 x i8> undef, <16 x i8>* undef, align 4			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <16 x i8> undef, <16 x i8>* undef, align 4
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <2 x double> undef, <2 x double>* undef, align 4			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <2 x double> undef, <2 x double>* undef, align 4
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <4 x float> undef, <4 x float>* undef, align 4			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <4 x float> undef, <4 x float>* undef, align 4
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <8 x half> undef, <8 x half>* undef, align 4			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 12 for instruction: store <8 x half> undef, <8 x half>* undef, align 4
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 16 for instruction: store <2 x i8> undef, <2 x i8>* undef, align 2			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: store <2 x i8> undef, <2 x i8>* undef, align 2
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: store <4 x i8> undef, <4 x i8>* undef, align 4			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: store <4 x i8> undef, <4 x i8>* undef, align 4
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %1 = load <2 x i8>, <2 x i8>* undef, align 2			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %1 = load <2 x i8>, <2 x i8>* undef, align 2
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %2 = load <4 x i8>, <4 x i8>* undef, align 4			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %2 = load <4 x i8>, <4 x i8>* undef, align 4
	; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; SLOW_MISALIGNED_128_STORE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	store <4 x i64> undef, <4 x i64> * undef			store <4 x i64> undef, <4 x i64> * undef
	store <8 x i32> undef, <8 x i32> * undef			store <8 x i32> undef, <8 x i32> * undef
	store <16 x i16> undef, <16 x i16> * undef			store <16 x i16> undef, <16 x i16> * undef
	store <32 x i8> undef, <32 x i8> * undef			store <32 x i8> undef, <32 x i8> * undef

	store <4 x double> undef, <4 x double> * undef			store <4 x double> undef, <4 x double> * undef
	Show All 21 Lines

llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll

	; RUN: opt -loop-vectorize -mtriple=arm64-apple-darwin -S %s \| FileCheck %s			; RUN: opt -loop-vectorize -mtriple=arm64-apple-darwin -S %s \| FileCheck %s

	; Test cases for extending the vectorization factor, if small memory operations			; Test cases for extending the vectorization factor, if small memory operations
	; are not profitable.			; are not profitable.

	; Test with a loop that contains memory accesses of i8 and i32 types. The			; Test with a loop that contains memory accesses of i8 and i32 types. The
	; default maximum VF for NEON is 4, but vectorizing 4 x i8 is not			; default maximum VF for NEON is 4. And while we don't have an instruction to
	; profitable. But we can extend to VF to 8 or 16, at which point the			; load 4 x i8, vectorization might still be profitable.
	; i8 memory accesses become profitable.
	define void @test_load_i8_store_i32(i8* noalias %src, i32* noalias %dst, i32 %off, i64 %N) {			define void @test_load_i8_store_i32(i8* noalias %src, i32* noalias %dst, i32 %off, i64 %N) {
	; CHECK-LABEL: @test_load_i8_store_i32(			; CHECK-LABEL: @test_load_i8_store_i32(
	; CHECK-NOT: x i8>			; CHECK: <4 x i8>
				dmgreenUnsubmitted Not Done Reply Inline Actions Can check for <4 x i8> specifically dmgreen: Can check for <4 x i8> specifically
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]			%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
	%gep.src = getelementptr inbounds i8, i8* %src, i64 %iv			%gep.src = getelementptr inbounds i8, i8* %src, i64 %iv
	%lv = load i8, i8* %gep.src, align 1			%lv = load i8, i8* %gep.src, align 1
	▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 \| FileCheck %s			; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 \| FileCheck %s

	; This test shows extremely high interleaving cost that, probably, should be fixed.
	; Due to the high cost, interleaving is not beneficial and the cost model chooses to scalarize
	; the load instructions.

	target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				dmgreenUnsubmitted Not Done Reply Inline Actions This comment looks old now. dmgreen: This comment looks old now.
	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	%pair = type { i8, i8 }			%pair = type { i8, i8 }

	; CHECK-LABEL: test			; CHECK-LABEL: test
	; CHECK: Found an estimated cost of 20 for VF 2 For instruction: {{.*}} load i8			; CHECK: Found an estimated cost of 17 for VF 2 For instruction: {{.*}} load i8
	; CHECK: Found an estimated cost of 0 for VF 2 For instruction: {{.*}} load i8			; CHECK: Found an estimated cost of 0 for VF 2 For instruction: {{.*}} load i8
	; CHECK: vector.body			; CHECK: vector.body
	; CHECK: load i8			; CHECK: load <4 x i8>
	; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body			; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

	define void @test(%pair* %p, i64 %n) {			define void @test(%pair* %p, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
	Show All 12 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll

	Show First 20 Lines • Show All 165 Lines • ▼ Show 20 Lines
	; GATHER-NEXT: [[TMP17:%.*]] = extractelement <8 x i32> [[TMP3]], i32 6			; GATHER-NEXT: [[TMP17:%.*]] = extractelement <8 x i32> [[TMP3]], i32 6
	; GATHER-NEXT: [[TMP18:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP3]])			; GATHER-NEXT: [[TMP18:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP3]])
	; GATHER-NEXT: [[OP_EXTRA]] = add i32 [[TMP18]], -5			; GATHER-NEXT: [[OP_EXTRA]] = add i32 [[TMP18]], -5
	; GATHER-NEXT: [[TMP19:%.*]] = extractelement <8 x i32> [[TMP3]], i32 7			; GATHER-NEXT: [[TMP19:%.*]] = extractelement <8 x i32> [[TMP3]], i32 7
	; GATHER-NEXT: br label [[FOR_BODY]]			; GATHER-NEXT: br label [[FOR_BODY]]
	;			;
	; MAX-COST-LABEL: @PR32038(			; MAX-COST-LABEL: @PR32038(
	; MAX-COST-NEXT: entry:			; MAX-COST-NEXT: entry:
	; MAX-COST-NEXT: [[TMP0:%.]] = load <2 x i8>, <2 x i8> bitcast (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) to <2 x i8>*), align 1			; MAX-COST-NEXT: [[TMP0:%.]] = load <4 x i8>, <4 x i8> bitcast (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) to <4 x i8>*), align 1
	; MAX-COST-NEXT: [[TMP1:%.*]] = icmp eq <2 x i8> [[TMP0]], zeroinitializer			; MAX-COST-NEXT: [[TMP1:%.*]] = icmp eq <4 x i8> [[TMP0]], zeroinitializer
	; MAX-COST-NEXT: [[TMP2:%.]] = load <2 x i8>, <2 x i8> bitcast (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 3) to <2 x i8>*), align 1
	; MAX-COST-NEXT: [[TMP3:%.*]] = icmp eq <2 x i8> [[TMP2]], zeroinitializer
	; MAX-COST-NEXT: [[P8:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1			; MAX-COST-NEXT: [[P8:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
	; MAX-COST-NEXT: [[P9:%.*]] = icmp eq i8 [[P8]], 0			; MAX-COST-NEXT: [[P9:%.*]] = icmp eq i8 [[P8]], 0
	; MAX-COST-NEXT: [[P10:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2			; MAX-COST-NEXT: [[P10:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
	; MAX-COST-NEXT: [[P11:%.*]] = icmp eq i8 [[P10]], 0			; MAX-COST-NEXT: [[P11:%.*]] = icmp eq i8 [[P10]], 0
	; MAX-COST-NEXT: [[P12:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1			; MAX-COST-NEXT: [[P12:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
	; MAX-COST-NEXT: [[P13:%.*]] = icmp eq i8 [[P12]], 0			; MAX-COST-NEXT: [[P13:%.*]] = icmp eq i8 [[P12]], 0
	; MAX-COST-NEXT: [[P14:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8			; MAX-COST-NEXT: [[P14:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
	; MAX-COST-NEXT: [[P15:%.*]] = icmp eq i8 [[P14]], 0			; MAX-COST-NEXT: [[P15:%.*]] = icmp eq i8 [[P14]], 0
	; MAX-COST-NEXT: br label [[FOR_BODY:%.*]]			; MAX-COST-NEXT: br label [[FOR_BODY:%.*]]
	; MAX-COST: for.body:			; MAX-COST: for.body:
	; MAX-COST-NEXT: [[P17:%.]] = phi i32 [ [[P34:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]			; MAX-COST-NEXT: [[P17:%.]] = phi i32 [ [[P34:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
	; MAX-COST-NEXT: [[TMP4:%.*]] = extractelement <2 x i1> [[TMP3]], i32 1			; MAX-COST-NEXT: [[TMP2:%.*]] = extractelement <4 x i1> [[TMP1]], i32 3
	; MAX-COST-NEXT: [[TMP5:%.*]] = shufflevector <2 x i1> [[TMP1]], <2 x i1> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>			; MAX-COST-NEXT: [[TMP3:%.*]] = select <4 x i1> [[TMP1]], <4 x i32> <i32 -720, i32 -720, i32 -720, i32 -720>, <4 x i32> <i32 -80, i32 -80, i32 -80, i32 -80>
	; MAX-COST-NEXT: [[TMP6:%.*]] = shufflevector <4 x i1> poison, <4 x i1> [[TMP5]], <4 x i32> <i32 4, i32 5, i32 2, i32 3>			; MAX-COST-NEXT: [[TMP4:%.*]] = extractelement <4 x i1> [[TMP1]], i32 2
	; MAX-COST-NEXT: [[TMP7:%.*]] = shufflevector <2 x i1> [[TMP3]], <2 x i1> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>			; MAX-COST-NEXT: [[TMP5:%.*]] = extractelement <4 x i1> [[TMP1]], i32 1
	; MAX-COST-NEXT: [[TMP8:%.*]] = shufflevector <4 x i1> [[TMP6]], <4 x i1> [[TMP7]], <4 x i32> <i32 0, i32 1, i32 4, i32 5>			; MAX-COST-NEXT: [[TMP6:%.*]] = extractelement <4 x i1> [[TMP1]], i32 0
	; MAX-COST-NEXT: [[TMP9:%.*]] = select <4 x i1> [[TMP8]], <4 x i32> <i32 -720, i32 -720, i32 -720, i32 -720>, <4 x i32> <i32 -80, i32 -80, i32 -80, i32 -80>
	; MAX-COST-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[TMP3]], i32 0
	; MAX-COST-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[TMP1]], i32 1
	; MAX-COST-NEXT: [[TMP12:%.*]] = extractelement <2 x i1> [[TMP1]], i32 0
	; MAX-COST-NEXT: [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80			; MAX-COST-NEXT: [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80
	; MAX-COST-NEXT: [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80			; MAX-COST-NEXT: [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80
	; MAX-COST-NEXT: [[TMP13:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP9]])			; MAX-COST-NEXT: [[TMP7:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP3]])
	; MAX-COST-NEXT: [[TMP14:%.*]] = add i32 [[TMP13]], [[P27]]			; MAX-COST-NEXT: [[TMP8:%.*]] = add i32 [[TMP7]], [[P27]]
	; MAX-COST-NEXT: [[TMP15:%.*]] = add i32 [[TMP14]], [[P29]]			; MAX-COST-NEXT: [[TMP9:%.*]] = add i32 [[TMP8]], [[P29]]
	; MAX-COST-NEXT: [[OP_EXTRA:%.*]] = add i32 [[TMP15]], -5			; MAX-COST-NEXT: [[OP_EXTRA:%.*]] = add i32 [[TMP9]], -5
	; MAX-COST-NEXT: [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80			; MAX-COST-NEXT: [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80
	; MAX-COST-NEXT: [[P32:%.*]] = add i32 [[OP_EXTRA]], [[P31]]			; MAX-COST-NEXT: [[P32:%.*]] = add i32 [[OP_EXTRA]], [[P31]]
	; MAX-COST-NEXT: [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80			; MAX-COST-NEXT: [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80
	; MAX-COST-NEXT: [[P34]] = add i32 [[P32]], [[P33]]			; MAX-COST-NEXT: [[P34]] = add i32 [[P32]], [[P33]]
	; MAX-COST-NEXT: br label [[FOR_BODY]]			; MAX-COST-NEXT: br label [[FOR_BODY]]
	;			;
	entry:			entry:
	%p0 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1), align 1			%p0 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1), align 1
	Show All 37 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/loadi8.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=generic < %s \| FileCheck %s			; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=generic < %s \| FileCheck %s

	target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64"			target triple = "aarch64"

	%struct.weight_t = type { i32, i32 }			%struct.weight_t = type { i32, i32 }

	define void @f_noalias(i8* noalias nocapture %dst, i8* noalias nocapture readonly %src, %struct.weight_t* noalias nocapture readonly %w) {			define void @f_noalias(i8* noalias nocapture %dst, i8* noalias nocapture readonly %src, %struct.weight_t* noalias nocapture readonly %w) {
	; CHECK-LABEL: @f_noalias(			; CHECK-LABEL: @f_noalias(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[SCALE:%.]] = getelementptr inbounds [[STRUCT_WEIGHT_T:%.]], %struct.weight_t* [[W:%.*]], i64 0, i32 0			; CHECK-NEXT: [[SCALE:%.]] = getelementptr inbounds [[STRUCT_WEIGHT_T:%.]], %struct.weight_t* [[W:%.*]], i64 0, i32 0
	; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[SCALE]], align 16			; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[SCALE]], align 16
	; CHECK-NEXT: [[OFFSET:%.]] = getelementptr inbounds [[STRUCT_WEIGHT_T]], %struct.weight_t [[W]], i64 0, i32 1			; CHECK-NEXT: [[OFFSET:%.]] = getelementptr inbounds [[STRUCT_WEIGHT_T]], %struct.weight_t [[W]], i64 0, i32 1
	; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[OFFSET]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[OFFSET]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = load i8, i8 [[SRC:%.*]], align 1			; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i8, i8 [[SRC:%.*]], i64 1
	; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP2]] to i32			; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds i8, i8 [[DST:%.*]], i64 1
	; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP0]], [[CONV]]
	; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[MUL]], [[TMP1]]
	; CHECK-NEXT: [[TOBOOL_NOT_I:%.*]] = icmp ult i32 [[ADD]], 256
	; CHECK-NEXT: [[TMP3:%.*]] = icmp sgt i32 [[ADD]], 0
	; CHECK-NEXT: [[SHR_I:%.*]] = sext i1 [[TMP3]] to i32
	; CHECK-NEXT: [[COND_I:%.*]] = select i1 [[TOBOOL_NOT_I]], i32 [[ADD]], i32 [[SHR_I]]
	; CHECK-NEXT: [[CONV_I:%.*]] = trunc i32 [[COND_I]] to i8
	; CHECK-NEXT: store i8 [[CONV_I]], i8* [[DST:%.*]], align 1
	; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 1
	; CHECK-NEXT: [[TMP4:%.]] = load i8, i8 [[ARRAYIDX_1]], align 1
	; CHECK-NEXT: [[CONV_1:%.*]] = zext i8 [[TMP4]] to i32
	; CHECK-NEXT: [[MUL_1:%.*]] = mul nsw i32 [[TMP0]], [[CONV_1]]
	; CHECK-NEXT: [[ADD_1:%.*]] = add nsw i32 [[MUL_1]], [[TMP1]]
	; CHECK-NEXT: [[TOBOOL_NOT_I_1:%.*]] = icmp ult i32 [[ADD_1]], 256
	; CHECK-NEXT: [[TMP5:%.*]] = icmp sgt i32 [[ADD_1]], 0
	; CHECK-NEXT: [[SHR_I_1:%.*]] = sext i1 [[TMP5]] to i32
	; CHECK-NEXT: [[COND_I_1:%.*]] = select i1 [[TOBOOL_NOT_I_1]], i32 [[ADD_1]], i32 [[SHR_I_1]]
	; CHECK-NEXT: [[CONV_I_1:%.*]] = trunc i32 [[COND_I_1]] to i8
	; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 1
	; CHECK-NEXT: store i8 [[CONV_I_1]], i8* [[ARRAYIDX2_1]], align 1
	; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 2			; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 2
	; CHECK-NEXT: [[TMP6:%.]] = load i8, i8 [[ARRAYIDX_2]], align 1
	; CHECK-NEXT: [[CONV_2:%.*]] = zext i8 [[TMP6]] to i32
	; CHECK-NEXT: [[MUL_2:%.*]] = mul nsw i32 [[TMP0]], [[CONV_2]]
	; CHECK-NEXT: [[ADD_2:%.*]] = add nsw i32 [[MUL_2]], [[TMP1]]
	; CHECK-NEXT: [[TOBOOL_NOT_I_2:%.*]] = icmp ult i32 [[ADD_2]], 256
	; CHECK-NEXT: [[TMP7:%.*]] = icmp sgt i32 [[ADD_2]], 0
	; CHECK-NEXT: [[SHR_I_2:%.*]] = sext i1 [[TMP7]] to i32
	; CHECK-NEXT: [[COND_I_2:%.*]] = select i1 [[TOBOOL_NOT_I_2]], i32 [[ADD_2]], i32 [[SHR_I_2]]
	; CHECK-NEXT: [[CONV_I_2:%.*]] = trunc i32 [[COND_I_2]] to i8
	; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 2			; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 2
	; CHECK-NEXT: store i8 [[CONV_I_2]], i8* [[ARRAYIDX2_2]], align 1
	; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 3			; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 3
	; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX_3]], align 1			; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[SRC]] to <4 x i8>*
	; CHECK-NEXT: [[CONV_3:%.*]] = zext i8 [[TMP8]] to i32			; CHECK-NEXT: [[TMP3:%.]] = load <4 x i8>, <4 x i8> [[TMP2]], align 1
	; CHECK-NEXT: [[MUL_3:%.*]] = mul nsw i32 [[TMP0]], [[CONV_3]]			; CHECK-NEXT: [[TMP4:%.*]] = zext <4 x i8> [[TMP3]] to <4 x i32>
	; CHECK-NEXT: [[ADD_3:%.*]] = add nsw i32 [[MUL_3]], [[TMP1]]			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> poison, i32 [[TMP0]], i32 0
	; CHECK-NEXT: [[TOBOOL_NOT_I_3:%.*]] = icmp ult i32 [[ADD_3]], 256			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP9:%.*]] = icmp sgt i32 [[ADD_3]], 0			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP0]], i32 2
	; CHECK-NEXT: [[SHR_I_3:%.*]] = sext i1 [[TMP9]] to i32			; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP0]], i32 3
	; CHECK-NEXT: [[COND_I_3:%.*]] = select i1 [[TOBOOL_NOT_I_3]], i32 [[ADD_3]], i32 [[SHR_I_3]]			; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP8]], [[TMP4]]
	; CHECK-NEXT: [[CONV_I_3:%.*]] = trunc i32 [[COND_I_3]] to i8			; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> poison, i32 [[TMP1]], i32 0
				; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP1]], i32 1
				; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP1]], i32 2
				; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP12]], i32 [[TMP1]], i32 3
				; CHECK-NEXT: [[TMP14:%.*]] = add nsw <4 x i32> [[TMP9]], [[TMP13]]
				; CHECK-NEXT: [[TMP15:%.*]] = icmp ult <4 x i32> [[TMP14]], <i32 256, i32 256, i32 256, i32 256>
				; CHECK-NEXT: [[TMP16:%.*]] = icmp sgt <4 x i32> [[TMP14]], zeroinitializer
				; CHECK-NEXT: [[TMP17:%.*]] = sext <4 x i1> [[TMP16]] to <4 x i32>
				; CHECK-NEXT: [[TMP18:%.*]] = select <4 x i1> [[TMP15]], <4 x i32> [[TMP14]], <4 x i32> [[TMP17]]
				; CHECK-NEXT: [[TMP19:%.*]] = trunc <4 x i32> [[TMP18]] to <4 x i8>
	; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 3			; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 3
	; CHECK-NEXT: store i8 [[CONV_I_3]], i8* [[ARRAYIDX2_3]], align 1			; CHECK-NEXT: [[TMP20:%.]] = bitcast i8 [[DST]] to <4 x i8>*
				; CHECK-NEXT: store <4 x i8> [[TMP19]], <4 x i8>* [[TMP20]], align 1
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%scale = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 0			%scale = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 0
	%0 = load i32, i32* %scale, align 16			%0 = load i32, i32* %scale, align 16
	%offset = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 1			%offset = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 1
	%1 = load i32, i32* %offset, align 4			%1 = load i32, i32* %offset, align 4
	%2 = load i8, i8* %src, align 1			%2 = load i8, i8* %src, align 1
	▲ Show 20 Lines • Show All 160 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Cost-model i8 vector loads/storesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 356463

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/cast.ll

llvm/test/Analysis/CostModel/AArch64/mem-op-cost-model.ll

llvm/test/Analysis/CostModel/AArch64/store.ll

llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll

llvm/test/Transforms/SLPVectorizer/AArch64/loadi8.ll

[AArch64] Cost-model i8 vector loads/stores
ClosedPublic