This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Transforms/Scalar/
-
llvm/
-
Transforms/
-
Scalar/
1
SROA.h
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
1
SROA.cpp
-
test/Transforms/SROA/
-
Transforms/
-
SROA/
-
memcpy-minsize.ll

Differential D59129

[SROA] Don't lower expensive allocas with minsize
AbandonedPublic

Authored by SjoerdMeijer on Mar 8 2019, 4:02 AM.

Download Raw Diff

Details

Reviewers

chandlerc
efriedma
dmgreen
RKSimon

Summary

This is work in progress to show how different pieces should fit together
(with e.g. D59014). I would like to teach SROA that lowering alloca's is not
always beneficial.

Root cause of the inefficiency I am looking at is that some slices
of alloca are covered by a memcpy that load data from a character pointer.
Thus, the source operand of a memcpy get an alignment of 1 like this:

call void @llvm.memcpy.p0i8.p0i8.i32(i8* nonnull align 1 %tmp.sroa.0.0..sroa_idx7, i8* align 1 %src, i32 32, i1 false)

This is fine, nothing wrong with that, but after a few SROA invocations and
alloca rewrites, we end up with loads/stores that get an alignment of 1 to
which these memcpy's are expanded to. Depending on the how much data we're
loading/storing, and if unaligned data access is supported, this can lead to an
explosion in the number of byte loads/stores instructions. To illustrate the
problem, this is not uncommon initialisation code example:

    
typedef struct {
  char data[32];
} data_t;
typedef struct {
  long long m1;
  long long m2;
  long long m3;
} data_2_t;
data_2_t *getData2();
void init(char *src) {
  data_t tmp;
  memcpy(&tmp, src, sizeof(tmp));
  data_2_t *dst = getData2();
  memcpy(&dst->m1, &tmp.data[8], 8);
  memcpy(&dst->m2, &tmp.data[16], 8);
}

Here stack object 'tmp' gets filled in by the memcpy that reads from the
character pointer, and the subsequent memcpy's read from that stack object.
When optimising for minsize, I would like to generate this:

    
init:
  push  {r7, lr}
  mov r7, sp
  sub sp, #32
  mov r1, r0
  mov r0, sp
  movs  r2, #32
  bl  __aeabi_memcpy
  bl  getData2
  add.w r12, sp, #8
  ldm.w r12, {r1, r2, r3, r12}
  stm.w r0, {r1, r2, r3, r12}
  add sp, #32
  pop {r7, pc}

The solution to achieve this, relies at looking at the cost of instructions
covered by the alloca, and if it is an expensive instruction (for example
memcpy of 36 bytes with 1 byte alignment), then don't rewrite the alloca.
At the moment however, we generate this monster:

init:
  push  {r4, r5, r6, r7, lr}
  add r7, sp, #12
  push.w  {r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11}
  ldrb  r1, [r0, #8]!
  str r1, [sp, #24]
  ldrb  r1, [r0, #8]!
  str r1, [sp, #16]
  ldrb  r1, [r0, #4]!
  ldrb  r2, [r0, #-8]!
  str r2, [sp, #12]
  ldrb  r2, [r0, #11]
  ldrb  r3, [r0, #10]
  ldrb  r6, [r0, #-1]
  ldrb.w  r8, [r0, #1]
  ldrb.w  r9, [r0, #3]
  ldrb.w  r10, [r0, #2]
  ldrb.w  r11, [r0, #5]
  ldrb  r4, [r0, #7]
  ldrb  r5, [r0, #6]
  orr.w r2, r3, r2, lsl #8
  ldrb  r3, [r0, #9]
  orr.w r1, r1, r3, lsl #8
  orr.w r1, r1, r2, lsl #16
  str r1, [sp, #20]
  ldrb  r1, [r0, #-3]
  str r1, [sp, #8]
  ldrb  r1, [r0, #-2]
  str r1, [sp, #4]
  bl  getData2
  ldr r2, [sp, #16]
  ldr r3, [sp, #12]
  orr.w r1, r5, r4, lsl #8
  ldr r5, [sp, #8]
  orr.w r2, r2, r11, lsl #8
  orr.w r3, r3, r8, lsl #8
  orr.w r1, r2, r1, lsl #16
  orr.w r2, r10, r9, lsl #8
  orr.w r2, r3, r2, lsl #16
  ldr r3, [sp, #24]
  orr.w r3, r3, r5, lsl #8
  ldr r5, [sp, #4]
  orr.w r6, r5, r6, lsl #8
  orr.w r3, r3, r6, lsl #16
  strd  r3, r2, [r0]
  str r1, [r0, #8]
  ldr r1, [sp, #20]
  str r1, [r0, #12]
  add sp, #28
  pop.w {r8, r9, r10, r11}
  pop {r4, r5, r6, r7, pc}

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Mar 8 2019, 4:02 AM

SjoerdMeijer mentioned this in D59014: [TTI] Enable analysis of clib functions in getIntrinsicCosts. NFCI..Mar 8 2019, 4:06 AM

How hard would it be to try to handle your testcase late instead? Given a series of adjacent unaligned loads, transform that into a memcpy followed by aligned loads. Maybe we don't have appropriate infrastructure...

If we are going to do this, we probably don't want to disable SROA completely for allocas that contain a memcpy source/destination, just avoid promoting the specific slices where it's relevant.

Minor style issues

include/llvm/Transforms/Scalar/SROA.h
23	(style) include order?
lib/Transforms/Scalar/SROA.cpp
1100	remove empty line

Hi @RKSimon, @efriedma, thank you very much for looking at this!

I will address the minor issues, but first about:

How hard would it be to try to handle your testcase late instead?

I think I've explored most places where intrinsics/alignment are handled: from Clang, InstCombine, to the LoadStoreOptimizer, to see where I could e.g. patch up the alignment problem. But either this analysis and rewrite was difficult to fit in, but perhaps more importantly, I came to the conclusion that here in SROA is where the rewrites happen that's causing issues later on. Thus, I think it makes to make a different decision here. But perhaps I've missed something, so that was one reason to post this work-in-progress patch. And you're absolutely right about this:

If we are going to do this, we probably don't want to disable SROA completely for allocas that contain a memcpy source/destination, just avoid promoting the specific slices where it's relevant.

We definitely don't want to disable SROA completely. If you look at shouldExpand, you see that a decision is made per alloca slice. So, it's not a blunt way to completely disable things, or at least, that's definitely not the idea.

To make an informed decision that memcpy's are expensive, we need 3 things:

get the cost of a c-library intrinsic, which is what D59014 is enabling,
model the cost of a memcpy, which is what's currently missing and still need to add, and then
This SROA patch which just uses 1) and 2). Thus, I don't think at this point that this patch needs much changing or more code, except of course some tests.

To illustrate that "patching up the alignment" later on is quite difficult, this is the IR that comes out of SROA:

define hidden void @foo(i8* nocapture readonly %src) local_unnamed_addr #0 {
entry:
  %tmp.sroa.0.sroa.0 = alloca [8 x i8]
  %tmp.sroa.0.sroa.5 = alloca [8 x i8]
  %tmp.sroa.0.sroa.0.0..sroa_idx17 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.0, i32 0, i32 0
  call void @llvm.lifetime.start.p0i8(i64 8, i8* %tmp.sroa.0.sroa.0.0..sroa_idx17)
  %tmp.sroa.0.sroa.5.0..sroa_idx15 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.5, i32 0, i32 0
  call void @llvm.lifetime.start.p0i8(i64 8, i8* %tmp.sroa.0.sroa.5.0..sroa_idx15)
  %tmp.sroa.0.sroa.0.0..sroa_idx9 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.0, i32 0, i32 0
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %tmp.sroa.0.sroa.0.0..sroa_idx9, i8* align 1 %src, i32 8, i1 false)
  %tmp.sroa.0.sroa.3.0.src.sroa_idx = getelementptr inbounds i8, i8* %src, i32 8
  %tmp.sroa.0.sroa.3.0.src.sroa_cast = bitcast i8* %tmp.sroa.0.sroa.3.0.src.sroa_idx to i64*
  %tmp.sroa.0.sroa.3.0.copyload = load i64, i64* %tmp.sroa.0.sroa.3.0.src.sroa_cast, align 1
  %tmp.sroa.0.sroa.4.0.src.sroa_idx = getelementptr inbounds i8, i8* %src, i32 16
  %tmp.sroa.0.sroa.4.0.src.sroa_cast = bitcast i8* %tmp.sroa.0.sroa.4.0.src.sroa_idx to i64*
  %tmp.sroa.0.sroa.4.0.copyload = load i64, i64* %tmp.sroa.0.sroa.4.0.src.sroa_cast, align 1
  %tmp.sroa.0.sroa.5.0.src.sroa_raw_idx = getelementptr inbounds i8, i8* %src, i32 24
  %tmp.sroa.0.sroa.5.0..sroa_idx13 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.5, i32 0, i32 0
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %tmp.sroa.0.sroa.5.0..sroa_idx13, i8* align 1 %tmp.sroa.0.sroa.5.0.src.sroa_raw_idx, i32 8, i1 false)
  %call = call %struct.data_2_t* bitcast (%struct.data_2_t* (...)* @getData2 to %struct.data_2_t* ()*)() #3
  %0 = getelementptr inbounds %struct.data_2_t, %struct.data_2_t* %call, i32 0, i32 0
  store i64 %tmp.sroa.0.sroa.3.0.copyload, i64* %0, align 8
  %tmp.sroa.0.16..sroa_idx = getelementptr inbounds %struct.data_2_t, %struct.data_2_t* %call, i32 0, i32 1
  store i64 %tmp.sroa.0.sroa.4.0.copyload, i64* %tmp.sroa.0.16..sroa_idx, align 8
  %tmp.sroa.0.sroa.0.0..sroa_idx18 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.0, i32 0, i32 0
  call void @llvm.lifetime.end.p0i8(i64 8, i8* %tmp.sroa.0.sroa.0.0..sroa_idx18)
  %tmp.sroa.0.sroa.5.0..sroa_idx16 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.5, i32 0, i32 0
  call void @llvm.lifetime.end.p0i8(i64 8, i8* %tmp.sroa.0.sroa.5.0..sroa_idx16)
  ret void
}

The main problem here is the align 1 on the i64 loads, and again, that's created here and thus preventing that from happening here in SROA seems to make most sense to me.

There's actually another problem with this IR, which is that all these copyloads are emitted first by SROA, consecutively, creating huge live ranges between them and the store instructions. Thus creating another problem (when the example is a bit bigger): spilling. A more sensible output would be to follow more the original program-order. I think that's what SROA tries to do, but fails quite bad in these examples. But anyway, this is not the biggest problem now, and it is hidden when we not always lower allocas.

I was sort of thinking you could write the reverse transform in MemCpyOpt, since it does a similar sorts of analysis, but maybe we'd have to do it later, or patch SROA anyway, to avoid two transforms fighting each other.

If you look at shouldExpand, you see that a decision is made per alloca slice

You're returning early from SROA::runOnAlloca. That's a decision on the whole alloca: if there's an expensive memcpy into any part of the alloca, don't do any splitting on the whole alloca. The alternative is to make the decision on a per-slice level: given a subset of slices that are the target of a memcpy, "merge" those slices so splitAlloca() doesn't split them away from each other.

I was sort of thinking you could write the reverse transform in MemCpyOpt, since it does a similar sorts of analysis, but maybe we'd have to do it later, or patch SROA anyway, to avoid two transforms fighting each other.

Ah, yes, MemCpyOpt. I haven't looked in great detail at that because it hasn't bothered me yet. That is, in experiments when I completely disabled SROA, I didn't find MemCpyOpt reversing things. I am guessing because the analysis and transformation is far from straightforward. To me, it sounds easier to restrict the rewrite/transformation here in SROA, rather than trying to recreate the IR in MemCpyOpt that we already once had. But again, I haven't looked much into MemCpyOpt, so if you think it would be better there, I will start looking into that.

If you look at shouldExpand, you see that a decision is made per alloca slice

You're returning early from SROA::runOnAlloca. That's a decision on the whole alloca: if there's an expensive memcpy into any part of the alloca, don't do any splitting on the whole alloca. The alternative is to make the decision on a per-slice level: given a subset of slices that are the target of a memcpy, "merge" those slices so splitAlloca() doesn't split them away from each other.

Ah yes, you're right; I wanted to do it per alloca slice, but that's currently not the case. I can't remember if I had good reasons for that, and will double check what the codegen differences are.

But again, I haven't looked much into MemCpyOpt, so if you think it would be better there, I will start looking into that.

The transform is more powerful working the other way: it could catch cases that were't originally written as "memcpy" in the source code. Not sure that means we shouldn't do something in SROA, though. Ideally, I think we want the "canonical" form to be the promoted version, and transform it to the compact version as part of lowering. But I'm not sure how much work it would be to make the reverse transform reliable, and I don't want to block an incremental improvement.

MemCpyOpt doesn't actually merge memcpys at the moment, I think. It does merge memsets, and it has a lot of memcpy-related transforms, but that one in particular is missing. But it's the right sort of logic, at least...

The transform is more powerful working the other way: it could catch cases that were't originally written as "memcpy" in the source code. Not sure that means we shouldn't do something in SROA, though. Ideally, I think we want the "canonical" form to be the promoted version, and transform it > to the compact version as part of lowering. But I'm not sure how much work it would be to make the reverse transform reliable, and I don't want to block an incremental improvement.

MemCpyOpt doesn't actually merge memcpys at the moment, I think. It does merge memsets, and it has a lot of memcpy-related transforms, but that one in particular is missing. But it's the right sort of logic, at least...

Many thanks for sharing your ideas. I will first go for the incremental approach first then. One of the advantages is that the code changes will be minimal, I only need a target hook for the memcpy cost, so we don't create (much) technical debt if we do end up tackling this in MemCpyOpt.

RKSimon resigned from this revision.Mar 21 2019, 11:46 AM

With all the groundwork now committed, I would like to progress this. I've added:

some minor cleanups as commented earlier,
my motivating example as a test case.

And I've also ran codesize benchmarks with this change. This didn't show any changes in MBed and another codesize corpus (which is okay). For the CSiBe benchmark, there were a few changes:

App	Before	After	Diff %
csibe/OpenTCP-1.0.4	21835.0	21835.0	0.0
csibe/bzip2-1.0.2	47344.0	47344.0	0.0
csibe/cg_compiler_opensrc	96142.0	96198.0	0.0582471760521
csibe/compiler	19278.0	19278.0	0.0
csibe/jikespg-1.3	170498.0	170152.0	-0.202934931788
csibe/jpeg-6b	96337.0	96337.0	0.0
csibe/libpng-1.2.5	77620.0	77636.0	0.0206132440093
csibe/lwip-0.5.3.preproc	68569.0	68569.0	0.0
csibe/mpeg2dec-0.3.1	38904.0	38904.0	0.0
csibe/mpgcut-1.1	7192.0	7192.0	0.0
csibe/teem-1.6.0-src	1028555.0	1028555.0	0.0
csibe/ttt-0.10.1.preproc	12691.0	12691.0	0.0
csibe/unrarlib-0.4.0	10382.0	10382.0	0.0
csibe/zlib-1.1.4	29221.0	29297.0	0.260086923788
total	1724569.0	1724371.0	-0.011481129488

This shows a really code size reduction of 346 bytes in jikespg-1.3, with a few minor regressions, but a fairly decent improvement overall. Thus, I am happy with the results as it does what I wanted and hoped for.

The only thing I haven't tried yet is @efriedma 's suggestion to do check on a per-slice basis; currently I do indeed per alloca.

I'm not sure those benchmark results are good enough to justify this: there are only 4 changes, and the ups and downs almost balance each other out. Have you investigated the regressions to see what this makes worse, and if there's a way we could avoid them?

I'm not sure those benchmark results are good enough to justify this: there are only 4 changes, and the ups and downs almost balance each other out.

I see what you mean. But overall we save 198 bytes, and I would be happy to take a 198 bytes size reduction. Probably what I wanted to say in my previous message is that it's not a one trick pony; it not only helps with my motivating example, but other benchmarks too. But I will indeed now look into fine tuning it and see if we can avoid the regressions. As I said, I will look into doing it on a per-slice basis.

Have you investigated the regressions to see what this makes worse, and if there's a way we could avoid them?

Yep, only 1, but some time ago. I thought we were "unlucky", but can't remember what it was. Will looking into this, and the other cases.

My concern isn't really the size of the change, it's with the variance. With only four benchmarks which change, and only by a small amount each, then if you happened to run one more or one fewer benchmark you could come to a completely different conclusion.

My concern isn't really the size of the change, it's with the variance. With only four benchmarks which change, and only by a small amount each, then if you happened to run one more or one fewer benchmark you could come to a completely different conclusion.

Okay, understood. But I have tested 3 I would say quite large code bases. 2 of them don't show any changes, which I interpret as good news (because there's definitely memcpys in these codebases). The other, CSiBe, shows this little up and down behaviour, but improvements overall. Before I continue fine tuning, how can I make this more convincing for you? Do I need more code bases, or just getting rid of the regressions?

It's hard to put exact numbers on this, especially when we are trading off vague concepts like "the complexity of the LLVM codebase", but I'd like to see enough benchmark results that we can be confident that this is making an overall improvement, and that we're not being misled by a few outliers. That might come from running a greater variety of benchmarks until we collect enough data, or it might come from improvements to the patch itself.

Okey dokey, agreed. I could throw a few more benchmarks into the mix, I am not sure it adds or changes anything, but will do it anyway because it's easy. But I will mainly focus on this patch, and see if I can improve things.

SjoerdMeijer mentioned this in D77050: [TTI] Remove getCallCost.Mar 30 2020, 9:32 AM

samparker removed a reviewer: samparker.Aug 24 2020, 11:50 PM

SjoerdMeijer abandoned this revision.Mar 17 2023, 1:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2023, 1:36 AM

Revision Contents

Path

Size

include/

llvm/

Transforms/

Scalar/

SROA.h

4 lines

lib/

Transforms/

Scalar/

SROA.cpp

37 lines

test/

Transforms/

SROA/

memcpy-minsize.ll

40 lines

Diff 199575

include/llvm/Transforms/Scalar/SROA.h

Show All 11 Lines
///		///
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_TRANSFORMS_SCALAR_SROA_H		#ifndef LLVM_TRANSFORMS_SCALAR_SROA_H
#define LLVM_TRANSFORMS_SCALAR_SROA_H		#define LLVM_TRANSFORMS_SCALAR_SROA_H

#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/IR/PassManager.h"		#include "llvm/IR/PassManager.h"
#include "llvm/Support/Compiler.h"		#include "llvm/Support/Compiler.h"
#include <vector>		#include <vector>
		RKSimonUnsubmitted Not Done Reply Inline Actions (style) include order? RKSimon: (style) include order?

namespace llvm {		namespace llvm {

class AllocaInst;		class AllocaInst;
class AssumptionCache;		class AssumptionCache;
class DominatorTree;		class DominatorTree;
class Function;		class Function;
class Instruction;		class Instruction;
Show All 30 Lines
/// 3) Finally, this will try to detect a pattern of accesses which map cleanly		/// 3) Finally, this will try to detect a pattern of accesses which map cleanly
/// onto insert and extract operations on a vector value, and convert them to		/// onto insert and extract operations on a vector value, and convert them to
/// this form. By doing so, it will enable promotion of vector aggregates to		/// this form. By doing so, it will enable promotion of vector aggregates to
/// SSA vector values.		/// SSA vector values.
class SROA : public PassInfoMixin<SROA> {		class SROA : public PassInfoMixin<SROA> {
LLVMContext *C = nullptr;		LLVMContext *C = nullptr;
DominatorTree *DT = nullptr;		DominatorTree *DT = nullptr;
AssumptionCache *AC = nullptr;		AssumptionCache *AC = nullptr;
		TargetTransformInfo *TTI = nullptr;

/// Worklist of alloca instructions to simplify.		/// Worklist of alloca instructions to simplify.
///		///
/// Each alloca in the function is added to this. Each new alloca formed gets		/// Each alloca in the function is added to this. Each new alloca formed gets
/// added to it as well to recursively simplify unless that alloca can be		/// added to it as well to recursively simplify unless that alloca can be
/// directly promoted. Finally, each time we rewrite a use of an alloca other		/// directly promoted. Finally, each time we rewrite a use of an alloca other
/// the one being actively rewritten, we add it back onto the list if not		/// the one being actively rewritten, we add it back onto the list if not
/// already present to ensure it is re-visited.		/// already present to ensure it is re-visited.
Show All 39 Lines	public:
PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);		PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);

private:		private:
friend class sroa::AllocaSliceRewriter;		friend class sroa::AllocaSliceRewriter;
friend class sroa::SROALegacyPass;		friend class sroa::SROALegacyPass;

/// Helper used by both the public run method and by the legacy pass.		/// Helper used by both the public run method and by the legacy pass.
PreservedAnalyses runImpl(Function &F, DominatorTree &RunDT,		PreservedAnalyses runImpl(Function &F, DominatorTree &RunDT,
AssumptionCache &RunAC);		AssumptionCache &RunAC, TargetTransformInfo &TTI);

bool presplitLoadsAndStores(AllocaInst &AI, sroa::AllocaSlices &AS);		bool presplitLoadsAndStores(AllocaInst &AI, sroa::AllocaSlices &AS);
AllocaInst *rewritePartition(AllocaInst &AI, sroa::AllocaSlices &AS,		AllocaInst *rewritePartition(AllocaInst &AI, sroa::AllocaSlices &AS,
sroa::Partition &P);		sroa::Partition &P);
bool splitAlloca(AllocaInst &AI, sroa::AllocaSlices &AS);		bool splitAlloca(AllocaInst &AI, sroa::AllocaSlices &AS);
bool runOnAlloca(AllocaInst &AI);		bool runOnAlloca(AllocaInst &AI);
void clobberUse(Use &U);		void clobberUse(Use &U);
bool deleteDeadInstructions(SmallPtrSetImpl<AllocaInst *> &DeletedAllocas);		bool deleteDeadInstructions(SmallPtrSetImpl<AllocaInst *> &DeletedAllocas);
bool promoteAllocas(Function &F);		bool promoteAllocas(Function &F);
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_TRANSFORMS_SCALAR_SROA_H		#endif // LLVM_TRANSFORMS_SCALAR_SROA_H

lib/Transforms/Scalar/SROA.cpp

Show First 20 Lines • Show All 291 Lines • ▼ Show 20 Lines	void printSlice(raw_ostream &OS, const_iterator I,
StringRef Indent = " ") const;		StringRef Indent = " ") const;
void printUse(raw_ostream &OS, const_iterator I,		void printUse(raw_ostream &OS, const_iterator I,
StringRef Indent = " ") const;		StringRef Indent = " ") const;
void print(raw_ostream &OS) const;		void print(raw_ostream &OS) const;
void dump(const_iterator I) const;		void dump(const_iterator I) const;
void dump() const;		void dump() const;
#endif		#endif

		bool shouldExpand(TargetTransformInfo *TTI) const;

private:		private:
template <typename DerivedT, typename RetT = void> class BuilderBase;		template <typename DerivedT, typename RetT = void> class BuilderBase;
class SliceBuilder;		class SliceBuilder;

friend class AllocaSlices::SliceBuilder;		friend class AllocaSlices::SliceBuilder;

#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
/// Handle to alloca instruction to simplify method interfaces.		/// Handle to alloca instruction to simplify method interfaces.
▲ Show 20 Lines • Show All 782 Lines • ▼ Show 20 Lines

LLVM_DUMP_METHOD void AllocaSlices::dump(const_iterator I) const {		LLVM_DUMP_METHOD void AllocaSlices::dump(const_iterator I) const {
print(dbgs(), I);		print(dbgs(), I);
}		}
LLVM_DUMP_METHOD void AllocaSlices::dump() const { print(dbgs()); }		LLVM_DUMP_METHOD void AllocaSlices::dump() const { print(dbgs()); }

#endif // !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)		#endif // !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)

		bool AllocaSlices::shouldExpand(TargetTransformInfo *TTI) const {
		RKSimonUnsubmitted Not Done Reply Inline Actions remove empty line RKSimon: remove empty line
		for (auto I = begin(), E = end(); I != E; ++I) {
		const User *U = I->getUse()->getUser();
		if (!dyn_cast<Instruction>(U)->getParent()->getParent()->hasMinSize())
		return true;

		const unsigned Cost = TTI->getInstructionCost(dyn_cast<Instruction>(U),
		TargetTransformInfo::TCK_CodeSize);

		if (Cost >= TargetTransformInfo::TCC_Expensive) {
		LLVM_DEBUG(dbgs() << "Instruction with cost " << Cost << " is too " <<
		"expensive to expand: "; U->dump();
		dbgs() << "Not lowering this slice: ";
		print(dbgs(), I));
		return false;
		}
		}
		return true;
		}

/// Walk the range of a partitioning looking for a common type to cover this		/// Walk the range of a partitioning looking for a common type to cover this
/// sequence of slices.		/// sequence of slices.
static Type *findCommonType(AllocaSlices::const_iterator B,		static Type *findCommonType(AllocaSlices::const_iterator B,
AllocaSlices::const_iterator E,		AllocaSlices::const_iterator E,
uint64_t EndOffset) {		uint64_t EndOffset) {
Type *Ty = nullptr;		Type *Ty = nullptr;
bool TyIsCommon = true;		bool TyIsCommon = true;
IntegerType *ITy = nullptr;		IntegerType *ITy = nullptr;
▲ Show 20 Lines • Show All 3,292 Lines • ▼ Show 20 Lines	bool SROA::runOnAlloca(AllocaInst &AI) {
// First, split any FCA loads and stores touching this alloca to promote		// First, split any FCA loads and stores touching this alloca to promote
// better splitting and promotion opportunities.		// better splitting and promotion opportunities.
AggLoadStoreRewriter AggRewriter(DL);		AggLoadStoreRewriter AggRewriter(DL);
Changed \|= AggRewriter.rewrite(AI);		Changed \|= AggRewriter.rewrite(AI);

// Build the slices using a recursive instruction-visiting builder.		// Build the slices using a recursive instruction-visiting builder.
AllocaSlices AS(DL, AI);		AllocaSlices AS(DL, AI);
LLVM_DEBUG(AS.print(dbgs()));		LLVM_DEBUG(AS.print(dbgs()));

		if (!AS.shouldExpand(TTI))
		return Changed;

if (AS.isEscaped())		if (AS.isEscaped())
return Changed;		return Changed;

// Delete all the dead users of this alloca before splitting and rewriting it.		// Delete all the dead users of this alloca before splitting and rewriting it.
for (Instruction *DeadUser : AS.getDeadUsers()) {		for (Instruction *DeadUser : AS.getDeadUsers()) {
// Free up everything used by this instruction.		// Free up everything used by this instruction.
for (Use &DeadOp : DeadUser->operands())		for (Use &DeadOp : DeadUser->operands())
clobberUse(DeadOp);		clobberUse(DeadOp);
▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	bool SROA::promoteAllocas(Function &F) {

LLVM_DEBUG(dbgs() << "Promoting allocas with mem2reg...\n");		LLVM_DEBUG(dbgs() << "Promoting allocas with mem2reg...\n");
PromoteMemToReg(PromotableAllocas, *DT, AC);		PromoteMemToReg(PromotableAllocas, *DT, AC);
PromotableAllocas.clear();		PromotableAllocas.clear();
return true;		return true;
}		}

PreservedAnalyses SROA::runImpl(Function &F, DominatorTree &RunDT,		PreservedAnalyses SROA::runImpl(Function &F, DominatorTree &RunDT,
AssumptionCache &RunAC) {		AssumptionCache &RunAC,
		TargetTransformInfo &RunTTI) {
LLVM_DEBUG(dbgs() << "SROA function: " << F.getName() << "\n");		LLVM_DEBUG(dbgs() << "SROA function: " << F.getName() << "\n");
C = &F.getContext();		C = &F.getContext();
DT = &RunDT;		DT = &RunDT;
AC = &RunAC;		AC = &RunAC;
		TTI = &RunTTI;

BasicBlock &EntryBB = F.getEntryBlock();		BasicBlock &EntryBB = F.getEntryBlock();
for (BasicBlock::iterator I = EntryBB.begin(), E = std::prev(EntryBB.end());		for (BasicBlock::iterator I = EntryBB.begin(), E = std::prev(EntryBB.end());
I != E; ++I) {		I != E; ++I) {
if (AllocaInst *AI = dyn_cast<AllocaInst>(I))		if (AllocaInst *AI = dyn_cast<AllocaInst>(I))
Worklist.insert(AI);		Worklist.insert(AI);
}		}

Show All 31 Lines	PreservedAnalyses SROA::runImpl(Function &F, DominatorTree &RunDT,
PreservedAnalyses PA;		PreservedAnalyses PA;
PA.preserveSet<CFGAnalyses>();		PA.preserveSet<CFGAnalyses>();
PA.preserve<GlobalsAA>();		PA.preserve<GlobalsAA>();
return PA;		return PA;
}		}

PreservedAnalyses SROA::run(Function &F, FunctionAnalysisManager &AM) {		PreservedAnalyses SROA::run(Function &F, FunctionAnalysisManager &AM) {
return runImpl(F, AM.getResult<DominatorTreeAnalysis>(F),		return runImpl(F, AM.getResult<DominatorTreeAnalysis>(F),
AM.getResult<AssumptionAnalysis>(F));		AM.getResult<AssumptionAnalysis>(F),
		AM.getResult<TargetIRAnalysis>(F));
}		}

/// A legacy pass for the legacy pass manager that wraps the \c SROA pass.		/// A legacy pass for the legacy pass manager that wraps the \c SROA pass.
///		///
/// This is in the llvm namespace purely to allow it to be a friend of the \c		/// This is in the llvm namespace purely to allow it to be a friend of the \c
/// SROA pass.		/// SROA pass.
class llvm::sroa::SROALegacyPass : public FunctionPass {		class llvm::sroa::SROALegacyPass : public FunctionPass {
/// The SROA implementation.		/// The SROA implementation.
SROA Impl;		SROA Impl;

public:		public:
static char ID;		static char ID;

SROALegacyPass() : FunctionPass(ID) {		SROALegacyPass() : FunctionPass(ID) {
initializeSROALegacyPassPass(*PassRegistry::getPassRegistry());		initializeSROALegacyPassPass(*PassRegistry::getPassRegistry());
}		}

bool runOnFunction(Function &F) override {		bool runOnFunction(Function &F) override {
if (skipFunction(F))		if (skipFunction(F))
return false;		return false;

auto PA = Impl.runImpl(		auto PA = Impl.runImpl(
F, getAnalysis<DominatorTreeWrapperPass>().getDomTree(),		F, getAnalysis<DominatorTreeWrapperPass>().getDomTree(),
getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F));		getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F),
		getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F));
return !PA.areAllPreserved();		return !PA.areAllPreserved();
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
		AU.addRequired<TargetTransformInfoWrapperPass>();
AU.addRequired<AssumptionCacheTracker>();		AU.addRequired<AssumptionCacheTracker>();
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.addPreserved<GlobalsAAWrapperPass>();		AU.addPreserved<GlobalsAAWrapperPass>();
AU.setPreservesCFG();		AU.setPreservesCFG();
}		}

StringRef getPassName() const override { return "SROA"; }		StringRef getPassName() const override { return "SROA"; }
};		};
Show All 11 Lines

test/Transforms/SROA/memcpy-minsize.ll

This file was added.

				; RUN: opt < %s -sroa -S \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv7m-arm-unknown-eabi"

				%struct.data_t = type { [32 x i8] }
				%struct.data_2_t = type { i64, i64, i64 }

				define dso_local void @test(i8* %src) minsize optsize {
				; CHECK-LABEL: @test
				; CHECK: %tmp = alloca %struct.data_t, align 1
				; CHECK: @llvm.memcpy.p0i8.p0i8.i32(i8* align 8 %tmp6, i8* align 1 %arrayidx, i32 8, i1 false)
				; CHECK: @llvm.memcpy.p0i8.p0i8.i32(i8* align 8 %tmp8, i8* align 1 %arrayidx2, i32 8, i1 false)
				entry:
				%src.addr = alloca i8*, align 4
				%tmp = alloca %struct.data_t, align 1
				%dst = alloca %struct.data_2_t*, align 4
				store i8* %src, i8** %src.addr, align 4
				%tmp2 = bitcast %struct.data_t* %tmp to i8*
				%tmp3 = load i8, i8* %src.addr, align 4
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %tmp2, i8* align 1 %tmp3, i32 32, i1 false)
				%call = call %struct.data_2_t* bitcast (%struct.data_2_t* (...)* @getData2 to %struct.data_2_t* ()*)()
				store %struct.data_2_t* %call, %struct.data_2_t** %dst, align 4
				%tmp5 = load %struct.data_2_t, %struct.data_2_t* %dst, align 4
				%m1 = getelementptr inbounds %struct.data_2_t, %struct.data_2_t* %tmp5, i32 0, i32 0
				%tmp6 = bitcast i64* %m1 to i8*
				%data = getelementptr inbounds %struct.data_t, %struct.data_t* %tmp, i32 0, i32 0
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* %data, i32 0, i32 8
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 8 %tmp6, i8* align 1 %arrayidx, i32 8, i1 false)
				%tmp7 = load %struct.data_2_t, %struct.data_2_t* %dst, align 4
				%m2 = getelementptr inbounds %struct.data_2_t, %struct.data_2_t* %tmp7, i32 0, i32 1
				%tmp8 = bitcast i64* %m2 to i8*
				%data1 = getelementptr inbounds %struct.data_t, %struct.data_t* %tmp, i32 0, i32 0
				%arrayidx2 = getelementptr inbounds [32 x i8], [32 x i8]* %data1, i32 0, i32 16
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 8 %tmp8, i8* align 1 %arrayidx2, i32 8, i1 false)
				ret void
				}

				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture writeonly, i8* nocapture readonly, i32, i1 immarg)
				declare dso_local %struct.data_2_t* @getData2(...) local_unnamed_addr

This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Don't lower expensive allocas with minsizeAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 199575

include/llvm/Transforms/Scalar/SROA.h

lib/Transforms/Scalar/SROA.cpp

test/Transforms/SROA/memcpy-minsize.ll

[SROA] Don't lower expensive allocas with minsize
AbandonedPublic