This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Don't lower expensive allocas with minsize
AbandonedPublic

Authored by SjoerdMeijer on Mar 8 2019, 4:02 AM.

Details

Summary

This is work in progress to show how different pieces should fit together
(with e.g. D59014). I would like to teach SROA that lowering alloca's is not
always beneficial.

Root cause of the inefficiency I am looking at is that some slices
of alloca are covered by a memcpy that load data from a character pointer.
Thus, the source operand of a memcpy get an alignment of 1 like this:

call void @llvm.memcpy.p0i8.p0i8.i32(i8* nonnull align 1 %tmp.sroa.0.0..sroa_idx7, i8* align 1 %src, i32 32, i1 false)

This is fine, nothing wrong with that, but after a few SROA invocations and
alloca rewrites, we end up with loads/stores that get an alignment of 1 to
which these memcpy's are expanded to. Depending on the how much data we're
loading/storing, and if unaligned data access is supported, this can lead to an
explosion in the number of byte loads/stores instructions. To illustrate the
problem, this is not uncommon initialisation code example:

    
typedef struct {
  char data[32];
} data_t;
typedef struct {
  long long m1;
  long long m2;
  long long m3;
} data_2_t;
data_2_t *getData2();
void init(char *src) {
  data_t tmp;
  memcpy(&tmp, src, sizeof(tmp));
  data_2_t *dst = getData2();
  memcpy(&dst->m1, &tmp.data[8], 8);
  memcpy(&dst->m2, &tmp.data[16], 8);
}

Here stack object 'tmp' gets filled in by the memcpy that reads from the
character pointer, and the subsequent memcpy's read from that stack object.
When optimising for minsize, I would like to generate this:

    
init:
  push  {r7, lr}
  mov r7, sp
  sub sp, #32
  mov r1, r0
  mov r0, sp
  movs  r2, #32
  bl  __aeabi_memcpy
  bl  getData2
  add.w r12, sp, #8
  ldm.w r12, {r1, r2, r3, r12}
  stm.w r0, {r1, r2, r3, r12}
  add sp, #32
  pop {r7, pc}

The solution to achieve this, relies at looking at the cost of instructions
covered by the alloca, and if it is an expensive instruction (for example
memcpy of 36 bytes with 1 byte alignment), then don't rewrite the alloca.
At the moment however, we generate this monster:

init:
  push  {r4, r5, r6, r7, lr}
  add r7, sp, #12
  push.w  {r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11}
  ldrb  r1, [r0, #8]!
  str r1, [sp, #24]
  ldrb  r1, [r0, #8]!
  str r1, [sp, #16]
  ldrb  r1, [r0, #4]!
  ldrb  r2, [r0, #-8]!
  str r2, [sp, #12]
  ldrb  r2, [r0, #11]
  ldrb  r3, [r0, #10]
  ldrb  r6, [r0, #-1]
  ldrb.w  r8, [r0, #1]
  ldrb.w  r9, [r0, #3]
  ldrb.w  r10, [r0, #2]
  ldrb.w  r11, [r0, #5]
  ldrb  r4, [r0, #7]
  ldrb  r5, [r0, #6]
  orr.w r2, r3, r2, lsl #8
  ldrb  r3, [r0, #9]
  orr.w r1, r1, r3, lsl #8
  orr.w r1, r1, r2, lsl #16
  str r1, [sp, #20]
  ldrb  r1, [r0, #-3]
  str r1, [sp, #8]
  ldrb  r1, [r0, #-2]
  str r1, [sp, #4]
  bl  getData2
  ldr r2, [sp, #16]
  ldr r3, [sp, #12]
  orr.w r1, r5, r4, lsl #8
  ldr r5, [sp, #8]
  orr.w r2, r2, r11, lsl #8
  orr.w r3, r3, r8, lsl #8
  orr.w r1, r2, r1, lsl #16
  orr.w r2, r10, r9, lsl #8
  orr.w r2, r3, r2, lsl #16
  ldr r3, [sp, #24]
  orr.w r3, r3, r5, lsl #8
  ldr r5, [sp, #4]
  orr.w r6, r5, r6, lsl #8
  orr.w r3, r3, r6, lsl #16
  strd  r3, r2, [r0]
  str r1, [r0, #8]
  ldr r1, [sp, #20]
  str r1, [r0, #12]
  add sp, #28
  pop.w {r8, r9, r10, r11}
  pop {r4, r5, r6, r7, pc}

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Mar 8 2019, 4:02 AM

How hard would it be to try to handle your testcase late instead? Given a series of adjacent unaligned loads, transform that into a memcpy followed by aligned loads. Maybe we don't have appropriate infrastructure...

If we are going to do this, we probably don't want to disable SROA completely for allocas that contain a memcpy source/destination, just avoid promoting the specific slices where it's relevant.

Minor style issues

include/llvm/Transforms/Scalar/SROA.h
23

(style) include order?

lib/Transforms/Scalar/SROA.cpp
1100

remove empty line

Hi @RKSimon, @efriedma, thank you very much for looking at this!

I will address the minor issues, but first about:

How hard would it be to try to handle your testcase late instead?

I think I've explored most places where intrinsics/alignment are handled: from Clang, InstCombine, to the LoadStoreOptimizer, to see where I could e.g. patch up the alignment problem. But either this analysis and rewrite was difficult to fit in, but perhaps more importantly, I came to the conclusion that here in SROA is where the rewrites happen that's causing issues later on. Thus, I think it makes to make a different decision here. But perhaps I've missed something, so that was one reason to post this work-in-progress patch. And you're absolutely right about this:

If we are going to do this, we probably don't want to disable SROA completely for allocas that contain a memcpy source/destination, just avoid promoting the specific slices where it's relevant.

We definitely don't want to disable SROA completely. If you look at shouldExpand, you see that a decision is made per alloca slice. So, it's not a blunt way to completely disable things, or at least, that's definitely not the idea.

To make an informed decision that memcpy's are expensive, we need 3 things:

  1. get the cost of a c-library intrinsic, which is what D59014 is enabling,
  2. model the cost of a memcpy, which is what's currently missing and still need to add, and then
  3. This SROA patch which just uses 1) and 2). Thus, I don't think at this point that this patch needs much changing or more code, except of course some tests.

To illustrate that "patching up the alignment" later on is quite difficult, this is the IR that comes out of SROA:

define hidden void @foo(i8* nocapture readonly %src) local_unnamed_addr #0 {
entry:
  %tmp.sroa.0.sroa.0 = alloca [8 x i8]
  %tmp.sroa.0.sroa.5 = alloca [8 x i8]
  %tmp.sroa.0.sroa.0.0..sroa_idx17 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.0, i32 0, i32 0
  call void @llvm.lifetime.start.p0i8(i64 8, i8* %tmp.sroa.0.sroa.0.0..sroa_idx17)
  %tmp.sroa.0.sroa.5.0..sroa_idx15 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.5, i32 0, i32 0
  call void @llvm.lifetime.start.p0i8(i64 8, i8* %tmp.sroa.0.sroa.5.0..sroa_idx15)
  %tmp.sroa.0.sroa.0.0..sroa_idx9 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.0, i32 0, i32 0
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %tmp.sroa.0.sroa.0.0..sroa_idx9, i8* align 1 %src, i32 8, i1 false)
  %tmp.sroa.0.sroa.3.0.src.sroa_idx = getelementptr inbounds i8, i8* %src, i32 8
  %tmp.sroa.0.sroa.3.0.src.sroa_cast = bitcast i8* %tmp.sroa.0.sroa.3.0.src.sroa_idx to i64*
  %tmp.sroa.0.sroa.3.0.copyload = load i64, i64* %tmp.sroa.0.sroa.3.0.src.sroa_cast, align 1
  %tmp.sroa.0.sroa.4.0.src.sroa_idx = getelementptr inbounds i8, i8* %src, i32 16
  %tmp.sroa.0.sroa.4.0.src.sroa_cast = bitcast i8* %tmp.sroa.0.sroa.4.0.src.sroa_idx to i64*
  %tmp.sroa.0.sroa.4.0.copyload = load i64, i64* %tmp.sroa.0.sroa.4.0.src.sroa_cast, align 1
  %tmp.sroa.0.sroa.5.0.src.sroa_raw_idx = getelementptr inbounds i8, i8* %src, i32 24
  %tmp.sroa.0.sroa.5.0..sroa_idx13 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.5, i32 0, i32 0
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %tmp.sroa.0.sroa.5.0..sroa_idx13, i8* align 1 %tmp.sroa.0.sroa.5.0.src.sroa_raw_idx, i32 8, i1 false)
  %call = call %struct.data_2_t* bitcast (%struct.data_2_t* (...)* @getData2 to %struct.data_2_t* ()*)() #3
  %0 = getelementptr inbounds %struct.data_2_t, %struct.data_2_t* %call, i32 0, i32 0
  store i64 %tmp.sroa.0.sroa.3.0.copyload, i64* %0, align 8
  %tmp.sroa.0.16..sroa_idx = getelementptr inbounds %struct.data_2_t, %struct.data_2_t* %call, i32 0, i32 1
  store i64 %tmp.sroa.0.sroa.4.0.copyload, i64* %tmp.sroa.0.16..sroa_idx, align 8
  %tmp.sroa.0.sroa.0.0..sroa_idx18 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.0, i32 0, i32 0
  call void @llvm.lifetime.end.p0i8(i64 8, i8* %tmp.sroa.0.sroa.0.0..sroa_idx18)
  %tmp.sroa.0.sroa.5.0..sroa_idx16 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.5, i32 0, i32 0
  call void @llvm.lifetime.end.p0i8(i64 8, i8* %tmp.sroa.0.sroa.5.0..sroa_idx16)
  ret void
}

The main problem here is the align 1 on the i64 loads, and again, that's created here and thus preventing that from happening here in SROA seems to make most sense to me.

There's actually another problem with this IR, which is that all these copyloads are emitted first by SROA, consecutively, creating huge live ranges between them and the store instructions. Thus creating another problem (when the example is a bit bigger): spilling. A more sensible output would be to follow more the original program-order. I think that's what SROA tries to do, but fails quite bad in these examples. But anyway, this is not the biggest problem now, and it is hidden when we not always lower allocas.

I was sort of thinking you could write the reverse transform in MemCpyOpt, since it does a similar sorts of analysis, but maybe we'd have to do it later, or patch SROA anyway, to avoid two transforms fighting each other.

If you look at shouldExpand, you see that a decision is made per alloca slice

You're returning early from SROA::runOnAlloca. That's a decision on the whole alloca: if there's an expensive memcpy into any part of the alloca, don't do any splitting on the whole alloca. The alternative is to make the decision on a per-slice level: given a subset of slices that are the target of a memcpy, "merge" those slices so splitAlloca() doesn't split them away from each other.

I was sort of thinking you could write the reverse transform in MemCpyOpt, since it does a similar sorts of analysis, but maybe we'd have to do it later, or patch SROA anyway, to avoid two transforms fighting each other.

Ah, yes, MemCpyOpt. I haven't looked in great detail at that because it hasn't bothered me yet. That is, in experiments when I completely disabled SROA, I didn't find MemCpyOpt reversing things. I am guessing because the analysis and transformation is far from straightforward. To me, it sounds easier to restrict the rewrite/transformation here in SROA, rather than trying to recreate the IR in MemCpyOpt that we already once had. But again, I haven't looked much into MemCpyOpt, so if you think it would be better there, I will start looking into that.

If you look at shouldExpand, you see that a decision is made per alloca slice

You're returning early from SROA::runOnAlloca. That's a decision on the whole alloca: if there's an expensive memcpy into any part of the alloca, don't do any splitting on the whole alloca. The alternative is to make the decision on a per-slice level: given a subset of slices that are the target of a memcpy, "merge" those slices so splitAlloca() doesn't split them away from each other.

Ah yes, you're right; I wanted to do it per alloca slice, but that's currently not the case. I can't remember if I had good reasons for that, and will double check what the codegen differences are.

But again, I haven't looked much into MemCpyOpt, so if you think it would be better there, I will start looking into that.

The transform is more powerful working the other way: it could catch cases that were't originally written as "memcpy" in the source code. Not sure that means we shouldn't do something in SROA, though. Ideally, I think we want the "canonical" form to be the promoted version, and transform it to the compact version as part of lowering. But I'm not sure how much work it would be to make the reverse transform reliable, and I don't want to block an incremental improvement.

MemCpyOpt doesn't actually merge memcpys at the moment, I think. It does merge memsets, and it has a lot of memcpy-related transforms, but that one in particular is missing. But it's the right sort of logic, at least...

The transform is more powerful working the other way: it could catch cases that were't originally written as "memcpy" in the source code. Not sure that means we shouldn't do something in SROA, though. Ideally, I think we want the "canonical" form to be the promoted version, and transform it > to the compact version as part of lowering. But I'm not sure how much work it would be to make the reverse transform reliable, and I don't want to block an incremental improvement.

MemCpyOpt doesn't actually merge memcpys at the moment, I think. It does merge memsets, and it has a lot of memcpy-related transforms, but that one in particular is missing. But it's the right sort of logic, at least...

Many thanks for sharing your ideas. I will first go for the incremental approach first then. One of the advantages is that the code changes will be minimal, I only need a target hook for the memcpy cost, so we don't create (much) technical debt if we do end up tackling this in MemCpyOpt.

RKSimon resigned from this revision.Mar 21 2019, 11:46 AM
SjoerdMeijer retitled this revision from [SROA] WIP: Lowering alloca is not always beneficial to [SROA] Don't lower expensive allocas with minsize.

With all the groundwork now committed, I would like to progress this. I've added:

  • some minor cleanups as commented earlier,
  • my motivating example as a test case.

And I've also ran codesize benchmarks with this change. This didn't show any changes in MBed and another codesize corpus (which is okay). For the CSiBe benchmark, there were a few changes:

AppBeforeAfterDiff %
csibe/OpenTCP-1.0.421835.021835.00.0
csibe/bzip2-1.0.247344.047344.00.0
csibe/cg_compiler_opensrc96142.096198.00.0582471760521
csibe/compiler19278.019278.00.0
csibe/jikespg-1.3170498.0170152.0-0.202934931788
csibe/jpeg-6b96337.096337.00.0
csibe/libpng-1.2.577620.077636.00.0206132440093
csibe/lwip-0.5.3.preproc68569.068569.00.0
csibe/mpeg2dec-0.3.138904.038904.00.0
csibe/mpgcut-1.17192.07192.00.0
csibe/teem-1.6.0-src1028555.01028555.00.0
csibe/ttt-0.10.1.preproc12691.012691.00.0
csibe/unrarlib-0.4.010382.010382.00.0
csibe/zlib-1.1.429221.029297.00.260086923788
total1724569.01724371.0-0.011481129488

This shows a really code size reduction of 346 bytes in jikespg-1.3, with a few minor regressions, but a fairly decent improvement overall. Thus, I am happy with the results as it does what I wanted and hoped for.

The only thing I haven't tried yet is @efriedma 's suggestion to do check on a per-slice basis; currently I do indeed per alloca.

I'm not sure those benchmark results are good enough to justify this: there are only 4 changes, and the ups and downs almost balance each other out. Have you investigated the regressions to see what this makes worse, and if there's a way we could avoid them?

I'm not sure those benchmark results are good enough to justify this: there are only 4 changes, and the ups and downs almost balance each other out.

I see what you mean. But overall we save 198 bytes, and I would be happy to take a 198 bytes size reduction. Probably what I wanted to say in my previous message is that it's not a one trick pony; it not only helps with my motivating example, but other benchmarks too. But I will indeed now look into fine tuning it and see if we can avoid the regressions. As I said, I will look into doing it on a per-slice basis.

Have you investigated the regressions to see what this makes worse, and if there's a way we could avoid them?

Yep, only 1, but some time ago. I thought we were "unlucky", but can't remember what it was. Will looking into this, and the other cases.

My concern isn't really the size of the change, it's with the variance. With only four benchmarks which change, and only by a small amount each, then if you happened to run one more or one fewer benchmark you could come to a completely different conclusion.

My concern isn't really the size of the change, it's with the variance. With only four benchmarks which change, and only by a small amount each, then if you happened to run one more or one fewer benchmark you could come to a completely different conclusion.

Okay, understood. But I have tested 3 I would say quite large code bases. 2 of them don't show any changes, which I interpret as good news (because there's definitely memcpys in these codebases). The other, CSiBe, shows this little up and down behaviour, but improvements overall. Before I continue fine tuning, how can I make this more convincing for you? Do I need more code bases, or just getting rid of the regressions?

It's hard to put exact numbers on this, especially when we are trading off vague concepts like "the complexity of the LLVM codebase", but I'd like to see enough benchmark results that we can be confident that this is making an overall improvement, and that we're not being misled by a few outliers. That might come from running a greater variety of benchmarks until we collect enough data, or it might come from improvements to the patch itself.

Okey dokey, agreed. I could throw a few more benchmarks into the mix, I am not sure it adds or changes anything, but will do it anyway because it's easy. But I will mainly focus on this patch, and see if I can improve things.

SjoerdMeijer abandoned this revision.Mar 17 2023, 1:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2023, 1:36 AM