This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Enhance AggLoadStoreRewriter to rewrite integer load/store if it covers multi fields in original aggregate
Needs ReviewPublic

Authored by Carrot on Oct 3 2019, 12:07 PM.

Details

Summary

The motivated example is:

enum ResultType {
  a, b, c, d, 
  Error,
};

struct Result {
  Result(ResultType type = Error, unsigned hash = 0)
    : type(type), hash(hash) {}
  ResultType type; 
  unsigned hash; 
};

template<typename Function>
inline Result foo(Function function) {
  bool done; 
  Result result;
  std::tie(done, result) = function();
  if (done) return result;
  return Result(Error);
}

int main(int argc, char** argv) {
  auto function = [] { return std::make_tuple(false, Result()); };
  Result result = foo(function);
  return int(result.type);
}

When compiled with libc++, llvm generates:

movb    $0, -16(%rsp)
movq    $4, -12(%rsp)
movq    -16(%rsp), %rcx
movq    %rcx, -16(%rsp)
movl    $4, %eax
testb   %cl, %cl 
je      .LBB0_2

All of the memory accesses are redundant.

The problem is the underlying tuple structure looks like

{i8, {i32, i32}}

Its total size is 96 bit, small enough to be returned through registers, but as function return value its type is changed to

{i64, i32}

So for the temporary alloca object to receive the result of the lambda function, it is written and read as different types. When alloca slices are built from memory accesses, these slices overlapped with each other

Slices of alloca:   %6 = alloca %"struct.std::__u::__tuple_impl", align 8
  [0,8) slice #0
    used by:   store i64 %20, i64* %22
  [0,1) slice #1
    used by:   %31 = load i8, i8* %30, align 8
  [0,12) slice #2 (splittable)
    used by:   call void @llvm.lifetime.end.p0i8(i64 12, i8* %40)
  [0,12) slice #3 (splittable)
    used by:   call void @llvm.lifetime.start.p0i8(i64 12, i8* %12)
  [4,12) slice #4
    used by:   %37 = load i64, i64* %36, align 4
  [8,12) slice #5
    used by:   store i32 %21, i32* %23, align 8

then all of the slices are grouped together as a single one, so no SROA occurred.

This patch solved the problem by splitting some integer load/store which covers multiple fields of the alloca aggregate, and these fields have different parent structure. In following example

{i32, {i32, i32}}

%ptrval = ptrtoint %struct.ptr* %ptr to i64
%ptrval2 = add i64 %ptrval, 4
%ptr1 = inttoptr i64 %ptrval to i64*
%ptr2 = inttoptr i64 %ptrval2 to i64*
%val1 = load i64, i64* ptr1, align 4
%val2 = load i64, i64* ptr2, align 4

The first 64-bit load will be rewritten to 2 32-bit loads because it actually access 2 fields in the original aggregate, and the two fields don't belong to the same inner structure.

The second load won't be rewritten because all fields accessed by the load belong to the same inner structure, it's a common case in LLVM IR.

Diff Detail

Event Timeline

Carrot created this revision.Oct 3 2019, 12:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2019, 12:07 PM
lebedev.ri edited the summary of this revision. (Show Details)Oct 24 2019, 2:49 PM
lkail added a subscriber: lkail.Nov 15 2019, 3:12 PM

Could anybody help to review this last decade's patch?
Thanks a lot!

arsenm added inline comments.Feb 13 2020, 4:49 PM
lib/Transforms/Scalar/SROA.cpp
3470 ↗(On Diff #223063)

Unchecked dyn cast, jut cast

3550 ↗(On Diff #223063)

Unchecked dyn_cast

test/Transforms/SROA/split-integer.ll
15 ↗(On Diff #223063)

Test should be smaller, check more, and use named values

Carrot updated this revision to Diff 246850.Feb 26 2020, 3:42 PM
Carrot marked 4 inline comments as done.
Carrot added inline comments.Feb 26 2020, 3:43 PM
test/Transforms/SROA/split-integer.ll
15 ↗(On Diff #223063)

Test is simplified.
The IR after SROA is quite redundant, they will soon be cleaned up by following passes. And future enhancement of SROA can also change them, so the IR after SROA is not very reliable, other IR changes is not a regression when there is no alloca/store/load.
So I prefer to check alloca/store/load only.

reames resigned from this revision.Mar 25 2020, 11:10 AM

My biggest concern with this patch is that you're blindly rewriting based on the type of the alloca. That type isn't always reliable, particularly when you're dealing with unions. And even in cases where it is reliable, it might not reflect the actual operations the code ends up using in practice. I'm concerned you'll end up preemptively splitting operations that didn't need to be split, and hurt performance by doing that.

If we're going to do a transform like this, I'd like to see the following:

  1. Base the transform on the SROA slices rather than the IR type. IR types of memory are not trustworthy; the true shape should be based on actual usage.
  2. Specifically target memory operations where splitting them would unblock mem2reg. Otherwise it isn't obvious the transform is profitable. Partitions which can't be transformed to registers aren't that common, but various operations can trigger this: volatile operations, operations behind a select, etc.
llvm/lib/Transforms/Scalar/SROA.cpp
3471

Rearranging functions like this makes it harder to read the patch.

Carrot marked an inline comment as done.Apr 8 2020, 4:34 PM

The most complex part of this patch is avoiding "blindly rewriting", and most of the new data structures are used for this purpose.

Suppose we have following type definition and pointers

%inner = type { i32, i32 }
%outer = type { %inner, %inner }

%local = alloca %outer, align 8
%inner_ptr1 = getelementptr inbounds %outer, %outer* %local, i64 0, i32 0, i32 0
%inner_ptr2 = getelementptr inbounds %outer, %outer* %local, i64 0, i32 0, i32 1
%inner_ptr3 = getelementptr inbounds %outer, %outer* %local, i64 0, i32 1, i32 0
%ptr1 = bitcast i32* %inner_ptr1 to i64*
%ptr2 = bitcast i32* %inner_ptr2 to i64*
%ptr3 = bitcast i32* %inner_ptr3 to i64*

And then we use these pointers to load 64bit value.

%load1 = load i64, i64* %ptr1
%load2 = load i64, i64* %ptr2
%load3 = load i64, i64* %ptr3

load1 covers the first whole inner struct, load3 covers the second whole inner struct, these are very common in llvm, there are many such cases in llvm test cases, so they should not be split.
load2 covers the second part of the first inner struct and the first part of the second inner struct, this is very uncommon in llvm. And later when construct SROA slices, all load1, load2 and load3 forms a single slice, causes SROA failed to replace any field of the %local structure. Only split load2 earlier in AggLoadStoreRewriter, usual SROA slices can be formed later for load2 and load3 separately.

llvm/lib/Transforms/Scalar/SROA.cpp
3471

This is because the base class is changed from InstVisitor to PtrUseVisitor. These 2 classes have different return types for various visit functions. Fortunately they have similar semantics.

Your reply doesn't really address the most important point. The entire premise of your extractTypeFields is that the type of an alloca is a reliable guide to the way code will access that alloca. That simply is not true: clang cannot generate appropriate LLVM types in some cases. So you'll end up with weird behaviors in many cases. I understand that the patch tries to recognize certain cases, but it still doesn't address the fundamental problem.

Instead of extracting fields from the type, I expect the code to construct "fields" based on actual memory accesses.

llvm/lib/Transforms/Scalar/SROA.cpp
3471

I don't see the relationship between changing the base class and reordering the visit* functions. It should be possible to write the functions in any order for either base class.

Carrot marked an inline comment as done.Apr 14 2020, 11:19 AM

Your reply doesn't really address the most important point. The entire premise of your extractTypeFields is that the type of an alloca is a reliable guide to the way code will access that alloca. That simply is not true: clang cannot generate appropriate LLVM types in some cases. So you'll end up with weird behaviors in many cases.

Could you show me an example?

I understand that the patch tries to recognize certain cases, but it still doesn't address the fundamental problem.

Instead of extracting fields from the type, I expect the code to construct "fields" based on actual memory accesses.

For the following instruction in the attached test case the actual memory access is 64bit integer load. This is exactly the problematic instruction and I want to split it.

%split = load i64, i64* %altptr, align 8

llvm/lib/Transforms/Scalar/SROA.cpp
3471

I didn't reorder any visit* functions. I deleted some visit* functions because PtrUseVisitor has same default implementations.

Phabricator diff algorithm matches the meaningless "}" and blank lines, causes the result looks strange.

Consider something like the following C testcase:

union U {
  struct A {
    long m1;
    int m2;
  } a;
  struct B {
    int m;
    struct C {
      int z[2];
    } c;
  } b;
};

void f(struct C);
void a(struct C c) {
  union U x = { .b = { 1, c } };
  f(x.b.c);
}

The alloca is generated with a type like "{i64, i32}", but all the accesses happen as it it were "{i32, i64}". If you assume "{i64, i32}" actually describes the memory accesses, you're paying the price for extra splitting with no profit. (In this particular case, we might eventually recover in instcombine, but more generally the optimization is going to be unpredictable.)

llvm/lib/Transforms/Scalar/SROA.cpp
3471

Oh, sorry, you're right, not your fault.

Thank you for the example.

The alloca type is declared as you described:

%union.U = type { %"struct.U::A" }
%"struct.U::A" = type { i64, i32 }

Some related instructions

%x = alloca %union.U, align 8
...
%b2 = bitcast %union.U* %x to %"struct.U::B"*
%agg.tmp.sroa.0.0..sroa_idx = getelementptr inbounds %"struct.U::B", %"struct.U::B"* %b2, i64 0, i32 1
%agg.tmp.sroa.0.0..sroa_cast = bitcast %struct.C* %agg.tmp.sroa.0.0..sroa_idx to i64*
%agg.tmp.sroa.0.0.copyload = load i64, i64* %agg.tmp.sroa.0.0..sroa_cast, align 4, !tbaa.struct !8

My patch doesn't split the last load instruction, because the 64bit load offset doesn't match a field in %union.U, to be conservative function isSplittableType returns false.

But if I change the definition of struct A to

struct D {

int d1; 
int d2;

};

struct A { 
  struct D m1; 
  int m2; 
} a;

Then it works as you described.

So for union types the alloca type is totally useless.

I will seek the road in SROA slices as you suggested in https://reviews.llvm.org/D68414#1965055.

Thanks a lot!

Carrot updated this revision to Diff 261369.Apr 30 2020, 3:08 PM

This version splits overlapped AllocaSlices like

S1   ------
S2      ------

into

S11  ---
S12     ---
S2      ------

So later these slices can be rewritten with scalar operations.

lebedev.ri added inline comments.
llvm/lib/Transforms/Scalar/SROA.cpp
4142–4143

Ignorant question: is it guaranteed that there isn't a situation like this:

///     S1   ------
///     S2      ------
///     S3     ------

I.e., there is no other slice S3 overlapping S2, with different overlap characteristic?

Carrot marked an inline comment as done.May 3 2020, 3:49 PM
Carrot added inline comments.
llvm/lib/Transforms/Scalar/SROA.cpp
4142–4143

Although I think it is extremely unlikely in real world applications, one can easily write such code in a synthetic test case. Function @test3 in basictest.ll demonstrates this situation, actually it is more crazily overlapped. With this patch more alloca space and related load/store instructions are eliminated.

I'm happy with the general approach here; just some questions about the details.

Also, can you gather numbers about how frequently this triggers over some testsuite like the LLVM testsuite or SPEC? I'd like some idea about how large the impact of this is going to be.

llvm/lib/Transforms/Scalar/SROA.cpp
4162

I think this is worst-case O(n^3) over the number of operations in a partition? There's one loop outside the function, and two loops inside the function. I guess it's unlikely to bite in most cases, but it's the sort of thing that can easily blow up to infinite compile-time.

4164

Is it possible that we don't end up actually splitting a partition after we pre-split? Could we try harder to avoid that?

4306

Given this is endian-sensitive, I'd like to see a big-endian testcase.

Carrot marked 3 inline comments as done.May 14 2020, 7:03 PM

I'm happy with the general approach here; just some questions about the details.

Also, can you gather numbers about how frequently this triggers over some testsuite like the LLVM testsuite or SPEC? I'd like some idea about how large the impact of this is going to be.

For LLVM testsuite function presplitOverlappedSlices returns true for 114 times, I think most of them are from basictest.ll.

llvm/lib/Transforms/Scalar/SROA.cpp
4162

I limit the number of calls to this function to some constant.

4164

This function only presplits slices, it doesn't split partitions. The partitions are only constructed on demand when we request an iterator from AllocaSlices like:

for (auto &P : AS.partitions())
Carrot updated this revision to Diff 264140.May 14 2020, 7:07 PM

For LLVM testsuite function presplitOverlappedSlices returns true for 114 times, I think most of them are from basictest.ll.

I mean the LLVM testsuite, not the regression tests. (http://llvm.org/docs/TestSuiteGuide.html). I'm trying to get an idea how wide the impact is on general C/C++ code. I'm not expecting this to trigger very frequently, but I want to make sure my intuition is correct.

llvm/lib/Transforms/Scalar/SROA.cpp
4141

static const int MaxPresplitIterations = 128;

4164

I wasn't trying to ask about the partitions datastructure.

I meant, are there cases where we do the transform, but then SROA can't do any further transforms? So instead of a single load, we have two loads and some bit manipulation that doesn't simplify? I'd like to avoid splitting the load/store instruction if it''s not actually going to help.

4356

"been"

compile-time results: https://llvm-compile-time-tracker.com/compare.php?from=e616a4259889b55ed1bf5bf095f0e59658c6e311&to=0a2a92f815130c9ed2f0fed11850079bbd55038e&stat=instructions

As of vanilla llvm test-suite + RawSpeed, presplitOverlappedSlices() fires 629240 times, and succeeds 3 (three) times.
I would say that it should either succeed more, or cost less :)

|             statistic name             |  baseline |  proposed |    Δ   |    %   | \|%\| |
|:--------------------------------------:|:---------:|:---------:|:------:|:------:|:-----:|
| sroa.NumPresplitAttempted              |         0 |    629240 | 629240 |  0.00% | 0.00% |
| sroa.NumPresplitSuccess                |         0 |         3 |      3 |  0.00% | 0.00% |
| correlated-value-propagation.NumShlNUW |      4210 |      4209 |     -1 | -0.02% | 0.02% |
| correlated-value-propagation.NumShlNW  |      6274 |      6273 |     -1 | -0.02% | 0.02% |
| mem2reg.NumLocalPromoted               |      6245 |      6246 |      1 |  0.02% | 0.02% |
| correlated-value-propagation.NumNUW    |     15086 |     15085 |     -1 | -0.01% | 0.01% |
| sroa.MaxPartitionsPerAlloca            |     11933 |     11934 |      1 |  0.01% | 0.01% |
| stack-coloring.StackSlotMerged         |     12160 |     12159 |     -1 | -0.01% | 0.01% |
| SLP.NumVectorInstructions              |     34625 |     34626 |      1 |  0.00% | 0.00% |
| asm-printer.EmittedInsts               |   7936899 |   7936895 |     -4 |  0.00% | 0.00% |
| assembler.EmittedDataFragments         |   2500398 |   2500399 |      1 |  0.00% | 0.00% |
| assembler.EmittedFillFragments         |    423471 |    423472 |      1 |  0.00% | 0.00% |
| assembler.EmittedFragments             |   5157859 |   5157861 |      2 |  0.00% | 0.00% |
| assembler.FragmentLayouts              |  12143832 |  12143834 |      2 |  0.00% | 0.00% |
| assembler.ObjectBytes                  | 254675544 | 254675584 |     40 |  0.00% | 0.00% |
| assembler.evaluateFixup                |   7937674 |   7937675 |      1 |  0.00% | 0.00% |
| assume-queries.NumAssumeQueries        |   8436268 |   8436587 |    319 |  0.00% | 0.00% |
| basicaa.SearchTimes                    |  66366214 |  66366216 |      2 |  0.00% | 0.00% |
| bdce.NumRemoved                        |     43590 |     43589 |     -1 |  0.00% | 0.00% |
| codegenprepare.NumCastUses             |    375363 |    375361 |     -2 |  0.00% | 0.00% |
| codegenprepare.NumGEPsElim             |    106610 |    106609 |     -1 |  0.00% | 0.00% |
| correlated-value-propagation.NumNW     |     25516 |     25515 |     -1 |  0.00% | 0.00% |
| dagcombine.NodesCombined               |   3881288 |   3881285 |     -3 |  0.00% | 0.00% |
| dse.NumDomMemDefChecks                 |   3131956 |   3131955 |     -1 |  0.00% | 0.00% |
| dse.NumGetDomMemoryDefPassed           |   1084707 |   1084706 |     -1 |  0.00% | 0.00% |
| dse.NumRemainingStores                 |    846110 |    846108 |     -2 |  0.00% | 0.00% |
| early-cse.NumCSE                       |   2188895 |   2188891 |     -4 |  0.00% | 0.00% |
| early-cse.NumSimplify                  |    542909 |    542924 |     15 |  0.00% | 0.00% |
| gvn.NumGVNInstr                        |    325697 |    325693 |     -4 |  0.00% | 0.00% |
| gvn.NumGVNLoad                         |     76337 |     76336 |     -1 |  0.00% | 0.00% |
| gvn.NumGVNSimpl                        |     96093 |     96090 |     -3 |  0.00% | 0.00% |
| instcombine.NumCombined                |   3674269 |   3674267 |     -2 |  0.00% | 0.00% |
| instcombine.NumSunkInst                |     63820 |     63817 |     -3 |  0.00% | 0.00% |
| instcombine.NumWorklistIterations      |   2024510 |   2024511 |      1 |  0.00% | 0.00% |
| instcount.NumAllocaInst                |     45896 |     45895 |     -1 |  0.00% | 0.00% |
| instcount.NumBitCastInst               |    607901 |    607898 |     -3 |  0.00% | 0.00% |
| instcount.NumCallInst                  |   1760607 |   1760604 |     -3 |  0.00% | 0.00% |
| instcount.NumGetElementPtrInst         |   1177736 |   1177732 |     -4 |  0.00% | 0.00% |
| instcount.NumLoadInst                  |   1006106 |   1006105 |     -1 |  0.00% | 0.00% |
| instcount.NumStoreInst                 |    706082 |    706079 |     -3 |  0.00% | 0.00% |
| instcount.TotalInsts                   |   8826738 |   8826723 |    -15 |  0.00% | 0.00% |
| instcount.TotalIntegerInsts            |   2263812 |   2263811 |     -1 |  0.00% | 0.00% |
| instcount.TotalIntegerScalarInsts      |   2154781 |   2154780 |     -1 |  0.00% | 0.00% |
| instcount.TotalScalarInsts             |   8163514 |   8163499 |    -15 |  0.00% | 0.00% |
| isel.NumDAGIselRetries                 |  56963972 |  56963901 |    -71 |  0.00% | 0.00% |
| mcexpr.MCExprEvaluate                  |  39299357 |  39299360 |      3 |  0.00% | 0.00% |
| mem2reg.NumSingleStore                 |    556895 |    556897 |      2 |  0.00% | 0.00% |
| memory-builtins.ObjectVisitorArgument  |   1652515 |   1652519 |      4 |  0.00% | 0.00% |
| memory-builtins.ObjectVisitorLoad      |    560185 |    560201 |     16 |  0.00% | 0.00% |
| post-RA-sched.NumFixedAnti             |     52447 |     52446 |     -1 |  0.00% | 0.00% |
| post-RA-sched.NumStalls                |   3686107 |   3686108 |      1 |  0.00% | 0.00% |
| regalloc.NumAssigned                   |   4117619 |   4117618 |     -1 |  0.00% | 0.00% |
| simplifycfg.NumSimpl                   |    985893 |    985892 |     -1 |  0.00% | 0.00% |
| sroa.NumAllocaPartitionUses            |   3095830 |   3095827 |     -3 |  0.00% | 0.00% |
| sroa.NumAllocaPartitions               |    698974 |    698973 |     -1 |  0.00% | 0.00% |
| sroa.NumAllocasAnalyzed                |    796794 |    796791 |     -3 |  0.00% | 0.00% |
| sroa.NumDeleted                        |   3687318 |   3687311 |     -7 |  0.00% | 0.00% |
| sroa.NumPromoted                       |    689146 |    689149 |      3 |  0.00% | 0.00% |
| stack-coloring.NumMarkerSeen           |    105177 |    105174 |     -3 |  0.00% | 0.00% |
| stack-coloring.StackSpaceSaved         |    383217 |    383205 |    -12 |  0.00% | 0.00% |
llvm/lib/Transforms/Scalar/SROA.cpp
4535

As of vanilla llvm test-suite, PresplitTimes is at most ever 2,
so i think MAX_PRESPLIT_ITERATIONS could be much lower than 256.

3? Really? I mean, I guess it would be odd for for this pattern to show up outside of unions or ABI lowering, but I thought it would be a little more common than that.

Is this still relevant?

llvm/test/Transforms/SROA/split-integer.ll
17

Use opaque pointers

Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2023, 1:45 PM