This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Improve SROA to prevent generating redundant coalescing operations.
AbandonedPublic

Authored by mingmingl on Nov 30 2021, 4:49 PM.

Details

Summary

Before this change, coalesce instructions are difficult to clean up, and prevent the tail call generation (see test case test_struct_of_two_int in llvm/test/Transforms/SROA/alloca-struct.ll)

Change SROA pass so a structure type won't be scalarized to smaller allocations (i.e., remain unchanged) in the following conditions

  1. If it doesn't have scalar access

or

  1. If the scalar stores will essentially make the struct a constant (i.e., all individual fields are constant values).

This change adds additional analysis information about usages, and mark all slices of an alloca as unsplittable if all conditions are true:

  1. The alloca doesn't have a) scalar load b) non constant scalar store c) uncovered access type (e.g., memcpy, memset, etc)
  2. In each basic block, the constant stores cover all bytes (i.e., the allocated type could be regarded as a constant in that basic block)

The change preserves original slice-split logic if there are scalar access, or if analysis information isn't sufficient to prevent scalarization.

Diff Detail

Event Timeline

mingmingl created this revision.Nov 30 2021, 4:49 PM
mingmingl retitled this revision from [SROA] For a specialized case, skip scalarization in SROA pass; the goal is to save useless operations to coalesce values. Before this change, coalesce instructions are difficult to clean up, and prevent the tail call generation (see test case... to [SROA] For a type of alloca access, skip scalarization in SROA pass; the goal is to save useless operations to coalesce values. .Nov 30 2021, 4:51 PM
mingmingl edited the summary of this revision. (Show Details)
mingmingl added reviewers: Carrot, efriedma, aeubanks.
mingmingl added a subscriber: davidxl.
mingmingl published this revision for review.Nov 30 2021, 4:55 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 30 2021, 4:55 PM

I suggest simplify the title to : "Improve SROA to prevent generating redundant coalescing operations'.

mingmingl retitled this revision from [SROA] For a type of alloca access, skip scalarization in SROA pass; the goal is to save useless operations to coalesce values. to [SROA] Improve SROA to prevent generating redundant coalescing operations..Nov 30 2021, 4:59 PM

I suggest simplify the title to : "Improve SROA to prevent generating redundant coalescing operations'.

Done, thanks for the suggestion!

nikic added a subscriber: nikic.
nikic added inline comments.
llvm/test/Transforms/SROA/alloca-struct.ll
82

Please use update_test_checks.py.

That's a lot of complexity. It sounds like we are missing some cleanup transforms in some other passes?
Can you please post a standalone example of the "bad" codegen (via a godbolt link) that you are trying to avoid?

That's a lot of complexity. It sounds like we are missing some cleanup transforms in some other passes?
Can you please post a standalone example of the "bad" codegen (via a godbolt link) that you are trying to avoid?

To compare the LLVM and GCC codegen, tail call is generated in GCC but not LLVM.

test_struct_of_two_int (from llvm/test/Transforms/SROA/alloca-struct.ll) shows the diff in IR level before and after this change, and run clang with this patch (clang -S -O3 file.cpp) would generate a tail call.

In D114832#3164711, @luna wrote:

That's a lot of complexity. It sounds like we are missing some cleanup transforms in some other passes?
Can you please post a standalone example of the "bad" codegen (via a godbolt link) that you are trying to avoid?

To compare the LLVM and GCC codegen, tail call is generated in GCC but not LLVM.

test_struct_of_two_int (from llvm/test/Transforms/SROA/alloca-struct.ll) shows the diff in IR level before and after this change, and run clang with this patch (clang -S -O3 file.cpp) would generate a tail call.

I dig a little bit on how GCC SROA works, and the findings are

  1. this [1] writeup also mentions two RPs which avoid creating unnecessary or harmfully excessive statements when aggregates are used as a whole. The relevant PR are PR 40744 and PR 44423, mentioned in xi) of 2nd section.
  2. The implementation (https://github.com/gcc-mirror/gcc/blob/16e2427f50c208dfe07d07f18009969502c25dc8/gcc/tree-sra.c#L2435-L2472) has analysis of scalar/whole usage

[1] https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=jambor.pdf

mingmingl added inline comments.Dec 1 2021, 10:36 AM
llvm/test/Transforms/SROA/alloca-struct.ll
82

Thanks for the suggestions!

Using update_test_checks.py might make the test case a little fragile. This is a preview of the result (https://gist.github.com/minglotus-6/a05ac0300d8ea8204ec9eed409951b25)

To improve the current test style, maybe I can enrich it a little bit but still scrub updater output that look strict?

The key is an aggregate is written field by field (piecemeal), but not read as such (e.g, as a whole), so splitting messes up the IR. For this reason, the pattern should probably be relaxed more (i.e. not requiring constant values in the stores). Example:

struct RetVal {

int x;
int y;

};
typedef RetVal (*Parser)();
RetVal Parse(bool test, Parser p, int v) {

if (test) return {v, v};
return p();

}

I'm not familiar with the term "coalesce instructions", and it doesn't appear anywhere in SROA.cpp, could you explain that in the description?

llvm/test/Transforms/SROA/alloca-struct.ll
82

it's very hard to tell what changed without seeing every instruction in the function, so I'd also recommend update_test_checks.py. yes it's fragile, but that's kind of the point, to be able to see changes in IR

mingmingl updated this revision to Diff 391461.Dec 2 2021, 2:20 PM

Use update_test_checks.py to update test/Transforms/SROA/alloca-struct.ll.

  1. In this revision, the opt used is built with this patch.
  2. Test case test_struct_of_int_char and test_one_field_has_runtime_value should remain unchanged.
  3. Diff of test case test_struct_of_two_int is shown in https://gist.github.com/minglotus-6/a05ac0300d8ea8204ec9eed409951b25/revisions
mingmingl updated this revision to Diff 391462.EditedDec 2 2021, 2:22 PM

Clean up CHECK that were manually added previously, but not needed when we use update_test_checks.py.

I'm not familiar with the term "coalesce instructions", and it doesn't appear anywhere in SROA.cpp, could you explain that in the description?

In this example (https://godbolt.org/z/fhWT48db7), coalesce instruction is %10 = or i64 %9, %8, !dbg !27 from the following basic block.

7:                                                ; preds = %2, %3
  %8 = phi i64 [ %6, %3 ], [ 0, %2 ]
  %9 = phi i64 [ %5, %3 ], [ 0, %2 ]
  %10 = or i64 %9, %8, !dbg !27

Another similar example is https://godbolt.org/z/sna4TPPTY, coalesce instructions are the following three instructions

%17 = or i64 %13, %12, !dbg !30
%18 = or i64 %17, %14, !dbg !30
%19 = or i64 %18, %15, !dbg !30

in the last basic block, which are used to or scalar values together.

mingmingl marked 2 inline comments as done.Dec 2 2021, 2:41 PM
mingmingl added inline comments.
llvm/test/Transforms/SROA/alloca-struct.ll
82

Done.

aeubanks added inline comments.Dec 2 2021, 4:13 PM
llvm/test/Transforms/SROA/alloca-struct.ll
82

can you first run update_test_checks.py on this test, submit that, then rebase so we can see the effects of this change on the IR?

mingmingl marked 2 inline comments as done.Dec 2 2021, 4:23 PM
mingmingl added inline comments.
llvm/test/Transforms/SROA/alloca-struct.ll
82
mingmingl updated this revision to Diff 391498.Dec 2 2021, 4:54 PM
mingmingl marked an inline comment as done.

Update revision after git rebase

A gentle ping. Please let me know if there are any feedback, thanks!

Can the issues instead be viewed as missed transformations that should be implemented instead of trying to prevent creating those "bad" patterns?

Can the issues instead be viewed as missed transformations that should be implemented instead of trying to prevent creating those "bad" patterns?

For my understanding of 'missed transformations', would this transformation takes the IR after SROA pass (pasted in https://gist.github.com/minglotus-6/cae98ad58f7ae6cac8dd848e75c2bde9) as input,
doing something like instruction combine or clean ups, with the desired output the same as this patch did in the SROA pass?

I have also thought of the approach of a separate transformation in the beginning, but decided to implement this first (easier to prevent split in the first place than analyzing results after split and combine them together)

Now upon this suggestion, I think it twice about how to do it in another pass (maybe right after SROA); some thoughts:

  1. The basic essence of the current patch is to analyze the use pattern of an allocation, and the use pattern information (offsets of fields) is still in the IR after SROA pass. From this perspective it might be possible to do it in a separate pass.
  2. However, a pattern matcher (to analyze offsets after split) might be fragile, and won't continue to work if the specific instructions in IR changes (but the semantics of constant stores remains).

I'm wondering if there happens to be an example of existing passes that might be inspiring for a use case of joining instructions like this. Also would like to hear concerns of the current approach so maybe I can iterate from there.

The bad code pattern generated is probably not general enough to be useful to introduce a cleanup pass or to enhance existing pass to do so -- it will probably just shift the complexity from one pass to another. Fixing this at the source (SROA) is reasonable (unlike other canonicalization pass, this code pattern from SROA actually makes IR much worse)

The bad code pattern generated is probably not general enough to be useful to introduce a cleanup pass or to enhance existing pass to do so -- it will probably just shift the complexity from one pass to another. Fixing this at the source (SROA) is reasonable (unlike other canonicalization pass, this code pattern from SROA actually makes IR much worse)

Who defines what is/isn't reasonable? :)

As i see it, the input IR may or may not already have such bad patterns that are similar to the bad patterns you are trying to avoid producing here.
Trying to avoid producing it has compile time cost: https://llvm-compile-time-tracker.com/compare.php?from=0850655da69a700b7def4fe8d9a44d1c8d55877c&to=8a4304c8f4253fb1944e5e5988a24285a14181c4&stat=instructions
So, you've paid for avoiding producing more bad patterns, but you are still left with the ones that were already there.

Alternatively, instead of paying for trying not to introduce bad patterns, you could pay for trying to improve all of the patterns afterwards, regardless of their source.
Then, you still end up paying, but end up with not just the SROA patterns improved.

This is a very standard logic behind IR changes - if something doesn't understand the pattern, generally don't try to workaround it elsewhere, just fix the missing piece.

The bad code pattern generated is probably not general enough to be useful to introduce a cleanup pass or to enhance existing pass to do so -- it will probably just shift the complexity from one pass to another. Fixing this at the source (SROA) is reasonable (unlike other canonicalization pass, this code pattern from SROA actually makes IR much worse)

Who defines what is/isn't reasonable? :)

As i see it, the input IR may or may not already have such bad patterns that are similar to the bad patterns you are trying to avoid producing here.
Trying to avoid producing it has compile time cost: https://llvm-compile-time-tracker.com/compare.php?from=0850655da69a700b7def4fe8d9a44d1c8d55877c&to=8a4304c8f4253fb1944e5e5988a24285a14181c4&stat=instructions
So, you've paid for avoiding producing more bad patterns, but you are still left with the ones that were already there.

Alternatively, instead of paying for trying not to introduce bad patterns, you could pay for trying to improve all of the patterns afterwards, regardless of their source.
Then, you still end up paying, but end up with not just the SROA patterns improved.

This is a very standard logic behind IR changes - if something doesn't understand the pattern, generally don't try to workaround it elsewhere, just fix the missing piece.

I agree with most of what you said as a general principle, while analysis should be done case by case :)

The assumption is that the producer (SROA) is pretty much the only creator of the pattern (excluding manually written IR), thus cleaning it up downstream does not really help in sharing. The compile time cost is also shifted. Imagine another hypothetical scenario: if we there are N different unique bad patterns that can be handled by one single analysis pass in the source pass, we may need to have N different downstream pass changes to handle them resulting in more waste.

Having said that, suggestions on which downstream pass to handle this is useful. IIRC Mingming considered CodegenPrepare where some taildup can be done to handle tailcalls ...

The bad code pattern generated is probably not general enough to be useful to introduce a cleanup pass or to enhance existing pass to do so -- it will probably just shift the complexity from one pass to another. Fixing this at the source (SROA) is reasonable (unlike other canonicalization pass, this code pattern from SROA actually makes IR much worse)

Who defines what is/isn't reasonable? :)

As i see it, the input IR may or may not already have such bad patterns that are similar to the bad patterns you are trying to avoid producing here.
Trying to avoid producing it has compile time cost: https://llvm-compile-time-tracker.com/compare.php?from=0850655da69a700b7def4fe8d9a44d1c8d55877c&to=8a4304c8f4253fb1944e5e5988a24285a14181c4&stat=instructions
So, you've paid for avoiding producing more bad patterns, but you are still left with the ones that were already there.

Alternatively, instead of paying for trying not to introduce bad patterns, you could pay for trying to improve all of the patterns afterwards, regardless of their source.
Then, you still end up paying, but end up with not just the SROA patterns improved.

This is a very standard logic behind IR changes - if something doesn't understand the pattern, generally don't try to workaround it elsewhere, just fix the missing piece.

I agree with most of what you said as a general principle, while analysis should be done case by case :)

The assumption is that the producer (SROA) is pretty much the only creator of the pattern (excluding manually written IR), thus cleaning it up downstream does not really help in sharing. The compile time cost is also shifted. Imagine another hypothetical scenario: if we there are N different unique bad patterns that can be handled by one single analysis pass in the source pass, we may need to have N different downstream pass changes to handle them resulting in more waste.

Having said that, suggestions on which downstream pass to handle this is useful. IIRC Mingming considered CodegenPrepare where some taildup can be done to handle tailcalls ...

thanks for the discussions and insight everyone!

The compile time link is helpful. Is there a pointer or rule-of-thumb to evaluate cost vs benefit?

  • Ask since the current analysis is inlined in an existing instruction walker. Adding a separate transformation (as opposed to choose a good existing one to piggyback on) would likely increase the cost (with the same effect).

It seems the other option to consider is have a separate pass to transform bad patterns, or reuse an existing pass like CodeGenPrepare.

It'd be helpful to know a pass that I should probably look more into, to achieve the original optimization goal here.

In D114832#3179890, @luna wrote:

The bad code pattern generated is probably not general enough to be useful to introduce a cleanup pass or to enhance existing pass to do so -- it will probably just shift the complexity from one pass to another. Fixing this at the source (SROA) is reasonable (unlike other canonicalization pass, this code pattern from SROA actually makes IR much worse)

Who defines what is/isn't reasonable? :)

As i see it, the input IR may or may not already have such bad patterns that are similar to the bad patterns you are trying to avoid producing here.
Trying to avoid producing it has compile time cost: https://llvm-compile-time-tracker.com/compare.php?from=0850655da69a700b7def4fe8d9a44d1c8d55877c&to=8a4304c8f4253fb1944e5e5988a24285a14181c4&stat=instructions
So, you've paid for avoiding producing more bad patterns, but you are still left with the ones that were already there.

Alternatively, instead of paying for trying not to introduce bad patterns, you could pay for trying to improve all of the patterns afterwards, regardless of their source.
Then, you still end up paying, but end up with not just the SROA patterns improved.

This is a very standard logic behind IR changes - if something doesn't understand the pattern, generally don't try to workaround it elsewhere, just fix the missing piece.

I agree with most of what you said as a general principle, while analysis should be done case by case :)

The assumption is that the producer (SROA) is pretty much the only creator of the pattern (excluding manually written IR), thus cleaning it up downstream does not really help in sharing. The compile time cost is also shifted. Imagine another hypothetical scenario: if we there are N different unique bad patterns that can be handled by one single analysis pass in the source pass, we may need to have N different downstream pass changes to handle them resulting in more waste.

Having said that, suggestions on which downstream pass to handle this is useful. IIRC Mingming considered CodegenPrepare where some taildup can be done to handle tailcalls ...

thanks for the discussions and insight everyone!

The compile time link is helpful. Is there a pointer or rule-of-thumb to evaluate cost vs benefit?

  • Ask since the current analysis is inlined in an existing instruction walker. Adding a separate transformation (as opposed to choose a good existing one to piggyback on) would likely increase the cost (with the same effect).

It seems the other option to consider is have a separate pass to transform bad patterns, or reuse an existing pass like CodeGenPrepare.

It'd be helpful to know a pass that I should probably look more into, to achieve the original optimization goal here.

Driven by the compile-time feedback, another thing for me to consider, (probably after a preliminary convergence on the pass to add analysis info or do the transform), is to prune the existing solution so it's faster on the benchmark.

In D114832#3179919, @luna wrote:
In D114832#3179890, @luna wrote:

The bad code pattern generated is probably not general enough to be useful to introduce a cleanup pass or to enhance existing pass to do so -- it will probably just shift the complexity from one pass to another. Fixing this at the source (SROA) is reasonable (unlike other canonicalization pass, this code pattern from SROA actually makes IR much worse)

Who defines what is/isn't reasonable? :)

As i see it, the input IR may or may not already have such bad patterns that are similar to the bad patterns you are trying to avoid producing here.
Trying to avoid producing it has compile time cost: https://llvm-compile-time-tracker.com/compare.php?from=0850655da69a700b7def4fe8d9a44d1c8d55877c&to=8a4304c8f4253fb1944e5e5988a24285a14181c4&stat=instructions
So, you've paid for avoiding producing more bad patterns, but you are still left with the ones that were already there.

Alternatively, instead of paying for trying not to introduce bad patterns, you could pay for trying to improve all of the patterns afterwards, regardless of their source.
Then, you still end up paying, but end up with not just the SROA patterns improved.

This is a very standard logic behind IR changes - if something doesn't understand the pattern, generally don't try to workaround it elsewhere, just fix the missing piece.

I agree with most of what you said as a general principle, while analysis should be done case by case :)

The assumption is that the producer (SROA) is pretty much the only creator of the pattern (excluding manually written IR), thus cleaning it up downstream does not really help in sharing. The compile time cost is also shifted. Imagine another hypothetical scenario: if we there are N different unique bad patterns that can be handled by one single analysis pass in the source pass, we may need to have N different downstream pass changes to handle them resulting in more waste.

Having said that, suggestions on which downstream pass to handle this is useful. IIRC Mingming considered CodegenPrepare where some taildup can be done to handle tailcalls ...

thanks for the discussions and insight everyone!

The compile time link is helpful. Is there a pointer or rule-of-thumb to evaluate cost vs benefit?

  • Ask since the current analysis is inlined in an existing instruction walker. Adding a separate transformation (as opposed to choose a good existing one to piggyback on) would likely increase the cost (with the same effect).

It seems the other option to consider is have a separate pass to transform bad patterns, or reuse an existing pass like CodeGenPrepare.

It'd be helpful to know a pass that I should probably look more into, to achieve the original optimization goal here.

Driven by the compile-time feedback, another thing for me to consider, (probably after a preliminary convergence on the pass to add analysis info or do the transform), is to prune the existing solution so it's faster on the benchmark.

I did a quick experiment in https://godbolt.org/z/EWWaMedsx, which basically transforms {a series of PHI, one OR} into {a series of OR, one PHI}. This approach is not implemented in the first place, because result after transformation (i.e. {a series of OR, one PHI}) might be more machine instructions, and more expensive.

Some more context before starting to implement:

  1. For llvm/test/Transforms/SROA/alloca-struct.ll, original cpp and assembly code is as shown in https://godbolt.org/z/qTY6MfMGa (mentioned above)
  2. IR before the first SROA pass is https://gist.github.com/minglotus-6/24bccb53d5e101cb20a7118f5e9f389f IR after SROA is like https://gist.github.com/minglotus-6/1e6b93d52687e7fbbb43942563134117
    • In particular, the alloca instruction is allocating a struct before SROA pass, and scalarized in SROA pass, and lead to the verbose instructions (lshr, trunc, etc).

Based on this, the basic essence is not to create scalarization in the first place of the usage is whole. The idea of skip scalarization also exists in GCC (e.g., The relevant PR are PR 40744 and PR 44423, mentioned in xi) of 2nd section of https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=jambor.pdf).

The IR of test12 of llvm/test/Transforms/SROA/basictest-opaque-ptrs.ll is also more succinct (I didn't look into the codegen result though)

after running it through -O2:

define i64 @test_struct_of_two_int(i1 zeroext %test, i64 ()* nocapture %p) local_unnamed_addr {
entry:
  br i1 %test, label %return, label %if.end

if.end:                                           ; preds = %entry
  %call = tail call i64 %p()
  %retval.sroa.3.0.extract.shift = and i64 %call, -4294967296
  %phi.cast = and i64 %call, 4294967295
  br label %return

return:                                           ; preds = %entry, %if.end
  %retval.sroa.3.0 = phi i64 [ %retval.sroa.3.0.extract.shift, %if.end ], [ 0, %entry ]
  %retval.sroa.0.0 = phi i64 [ %phi.cast, %if.end ], [ 0, %entry ]
  %retval.sroa.0.0.insert.insert = or i64 %retval.sroa.0.0, %retval.sroa.3.0
  ret i64 %retval.sroa.0.0.insert.insert
}

yeah I'm not sure if this is something that only happens with SROA, but it does look very weird

perhaps trying to implement something like

b:
 phi1 = phi [b1, v1] [b2, v2]
 phi2 = phi [b1, v3] [b2, v4]
 r = op phi1, phi2

into

b1:
 n1 = op v1, v2
b2:
 n2 = op v3, v4
b:
 r = phi [b1, n1] [b2, n2]

and maybe only if at least one of the ops simplifies away to prevent increasing the number of op instructions

when test_struct_of_two_int is properly fixed we should add a phase ordering test for it to make sure it fully optimizes correctly if we do go down this route

Carrot added a comment.Dec 8 2021, 3:56 PM

Driven by the compile-time feedback, another thing for me to consider, (probably after a preliminary convergence on the pass to add analysis info or do the transform), is to prune the existing solution so it's faster on the benchmark.

Your current method is

1 collect all accesses to an alloca object, mark the access type
2 check the aggregated access types, if there is only <store constant> type, mark the object as unsplittable.

I think in step1, if the access type is not <store constant>, you can bail out early, then for most splittable objects the check can be finished quickly.

thanks for the feedback!

after running it through -O2:

define i64 @test_struct_of_two_int(i1 zeroext %test, i64 ()* nocapture %p) local_unnamed_addr {
entry:
  br i1 %test, label %return, label %if.end

if.end:                                           ; preds = %entry
  %call = tail call i64 %p()
  %retval.sroa.3.0.extract.shift = and i64 %call, -4294967296
  %phi.cast = and i64 %call, 4294967295
  br label %return

return:                                           ; preds = %entry, %if.end
  %retval.sroa.3.0 = phi i64 [ %retval.sroa.3.0.extract.shift, %if.end ], [ 0, %entry ]
  %retval.sroa.0.0 = phi i64 [ %phi.cast, %if.end ], [ 0, %entry ]
  %retval.sroa.0.0.insert.insert = or i64 %retval.sroa.0.0, %retval.sroa.3.0
  ret i64 %retval.sroa.0.0.insert.insert
}

yeah I'm not sure if this is something that only happens with SROA, but it does look very weird

perhaps trying to implement something like

b:
 phi1 = phi [b1, v1] [b2, v2]
 phi2 = phi [b1, v3] [b2, v4]
 r = op phi1, phi2

into

b1:
 n1 = op v1, v2
b2:
 n2 = op v3, v4
b:
 r = phi [b1, n1] [b2, n2]

and maybe only if at least one of the ops simplifies away to prevent increasing the number of op instructions

To confirm my understanding, we want the transformation above when one or more op could be simplified away because the result of op already exists as a variable, so op (along with the instructions that generates op parameters) could be eliminated away. Is that roughly the idea?

In particular for the test_struct_of_two_int example, op is or, and

%retval.sroa.3.0.extract.shift = and i64 %call, -4294967296
%phi.cast = and i64 %call, 4294967295

are used to generate or parameters. As a result, both could be optimized away.

when test_struct_of_two_int is properly fixed we should add a phase ordering test for it to make sure it fully optimizes correctly if we do go down this route

Also, it sounds the transformation could happen in a pass orthogonal with SROA (possibly after SROA, because PHI are inserted after SROA in this case).

So shall I prototype this transformation in a pass (wondering if inst combine is a good one), and insert the modified pass between SROA and the pass after SROA?

I guess I'm trying to get more instructions where this transformation should happen, around phase ordering.

Driven by the compile-time feedback, another thing for me to consider, (probably after a preliminary convergence on the pass to add analysis info or do the transform), is to prune the existing solution so it's faster on the benchmark.

Your current method is

1 collect all accesses to an alloca object, mark the access type
2 check the aggregated access types, if there is only <store constant> type, mark the object as unsplittable.

I think in step1, if the access type is not <store constant>, you can bail out early, then for most splittable objects the check can be finished quickly.

whoops; I was typing so probably missed this suggestions just now. Bailing out early for most splittable cases sounds reasonable. I will do a quick experiment and report back the compile time(if possible). thanks!

Driven by the compile-time feedback, another thing for me to consider, (probably after a preliminary convergence on the pass to add analysis info or do the transform), is to prune the existing solution so it's faster on the benchmark.

Your current method is

1 collect all accesses to an alloca object, mark the access type
2 check the aggregated access types, if there is only <store constant> type, mark the object as unsplittable.

I think in step1, if the access type is not <store constant>, you can bail out early, then for most splittable objects the check can be finished quickly.

In D114832#3181393, @luna wrote:

thanks for the feedback!

after running it through -O2:

define i64 @test_struct_of_two_int(i1 zeroext %test, i64 ()* nocapture %p) local_unnamed_addr {
entry:
  br i1 %test, label %return, label %if.end

if.end:                                           ; preds = %entry
  %call = tail call i64 %p()
  %retval.sroa.3.0.extract.shift = and i64 %call, -4294967296
  %phi.cast = and i64 %call, 4294967295
  br label %return

return:                                           ; preds = %entry, %if.end
  %retval.sroa.3.0 = phi i64 [ %retval.sroa.3.0.extract.shift, %if.end ], [ 0, %entry ]
  %retval.sroa.0.0 = phi i64 [ %phi.cast, %if.end ], [ 0, %entry ]
  %retval.sroa.0.0.insert.insert = or i64 %retval.sroa.0.0, %retval.sroa.3.0
  ret i64 %retval.sroa.0.0.insert.insert
}

yeah I'm not sure if this is something that only happens with SROA, but it does look very weird

perhaps trying to implement something like

b:
 phi1 = phi [b1, v1] [b2, v2]
 phi2 = phi [b1, v3] [b2, v4]
 r = op phi1, phi2

into

b1:
 n1 = op v1, v2
b2:
 n2 = op v3, v4
b:
 r = phi [b1, n1] [b2, n2]

and maybe only if at least one of the ops simplifies away to prevent increasing the number of op instructions

To confirm my understanding, we want the transformation above when one or more op could be simplified away because the result of op already exists as a variable, so op (along with the instructions that generates op parameters) could be eliminated away. Is that roughly the idea?

my thoughts were that we don't want to increase the number of instructions doing op in the code for code size reasons, so hopefully at least one of them can fold away. might have to run instsimplify (llvm::SimplifyInstruction) on the new op instruction and check that it actually gets simplified before proceeding
also moving the op into one of the incoming blocks is strictly better because now we won't do the op if we're coming from the other branch

In particular for the test_struct_of_two_int example, op is or, and

%retval.sroa.3.0.extract.shift = and i64 %call, -4294967296
%phi.cast = and i64 %call, 4294967295

are used to generate or parameters. As a result, both could be optimized away.

when test_struct_of_two_int is properly fixed we should add a phase ordering test for it to make sure it fully optimizes correctly if we do go down this route

Also, it sounds the transformation could happen in a pass orthogonal with SROA (possibly after SROA, because PHI are inserted after SROA in this case).

So shall I prototype this transformation in a pass (wondering if inst combine is a good one), and insert the modified pass between SROA and the pass after SROA?

I guess I'm trying to get more instructions where this transformation should happen, around phase ordering.

could first start with adding this to instcombine, instcombine runs a lot in the opt pipelines, this sort of transform isn't large/important enough to warrant its own pass

thanks everyone for the feedback!

I took the liberty to create another revision (https://reviews.llvm.org/D115914) and add all reviewers there.

Let me know if it's more idiomatic to just revise in this revision and I could make that happen.