This is an archive of the discontinued LLVM Phabricator instance.

In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
ClosedPublic

Authored by niravd on Nov 19 2015, 10:59 AM.

Details

Summary

In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.

* Simplify Consecutive Merge Store Candidate Search

Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.

When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.

This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
 
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)

Additional Minor Changes:

  1. Finishes removing unused AliasLoad code

  2. Unifies the chain aggregation in the merged stores across code
      paths

  3. Re-add the Store node to the worklist after calling
      SimplifyDemandedBits.
 
  4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
      arbitrary, but seems sufficient to not cause regressions in
      tests.

 5. Remove Chain dependencies of Memory operations on CopyfromReg
     nodes as these are captured by data dependence
 
 6. Forward loads-store values through tokenfactors containing
     {CopyToReg,CopyFromReg} Values.

 7. Peephole to convert buildvector of extract_vector_elt to
     extract_subvector if possible (see CodeGen/AArch64/store-merge.ll)

 8. Store merging for the ARM target is restricted to 32-bit as
     some in some contexts invalid 64-bit operations are being
     generated. This can be removed once appropriate checks are
     added.

This finishes the change Matt Arsenault started in r246307 and jyknight's original patch.

Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. This test should be
noted:

CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.

Diff Detail

Event Timeline

jyknight updated this revision to Diff 40679.Nov 19 2015, 10:59 AM
jyknight retitled this revision from to In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled..
jyknight updated this object.
jyknight added reviewers: arsenm, hfinkel.

Hi,

I think this patch: http://reviews.llvm.org/D14268 may correct the issue with: CodeGen/AMDGPU/merge-stores.ll

RKSimon added inline comments.
test/CodeGen/X86/vector-idiv.ll
4 ↗(On Diff #40679)

Please can you regenerate this with utils\update_llc_test_checks.py ? It should clean up some of the asm comments.

test/CodeGen/X86/vector-lzcnt-128.ll
13

Please can you regenerate this with utils\update_llc_test_checks.py ? It should clean up some of the asm comments.

jyknight added inline comments.Nov 23 2015, 10:42 AM
test/CodeGen/X86/vector-idiv.ll
4 ↗(On Diff #40679)

Aha, so *that's* why these tests look like this.

I had no idea that script existed. I'll do that. And also update the generator script to write a note in the output that it was generated by the script, so the next person doesn't have that problem. :)

I think this patch: http://reviews.llvm.org/D14268 may correct the issue with: CodeGen/AMDGPU/merge-stores.ll

Yep, it fixes that test case.

It also seems to cause vgpr-spill-emergency-stack-slot.ll to revert to its previous pessimal state: unable to recognize that the loads generated from the extracts come from the stores generated by the inserts. (Seems a mismatch between store/load sizes for the insertelement/extractelement memops, as it's getting LD16 and ST4, instead of LD4/ST4 as it was before).

Anyways, I think this change makes sense regardless of all the noted test changes; I intended those more just as an "FYI" to the various arch maintainers about potential enhancements that could be made.

hfinkel edited edge metadata.Nov 25 2015, 2:09 AM

Thanks for working on this!

One situation exists now where the new code is not able to detect
merge candidates and it was before: when some stores overlap a load,
and others do not. This causes
test/CodeGen/X86/merge-store-partially-alias-loads.ll to no longer
work.

What's a good plan for dealing with this? Is there a better way of doing this without (essentially) re-introducing the old algorithm? Permanently losing this functionality is likely not good either. Should we do both kinds of searches?

test/CodeGen/PowerPC/ppc64-align-long-double.ll
22

TODO -> FIXME

arsenm added inline comments.Nov 25 2015, 6:23 AM
lib/CodeGen/TargetLoweringBase.cpp
823

I increased this to 16 for AMDGPU. The custom setting for it there can be removed now

This causes
test/CodeGen/X86/merge-store-partially-alias-loads.ll to no longer
work.

What's a good plan for dealing with this? Is there a better way of doing this without (essentially) re-introducing the old algorithm? Permanently losing this functionality is likely not good either. Should we do both kinds of searches?

No, I don't think the old algorithm is at all the right thing to do. I also think it's bad for LLVM to use (basically arbitrarily) different modes for different targets, so I'd like to get rid of the old method even without having a fully baked plan on how to solve this particular case the new way.

Now, I do have a start on a thought on how this could be made to work, but haven't really fully thought through it. I also wonder whether it'd affect compilation-speed too much? Don't really know about that...

Anyways, here goes:

Firstly, we should look at all stores in the basic block, not only those attached to the same chain node. Then, once we've found those stores that look promising (same base pointer, not volatile, neighboring offsets, etc), see if it would be possible to:

  • create a TokenFactor node merging the incoming chains of ALL the candidate stores-to-merge.
  • make a new merged store with incoming chain being that TokenFactor.
  • replace all the uses of the outgoing chain values to use the single merged-store outgoing chain (as is done today).

What does "possible" mean? That doing the above won't create a loop in the DAG. That is, I think, checked simply by ensuring that no candidate store can be a predecessor of any of the other stores.

I'm also not sure if some other checks might be needed beyond the simple "possible" to avoid pessimizing the code. It seems like in some cases it might be a bad idea to merge two stores that are "far away" from each-other. E.g., let's say you have a dependency graph that looks like:

exit -> store1 -> ...lots of stuff... -> entry
exit -> ...other stuff... -> store2 -> entry

Originally, there's two completely independent streams of instructions which can be interleaved by the instruction scheduler. But then if a new fancy store-merger gets ahold of it, it might become:

exit -> ...other stuff... -> store1+2 -> ...lots of stuff...

And thus that optimization could be a substantial performance hit. I dunno if that situation is likely to come up, though.

This causes
test/CodeGen/X86/merge-store-partially-alias-loads.ll to no longer
work.

What's a good plan for dealing with this? Is there a better way of doing this without (essentially) re-introducing the old algorithm? Permanently losing this functionality is likely not good either. Should we do both kinds of searches?

No, I don't think the old algorithm is at all the right thing to do.

Okay, but...

I also think it's bad for LLVM to use (basically arbitrarily) different modes for different targets,

I completely agree.

so I'd like to get rid of the old method even without having a fully baked plan on how to solve this particular case the new way.

I agree that the problem can get quite general, but the particular test case in question looks pretty simple, and that's partially what worries me about it.

As you explain in the comment added to the test case, what the test case really exposes is, in a sense, a phase-ordering problem between findBetterNeighborChains() and MergeConsecutiveStores(). It seems like what should really happen is that findBetterNeighborChains() should check whether it can merge the store being moved up the chain with a store it find on the chain as it searches up the chain for a better (less constraining) operand. Would that work?

...

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12046

This comment is now out of date.

As you explain in the comment added to the test case, what the test case really exposes is, in a sense, a phase-ordering problem between findBetterNeighborChains() and MergeConsecutiveStores(). It seems like what should really happen is that findBetterNeighborChains() should check whether it can merge the store being moved up the chain with a store it find on the chain as it searches up the chain for a better (less constraining) operand. Would that work?

That might work okay for this case, but I'm afraid it will fall down at any minor difference in the graph.

For example, if you have instead:
store X+1
load X
store X

It should be able to merge the stores, but won't be able to find it without being able to move the "store X+1" down.

Okay, maybe we just ignore that problem. But, can it reliably handle, say:
#1: load X
#2: store X
#3: store X+1
#4: store X+2
#5: store X+3

Those should certainly be mergeable, too. But, even with the proposed modification, moving each node incrementally up the chain might or might not work, depending on visit order, which is basically arbitrary.

(And, of course, there's also the known problem with the code now, where it can not merge stores of non-equal sizes, which is why it's important to move as many nodes first, before doing any merging. Otherwise, it can happen that it'd merge the X+1 and X+2 stores, before the others are available to merge. And then, it's not possible to merge in the X and X+3 stores. That should be fixed, of course.)

Anyhow, basically, I think to get the right answer, you'd need to know not to move ANY of the nodes past node #1 -- until you've reached the state with all of them having their chain as #1 so they can be merged together. But, since the desirable final state for each of the nodes (other than #2) is to be attached to entry node, having to somehow somehow enforce that that desirable modification DOESN'T happen until other nodes get moved to the proper intermediate place, I'm not sure, doesn't seem easy to make non-fragile.

I'm sure it's implementable somehow, but it doesn't seem like it'd be easier than solving the general problem, at least to me.

As you explain in the comment added to the test case, what the test case really exposes is, in a sense, a phase-ordering problem between findBetterNeighborChains() and MergeConsecutiveStores(). It seems like what should really happen is that findBetterNeighborChains() should check whether it can merge the store being moved up the chain with a store it find on the chain as it searches up the chain for a better (less constraining) operand. Would that work?

That might work okay for this case, but I'm afraid it will fall down at any minor difference in the graph.

For example, if you have instead:
store X+1
load X
store X

It should be able to merge the stores, but won't be able to find it without being able to move the "store X+1" down.

Okay, maybe we just ignore that problem. But, can it reliably handle, say:
#1: load X
#2: store X
#3: store X+1
#4: store X+2
#5: store X+3

Those should certainly be mergeable, too. But, even with the proposed modification, moving each node incrementally up the chain might or might not work, depending on visit order, which is basically arbitrary.

(And, of course, there's also the known problem with the code now, where it can not merge stores of non-equal sizes, which is why it's important to move as many nodes first, before doing any merging. Otherwise, it can happen that it'd merge the X+1 and X+2 stores, before the others are available to merge. And then, it's not possible to merge in the X and X+3 stores. That should be fixed, of course.)

Anyhow, basically, I think to get the right answer, you'd need to know not to move ANY of the nodes past node #1 -- until you've reached the state with all of them having their chain as #1 so they can be merged together. But, since the desirable final state for each of the nodes (other than #2) is to be attached to entry node, having to somehow somehow enforce that that desirable modification DOESN'T happen until other nodes get moved to the proper intermediate place, I'm not sure, doesn't seem easy to make non-fragile.

I'm sure it's implementable somehow, but it doesn't seem like it'd be easier than solving the general problem, at least to me.

The "general" problem is global, and we'll never be able to get it with a local analysis. However, while I like the direction of this patch, I think we need to try really hard not to introduce "basic" regressions (i.e. regressions that can be demonstrated with really small test cases).

You've convinced me that trying to fix this by merging the chain-finding process and the merge-candidate-determination process is too limited to be really useful (even if it would fix the one existing regression test, writing other similarly-simple ones that would also be broken would be easy).

Thus far we have a few examples:

Case 1:

load X
store X
store X+1

This will give us:

store X -> load X -> E
store X+1 -> E

Case 2:

store X+1
load X
store X

Which should also yield:

store X -> load X -> E
store X+1 -> E

and some generalizations with more stores, etc. All of these can be found by a limited search. Given a situation like this:

store X -> load X -> E
store X+1 -> load X+1 -> E

it seems that instead of just searching other users of the chain of the initial store, we need to also walk up and down a bit. We might walk up through (optionally) a token factor and a load to find a new chain to search. We can then search down the users of that chain, directly, through a load, or through a load and a token factor, for consecutive stores. If found, then we can do this:

store (X, X+1) -> TF -> load X -> E
                                 -> load X+1 -> E

we just need to make sure that, in making load X+1 a predecessor of store X (and making load X a predecessor of store X+1) we won't create any cycles (load X+1 cannot already be a successor of store X, etc.).

Given that findBetterChain should collapse otherwise-sequential loads to be parallel ones (joined by token factors), this should be fairly general, and will handle the simple cases outlined here. What do you think?

What do you think?

That was actually the same idea I had been cooking up, but I hadn't actually gotten around to trying it out yet. :)

I've still not had time to actually explore it with code yet, but I'll try to do that at some point. I think it ought to work.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12046

And, actually, so is the code. Removed the loop and the comment.

lib/CodeGen/TargetLoweringBase.cpp
823

Done.

jyknight added inline comments.Dec 14 2015, 8:47 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12046

Oops, I didn't mean to send that reply yet, as the loop sadly DOES still accomplish /something/, but I think it's probably doing so in a way that won't actually complete all the work that could be done in all cases.

niravd edited edge metadata.Feb 26 2016, 1:10 PM
niravd added a subscriber: niravd.
niravd commandeered this revision.Mar 7 2016, 9:18 AM
niravd added a reviewer: jyknight.
niravd updated this revision to Diff 49974.EditedMar 7 2016, 9:18 AM

A new patch which inprinciple capture missing functionality. Needs somechanges to AliasAnalysis to resolve regressions. Also exposes some bugs in AMDGPU target which need to be resolved.

niravd updated this object.Mar 7 2016, 10:24 AM

The new patch is currently exposes a bug with AMDGPU target and the CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot-compute.ll test now fails to compile. Can someone familiar with the target take a look at this?

niravd updated this revision to Diff 50969.Mar 17 2016, 1:10 PM
niravd updated this object.

New Patch containing test case changes and minor code cleanup. Everything is nominally working

jyknight added inline comments.Mar 25 2016, 7:04 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
54–55

Please remove, since it's now unused.

11369–11370

This is actually fixing incorrect behavior in UseAA mode; let's commit it separately (along with the exact same change which is made down below). When done separately, it'll need to be conditioned on UseAA, though, since the previous code was right for !UseAA.

11403

Commented code should be removed.

11449–11453

Why isn't this:

} else if (I.getOperandNo() == 0)
  addStoreNodeIfMergableStore(*I, St, StoreNodes, Seq);

?

11802–11803

The other half of the bugfix I mentioned above.

11994–12001

This can also be pulled out as a simple fix.

lib/CodeGen/TargetLoweringBase.cpp
823

The removal of that line (in AMDGPUISelLoweing.cpp) doesn't actually seem to be done, despite that I said done before (oops).

test/CodeGen/AArch64/argument-blocks.ll
64–68

Delete added comment; upon re-reading, it's the size/align that it was checking.

test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
16

What's the status of this one now?

test/CodeGen/X86/copy-eflags.ll
21–22

Is it really the best thing to add volatile to a bunch of random tests? Can't the CHECK lines be fixed instead?

niravd updated this object.Mar 29 2016, 1:56 PM
niravd updated this object.
niravd updated this revision to Diff 51983.Mar 29 2016, 2:06 PM
niravd marked 15 inline comments as done.
niravd updated this object.

Address comments. Simplify and cleanup code

niravd marked 2 inline comments as not done.Apr 8 2016, 12:52 PM
niravd added inline comments.
test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
16–26

No change here. Test completes but no longer tests desired property.

niravd updated this revision to Diff 53906.Apr 15 2016, 9:44 AM

Rebase and simplify. AMDGPU VGPR test again crashing.

niravd updated this revision to Diff 56602.May 9 2016, 11:13 AM

Potential fix for crash in SIFrameLowering and modify VGPR test to pass though it no longer tests what it was supposed to.

Can someone who knows AMDGPU take a look at SIFrameLowering change and the associated vgpr test that was crashing before it?

niravd updated this revision to Diff 58099.May 23 2016, 8:26 AM

Rebasing again. Still waiting on final okay

niravd updated this revision to Diff 58380.May 24 2016, 7:30 PM

Update given r270646.

Now that the AMDGPU crash case has been resolved, this this patch is all set and ready for an LGTM and finally landing.

niravd updated this object.May 24 2016, 7:57 PM
jyknight edited edge metadata.May 25 2016, 11:30 AM

This looks pretty good. All I have are some trivial requests for changes, and some questions on tests.

jyknight added inline comments.May 25 2016, 11:30 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11376–11377

Remove "AndAlias" from the name; no longer collects AliasLoadNodes.

11394

I think this could do with an ascii-art diagram to explain the shape of the DAG we're looking for.

E.g. something like this:

Given a structure like this:

 Root
  |-------|-------|
 Load    Load   Store
  |       |
Store   Store

We might start off looking at any of the Store nodes, and need to find all of the others to see if they can be merged.

test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
38

Why add volatile, if the test still doesn't work?

test/CodeGen/X86/copy-eflags.ll
12–16

Deleted a comment line by accident here.

21–23

Same Q re addition of volatile here; why's it needed?

test/CodeGen/X86/i256-add.ll
7

why volatile here.

test/CodeGen/X86/i386-shrink-wrapping.ll
75

why volatile here

test/DebugInfo/X86/dbg-value-dag-combine.ll
16 ↗(On Diff #58380)

why volatile?

niravd updated this revision to Diff 62923.Jul 6 2016, 11:44 AM
niravd marked 3 inline comments as done.
niravd edited edge metadata.

Resolved comments and rebased

niravd added inline comments.Jul 11 2016, 6:46 AM
test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
38

Reverted.

test/CodeGen/X86/i256-add.ll
7

Removing dependencies in the DAG puts the two references to %p together and enables an optimization that to converts the sbbls to adcls. I've changed this to separate the srcs and dests to prevent this.

test/CodeGen/X86/i386-shrink-wrapping.ll
75

Part of this test checks that we're not clobbering the flags when we shrink wrap which was clobbered because we can now move the load of @f to avoid the problem entirely. I've changed it to load @e which does the same thing.

test/DebugInfo/X86/dbg-value-dag-combine.ll
16 ↗(On Diff #58380)

Reverted.

niravd updated this revision to Diff 66667.Aug 3 2016, 8:11 AM

Rebase and update for new tests

nhaehnle edited edge metadata.Aug 8 2016, 3:44 AM

I went through the AMDGPU test changes. Disabling the spill tests is unfortunate but I think acceptable. I do have two comments below, the rest looks good.

test/CodeGen/AMDGPU/amdgpu.private-memory.ll
230–231 ↗(On Diff #66667)

This looks like a regression to me. There are two stores to different parts of the [2 x i16] array, and the second gets eliminated even though the load may need it (since it uses a dynamic offset in the getelementptr).

FWIW, in the function no_overlap below, I see the same regression but only for the [3 x i8] alloca. The stores and loads based on the [2 x i8] alloca look correct to me.

test/CodeGen/AMDGPU/debugger-insert-nops.ll
6–8

I'm not sure about this test change. There seems to be some re-ordering going on that may not be desirable for debugging.

niravd updated this revision to Diff 67455.Aug 9 2016, 9:14 PM
niravd edited edge metadata.

Fix FrameIndex logic in DAGCombiner's alias analysis

test/CodeGen/AMDGPU/amdgpu.private-memory.ll
230–231 ↗(On Diff #66667)

Hmm. This was not a case of accessing past the alloc. Rather this is a bug in alias analysis in the DAGCombiner. isAlias as it was written is imprecise identifying frame indexes but we assume we're precise which means incorrectly determine that the second store is non-aliasing and as it's a frame store we can optimize it away as we think nothing can read it.

I have a fix that I'm folding into this patch because I haven't found an example that exercises this issue at current head.

test/CodeGen/AMDGPU/debugger-insert-nops.ll
6–8

This patch opens the possibility of the reordering, but I agree that this is non-ideal in that the scheduler doesn't seem to have an advantage from selecting that order and should therefore bias towards the source order.

In general, though this sort of reordering should happen so I'm inclined to punt this to a subsequent patch.

FTR: discussed the most recent change in person, and concluded that it's incorrect. New version on the way to fix this issue.

niravd updated this revision to Diff 70185.Sep 2 2016, 10:22 AM

Update with additional fix from D23356

niravd updated this revision to Diff 72352.Sep 23 2016, 1:54 PM

Rebase and update.

dsanders removed a subscriber: dsanders.Sep 26 2016, 1:46 AM

With D23356 landed, this diff should finally be ready to land.

Okay, I think this should be the last round of little nits for this change. :)

Please verify/update the commit message for consistency with the final state of this change, too.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11294–11295

This var should be moved way down above its (now) first use.

11454

Clearer without removing the braces around another compound statement.

lib/CodeGen/TargetLoweringBase.cpp
823

I don't see it still...

test/CodeGen/AMDGPU/debugger-insert-nops.ll
4

What does this comment mean?

11–18

Also this one?

test/CodeGen/X86/copy-eflags.ll
13

s/volitile/volatile/

test/CodeGen/X86/vector-compare-results.ll
2

Unfortunate duplicate line

test/CodeGen/X86/vector-lzcnt-128.ll
-11

And here. Probably should fix the script to not cause this. :)

test/CodeGen/X86/vector-shuffle-variable-128.ll
2

Here too.

test/CodeGen/X86/win32-eh.ll
71–72

What's up with this test change?

niravd updated this revision to Diff 72531.Sep 26 2016, 11:52 AM
niravd marked 9 inline comments as done.

Fix minor nits

test/CodeGen/X86/win32-eh.ll
71–72

Now that we do alias analysis by universally, we can see that there is no dependence between the reference to the __security_cookie address and the stack and we share the two loads now identical memory loads of the security cookie.

niravd updated this revision to Diff 72559.Sep 26 2016, 1:54 PM

Update tests for improved commit message

niravd updated this object.Sep 26 2016, 1:56 PM
jyknight accepted this revision.Sep 27 2016, 12:33 PM
jyknight edited edge metadata.

OK, let's try it. :)

This revision is now accepted and ready to land.Sep 27 2016, 12:33 PM
This revision was automatically updated to reflect the committed changes.
niravd reopened this revision.Oct 21 2016, 6:27 AM

Landing this appears to cause various bugs in various setups in bootstrapping that are proving hard to reproduce in a debug-friendly configuration. I'm going to start peeling off a smaller portions are functionally separable if not so testwise and see if that helps shake out the underlying problem.

This revision is now accepted and ready to land.Oct 21 2016, 6:27 AM
This revision was automatically updated to reflect the committed changes.
niravd reopened this revision.Dec 28 2016, 8:58 AM
niravd edited edge metadata.
niravd added a subscriber: rengolin.

Reopening for additional inspection

This revision is now accepted and ready to land.Dec 28 2016, 8:58 AM
niravd updated this revision to Diff 82602.Dec 28 2016, 8:59 AM

update with peepholes to fix degraded tests

niravd updated this object.Dec 28 2016, 9:03 AM

Hi Nirav,

Sorry for the delay. I looked again at the ARM/AArch64 tests and they seem pretty much standard. No problems there.

I only have two comments, inline.

cheers,
--renato

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1473 ↗(On Diff #82602)

What if OffsetStride > Offset?

test/CodeGen/ARM/memset-inline.ll
9

It's interesting because the store was merged before. This worries me because memset is an important function to get right.

IIRC, the hazards are between VFP/NEON and GPR. In this case, it's a VMOV-imm + VST1 d, which means no GPRs were involved and there is no hazard. Better still, this is independent from the movs and only has address update for the str afterwards, which could mean they'll execute pretty much in parallel.

There could be some sub-arch issues with other cores, but this could actually be an improvement. Have you measured it?

niravd updated this revision to Diff 85624.Jan 24 2017, 1:00 PM
niravd marked an inline comment as done.
niravd edited the summary of this revision. (Show Details)

Restore worsened AArch64 and ARM test cases.

niravd added inline comments.Jan 24 2017, 1:00 PM
lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1473 ↗(On Diff #82602)

This is just to make sure that our early exit check doesn't exit on MAXOFFSET+OffsetStride beacuse the stores may not come in order. The positive condition is overly aggressive but cases near zero are fine.

test/CodeGen/ARM/memset-inline.ll
9

For the moment I've restricted ARM memory merging to 32-bit or smaller. This makes all of the ARM tests clearly as good or better.

I can imagine leveraging the VFP/NEON and GPR together could be an improvement, but it looks like there's definitely causing issues on at least one subarch.

niravd edited the summary of this revision. (Show Details)Jan 24 2017, 1:03 PM
niravd updated this revision to Diff 86863.Feb 2 2017, 11:36 AM

Update testcases to upstream. Minor cleanup to TF pruning and load forwarding (NFC)

niravd updated this revision to Diff 89783.Feb 25 2017, 3:49 AM

Fix 32-bit anti-aliasing offset bug

niravd closed this revision.Mar 20 2017, 6:06 AM
test/CodeGen/X86/merge-store-partially-alias-loads.ll