This is an archive of the discontinued LLVM Phabricator instance.

[DAG] Do MergeConsecutiveStores again before Instruction Selection
ClosedPublic

Authored by niravd on May 30 2017, 6:38 AM.

Details

Summary

Enable by default post-legalization store merging to non-X86 machines for all targets to allow merging stores from lowered intrinsics / calls.

Pre-legalization store merging cannot yet be disabled as nodes with custom lowering may be lowered during legalization obscuring some merge candidates.

Event Timeline

niravd created this revision.May 30 2017, 6:38 AM
efriedma edited edge metadata.May 30 2017, 4:15 PM

5% of time in DAGCombine, or 5% total? 5% total is a lot for an optimization which triggers relatively rarely.

It's about a 5% time increase for the sum of DAG Combine phases on the my bad test cases for store merge (large basic blocks with a large number of stores which are offset from the same base, but non-mergeable stores). It looks closer to 2% on the total. In the majority of cases, the effect is negligible.

A simple caching check on the PersistantID could remove most of the redundant work. I had plans on looking at a more efficient version of this pass of this once the pre-legal type merge was removed and the nodes are more stable.

Okay, in that case the compile-time sounds fine.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13170

Can you point to an example testcase where the run before legalization matters?

lib/Target/AArch64/AArch64ISelLowering.cpp
9425

How is this related to the other change?

test/CodeGen/X86/bigstructret.ll
24

We should be able to do something more clever here... but I guess it's not important.

niravd added inline comments.Jun 1 2017, 7:08 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13170

The most obvious is:

merge_vec_element_store and merge_vec_extract_stores in CodeGen/X86/MergeConsecutiveStores have issues related to bitcasting simplification. From what I've seen this is because we do a bticasting op for a vector with index 0 that we don't with others.

Related issues that also need addressing before killing the pre-legalization merge. These seem simple enough:

  • redundant_stores_merging and overlapping_stores_merging in CodeGen/X86/stores-merging.ll fails to merge because BaseIndexOffset cannot look through the X86ISD:Wrappers nor extract offset from TargetGlobals preventing merge otherwise obvious merges. The later is also true for CodeGen/AMDGPU/merge-stores.ll
  • truncated stores are neither generated or used as valid inputs for store merge.
  • CodeGen/AArch64/merge-store.ll adds extra nodes due to bitcasting
lib/Target/AArch64/AArch64ISelLowering.cpp
9425

Amongst other things split storess replaces vector stores of zero/scalars into appropriate sizes so that as Machine instructions we can create paired memory operations which may be cheaper. This should be run whenever we create a new store larger store.

Concrete examples of the effects can be seen in the CodeGen/AArch64/ldst-opt.ll test. This patch nominally depends on D33518 which fixes untested cases caught by the recent store merge optimizations.

efriedma added inline comments.Jun 14 2017, 4:11 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13170

truncated stores are neither generated or used as valid inputs for store merge.

Oh, so it completely breaks everything that isn't x86. :)

Could you make before/after/both an option, so it's easier to check the impact if we do run into issues?

If we really just want to run this after legalization, it's probably a good idea to flip the switch as soon as possible, even if it causes minor regressions.

RKSimon resigned from this revision.Oct 29 2017, 5:14 AM
niravd updated this revision to Diff 121773.Nov 6 2017, 1:22 PM
niravd edited the summary of this revision. (Show Details)

Resurrecting after long hiatus.

The AArch64 test changes look fine.

If the Aarch64 tests look good, I think this should be landable. The only remaining test that is more than a simple merge is the Mips/cconv/vector.ll test where the store merging allows the value stored on the stack to be forwarded to a now matching load and the stores and loads excised.

The AArch64 test changes look fine.

Ping. I think we're all set here. Can I get an LGTM?

rnk accepted this revision.Nov 17 2017, 1:38 PM

lgtm

This revision is now accepted and ready to land.Nov 17 2017, 1:38 PM
This revision was automatically updated to reflect the committed changes.
eastig added a subscriber: eastig.Nov 29 2017, 9:23 AM

Hi Nirav,

Could you please revert the changes? They affected Arm targets (Thumb2 code).
The following sequence of stores:

MOVS     r0,#0xe5
STRB     r0,[r6,#0x1e5]
MOVS     r0,#0xe4
STRB     r0,[r6,#0x1e4]
MOVS     r0,#0xe6
STRB     r0,[r6,#0x1e6]
MOVS     r0,#0xe7
STRB     r0,[r6,#0x1e7]

is optimised into

MVN      r0,#0x1b
STR      r0,[r6,#0x1e4]

causing incorrect data to be written.

We are working on a reproducer.

Thanks,
Evgeny Astigeevich
The Arm Compiler Optimisation team

A reproducer:

$ cat test.ll
...
  %v304 = getelementptr inbounds i8, i8* %v50, i32 508
  store i8 -4, i8* %v304, align 1
  %v305 = getelementptr inbounds i8, i8* %v50, i32 509
  store i8 -3, i8* %v305, align 1
  %v306 = getelementptr inbounds i8, i8* %v50, i32 510
  store i8 -2, i8* %v306, align 1
  %v307 = getelementptr inbounds i8, i8* %v50, i32 511
  store i8 -1, i8* %v307, align 1
...
$ llc -O3 -filetype=asm -o test.s test.ll
$ cat test.s
...
        movs    r1, #251
        strb.w  r1, [r0, #507]
        mvn     r1, #3 <========= HERE the problem: -4, -1, -1, -1 is written instead of -4, -3, -2, -1
        str.w   r1, [r0, #508]
        bx      lr
.Lfunc_end0:
        .size   test, .Lfunc_end0-test
        .cantunwind
        .fnend
...
eastig added a comment.EditedDec 1 2017, 12:35 PM

These changes caused Clang to crash when it compiled spec2006 403.gcc for AArch64. I am working on a reproducer.

eastig added a comment.Dec 1 2017, 1:26 PM

These changes caused failures of AArch64 NEON Emperor tests.