This is an archive of the discontinued LLVM Phabricator instance.

[DAG] Do MergeConsecutiveStores again before Instruction Selection
ClosedPublic

Authored by niravd on May 30 2017, 6:38 AM.

Details

Summary

Enable by default post-legalization store merging to non-X86 machines for all targets to allow merging stores from lowered intrinsics / calls.

Pre-legalization store merging cannot yet be disabled as nodes with custom lowering may be lowered during legalization obscuring some merge candidates.

Diff Detail

Repository
rL LLVM

Event Timeline

niravd created this revision.May 30 2017, 6:38 AM
efriedma edited edge metadata.May 30 2017, 4:15 PM

5% of time in DAGCombine, or 5% total? 5% total is a lot for an optimization which triggers relatively rarely.

It's about a 5% time increase for the sum of DAG Combine phases on the my bad test cases for store merge (large basic blocks with a large number of stores which are offset from the same base, but non-mergeable stores). It looks closer to 2% on the total. In the majority of cases, the effect is negligible.

A simple caching check on the PersistantID could remove most of the redundant work. I had plans on looking at a more efficient version of this pass of this once the pre-legal type merge was removed and the nodes are more stable.

Okay, in that case the compile-time sounds fine.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13170 ↗(On Diff #100701)

Can you point to an example testcase where the run before legalization matters?

lib/Target/AArch64/AArch64ISelLowering.cpp
9425 ↗(On Diff #100701)

How is this related to the other change?

test/CodeGen/X86/bigstructret.ll
24 ↗(On Diff #100701)

We should be able to do something more clever here... but I guess it's not important.

niravd added inline comments.Jun 1 2017, 7:08 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13170 ↗(On Diff #100701)

The most obvious is:

merge_vec_element_store and merge_vec_extract_stores in CodeGen/X86/MergeConsecutiveStores have issues related to bitcasting simplification. From what I've seen this is because we do a bticasting op for a vector with index 0 that we don't with others.

Related issues that also need addressing before killing the pre-legalization merge. These seem simple enough:

  • redundant_stores_merging and overlapping_stores_merging in CodeGen/X86/stores-merging.ll fails to merge because BaseIndexOffset cannot look through the X86ISD:Wrappers nor extract offset from TargetGlobals preventing merge otherwise obvious merges. The later is also true for CodeGen/AMDGPU/merge-stores.ll
  • truncated stores are neither generated or used as valid inputs for store merge.
  • CodeGen/AArch64/merge-store.ll adds extra nodes due to bitcasting
lib/Target/AArch64/AArch64ISelLowering.cpp
9425 ↗(On Diff #100701)

Amongst other things split storess replaces vector stores of zero/scalars into appropriate sizes so that as Machine instructions we can create paired memory operations which may be cheaper. This should be run whenever we create a new store larger store.

Concrete examples of the effects can be seen in the CodeGen/AArch64/ldst-opt.ll test. This patch nominally depends on D33518 which fixes untested cases caught by the recent store merge optimizations.

efriedma added inline comments.Jun 14 2017, 4:11 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13170 ↗(On Diff #100701)

truncated stores are neither generated or used as valid inputs for store merge.

Oh, so it completely breaks everything that isn't x86. :)

Could you make before/after/both an option, so it's easier to check the impact if we do run into issues?

If we really just want to run this after legalization, it's probably a good idea to flip the switch as soon as possible, even if it causes minor regressions.

RKSimon resigned from this revision.Oct 29 2017, 5:14 AM
niravd updated this revision to Diff 121773.Nov 6 2017, 1:22 PM
niravd edited the summary of this revision. (Show Details)

Resurrecting after long hiatus.

The AArch64 test changes look fine.

If the Aarch64 tests look good, I think this should be landable. The only remaining test that is more than a simple merge is the Mips/cconv/vector.ll test where the store merging allows the value stored on the stack to be forwarded to a now matching load and the stores and loads excised.

The AArch64 test changes look fine.

Ping. I think we're all set here. Can I get an LGTM?

rnk accepted this revision.Nov 17 2017, 1:38 PM

lgtm

This revision is now accepted and ready to land.Nov 17 2017, 1:38 PM
This revision was automatically updated to reflect the committed changes.
eastig added a subscriber: eastig.Nov 29 2017, 9:23 AM

Hi Nirav,

Could you please revert the changes? They affected Arm targets (Thumb2 code).
The following sequence of stores:

MOVS     r0,#0xe5
STRB     r0,[r6,#0x1e5]
MOVS     r0,#0xe4
STRB     r0,[r6,#0x1e4]
MOVS     r0,#0xe6
STRB     r0,[r6,#0x1e6]
MOVS     r0,#0xe7
STRB     r0,[r6,#0x1e7]

is optimised into

MVN      r0,#0x1b
STR      r0,[r6,#0x1e4]

causing incorrect data to be written.

We are working on a reproducer.

Thanks,
Evgeny Astigeevich
The Arm Compiler Optimisation team

A reproducer:

$ cat test.ll
...
  %v304 = getelementptr inbounds i8, i8* %v50, i32 508
  store i8 -4, i8* %v304, align 1
  %v305 = getelementptr inbounds i8, i8* %v50, i32 509
  store i8 -3, i8* %v305, align 1
  %v306 = getelementptr inbounds i8, i8* %v50, i32 510
  store i8 -2, i8* %v306, align 1
  %v307 = getelementptr inbounds i8, i8* %v50, i32 511
  store i8 -1, i8* %v307, align 1
...
$ llc -O3 -filetype=asm -o test.s test.ll
$ cat test.s
...
        movs    r1, #251
        strb.w  r1, [r0, #507]
        mvn     r1, #3 <========= HERE the problem: -4, -1, -1, -1 is written instead of -4, -3, -2, -1
        str.w   r1, [r0, #508]
        bx      lr
.Lfunc_end0:
        .size   test, .Lfunc_end0-test
        .cantunwind
        .fnend
...
eastig added a comment.EditedDec 1 2017, 12:35 PM

These changes caused Clang to crash when it compiled spec2006 403.gcc for AArch64. I am working on a reproducer.

eastig added a comment.Dec 1 2017, 1:26 PM

These changes caused failures of AArch64 NEON Emperor tests.