This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Enable more load clustering in the MI Scheduler.
ClosedPublic

Authored by mcrosier on Mar 10 2016, 8:47 AM.

Details

Summary

This patch adds unscaled loads and sign-extend loads to the TII getMemOpBaseRegImmOfs API, which is used to control clustering in the MI scheduler. This is done to create more opportunities for load pairing. I've also added the scaled LDRSWui instruction, which was missing from the scaled instructions. Finally, I've added support in shouldClusterLoads for clustering adjacent sext and zext loads that too can be paired by the load/store optimizer.

Overall, this patch increases the number of unscaled pairs by about 3% for Spec2006. I saw similar results for Spec2000. Also, I didn't see any serious changes in the register allocator statistics (see below).

Below is a summary of the llvm stats when comparing without and with this patch. For example, the first stat indicates 148 (or ~3.18%) more ldps are generated from unscaled loads and and the total number of paired instructions increase by 317 (or 0.64%).

Summary:

 148 (3.18) aarch64-ldst-opt - Number of load/store from unscaled generated     
 317 (0.64) aarch64-ldst-opt - Number of load/store pair instructions generated 
  10 (0.42) aarch64-ldst-opt - Number of post-index updates folded              
-272 (-0.01) asm-printer - Number of machine instrs printed                      
-936 (-0.00) assembler - Number of emitted object file bytes                     
   6 (0.00) assembler - Number of evaluated fixups                              
   6 (0.00) mccodeemitter - Number of MC fixups created.                        
-272 (-0.01) mccodeemitter - Number of MC instructions emitted.                  
  11 (0.00) mcexpr - Number of MCExpr evaluations                               
  96 (0.00) pei - Number of bytes used for stack in all functions               
   4 (0.00) regalloc - Number of copies inserted for splitting                  
  -1 (-0.00) regalloc - Number of identity moves eliminated after rewriting      
   3 (0.01) regalloc - Number of interferences evicted                          
   2 (0.22) regalloc - Number of live ranges fractured by DCE                   
  15 (0.01) regalloc - Number of new live ranges queued                         
   8 (0.00) regalloc - Number of registers assigned                             
   4 (0.01) regalloc - Number of registers unassigned                           
   1 (0.00) regalloc - Number of rematerialized defs for spilling               
  -1 (-0.03) regalloc - Number of rematerialized defs for splitting              
   2 (0.01) regalloc - Number of spill slots allocated                          
   1 (0.00) regalloc - Number of spilled live ranges                            
   5 (0.04) regalloc - Number of split global live ranges                       
  -2 (-0.10) regalloc - Number of split local live ranges                        
   2 (0.01) regalloc - Number of splits finished                                
   2 (0.01) regalloc - Number of splits that were simple                        
  29 (0.07) slotindexes - Number of local renumberings                          
   1 (0.03) stackslotcoloring - Number of stack slots eliminated due to coloring
   1 (0.00) tailduplication - Additional instructions due to tail duplication   
   1 (0.04) tailduplication - Number of dead blocks removed

Passed all correctness for EEMBC, Spec200X, llvm test-suite. Performance results look to be mostly noise with minor improvements here and there.

Chad

Diff Detail

Repository
rL LLVM

Event Timeline

mcrosier updated this revision to Diff 50279.Mar 10 2016, 8:47 AM
mcrosier retitled this revision from to [AArch64] Enable load clustering of unscaled loads in the MI Scheduler..
mcrosier updated this object.
mcrosier added reviewers: jmolloy, t.p.northover.
mcrosier added subscribers: llvm-commits, gberry, mssimpso and 3 others.
gberry added inline comments.Mar 10 2016, 9:03 AM
lib/Target/AArch64/AArch64InstrInfo.cpp
1501 ↗(On Diff #50279)

Don't you need to check this for Offset2 as well?

Also, I'm not sure what the style guide says, but should this be?

if (Offset1 % OffsetStride != 0)
mcrosier updated this revision to Diff 50291.Mar 10 2016, 9:40 AM

Update per Geoff's comments.

I collected the stats for Spec2006 again and they're verbatim the same. The code make fewer changes to the schedule, however, so it looks like adding the check for offset2 removed a small amount of skewing in the schedule that would never result in a pair being generated.

mcrosier marked an inline comment as done.Mar 10 2016, 9:40 AM
mcrosier added inline comments.
lib/Target/AArch64/AArch64InstrInfo.cpp
1501 ↗(On Diff #50279)

Yes, I believe you're correct. We do need to check Offset2 as well.

I'll make the style change as well.

mcrosier updated this revision to Diff 50332.Mar 10 2016, 12:39 PM
mcrosier marked an inline comment as done.

Fix an off by 1 error when checking the offset range.
Add a test case.

mcrosier updated this revision to Diff 50475.Mar 11 2016, 2:09 PM
mcrosier retitled this revision from [AArch64] Enable load clustering of unscaled loads in the MI Scheduler. to [AArch64] Enable more load clustering in the MI Scheduler..
mcrosier updated this object.

Add support for clustering sext and zext loads and associated test case.

gberry added inline comments.Mar 16 2016, 10:10 AM
test/CodeGen/AArch64/arm64-ldp-cluster.ll
3 ↗(On Diff #50475)

Perhaps you should add some CHECK statements to verify that the loads you are interested in correspond to the SU nodes you are checking?

mcrosier updated this revision to Diff 50929.Mar 17 2016, 6:55 AM

Update test case, per Geoff's request.

mcrosier marked an inline comment as done.Mar 17 2016, 6:55 AM

We do not handle the unscaled + scaled case in this patch, right ?

test/CodeGen/AArch64/arm64-ldp-cluster.ll
20 ↗(On Diff #50929)

I think we can also add a test for zext + sext loads as well.

We do not handle the unscaled + scaled case in this patch, right ?

Correct.

mcrosier updated this revision to Diff 50935.Mar 17 2016, 8:11 AM

Address Jun's comments. Add a test case for zext + sext clustering. As a bonus I also added a test for ldr (without sext).

mcrosier marked an inline comment as done.Mar 17 2016, 8:12 AM
gberry added inline comments.Mar 17 2016, 9:31 AM
lib/Target/AArch64/AArch64InstrInfo.cpp
1502 ↗(On Diff #50935)

Do we always merge zext with sext loads? I would think this would be controlled by the subtarget check we do for combining narrow loads since this combination has the same trade off with load vs arith and increased depency chain?

1519 ↗(On Diff #50935)

Shouldn't we be checking isCandidateToMergeOrPair here to avoid clustering loads that we're not going to pair?

mcrosier added inline comments.Mar 17 2016, 9:39 AM
lib/Target/AArch64/AArch64InstrInfo.cpp
1502 ↗(On Diff #50935)

Yes; these are unconditionally merged in the load/store optimization pass.

1519 ↗(On Diff #50935)

This is the purpose of the canPairLdStOpc() check. It's effectively the pairing logic from the load/store optimizer isCandidateToMergeOrPair() function, but without the merging logic.

gberry added inline comments.Mar 17 2016, 10:33 AM
lib/Target/AArch64/AArch64InstrInfo.cpp
1502 ↗(On Diff #50935)

Okay. It seems like we may want to reconsider this.

1519 ↗(On Diff #50935)

But canPairLdStOpc() is only considering the opcode. It doesn't take into account the other things that may inhibit pairing later such as isLdStSuppressed being true. You may be asking the scheduler to cluster loads that you're not going to end up pariing later.

mcrosier added inline comments.Mar 17 2016, 10:57 AM
lib/Target/AArch64/AArch64InstrInfo.cpp
1519 ↗(On Diff #50935)

Ah, I see what you're saying. Let me see if I can move that function into InstrInfo, so it can be shared.

mcrosier updated this revision to Diff 50970.Mar 17 2016, 1:17 PM

Use the isCandidateToMergeOrPair() directly in shouldClusterLoads. This prevents us from clustering loads that can't possibly be paired.

Interestingly, this fix didn't change code generation for SPEC200X. My assumption is the scheduler doesn't really touch volatiles or MI's without MMOs and the load/store suppression only applies to stores (which we don't currently cluster). The remaining check is to make sure the load/store doesn't modify the base register and I'm guessing that rarely/never happens. Regardless, I agree this is the correct thing to do.

mcrosier marked 4 inline comments as done.Mar 17 2016, 1:18 PM
mcrosier added inline comments.
lib/Target/AArch64/AArch64InstrInfo.cpp
1528 ↗(On Diff #50970)

Shall I file a bug or something?

This looks good to me now

lib/Target/AArch64/AArch64InstrInfo.cpp
1528 ↗(On Diff #50970)

Sure. I think we should track this along with the narrow load merging issue.

mcrosier accepted this revision.Mar 18 2016, 9:40 AM
mcrosier added a reviewer: mcrosier.

Per Geoff's comment.

This revision is now accepted and ready to land.Mar 18 2016, 9:40 AM
This revision was automatically updated to reflect the committed changes.