This is an archive of the discontinued LLVM Phabricator instance.

[MachineScheduler] Add support for store clustering
ClosedPublic

Authored by junbuml on Mar 22 2016, 2:11 PM.

Details

Summary

Perform store clustering just like load clustering. This change add
StoreClusterMutation in machine-scheduler. To control StoreClusterMutation,
added enableClusterStores() in TargetInstrInfo.h. This is enabled only on AArch64
for now.

This change also add support for unscaled stores which were not handled in getMemOpBaseRegImmOfs().

Diff Detail

Event Timeline

junbuml updated this revision to Diff 51337.Mar 22 2016, 2:11 PM
junbuml retitled this revision from to [AArch64]Add support for store clustering in machine-scheduler.
junbuml updated this object.
junbuml added a subscriber: llvm-commits.

Looks like an obvious change to me. It is so obvious that I wonder why it wasn't added in the first place, maybe there is a reason to it?

include/llvm/Target/TargetInstrInfo.h
981–989

Would it make sense to simply rename "xxxClusterLoads" to "xxxClusterMemOps" instead of adding an extra interface? The target can still control whether loads or stores are clustered in getMemOpBaseRegImmOfs().

The same is true for a lot of the following code which would need the differentiation between loads and stores anymore.

junbuml retitled this revision from [AArch64]Add support for store clustering in machine-scheduler to [MachineScheduler] Add support for store clustering.Mar 24 2016, 10:15 AM
junbuml added a reviewer: arsenm.
junbuml added inline comments.
include/llvm/Target/TargetInstrInfo.h
981–989

I agree of using MemOps, instead of using load/store mixed.

Removing enableClusterStores() means that we enable this StoreMutation by default in other targets if getMemOpBaseRegImmOfs already support stores. We should extend reviewers and tests in other targets. As of now, I can see enableClusterLoads() is also enabled in :

AMDGPUInstrInfo.cpp.

I compared llvm stats with/without this change for Spec2006. Overall, this patch increases the number of ld/st pairs by about 3% for Spec2006. I didn't see any serious regression in other statistics (see below).

Please see the full summary of llvm stats diff collected for spec2006 without and with this patch. For example, the third stat indicates 1495 (or ~3%) more ld/st pairs are generated with this patch :

    2 (0.06%) aarch64-copyelim - Number of copies removed.                        
   38 (0.81%) aarch64-ldst-opt - Number of load/store from unscaled generated     
 1495 (3.00%) aarch64-ldst-opt - Number of load/store pair instructions generated 
   28 (1.14%) aarch64-ldst-opt - Number of post-index updates folded              
-1661 (-0.06%) asm-printer - Number of machine instrs printed                      
-6056 (-0.01%) assembler - Number of emitted object file bytes                     
   67 (0.01%) assembler - Number of evaluated fixups                              
  -42 (-0.13%) branchfolding - Number of block tails merged                        
    3 (0.00%) branchfolding - Number of branches optimized                        
  -32 (-0.055%) branchfolding - Number of dead blocks removed                       
   -7 (-0.69%) branchfolding - Number of times common instructions are hoisted     
    2 (0.02%) codegen-cp - Number of dead copies deleted                          
    3 (1.69%) machine-licm - Number of machine instructions hoisted out of loops post regalloc
   67 (0.01%) mccodeemitter - Number of MC fixups created.                        
-1661 (-0.06%) mccodeemitter - Number of MC instructions emitted.                  
   93 (0.00%) mcexpr - Number of MCExpr evaluations                               
  112 (0.00%) pei - Number of bytes used for stack in all functions               
 -491 (-0.54%) regalloc - Number of copies inserted for splitting                  
   15 (0.19%) regalloc - Number of hoisted spills                                 
 -103 (-0.03%) regalloc - Number of identity moves eliminated after rewriting      
    5 (0.02%) regalloc - Number of interferences evicted                          
   -2 (-0.23%) regalloc - Number of live ranges fractured by DCE                   
 -127 (-0.07%) regalloc - Number of new live ranges queued                         
   -7 (-0.58%) regalloc - Number of omitted spills of reloads                      
 -128 (-0.01%) regalloc - Number of registers assigned                             
  -21 (-0.05%) regalloc - Number of registers unassigned                           
   46 (0.12%) regalloc - Number of reloads inserted                               
   -3 (-0.41%) regalloc - Number of reloads removed                                
   -2 (-0.01%) regalloc - Number of rematerialized defs for spilling               
   -6 (-0.02%) regalloc - Number of spilled live ranges                            
   -7 (-1.23%) regalloc - Number of spilled snippets                               
  -31 (-0.13%) regalloc - Number of spills inserted                                
   -3 (-0.07%) regalloc - Number of spills removed                                 
   -5 (-0.04%) regalloc - Number of split global live ranges                       
   -5 (-0.03%) regalloc - Number of splits finished                                
   -3 (-0.02%) regalloc - Number of splits that were simple                        
  264 (0.61%) slotindexes - Number of local renumberings                          
   -2 (-0.07%) stackslotcoloring - Number of stack slots eliminated due to coloring
  -13 (-0.00%) tailduplication - Additional instructions due to tail duplication   
   -7 (-0.29%) tailduplication - Number of dead blocks removed                     
    3 (0.05%) tailduplication - Number of tail duplicated blocks                  
   -5 (-0.13%) tailduplication - Number of tails duplicated

I also think this is reasonable change.

lib/CodeGen/MachineScheduler.cpp
1380

IsLoad?

junbuml updated this revision to Diff 51661.Mar 25 2016, 11:00 AM

Addressed comments from comment Mattias and Junmo.

junbuml updated this object.Mar 25 2016, 11:02 AM
junbuml edited edge metadata.
mcrosier accepted this revision.Mar 31 2016, 6:29 AM
mcrosier edited edge metadata.

LGTM under the assumption this does not change codegen for the AMDGPU targets.

This revision is now accepted and ready to land.Mar 31 2016, 6:29 AM
junbuml updated this revision to Diff 52257.Mar 31 2016, 12:10 PM
junbuml updated this object.
junbuml edited edge metadata.

Just minor renaming and clean up.

As enableClusterMemOps() returns true in AMDGPUInstrInfo.cpp, I still want to wait until this patch is reviewed by reviewers from AMDGPU.

Kindly ping.
We need to confirm if this change doesn't have any negative impact on AMDGPU. Or, I can enable this only on AArch64 by adding enableClusterStores() in TargetInstrInfo.h.

junbuml updated this revision to Diff 53278.Apr 11 2016, 10:53 AM

Enabled this only on AArch64 for now by adding enableClusterStores().

Tom,
I believe you can easily enable this for AMDGPU later with proper tests as a separate patch.

junbuml updated this object.Apr 11 2016, 10:55 AM

Ping..
Please let me know if this is still LGTM.

atrick edited edge metadata.Apr 13 2016, 9:42 AM

This is fine with me.

If there was some reason I didn't do this it was specific to the microarchitecture and downstream optimizations at that time. As long as we get good results from this change I have no objection.

This is fine with me.

If there was some reason I didn't do this it was specific to the microarchitecture and downstream optimizations at that time. As long as we get good results from this change I have no objection.

Thanks, Andy.

I think the last bit that is needed is for @arsenm or @tstellarAMD to approve the patch as this may impact the AMD target.

Please note that in my last change, I turned it on only in AArch64 by separating enableClusterStores() and enableClusterLoads(). enableClusterStores() is off in AMD by default.

Please note that in my last change, I turned it on only in AArch64 by separating enableClusterStores() and enableClusterLoads(). enableClusterStores() is off in AMD by default.

In that case I think this is fine to commit. Thanks, Jun.

In D18376#399964, @atrick wrote:
This is fine with me.

If there was some reason I didn't do this it was specific to the microarchitecture and downstream optimizations at that time. As >long as we get good results from this change I have no objection.

I didn't see clear performance gains in score with this change in my spec2006 test. However, with this change I observed overall better llvm stats; please see my previous comments showing differences of llvm stats for spec2006.

I'm going to commit this if there is any objection by tomorrow.

junbuml closed this revision.Apr 15 2016, 8:06 AM

Landed in r266437.
Thanks for the reviews !