The buildSchedGraph() was in need of reworking as the AA features had been added on top of earlier code. It was very difficult to understand, and buggy. There had been found cases where scheduling dependencies had actually been missed (see r228686). AliasChain, RejectMemNodes, adjustChainDeps() and iterateChainSucc() have been removed. There are instead now just the four maps from Value to SUs, which have been renamed to Stores, Loads, NonAliasStores and NonAliasLoads. An unknown store used to become the AliasChain, but now becomes a store mapped to 'unknownValue' (in Stores). What used to be PendingLoads is instead the list of SUs mapped to 'unknownValue' in Loads. RejectMemNodes and adjustChainDeps() used to be a safety-net for everything. The SU maps were sometimes cleared and SUs were put in RejectMemNodes, where adjustChainDeps() would look. Instead of this, a more straight forward approach is used in maintaining the SU maps without clearing them and simply letting them grow over time. Instead of the cutt-off in adjustChainDeps() search, a reduction of maps will be done if needed (see below). Each SUnit either becomes the BarrierChain, or is put into one of the maps. For each SUnit encountered, all the information about previous ones are still available until a new BarrierChain is set, at which point the maps are cleared. For huge regions, the algorithm becomes slow, therefore the maps will get reduced at a threshold (default 1000 nodes), by a fraction (default 1/2). These values can be tuned by use of CL options in case some test case shows that they need to be changed for some target (-dag-maps-huge-region and -dag-maps-reduction-size). There has not been any considerable change observed in output quality or compile time. There may now be more DAG edges inserted than before (i.e. if A->B->C, then A->C is not needed). However, in a comparison run there were fewer total calls to AA, and a somewhat improved compile time, which means this seems to be not a problem.
Details
- Reviewers
• tstellarAMD atrick hfinkel
Diff Detail
Event Timeline
This looks like a great improvement except for all the data structure boilerplate in the header. I'll take it on your word and Hal's review that the functionality is preserved and that compile time is positively impacted.
include/llvm/CodeGen/ScheduleDAGInstrs.h | ||
---|---|---|
167 | I don't really understand the value of wrapping the std::list API. Can't you just use a typedef or expose the underlying list? | |
171 | Uppercase variable name. |
You're right, the addChainDependencies() method that called sus.getTrueMemOrderLatency() was no longer used, so I could remove the TrueMemOrderLatency member, and then also change SUList to a typedef, as you suggested.
This will hopefully change in the future, if there is a better implementation of SUList - I added a TODO in the comment on this.
This implementation of buildSchedGraph() was faster last time I checked on *my* machine and *my* test-suite. Of course, the main improvement was meant to be on readability and simplicity.
Thanks again for working on this; I few minor comments below.
I ran through the test suite and self-hosting with this patch applied; all tests passed and I saw no significant performance changes, so I'm happy on that front!
include/llvm/CodeGen/ScheduleDAGInstrs.h | ||
---|---|---|
163 | This comment is confusing. "this SU" is that barrier chain? | |
190 | This needs to be: llvm_unreachable("Don't use"); (or, preferably, some more informative message) | |
lib/CodeGen/ScheduleDAGInstrs.cpp | ||
914 | Please make this a cl::opt command-line parameter. | |
918 | This too (a command-line parameter) -- having it default to HugeRegion/2 likely makes sense. (You can use the getNumOccurrences() function on a cl::opt to see if it has actually been set or not). | |
1110 | Should this be >=? We can add multiple dependencies at a time, so we might miss an exact threshold crossing? |
Fixes according to Hal Finkels suggestions.
Bugfixing in reduceHugeMemNodeMaps() (Some details had gotten lost in the revision process):
After a reduction of maps: Don't forget to recompute NumNodes. Don't forget to add dep to old BarrierChain. Check for potential loop with newBarrierChain.
Andy, have your concerns been addressed at this point?
include/llvm/CodeGen/PseudoSourceValue.h | ||
---|---|---|
24 | Line is too long? |
Unless it can be shown by measuring compile time that the implementation needs to be defined in the header, I would:
(1) define insertBarrierChain in the .cpp file
(2) Forward declare and define Value2SUsMap and methods that call it in the .cpp file.
Patch rebased.
Some PowerPC regressions:
ppc64-fastcc.ll (A copy of an argument now became coalesced)
vsx-fma-sp.ll and vsx-fma-m.ll (not sure about)
I didn't mean to hold this up! I really have to say this is so much easier for me to understand than the previous code. And it's not just a good cleanup but an important bug fix (in the !AADep areMemAccessTriviallyDisjoint case where the Stores queue is cleared) and a performance improvement (FIFO draining of dependencies).
There's one thing that's a little scary to me. We assume that every NonAliasing load or store must be analyzable by getUnderlyingObjects:
+ if (Objs.empty()) {
+ An unknown store depends on all stores and loads, except
+ NonAliasStores and NonAliasLoads.
+ addChainDependencies(SU, Stores);
+ addChainDependencies(SU, Loads);
and
+ if (Objs.empty()) {
+ // An unknown load depends on all stores, except NonAliasStores.
+ addChainDependencies(SU, Stores);
If we fail to return an underlying object--for example:
+ if (MFI->hasTailCall())
+ return;
Then how can we be sure that load or store doesn't alias with the "NonAliasing" set?
Can you or Hal comment on why that is safe. I can't prove to myself that it's correct, so I think at least a comment would be nice.
Other than that, if this still passes all target tests, please go ahead and check it in.
lib/CodeGen/ScheduleDAGInstrs.cpp | ||
---|---|---|
1022–1029 | 80-column |
Andy,
I think that when looking for underlying objects, objects will only be considered to not alias only if PSV->mayAlias(MFI) returns false:
if (!PSV->isAliased(MFI)) {
bool MayAlias = PSV->mayAlias(MFI); Objects.push_back(UnderlyingObjectsVector::value_type(PSV, MayAlias));
}
Any other cases will either return either empty Objects, or aliasing objects. So, if MFI->hasTailCall(), an empty objects list is returned instead, which will make the instruction be treated as unknown store / load.
So I believe this is correct as long as the memory operands are correct, of course.
What do you think?
OK. My concern is this:
Two stores both access the same nonaliased location.
One store has it's memory operand stripped for some reason--we don't have any guarantee that memoperands are preserved.
The scheduler now assumes the two stores are independent.
Aha - I guess you are right that if we can't assume that the register allocator is correctly adding memory operands for spill instructions, then we can't safely separate the Alias / NonAlias accesses.
To me, it makes sense to have this assumption, as it is something that AFAIK should work for all targets since the register allocator is already correctly adding those memory operands everywhere. But of course, there is the case of e.g. expanding a spill instruction and forgetting to add a memory operand...
Having memory operands is important for scheduling, so it is natural for targets to want to add them. Perhaps there is a way to enforce it? On a target I worked on, there was seperate opcodes for spill instrucions, which made this easy to verify.
I guess it should work to have just one set of Stores / Loads, and insert all SUs there. The underlying object of stack accesses should still make them "no-alias" to LLVM Values. I kept this design of splitting Alias / NonAlias because I assumed it was important for performance. Merging the sets would give a bigger map, but also easier to read code. It might be worth a try...
/Jonas
I think we need to be conservative in the event that mem_operands are missing, since we haven't made that guarantee--and that's a separate discussion. For example, non-aliasing operations might exist prior to register allocation to handle calling convention (I'm not sure if this is purely theoretical without running tests on all targets).
You can probably be just as aggressive when the instruction has mem_operands. You may just need to distinguish between a load/store without a single mem_operand vs. one without a reachable underlying object. i.e. if it the Store has a mem_operand and is known not to be a PseudoSourceValue, but has a null Value, then your implementation should still be fine.
OK - I have now changed the patch so it does not assume that unanalyzable stores / loads are never NonAlias.
I guess what is needed is for the scheduler to print out a line each time there is a missing mem-operand, as it would then be simple to check that they are not forgotten somewhere. When that is done, this change should not matter.
diff --git a/lib/CodeGen/ScheduleDAGInstrs.cpp b/lib/CodeGen/ScheduleDAGInstrs.cpp
index d6a72b5..c719c78 100644
- a/lib/CodeGen/ScheduleDAGInstrs.cpp
+++ b/lib/CodeGen/ScheduleDAGInstrs.cpp
@@ -1030,10 +1030,11 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,
if (MI->mayStore()) { if (Objs.empty()) {
- // An unknown store depends on all stores and loads, except
- // NonAliasStores and NonAliasLoads.
+ // An unknown store depends on all stores and loads.
addChainDependencies(SU, Stores);
+ addChainDependencies(SU, NonAliasStores);
addChainDependencies(SU, Loads);
+ addChainDependencies(SU, NonAliasLoads);
@@ -1067,17 +1068,16 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,
- if (MayAlias) {
- // The store is not 'NonAlias', and may therefore have
- // dependencies to unanalyzable loads and stores.
- addChainDependencies(SU, Loads, UnknownValue);
- addChainDependencies(SU, Stores, UnknownValue);
- }
+ The store may have dependencies to unanalyzable loads and
+ stores.
+ addChainDependencies(SU, Loads, UnknownValue);
+ addChainDependencies(SU, Stores, UnknownValue);
} else { // SU is a load. if (Objs.empty()) {
- // An unknown load depends on all stores, except NonAliasStores.
+ // An unknown load depends on all stores.
addChainDependencies(SU, Stores);
+ addChainDependencies(SU, NonAliasStores);
Loads.insert(SU, UnknownValue); continue;
@@ -1097,10 +1097,8 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,
// Map this load to V. (ThisMayAlias ? Loads : NonAliasLoads).insert(SU, V); }
- if (MayAlias)
- // The load is not 'NonAlias', and may therefore have
- // dependencies to unanalyzable stores.
- addChainDependencies(SU, Stores, UnknownValue);
+ // The load may have dependencies to unanalyzable stores.
+ addChainDependencies(SU, Stores, UnknownValue);
}
Removed unused MayAlias variables.
Note: https://llvm.org/bugs/show_bug.cgi?id=25794 is an example of a huge region not handled well by the current scheduler. This reminds me that we might want to find good values for -dag-maps-huge-region. Default is now 1000, which is very high. I think 50-100 might actually work well generally, but this I have only tested on an out-of-tree target. Regardless, this patch was much faster already due to the reduction of maps, at least in this case.
.
Jonas,
I just tried this patch on a known-good LLVM+Clang, and it breaks generated code in a number of benchmark suite for the Cortex-A53 and Cortex-A57 targets:
- llvm-test-suite/MultiSource/Benchmarks
- spec2006: mcf
- cpu2000: mcf
I won't have time to investigate further in the next 2 weeks.
Thanks for giving it a try!
This patch has been tested on quite an extensive test-suite out-of-tree, and I would therefore say that it isn't impossible that a scheduler difference might expose bugs elsewhere than in the buildSchedGraph() method. Therefore I do think we have to testing on all targets before this is committed, as things are going to be scrambled around quite a bit. While doing so, we could tune the parameters for huge region reduction as well as check the overall compile time, preferably.
Thanks for following through with this and getting it tested. I think this fix should go in even if there are some regressions or fragile scheduling tests that need to be temporarily disabled. There aren't any miscompiles right?
OK, submitted as r259201.
Did not see any regressions that called for disabling any tests at least on my machine.
As mentioned before:
- The default value of 1000 for the -dag-maps-huge-region option is very high ("unlimited"). It may very well be that 50 is more reasonable if compile time is precious.
- The other thing that could be done to speed things up is to use a a memory pool for the SUList class - see comment in ScheduleDAGInstrs.h.
Sorry for the buildbot regressions. I could find one bug in the patch, but for the other four I am not sure if it is the test cases that needs to be fixed, or if there is an error somewhere. Could please people with knowledge of these backends have a look.
- test/CodeGen/AArch64/arm64-misched-memdep-bug.ll:
bug in patch for the case of !AADeps: Stores get cleared, and must always be chained, also against a load.
Test case updated to allow edges SU(2)<-SU(3)<-SU(4) instead of SU(2)<-SU(4)
The fix in MISNeedChainEdge() is:
if (!AA && MIa->mayStore() && MIb->mayStore())
>
if (!AA && MIb->mayStore())
- /home/jonas/llvm/llvm-dev/test/CodeGen/AMDGPU/split-vector-memoperand-offsets.ll
Don't know if this is a bug or test case needs fixing. Two read / write instructions have been reordered.
- /home/jonas/llvm/llvm-dev/test/CodeGen/PowerPC/ppc64-fastcc.ll
Looks like a change in register allocation. Test case updated with different physical registers
Looks like a change in register allocation:
/home/jonas/llvm/llvm-dev/test/CodeGen/PowerPC/vsx-fma-m.ll:
/home/jonas/llvm/llvm-dev/test/CodeGen/PowerPC/vsx-fma-sp.ll:95:14:
test/CodeGen/AArch64/arm64-misched-memdep-bug.ll | ||
---|---|---|
11–14 | I'm not sure I follow the reasoning behind this change. It seems like this is a regression. The 3->4 edge (load %ptr1_plus1 -> store %ptr1) should be unnecessary since areMemAccessTriviallyDisjoint can tell you these two accesses don't overlap. |
test/CodeGen/AArch64/arm64-misched-memdep-bug.ll | ||
---|---|---|
11–14 | I believe this would be the case *with* AliasAnalysis. However, without AA, all stores to the same Value get chained, and only the last seen store is kept for future reference. Therefore, it could never be legal to skip a dependency to a store to same value, since there may be other stores chained below it that will not be checked. I believe it is default for this target/cpu to not use AA, since when I debugged this test case, AAForDep was nullptr. |
lib/CodeGen/ScheduleDAGInstrs.cpp | ||
---|---|---|
1062 | This is indented weird (tabs?) | |
test/CodeGen/AArch64/arm64-misched-memdep-bug.ll | ||
11–14 | I'm not sure what you mean by stores to the same Value (since there aren't any in this example). I kind of get what you're saying about the the non-AA case, but isn't the point of areMemAccessTriviallyDisjoint() to have a cheap approximation of AA? It seems like with this change it only has an effect if AA is being used, in which case is it really adding any extra disambiguation? |
Considering the nasty bugs we've had in tree for months (as a result of some attempts to fake AA without enabling it), and based on the discussion today, I can see that it's time to converge the AA and non-AA scheduling. The --enable-aa-sched-mi flag and ST.useAA hook can simply control whether we actually call the AA analysis. That way targets can still isolate themselves from potential bugs in MI-level AliasAnalysis, but the DAG builder logic will be common across targets and free of special cases.
As a result, non-AA targets may have to deal with bloated DAGs. But AA targets have to deal with the same problem. We can deal with that through bug reports, all work toward a common solution, and encourage more targets to adopt AA scheduling.
This may contradict my earlier advice, but now I can see that it's the only way to get this code in a maintainable state. I also realize that targets that care most about the scheduler's compile time (GPU) really need the same level of DAG support for alias analysis as targets that have been using AA already. We might as well have a common solution.
So my suggestion for this patch is:
- a/lib/CodeGen/ScheduleDAGInstrs.cpp
+++ b/lib/CodeGen/ScheduleDAGInstrs.cpp
@@ -588,11 +588,6 @@ static bool MIsNeedChainEdge(AliasAnalysis *AA, const MachineFrameInfo *MFI,
assert ((MIa->mayStore() || MIb->mayStore()) && "Dependency checked between two loads");
- // buildSchedGraph() will clear list of stores if not using AA,
- // which means all stores have to be chained without AA.
- if (!AA && MIb->mayStore())
- return true; - // Let the target decide if memory accesses cannot possibly overlap. if (TII->areMemAccessesTriviallyDisjoint(MIa, MIb, AA)) return false;
@@ -1059,11 +1054,6 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,
addChainDependencies(SU, Loads); addChainDependencies(SU, NonAliasLoads);
- // If we're not using AA, clear Stores map since all stores
- // will be chained.
- if (!AAForDep)
- Stores.clear(); - // Map this store to 'UnknownValue'. Stores.insert(SU, UnknownValue); continue;
@@ -1081,10 +1071,6 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,
addChainDependencies(SU, stores_, V); addChainDependencies(SU, (ThisMayAlias ? Loads : NonAliasLoads), V);
- // If we're not using AA, then we only need one store per object.
- if (!AAForDep)
- stores_.clearList(V); - // Map this store to V. stores_.insert(SU, V); }
diff --git a/test/CodeGen/AArch64/arm64-misched-memdep-bug.ll b/test/CodeGen/AArch64/arm64-misched-memdep-bug.ll
index a9ddb73..292fbb7 100644
- a/test/CodeGen/AArch64/arm64-misched-memdep-bug.ll
+++ b/test/CodeGen/AArch64/arm64-misched-memdep-bug.ll
@@ -8,7 +8,7 @@
; CHECK: SU(2): %vreg2<def> = LDRWui %vreg0, 1; mem:LD4[%ptr1_plus1] GPR32:%vreg2 GPR64common:%vreg0
; CHECK: Successors:
; CHECK-NEXT: val SU(5): Latency=4 Reg=%vreg2
-; CHECK-NEXT: ch SU(3): Latency=0
+; CHECK-NEXT: ch SU(4): Latency=0
; CHECK: SU(3): STRWui %WZR, %vreg0, 0; mem:ST4[%ptr1] GPR64common:%vreg0
; CHECK: Successors:
; CHECK: ch SU(4): Latency=0
diff --git a/test/CodeGen/AArch64/tailcall_misched_graph.ll b/test/CodeGen/AArch64/tailcall_misched_graph.ll
index 343ffab..59a3be9 100644
- a/test/CodeGen/AArch64/tailcall_misched_graph.ll
+++ b/test/CodeGen/AArch64/tailcall_misched_graph.ll
@@ -37,6 +37,8 @@ declare void @callee2(i8*, i8*, i8*, i8*, i8*,
; CHECK: SU({{.*}}): [[VRB]]<def> = LDRXui <fi#-2>
; CHECK-NOT: SU
; CHECK: Successors:
-; CHECK: ch SU([[DEPSTORE:.*]]): Latency=0
+; CHECK: ch SU([[DEPSTOREB:.*]]): Latency=0
+; CHECK: ch SU([[DEPSTOREA:.*]]): Latency=0
-; CHECK: SU([[DEPSTORE]]): STRXui %vreg0, <fi#-4>
+; CHECK: SU([[DEPSTOREA]]): STRXui %vreg{{.*}}, <fi#-4>
+; CHECK: SU([[DEPSTOREB]]): STRXui %vreg{{.*}}, <fi#-3>
Patch updated by removing the handling for !AA, as suggested. To
me this is a welcome change - it was awkward to chain stores and
then handle them especially for the !AA case.
The following tests have been disabled temporarily. Someone
working with these backends, please take a look before I commit
the patch.
Failing Tests (4):
LLVM :: CodeGen/AMDGPU/split-vector-memoperand-offsets.ll LLVM :: CodeGen/PowerPC/ppc64-fastcc.ll LLVM :: CodeGen/PowerPC/vsx-fma-m.ll LLVM :: CodeGen/PowerPC/vsx-fma-sp.ll
I tried the test-suite on my laptop and it seems to pass. I have
no idea what would be a preferred value for -dag-maps-huge-region
for various targets, but regarding compile time value is making a
difference, as seen below (X86).
Results (-dag-maps-huge-region = 1000)
IMPROVED : 218
REGRESSED : 294
PASS : 1480
Results (-dag-maps-huge-region = 50)
IMPROVED : 290
REGRESSED : 208
PASS : 1494
Results (-dag-maps-huge-region = 20)
IMPROVED : 282
REGRESSED : 209
PASS : 1501
I guess one would have to both measure compile time and run
benchmarks to find a good compromise value. Since (sub)targets
are divided in interest regarding performance / compile time, it
perhaps would make sense to let the subtarget define this value,
with the CodeModel in mind (consider JIT). But I guess this might
lie a bit in the future, after getting the patch in...
test/CodeGen/PowerPC/vsx-fma-m.ll | ||
---|---|---|
3 | You can disable tests by just adding: ; XFAIL: * you don't need to change the run lines. It looks like we'll just need to fixup the register numbers in some of these. |
lib/CodeGen/ScheduleDAGInstrs.cpp | ||
---|---|---|
851 | SUItr might point SUEE here. Fixed in r268257. |
Line is too long?