Page MenuHomePhabricator

[regalloc] Ensure Query::collectInterferringVregs is called before interval iteration
ClosedPublic

Authored by mtrofin on Mar 8 2021, 8:56 PM.

Details

Summary

The main part of the patch is the change in RegAllocGreedy.cpp: Q.collectInterferringVregs() needs to be called before iterating the interfering live ranges.

The rest of the patch offers support that is the case: instead of clearing the query's InterferingVRegs field, we invalidate it. The clearing happens when the live reg matrix is invalidated (existing triggering mechanism).

Without the change in RegAllocGreedy.cpp, the compiler ices.

This patch should make it more easily discoverable by developers that collectInterferringVregs needs to be called before iterating.

I will follow up with a subsequent patch to improve the usability and maintainability of Query.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dmgreen added inline comments.Mar 10 2021, 11:36 AM
llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll
66

My understanding is that this test comes from a fairly standard matrix multiply, by the way:
https://godbolt.org/z/Ej5sb8

It is a shame to break such an obvious case.

I have seen some other cases under Arm where we hit cascading eviction chains like this too - something that consider-local-interval-cost didn't help with. It would be great to see a more reliable fix for that, I look forward to seeing it.

qcolombet added inline comments.Mar 10 2021, 11:54 AM
llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll
66

I agree the eviction chains detection is not fool proof.

What Aditya (and I to some extend) worked on showed that we should be able to eliminate the need to detect the eviction chain detection all together.

The approach we took is just some sort of copy rewriting as described in https://dl.acm.org/doi/10.1145/1811212.1811214, though our implementation is less thorough than this paper :).

mtrofin marked an inline comment as done.Mar 10 2021, 12:03 PM
mtrofin added inline comments.
llvm/include/llvm/CodeGen/LiveIntervalUnion.h
117

That's correct. This patch could land without the Optional, and/or I anyway intend to follow up with a change that avoids this level of API statefulness to avoid others falling into the pitfall.

llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll
66

@qcolombet: by "the compiler ices" I meant that, with the current patch (use of Optional), but without the Q.collectInterferingVRegs() call in the patch, the compiler crashes, supporting the claim that "the API should have been called" (i.e. we were in the past operating on stale information... at least in some cases, because removing the loop right under it leads to a set of tests, disjoint from the ones patched here, that fail)

@dmgreen, all - my hope was that someone could help with fixing the regression (I think that was the goal of https://reviews.llvm.org/D35816). I have little context into that. Otherwise it'll take me a bit of time to ramp up on that issue and propose a solution.

Question is, are we OK with the bug (==the code sometimes works off stale info) meanwhile?

mtrofin updated this revision to Diff 330556.Sun, Mar 14, 10:11 PM
mtrofin marked an inline comment as done.

avoid regression

mtrofin added inline comments.Sun, Mar 14, 11:37 PM
llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll
66

@dmgreen , all - I dug a bit deeper into the regression. PTAL the diff in RegAllocGreedy.cpp, deleted lines at line 1559. The call to checkInterference on line 1555 causes the reset-ing of the earlier-calculated interference collections, rendering the deleted code *almost* dead.

Almost, because the effect of calling query in canEvictInterferenceInRange seems to have an effect - see the i128-mul.ll diff. I'll track that down.

lkail added a subscriber: lkail.Sun, Mar 14, 11:58 PM
dmgreen added inline comments.Mon, Mar 15, 5:54 AM
llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll
66

I've never looked very deeply into the register allocator I'm afraid. It's still a bit of a mystery to me.

I ran some benchmarks on the new code though - none of them changed very much which is a good sign.

mtrofin updated this revision to Diff 330850.Mon, Mar 15, 5:06 PM

no regressions

mtrofin marked an inline comment as done.Mon, Mar 15, 5:09 PM
mtrofin added inline comments.
llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll
66

Tracked down the source of the last difference. @dmgreen @qcolombet - PTAL.

I will work on simplifying the Query API next, to help avoid the types of pitfalls identified in this patch.

mtrofin updated this revision to Diff 330852.Mon, Mar 15, 5:29 PM
mtrofin marked an inline comment as done.

reworded the comment in LiveRangeMatrix.cpp for more clarity (hopefully)

Harbormaster completed remote builds in B93951: Diff 330850.

If there's no pushback, given the previous lgtm (and the subsequent removal of regressions), I'll go ahead with submitting this.

I don't feel like I know this code well enough to be confident LGTM'ing it, but the testing I ran still look good. Almost no changes. Thanks for working on the regression!

This revision was landed with ongoing or failed builds.Tue, Mar 16, 12:19 PM
This revision was automatically updated to reflect the committed changes.
nikic reopened this revision.Tue, Mar 16, 12:45 PM
nikic added a subscriber: nikic.

I've reverted this change because it causes significant compile-time regressions, e.g. >5% on sqlite: https://llvm-compile-time-tracker.com/compare.php?from=0aa637b2037d882ddf7861284169abf63f524677&to=d40b4911bd9aca0573752e065f29ddd9aff280e1&stat=instructions I'm assuming a regression of that size wasn't intentional here.

This revision is now accepted and ready to land.Tue, Mar 16, 12:45 PM

I've reverted this change because it causes significant compile-time regressions, e.g. >5% on sqlite: https://llvm-compile-time-tracker.com/compare.php?from=0aa637b2037d882ddf7861284169abf63f524677&to=d40b4911bd9aca0573752e065f29ddd9aff280e1&stat=instructions I'm assuming a regression of that size wasn't intentional here.

It may be an effect of the code actually performing correctly (see the CL description). I'll confirm that is the case.

I've reverted this change because it causes significant compile-time regressions, e.g. >5% on sqlite: https://llvm-compile-time-tracker.com/compare.php?from=0aa637b2037d882ddf7861284169abf63f524677&to=d40b4911bd9aca0573752e065f29ddd9aff280e1&stat=instructions I'm assuming a regression of that size wasn't intentional here.

It may be an effect of the code actually performing correctly (see the CL description). I'll confirm that is the case.

Yup - the increased time in the regalloc is due to the added call to collectInterferringVRegs. A quick verification is to just add the call in the 'old' code, which leads to a similar timing effect.

This is intentional - the fix is a correctness issue (without the fix, the query is actually stale).

@nikic - any pushback to reapplying the patch?

nikic requested changes to this revision.Tue, Mar 16, 2:16 PM

Unfortunately I'm not familiar with RegAlloc, but I somehow doubt that this kind of compile-time hit is intrinsically necessary to address this issue.

This is also missing a test case.

This revision now requires changes to proceed.Tue, Mar 16, 2:16 PM

Unfortunately I'm not familiar with RegAlloc, but I somehow doubt that this kind of compile-time hit is intrinsically necessary to address this issue.

The problem is that the previous use of the APIs was incorrect - the code before was sometimes using stale data, because it was incorrectly not calling collectInterferingVRegs. The compile time regression is reflective of the effort required to refresh the data.

I think we have a situation like this:

  • (status quo) no compile time regression, but incorrect functionality
  • (patch) compile time regression, correct functionality
  • remove patch that originally introduced this functionality (https://reviews.llvm.org/D35816), but then there are cases where we hit regressions in code quality
  • <options that I am not aware of - actionable suggestions very welcome!>

What are the guidelines regarding compile time regressions (e.g. what constitutes acceptable; what is the tradeoff hierarchy, e.g. in this case, buggy code)?

This is also missing a test case.

That's reasonable. I'll craft one up while we explore the options here.

I agree with @mtrofin's assessment, the previous code was incorrect and if we have a compile time hit, at least short term, I think this is the right thing to do.

nikic added a comment.Tue, Mar 16, 3:03 PM

Unfortunately I'm not familiar with RegAlloc, but I somehow doubt that this kind of compile-time hit is intrinsically necessary to address this issue.

The problem is that the previous use of the APIs was incorrect - the code before was sometimes using stale data, because it was incorrectly not calling collectInterferingVRegs. The compile time regression is reflective of the effort required to refresh the data.

I think we have a situation like this:

  • (status quo) no compile time regression, but incorrect functionality
  • (patch) compile time regression, correct functionality
  • remove patch that originally introduced this functionality (https://reviews.llvm.org/D35816), but then there are cases where we hit regressions in code quality
  • <options that I am not aware of - actionable suggestions very welcome!>

The best option would be to implement this without the compile-time impact or with low impact -- it's probably possible, but may require a larger time investment. If you don't have the resources for that, then please evaluate the third option. On the assumption that disabling this code only makes for a minor code quality regression, that would be preferred over a large compile-time regression. Of course, if disabling this makes for a large code quality regression, this becomes harder to answer.

Unfortunately I'm not familiar with RegAlloc, but I somehow doubt that this kind of compile-time hit is intrinsically necessary to address this issue.

The problem is that the previous use of the APIs was incorrect - the code before was sometimes using stale data, because it was incorrectly not calling collectInterferingVRegs. The compile time regression is reflective of the effort required to refresh the data.

I think we have a situation like this:

  • (status quo) no compile time regression, but incorrect functionality
  • (patch) compile time regression, correct functionality
  • remove patch that originally introduced this functionality (https://reviews.llvm.org/D35816), but then there are cases where we hit regressions in code quality
  • <options that I am not aware of - actionable suggestions very welcome!>

The best option would be to implement this without the compile-time impact or with low impact -- it's probably possible, but may require a larger time investment. If you don't have the resources for that, then please evaluate the third option. On the assumption that disabling this code only makes for a minor code quality regression, that would be preferred over a large compile-time regression. Of course, if disabling this makes for a large code quality regression, this becomes harder to answer.

Looking at the original patch (https://reviews.llvm.org/D35816) it mentions https://bugs.llvm.org/show_bug.cgi?id=26810 which quotes a 25% regression.

nikic added a comment.Tue, Mar 16, 4:15 PM

@mtrofin I'm not really concerned about regressions on individual test cases, but the average impact. I found https://github.com/llvm/llvm-project/commit/f649f24d388c745d20fab5573d27b822b92818ed with some data from when the option was enabled to AArch64, and it looks like this option has essentially zero average impact and only improves a few rare outliers. This was totally fine at the time, because enabling the option was considered to be essentially free compile-time wise. But now it turns out that this is actually not the case, and the small compile-time impact is actually the result of an implementation bug. The change would have never landed at the time if this fact had been known. A zero geomean improvement on run-time is a terrible trade-off for a 2-3% geomean regression on compile-time.

Unless the situation on X86 is significantly different than for AArch64, it seems pretty clear to me that we should just disable the option until it can be implemented in a way that is both correct and sufficiently cheap.

mtrofin added a comment.EditedTue, Mar 16, 6:45 PM

An update on this: I spent a bit of time looking for ways to elide the compile time penalty, and the short of it is, I couldn't find a way.

Unless someone has a idea how to do that (i.e. elide compile time... etc), I propose we go ahead with a "staged option 3":

step 1: fix the code as per patch, but add a flag that disables, by default, the optimization. This means: no compile time regression; possible corner-case regressions in code quality, but (what should be an) easy way to avoid it for those impacted.

step 2: after <not sure what's reasonable here? 1-2 weeks? 1-2 months? 1 release?) assuming the regressions aren't hurting anyone, we remove the optimization code and flag.

What do folks think?

Thanks!

@mtrofin I'm not really concerned about regressions on individual test cases, but the average impact. I found https://github.com/llvm/llvm-project/commit/f649f24d388c745d20fab5573d27b822b92818ed with some data from when the option was enabled to AArch64, and it looks like this option has essentially zero average impact and only improves a few rare outliers. This was totally fine at the time, because enabling the option was considered to be essentially free compile-time wise. But now it turns out that this is actually not the case, and the small compile-time impact is actually the result of an implementation bug. The change would have never landed at the time if this fact had been known. A zero geomean improvement on run-time is a terrible trade-off for a 2-3% geomean regression on compile-time.

Unless the situation on X86 is significantly different than for AArch64, it seems pretty clear to me that we should just disable the option until it can be implemented in a way that is both correct and sufficiently cheap.

I'd let the others weigh in over whether the corner cases are important to them. My goal here is to improve the evolvability and maintainability of the code; the first patch would canonicalize the uses of query, the next one would avoid needing the multi-steps that were the cause of the bug discussed here. That's a long way of saying "option 3 is actually *very* attractive to me" - but I have no skin in the game.

A potential other option may be to hide the current behavior (as fixed here) behind a flag. Then, if anyone actually has an impactful regression, we can dig deeper, while also giving them a way to unblock (by flipping the flag in their build). Otherwise, after some reasonable time (tbd what that means) we can take silence as indication the regressions don't matter anymore, and do option 3.

I'll also spend a bit more time today to look into whether there's anything that can be salvaged easily (i.e. avoid option 3, and avoid compile time regression).

mtrofin updated this revision to Diff 332517.Mon, Mar 22, 8:57 PM

Disabling 'advancedRASplitCost' on x86.

Gentle reminder - I disabled the feature by default on x86 (assuming we're OK with the staged approach).

@sanwou01 , given the compile time overhead on x86 may also affect aarch64, should we disable on aarch64 by default, too?

(sorry if the context requires some digging into this patch's log, happy to provide a summary if that's too daunting)

nikic accepted this revision.Wed, Mar 24, 1:21 PM

The proposed approach is fine from my side at least. Flipping the switch for AArch64 as well would be good.

llvm/lib/CodeGen/RegAllocGreedy.cpp
1063

nit: A 5 snuck in.

llvm/test/CodeGen/X86/i128-mul.ll
2

Don't think regalloc details are important for this test and the two below. Might want to regenerate the output rather than adding the flag for these. (The two tests above specifically test for regalloc behavior.)

This revision is now accepted and ready to land.Wed, Mar 24, 1:21 PM

I'm worried that this comes up in a lot of places. Perhaps rare still, but important cases. The aarch64 example we have is just a matrix multiply, and is 25% slower with all the cascading spills, https://bugs.llvm.org/show_bug.cgi?id=26810 quotes the same. Like I said before though, the option didn't fix some examples of the same thing that we were seeing in ARM, so I'm not sure how reliably better it is.

@aditya_nandakumar @qcolombet any idea when a better fix for that issue might be available?

Could we just make consider-local-interval-cost -O3 only in the meantime? That should alleviate some of the compile time worries, as we have genuine examples of where it is hurting performance.

llvm/test/CodeGen/X86/bug26810.ll
1

If we are enabling this flag by default, we should probably update the tests, not hide them behind a flag.

nikic added a comment.Thu, Mar 25, 2:39 PM

I'm worried that this comes up in a lot of places. Perhaps rare still, but important cases. The aarch64 example we have is just a matrix multiply, and is 25% slower with all the cascading spills, https://bugs.llvm.org/show_bug.cgi?id=26810 quotes the same. Like I said before though, the option didn't fix some examples of the same thing that we were seeing in ARM, so I'm not sure how reliably better it is.

@aditya_nandakumar @qcolombet any idea when a better fix for that issue might be available?

Could we just make consider-local-interval-cost -O3 only in the meantime? That should alleviate some of the compile time worries, as we have genuine examples of where it is hurting performance.

This would be really bad for us, because rust effectively always uses O3, and we expect reasonable compile-time tradeoffs to be made for it as well.

Worth noting that D35816 discussed a number of alternatives to this, one being to handle this in machine copy propagation instead, which should be both much less complex and not have compile-time concerns. The cited disadvantage is that it would only work inside a block. From the examples I've seen long eviction chains inside a single BB seem to be the main problem, so maybe it would be worthwhile to go back to that option. I don't really have a good view on this topic though.

This would be really bad for us, because rust effectively always uses O3, and we expect reasonable compile-time tradeoffs to be made for it as well.

Worth noting that D35816 discussed a number of alternatives to this, one being to handle this in machine copy propagation instead, which should be both much less complex and not have compile-time concerns. The cited disadvantage is that it would only work inside a block. From the examples I've seen long eviction chains inside a single BB seem to be the main problem, so maybe it would be worthwhile to go back to that option. I don't really have a good view on this topic though.

It sounds like -O2 might be a better default for you - as a balance between optimizations and compile time. We have users that care deeply about performance and would be happy to spend extra compile time for it.

I'm not particularly interested in the exact test case here, and the compile time tradeoff does seem high. I would love to see an alternative. But it is just a pretty boring matrix multiply kernel that's going very wrong. If that happens in the inner loop of a ML kernel then that can have a large effect on a lot of people.

Extra compile time is fine for -O3 if it improves a runtime performance of various benchmarks.

3-5% extra compile time for almost no visible perf gain is bad trade off and not only Rust folks would be disapointed.

xbolva00 requested changes to this revision.Sun, Mar 28, 12:39 PM
This revision now requires changes to proceed.Sun, Mar 28, 12:39 PM

Extra compile time is fine for -O3 if it improves a runtime performance of various benchmarks.

3-5% extra compile time for almost no visible perf gain is bad trade off and not only Rust folks would be disapointed.

@xbolva00 - sorry if I'm missing it, but what changes would you like made?

All - note that this patch just changes defaults (well, and fixes an underlying problem).

It sounds like we don't have data on the frequency of the potential code quality regression. Would looking at a suite of benchmarks, including compression and eigen, on x86 (in a thinlto + FDO build, in an isolated environment) help advance the discussion - or what's an alternative?

Thanks!

While i'm personally not very neurotic regarding compile time,
and agree with @dmgreen that -O3 is by design allowed to consume more time,
here i have to agree with @nikic.

5% compile-time regression is a rather significant time investment.
What does it bring? If it's less than 1% in some obscure proprietary benchmark
then it's one thing, if it's a few percent here and there in SPEC it's another.

In fact, i would strongly insist to follow llvm best practices,
revert the original patch that this patch is trying to fix,
fold this fix into the original change, and post that as a new review.

Yeah, I agree. 5% is too much to pay. https://reviews.llvm.org/D69437 measured this option as a 25% speed increase in a something that was important enough to fix, with a 0.1-0.2% compile time effect. That's a very different question.

(What really would like to see - is it would be great if there was someone in the llvm community who cared about O2. Trying to get the best performance for a low compile time, and optimizing that tradeoff. That can do a lot of benefit for a lot of people. This particular issue I have less interest in, It's just a shame to see the compile get things so embarrassingly wrong!)

But this patch has been in tree a long time. I believe best practice is to fix things as-is, not revert and attempt to re-apply after so long.

Yeah, I agree. 5% is too much to pay. https://reviews.llvm.org/D69437 measured this option as a 25% speed increase in a something that was important enough to fix, with a 0.1-0.2% compile time effect. That's a very different question.

(What really would like to see - is it would be great if there was someone in the llvm community who cared about O2. Trying to get the best performance for a low compile time, and optimizing that tradeoff. That can do a lot of benefit for a lot of people. This particular issue I have less interest in, It's just a shame to see the compile get things so embarrassingly wrong!)

But this patch has been in tree a long time. I believe best practice is to fix things as-is, not revert and attempt to re-apply after so long.

Quick update on the code quality side, I ran spec2006, the llvm benchmarks, the eigen benchmarks, and a few others, on x86 (FDO, thinlto) with this patch, and with/without enabling consider-local-interval-cost. No significant real effect.

About how to proceed, to add to the "revert original vs not" aspect, the additional trouble is that the original patch has some changes to weight calculation that aren't controlled by neither consider-local-interval-cost, nor by enableAdvancedRASplitCost().

Any pushback, then, to proceeding as planned (fixing the code, but disabling it by default everywhere)? If anything "breaks", there's a quick unblocker for those affected, and meanwhile there's time to investigate the alternatives others mentioned here.

Yeah, I agree. 5% is too much to pay. https://reviews.llvm.org/D69437 measured this option as a 25% speed increase in a something that was important enough to fix, with a 0.1-0.2% compile time effect. That's a very different question.

(What really would like to see - is it would be great if there was someone in the llvm community who cared about O2. Trying to get the best performance for a low compile time, and optimizing that tradeoff. That can do a lot of benefit for a lot of people. This particular issue I have less interest in, It's just a shame to see the compile get things so embarrassingly wrong!)

But this patch has been in tree a long time. I believe best practice is to fix things as-is, not revert and attempt to re-apply after so long.

Quick update on the code quality side, I ran spec2006, the llvm benchmarks, the eigen benchmarks, and a few others, on x86 (FDO, thinlto) with this patch, and with/without enabling consider-local-interval-cost. No significant real effect.

About how to proceed, to add to the "revert original vs not" aspect, the additional trouble is that the original patch has some changes to weight calculation that aren't controlled by neither consider-local-interval-cost, nor by enableAdvancedRASplitCost().

Any pushback, then, to proceeding as planned (fixing the code, but disabling it by default everywhere)? If anything "breaks", there's a quick unblocker for those affected, and meanwhile there's time to investigate the alternatives others mentioned here.

If it's not going to be enabled *anywhere*, even on the AArch64, it sounds like dead code to me, which shouldn't be there.

Yeah, I agree. 5% is too much to pay. https://reviews.llvm.org/D69437 measured this option as a 25% speed increase in a something that was important enough to fix, with a 0.1-0.2% compile time effect. That's a very different question.

(What really would like to see - is it would be great if there was someone in the llvm community who cared about O2. Trying to get the best performance for a low compile time, and optimizing that tradeoff. That can do a lot of benefit for a lot of people. This particular issue I have less interest in, It's just a shame to see the compile get things so embarrassingly wrong!)

But this patch has been in tree a long time. I believe best practice is to fix things as-is, not revert and attempt to re-apply after so long.

Quick update on the code quality side, I ran spec2006, the llvm benchmarks, the eigen benchmarks, and a few others, on x86 (FDO, thinlto) with this patch, and with/without enabling consider-local-interval-cost. No significant real effect.

About how to proceed, to add to the "revert original vs not" aspect, the additional trouble is that the original patch has some changes to weight calculation that aren't controlled by neither consider-local-interval-cost, nor by enableAdvancedRASplitCost().

Any pushback, then, to proceeding as planned (fixing the code, but disabling it by default everywhere)? If anything "breaks", there's a quick unblocker for those affected, and meanwhile there's time to investigate the alternatives others mentioned here.

If it's not going to be enabled *anywhere*, even on the AArch64, it sounds like dead code to me, which shouldn't be there.

Not enabled by default. This would still give folks a chance to unblock if they hit regressions by using the flag. We can then remove the portion that's flag controlled when we have the alternative others were mention, or when we feel that, since no one reported regressions, it's safe to remove.

Yeah, I agree. 5% is too much to pay. https://reviews.llvm.org/D69437 measured this option as a 25% speed increase in a something that was important enough to fix, with a 0.1-0.2% compile time effect. That's a very different question.

(What really would like to see - is it would be great if there was someone in the llvm community who cared about O2. Trying to get the best performance for a low compile time, and optimizing that tradeoff. That can do a lot of benefit for a lot of people. This particular issue I have less interest in, It's just a shame to see the compile get things so embarrassingly wrong!)

But this patch has been in tree a long time. I believe best practice is to fix things as-is, not revert and attempt to re-apply after so long.

Quick update on the code quality side, I ran spec2006, the llvm benchmarks, the eigen benchmarks, and a few others, on x86 (FDO, thinlto) with this patch, and with/without enabling consider-local-interval-cost. No significant real effect.

About how to proceed, to add to the "revert original vs not" aspect, the additional trouble is that the original patch has some changes to weight calculation that aren't controlled by neither consider-local-interval-cost, nor by enableAdvancedRASplitCost().

Any pushback, then, to proceeding as planned (fixing the code, but disabling it by default everywhere)? If anything "breaks", there's a quick unblocker for those affected, and meanwhile there's time to investigate the alternatives others mentioned here.

If it's not going to be enabled *anywhere*, even on the AArch64, it sounds like dead code to me, which shouldn't be there.

Not enabled by default. This would still give folks a chance to unblock if they hit regressions by using the flag. We can then remove the portion that's flag controlled when we have the alternative others were mention, or when we feel that, since no one reported regressions, it's safe to remove.

I understand that. The question i'm asking is whether it's useful to have that flag in the first place?
Do we know that people will actually use it?

Yeah, I agree. 5% is too much to pay. https://reviews.llvm.org/D69437 measured this option as a 25% speed increase in a something that was important enough to fix, with a 0.1-0.2% compile time effect. That's a very different question.

(What really would like to see - is it would be great if there was someone in the llvm community who cared about O2. Trying to get the best performance for a low compile time, and optimizing that tradeoff. That can do a lot of benefit for a lot of people. This particular issue I have less interest in, It's just a shame to see the compile get things so embarrassingly wrong!)

But this patch has been in tree a long time. I believe best practice is to fix things as-is, not revert and attempt to re-apply after so long.

Quick update on the code quality side, I ran spec2006, the llvm benchmarks, the eigen benchmarks, and a few others, on x86 (FDO, thinlto) with this patch, and with/without enabling consider-local-interval-cost. No significant real effect.

About how to proceed, to add to the "revert original vs not" aspect, the additional trouble is that the original patch has some changes to weight calculation that aren't controlled by neither consider-local-interval-cost, nor by enableAdvancedRASplitCost().

Any pushback, then, to proceeding as planned (fixing the code, but disabling it by default everywhere)? If anything "breaks", there's a quick unblocker for those affected, and meanwhile there's time to investigate the alternatives others mentioned here.

If it's not going to be enabled *anywhere*, even on the AArch64, it sounds like dead code to me, which shouldn't be there.

Not enabled by default. This would still give folks a chance to unblock if they hit regressions by using the flag. We can then remove the portion that's flag controlled when we have the alternative others were mention, or when we feel that, since no one reported regressions, it's safe to remove.

I understand that. The question i'm asking is whether it's useful to have that flag in the first place?
Do we know that people will actually use it?

Ah, I see - we don't know, since right now people don't know if they'd have a regression if we flipped the default behavior. If they do, they'd easily unblock themselves, and hopefully report back. The alternative - removing the code outright - is harder to revert.

Quick update on the code quality side, I ran spec2006, the llvm benchmarks, the eigen benchmarks, and a few others, on x86 (FDO, thinlto) with this patch, and with/without enabling consider-local-interval-cost. No significant real effect.

I would expect any large out of order cpu to chew through register mov's like they aren't even there. They just end up as register renames. It will be in-order "little" cores where this hurts the most.

Can you remove -consider-local-interval-cost from the tests?

Yeah, I agree. 5% is too much to pay. https://reviews.llvm.org/D69437 measured this option as a 25% speed increase in a something that was important enough to fix, with a 0.1-0.2% compile time effect. That's a very different question.

(What really would like to see - is it would be great if there was someone in the llvm community who cared about O2. Trying to get the best performance for a low compile time, and optimizing that tradeoff. That can do a lot of benefit for a lot of people. This particular issue I have less interest in, It's just a shame to see the compile get things so embarrassingly wrong!)

But this patch has been in tree a long time. I believe best practice is to fix things as-is, not revert and attempt to re-apply after so long.

Quick update on the code quality side, I ran spec2006, the llvm benchmarks, the eigen benchmarks, and a few others, on x86 (FDO, thinlto) with this patch, and with/without enabling consider-local-interval-cost. No significant real effect.

About how to proceed, to add to the "revert original vs not" aspect, the additional trouble is that the original patch has some changes to weight calculation that aren't controlled by neither consider-local-interval-cost, nor by enableAdvancedRASplitCost().

Any pushback, then, to proceeding as planned (fixing the code, but disabling it by default everywhere)? If anything "breaks", there's a quick unblocker for those affected, and meanwhile there's time to investigate the alternatives others mentioned here.

If it's not going to be enabled *anywhere*, even on the AArch64, it sounds like dead code to me, which shouldn't be there.

Not enabled by default. This would still give folks a chance to unblock if they hit regressions by using the flag. We can then remove the portion that's flag controlled when we have the alternative others were mention, or when we feel that, since no one reported regressions, it's safe to remove.

I understand that. The question i'm asking is whether it's useful to have that flag in the first place?
Do we know that people will actually use it?

Ah, I see - we don't know, since right now people don't know if they'd have a regression if we flipped the default behavior. If they do, they'd easily unblock themselves, and hopefully report back. The alternative - removing the code outright - is harder to revert.

Ah, i see.

Quick update on the code quality side, I ran spec2006, the llvm benchmarks, the eigen benchmarks, and a few others, on x86 (FDO, thinlto) with this patch, and with/without enabling consider-local-interval-cost. No significant real effect.

I would expect any large out of order cpu to chew through register mov's like they aren't even there. They just end up as register renames. It will be in-order "little" cores where this hurts the most.

Right, unfortunately I don't have access to those.

Can you remove -consider-local-interval-cost from the tests?

It's disabled by default (I'm assuming it's about your comment on bug26810.ll - I'll reply there)

mtrofin updated this revision to Diff 334452.Wed, Mar 31, 8:42 AM
mtrofin marked 3 inline comments as done.

feedback

llvm/test/CodeGen/X86/bug26810.ll
1

We're not enabling it by default though, that'd regress compile time.

llvm/test/CodeGen/X86/i128-mul.ll
2

sgtm, I'll do that.

@aditya_nandakumar @qcolombet any idea when a better fix for that issue might be available?

Given Aditya did the prototype I let him comment on that, but I don't expect we land something anytime soon (i.e., at least a couple of months).

@aditya_nandakumar @qcolombet any idea when a better fix for that issue might be available?

Given Aditya did the prototype I let him comment on that, but I don't expect we land something anytime soon (i.e., at least a couple of months).

That sounds like a reasonable estimate of when I can get it upstreamed.

dmgreen accepted this revision.Thu, Apr 1, 6:44 AM

Thanks for the test update. Reluctantly, this LGTM.

llvm/test/CodeGen/X86/bug26810.ll
1

Ah, yes. I meant "If we are disabling this flag.."

mtrofin marked an inline comment as done.Thu, Apr 1, 7:45 AM

@aditya_nandakumar @qcolombet any idea when a better fix for that issue might be available?

Given Aditya did the prototype I let him comment on that, but I don't expect we land something anytime soon (i.e., at least a couple of months).

That sounds like a reasonable estimate of when I can get it upstreamed.

llvm/test/CodeGen/X86/bug26810.ll
1

Ah, I wanted to keep some regression testing for the behavior behind the flag.

mtrofin marked an inline comment as done.Thu, Apr 1, 7:54 AM

@xbolva00 - is there anything you'd like to see done to this patch?

xbolva00 resigned from this revision.Thu, Apr 1, 7:58 AM

No, nothing from me.

This revision is now accepted and ready to land.Thu, Apr 1, 7:58 AM