Page MenuHomePhabricator

[UpdateLLCTestChecks] Allow replacing register names with variables
Needs ReviewPublic

Authored by greened on Jan 7 2022, 11:05 AM.

Details

Summary

Add a --scrub-reg option to convert target-specific register names to regular
expressions or FileCheck variables. --scrub-reg=names will replace register
names with FileCheck variables and track register dependencies via the
variables. --scrub-reg=deps will replace register names with a regular
expression that doesn't capture into a FileCheck variable. Dependency scrubbing
makes a test more robust in the face of incidental register name changes at the
cost of losing the ability to catch changes in the register dependency graph.

This patch only implements the regular expressions to match register names for
X86 and Lanai. Other targets will not query the option, though adding support
for it is simply a matter of checking the option and passing a regular
expression that matches register names on the target.

X86 was chosen to demonstrate how to scrub in the face of a particularly complex
set of register super-/sub-register relationships along with non-uniform naming
even within a register class. It uses a fully custom match and replace
algorithm. The Lanai implementation demonstrates the other end of the spectrum,
where super-/sub-registers all have the same names anyway so effectively they do
not matter for matching purposes. The Lanai implmentation exercises generic
routine that other targets may also use to implement scrubbing.

The x86 regular expression intentionally does not substitute for [re]?sp and
[re]?ip because it's likely that tests will want to match based on addressing
mode. Lanai does not match anything other than r[0-9]+ for similar reasons.

Diff Detail

Event Timeline

greened created this revision.Jan 7 2022, 11:05 AM
greened requested review of this revision.Jan 7 2022, 11:05 AM
sebastian-ne added a subscriber: sebastian-ne.

The implementation looks fine to me, but I’m not sure what the goal is.

The only advantage I see is that if the register allocation changes, the test still passes without needing changes.
But, I find the checks with regexes hard to read. The length of the regex doesn’t improve the readability either.
The names for the matches are generated from the register names, so if the register allocation changed in some commit (though the test keeps passing) and someone re-generates the test later, all the match names change – possibly on a commit that is unrelated to the one that changed the register allocation.

Using <instruction name>+<counter> like the MIR-check script does could alleviate the second issue.

The implementation looks fine to me, but I’m not sure what the goal is.

Same as the goal for other checks: make them more robust in the face of non-relevant changes.

The only advantage I see is that if the register allocation changes, the test still passes without needing changes.

In the context of this change, yes. I have further features to add that will allow filtering the checks so we don't *have to* check all of the output. In that case, if code outside the checks changes (and we've said we don't care about it), then register allocation will almost certainly change and at that point you want patterns, not hard-coded names.

But, I find the checks with regexes hard to read. The length of the regex doesn’t improve the readability either.

Granted that is an issue.

The names for the matches are generated from the register names, so if the register allocation changed in some commit (though the test keeps passing) and someone re-generates the test later, all the match names change – possibly on a commit that is unrelated to the one that changed the register allocation.

Using <instruction name>+<counter> like the MIR-check script does could alleviate the second issue.

That'll change too if instruction counts change. We have this problem with update_test_checks.py as well.

RKSimon added a subscriber: RKSimon.

Have you examples of actual test checks that you've had problems with that this is a answer to?

TBH This looks like a lot of over complication for little reward - so often when checks getting hidden by things like this we discover later on that it was also hiding flaws that we should be dealing with.

Agreed with Simon, this is a disaster for doing code review. I believe almost lit tests won't be affected by RA changes given they are concise. For test cases that already complicated, the regexes make the readablity even worse.
According to my experience, jumping between xmm0 and xmm1 is much more tolerable that regexes. And yes, it's easy to hide unnoticeable flaws.

Have you examples of actual test checks that you've had problems with that this is a answer to?

I do, but my previous employer has them. As I said, this really becomes necessary when you start paring down the checks. Code may change in other places that you don't care about and then that changes register allocation.

TBH This looks like a lot of over complication for little reward - so often when checks getting hidden by things like this we discover later on that it was also hiding flaws that we should be dealing with.

Can you explain this a bit more? What sorts of issues are you worried about?

Agreed with Simon, this is a disaster for doing code review.

That seems a bit extreme.

I believe almost lit tests won't be affected by RA changes given they are concise. For test cases that already complicated, the regexes make the readablity even worse.

I disagree that lit test cases are concise. We have a huge issue with codegen churn causing pages of testcase changes in proposed merges and very little review is done on them. I'm trying to get us to a place where we have less of that. Filtering out trivial changes to tests in reviews can be a big win.

According to my experience, jumping between xmm0 and xmm1 is much more tolerable that regexes. And yes, it's easy to hide unnoticeable flaws.

How so? I'm struggling to think of a time a regexp hid a real problem. Can you explain a bit more?

We have a huge issue with codegen churn causing pages of testcase changes in proposed merges and very little review is done on them.

Can you show an example?

How so? I'm struggling to think of a time a regexp hid a real problem. Can you explain a bit more?

Here is one example: https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/distancemap.mir#L66
The several [[COPY6:%[0-9]+]] are totally different things. I was fooled during the review. That's why I hate it.

We have a huge issue with codegen churn causing pages of testcase changes in proposed merges and very little review is done on them.

Can you show an example?

Sure.

Not so huge but all of the ext instructions just changed register names. That's not what the patch is about so it's irrelevant to the test.
https://reviews.llvm.org/D115646

A little bigger, shows a bigger diff than is really warranted just because of register name changes.
https://reviews.llvm.org/D64174

I don't even know how to make sense of these test diffs.
https://reviews.llvm.org/D57367

That last one has more than register name changes going on, but the register name changes make it hard to pick out the changes that matter. This is also an example of codegen tests that are just too large IMO. Look at test/CodeGen/X86/avg.ll for example. What is this actually testing? The name isn't informative, there's no comment explaining what the test is and it's a big pile of asm. Is all of that asm really necessary to check? Maybe we only care about a very specific instruction sequence. I don't know. This is why I have follow-on patches to allow update_llc_test_checks.py to filter output and create a more focused asm test by only checking for specific patterns of asm.

This isn't even *that* bad of an example. I've seen much worse, with hundreds of instructions in a test and diffs over basically the whole asm with the PR affecting many such tests. I have no idea during review if the test changes are reasonable or if the submitter simply ran update_llc_test_checks.py and submitted the result.

I have had discussions with the people that originally created the update_* scripts and they were clear the intent was *not* to just blindly run it and submit the result without careful hand-editing. Unfortunately, it has not turned out that way. It makes sense to me to provide some tools to allow some semi-automatic narrowing of tests that will hopefully be used more than the amount of hand-editing that is done now (basically zero).

How so? I'm struggling to think of a time a regexp hid a real problem. Can you explain a bit more?

Here is one example: https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/distancemap.mir#L66
The several [[COPY6:%[0-9]+]] are totally different things. I was fooled during the review. That's why I hate it.

Ah, yes, I can see how that would be an issue. I'm understanding your concerns better, thanks. So maybe replacing the register name with a FileCheck variable named by the original register isn't the best. I could certainly alter this to uniquify names if it would help.

We have a huge issue with codegen churn causing pages of testcase changes in proposed merges and very little review is done on them.

Can you show an example?

Sure.

All of the examples you have provided show non-register-changes in the same tests, so the tests would still need re-generating. After re-generating, would all the register name updates still not be part of the diff on exactly the same lines?

I have had discussions with the people that originally created the update_* scripts and they were clear the intent was *not* to just blindly run it and submit the result without careful hand-editing.

The tests get a header that specifically tells you exactly which script to run with which options to fully automatically update them; regardless of original intent, that is how these update_* scripts are used now, for better or worse.

I could certainly alter this to uniquify names if it would help.

If you can find a good way to do that, so that register name updates do actually get kept out of diffs, then yes I can see how this may help.

All of the examples you have provided show non-register-changes in the same tests, so the tests would still need re-generating. After re-generating, would all the register name updates still not be part of the diff on exactly the same lines?

You're right, they would and I should fix that. Thanks for the feedback!

I have had discussions with the people that originally created the update_* scripts and they were clear the intent was *not* to just blindly run it and submit the result without careful hand-editing.

The tests get a header that specifically tells you exactly which script to run with which options to fully automatically update them; regardless of original intent, that is how these update_* scripts are used now, for better or worse.

That sidesteps the issue, which is that we have tests with no documentation of what they're testing. The fact that update_* scripts check *all* of the output exacerbates that. Sure, for better or worse but I'd like to make it better. :)

I could certainly alter this to uniquify names if it would help.

If you can find a good way to do that, so that register name updates do actually get kept out of diffs, then yes I can see how this may help.

Ok, I'll see what I can do!

I think I can also shorten up the regexps by being smarter about what kind of register is used in the asm. I'll work on that as well. It should make things much more readable and will flag changes in the register class, which is probably desirable.

It might be better if you do this for a target other than x86 - if generating arm/aarch64 codegen tests with the update scripts are a concern for those teams, using it there makes more sense, plus they have simpler register sets.

greened updated this revision to Diff 400018.Jan 14 2022, 8:21 AM
greened edited the summary of this revision. (Show Details)

Made the FileCheck variable names more generic, to limit diffs between test updates. Names are now based on register class with a uniquifying count that tracks redefinitions of registers so that changes in register def-use changes can be seen.

The X86 regexps for GPRs are complicated because only parts of register names are common across all super-/sub-classes and the prefix and suffix (non-common) parts of the names varies based on the "base" GPR name.

It might be better if you do this for a target other than x86 - if generating arm/aarch64 codegen tests with the update scripts are a concern for those teams, using it there makes more sense, plus they have simpler register sets.

Not sure where ARM/AArch64 came into this. I don't currently work on those targets. I did X86 because it was what I was working on at the time. With the recent update I added Lanai as an example of a simpler implementation. I think the X86 implementation is valuable because it demonstrates the flexibility targets can have.

greened updated this revision to Diff 400027.Jan 14 2022, 8:28 AM

Clarified comments.

Although the new regexp looks much concise, you still haven't solved the second question @hvdijk raised.
I cannot imagine there will be a simple method that can keep the register corresponding relationship between instructions while keep them unchanged with minor instruction changes.
Counting the register number won't solve the problem either. Imagine the first instruction is removed or scheduled somewhere, will be all register names changed too? Isn't it even wores? And I evaluated it on D64174 and think the problem is still there.

greened updated this revision to Diff 400862.Jan 18 2022, 8:49 AM

Fixed non-clobber register def regexp for x86. It was using lookbehind instead of negative lookahead for vector mnemonics.

Although the new regexp looks much concise, you still haven't solved the second question @hvdijk raised.

I cannot imagine there will be a simple method that can keep the register corresponding relationship between instructions while keep them unchanged with minor instruction changes.

Define "minor." If the instructions basically keep the same number of register operands and the number of instructions remains the same, I think the FileCheck variable names should remain stable. I would classify those as "minor" changes. Changing whole swaths of asm is not "minor" in my mind.

Counting the register number won't solve the problem either. Imagine the first instruction is removed or scheduled somewhere, will be all register names changed too? Isn't it even wores? And I evaluated it on D64174 and think the problem is still there.

If a test is going to check all of the asm generated for it, then yes, if the instruction counts change, scheduling changes, etc. then the FileCheck variable names will change, because there are different uses and defs. Remember that this enhancement is in the context of other enhancements. Namely, I have a patch (which I plan to submit soon) to filter the output such that smaller bits of asm can be checked. Then the register generalizing is applied after that. It's much less likely for small bits of focused asm to change the number of instructions, for example.

You're right that for D64174 there will be some name changes, because the number of instructions changed. I agree that there isn't a good solution for that outside of triming down what the test is looking at. Do those tests really have to look at *all* of the generated insrtuctions? I suspect not. If you narrow the scope of the check, you'll naturally get fewer register dependency changes in that scope.

Sorry if this has been asked before, but i think the main question here is: what's the target audience here?
Is this intended to be used solely downstream, or for certain upstream backend? Is it planned to be ever enabled by default?

If the instructions basically keep the same number of register operands and the number of instructions remains the same, I think the FileCheck variable names should remain stable.

No, assuming we have a test case:

define i32 @foo(i32 %0, i32 %0, i32 %0)
; CHECK:       # %bb.0:
; CHECK-NEXT:    add %R0, %R1
; CHECK-NEXT:    add %R0, %R2
; CHECK-NEXT:    other_use_def %R1
; CHECK-NEXT:    other_use_def %R2

A patch that only exchanges the order of the first add to

; CHECK-NEXT:    add %R0, %R2
; CHECK-NEXT:    add %R0, %R1

Then, the other use and def of %R1 and %R2 will remain unchanged.
But if you are numbering the register by order, all the other %R1 and %R2 will be changed while the expected change won't be shown. This is rather confusing.

greened added a comment.EditedJan 19 2022, 7:04 AM

Sorry if this has been asked before, but i think the main question here is: what's the target audience here?

Not asked before, and these are fair questions!

The target audience is anyone maintaining a code generator. The ultimate goal (along with other patches) is to enhance the tool to create much more focused tests. Let me give an example. In my previous job I worked a lot on X86 non-temporal load/store support. We found that we'd get a lot of spurious test changes as we changed other parts of the backend. Since the tests were only concerned with ensuring we continued to emit MOVNT instructions, I added the ability for update_llc_test_checks.py to filter and scrub output to minimize test changes caused by unrelated work. Essentially, I filtered the asm to only check MOVNT and a few auxilliary instructions and scrubbed the register names so that changes in other instructions that caused different register allocations wouldn't alter the test checks. They would continue to work as written.

Is this intended to be used solely downstream, or for certain upstream backend? Is it planned to be ever enabled by default?

I don't intend for this ever to be default. It should be strictly opt-in. As I mentioned above, I originally developed this for an in-tree backend, though it happens that the current downstream backend I'm working on can also use it.

A patch that only exchanges the order of the first add to

; CHECK-NEXT:    add %R0, %R2
; CHECK-NEXT:    add %R0, %R1

Then, the other use and def of %R1 and %R2 will remain unchanged.
But if you are numbering the register by order, all the other %R1 and %R2 will be changed while the expected change won't be shown. This is rather confusing.

Fair enough. Again, I would ask, what is the goal of the test. If it's testing scheduling, then don't scrub register names. If it's not testing scheduling, then why do we care if the scheduling change isn't caught by the test checks?

A simple uniquing counter will result in more differences than ideal. I could imagine any number of schemes to increase precision but my sense is that it's not worth it. If the user is also filtering output to create smaller test checks, then the uniquing tends to work better IME.

This is one tool in the toolbox. It's not meant for every situation.

D117694 is the revision adding the filtering capability. We could improve this register scrubbing for more filtered cases but I'd prefer to do that as a later enhancement.

A patch that only exchanges the order of the first add to

; CHECK-NEXT:    add %R0, %R2
; CHECK-NEXT:    add %R0, %R1

Then, the other use and def of %R1 and %R2 will remain unchanged.
But if you are numbering the register by order, all the other %R1 and %R2 will be changed while the expected change won't be shown. This is rather confusing.

Fair enough. Again, I would ask, what is the goal of the test. If it's testing scheduling, then don't scrub register names. If it's not testing scheduling, then why do we care if the scheduling change isn't caught by the test checks?

What I mainly wanted to demonstrate is there exists such a test generated for other purpose. For a scheduling patch, the unexpected changes of other use and def of %R1 and %R2 on an unrelated test are confusing.

; CHECK-NEXT:    add %[[GPR_1:(%r[0-9]+)]], %[[GPR_2:(%r[0-9]+)]] <== These two lines won't be changed, that's fine.
; CHECK-NEXT:    add %[[GPR_1:(%r[0-9]+)]], %[[GPR_3:(%r[0-9]+)]]
; ... ...
; CHECK-NEXT:    other_use_def %[[GPR_2]] <== These lines will be changed, that's bad.
; CHECK-NEXT:    other_use_def %[[GPR_3]] <== Same here.

A simple uniquing counter will result in more differences than ideal. I could imagine any number of schemes to increase precision but my sense is that it's not worth it. If the user is also filtering output to create smaller test checks, then the uniquing tends to work better IME.

This is one tool in the toolbox. It's not meant for every situation.

So I'm not arguing don't use this tool for number sensitive patches. I'm arguing any test cases generated in this way may confuse future number sensitive patches.
I agree filtering out unrelated code is a good idea. Then why we still need this patch if we have already filtered them?

greened added a comment.EditedJan 21 2022, 1:44 PM

What I mainly wanted to demonstrate is there exists such a test generated for other purpose. For a scheduling patch, the unexpected changes of other use and def of %R1 and %R2 on an unrelated test are confusing.

; CHECK-NEXT:    add %[[GPR_1:(%r[0-9]+)]], %[[GPR_2:(%r[0-9]+)]] <== These two lines won't be changed, that's fine.
; CHECK-NEXT:    add %[[GPR_1:(%r[0-9]+)]], %[[GPR_3:(%r[0-9]+)]]
; ... ...
; CHECK-NEXT:    other_use_def %[[GPR_2]] <== These lines will be changed, that's bad.
; CHECK-NEXT:    other_use_def %[[GPR_3]] <== Same here.

Ok, I gotcha. I think maybe it would be useful to add (in a follow-on patch) the ability to not care about register dependencies at all, so not generate any FileCheck variables and use just pure regular expressions to scrub the register names. That would give flexibility to allow testing different scenarios.

I agree filtering out unrelated code is a good idea. Then why we still need this patch if we have already filtered them?

Because we might still want to scrub names after filtering.

jdoerfert resigned from this revision.Jan 28 2022, 9:33 AM

Not an llc person.

greened updated this revision to Diff 404699.Jan 31 2022, 1:27 PM
greened edited the summary of this revision. (Show Details)

Updated to change to a --scrub-reg option with a value of "names" or "deps." --scrub-reg=names will create FileCheck variables as before while --scrub-reg=deps will replace with non-capturing regular expressions.

The latter makes tests more robust in the face of irrelevant register assignment changes, at the cost of losing the ability to flag changes in the register dependence graph.

Ping. I think the latest changes should address some of the concerns about applicability. The scrub-regs=deps option creates a very generic kind of check.

Looks good to me, with all the comments and fixes the generated tests look a load better than in the first version.

llvm/utils/UpdateTestChecks/asm.py
281

Typo: when -> that?

I'm still struggling to understand real use cases here - you spoke about just wanting to match specific instructions (nontemporal load/store) - but wasn't that handled by D117694 --filter and --filter-out?

I guess I'm getting worried that these scripts are experiencing feature creep for use cases that are barely a real issue.

greened added a comment.EditedFeb 7 2022, 11:50 AM

I'm still struggling to understand real use cases here - you spoke about just wanting to match specific instructions (nontemporal load/store) - but wasn't that handled by D117694 --filter and --filter-out?

I guess I'm getting worried that these scripts are experiencing feature creep for use cases that are barely a real issue.

The nontemporal case is just one example, but even with that use-case, consider a check like this:

; CHECK: vmovntpd %zmm0, 20(%rax, 10, 8)

If *anything* in the test (even in code that's been filtered out), causes register allocation changes then it's likely that %zmm0 and/or %rax no longer match. However, we still want to match the offset, index and scale components, so we can't just check for vmovntpd and ignore the rest of the line. That's when register scrubbing comes in handy. We don't really care what the actual registers are, just that we emittied a vmovntpd store with the correct addressing mode.

I guess to me this just brings feature parity with update_test_checks.py to update_llc_test_checks.py. Both can now scrub variable/register names.

I think my concerns still exist, even though you provide one more option. The author selects between cannot foresee what in their tests will be affected by future commit. And the --scrub-reg deps loses too much information.
Besides, comparing the introdued complexity in asm.py and the (probably) limited benefit, it's not a good deal to me.
So back to @lebedev.ri 's question, if this is mainly for downstream use, it's better to keep it downstream. Otherwise, please give a scope which in tree tests you want to apply it to, so that we can judge the value based on that.

I think my concerns still exist, even though you provide one more option. The author selects between cannot foresee what in their tests will be affected by future commit. And the --scrub-reg deps loses too much information.

There are use-cases for scrubbing names and deps. The test author should know what they are testing. If the author can't understand whether or not register names/deps are important, perhaps it's not a good test. IME this kind of scrubbing really helps for creating focused tests. This isn't an option to use blindly, the author should understand what they're doing.

Besides, comparing the introdued complexity in asm.py and the (probably) limited benefit, it's not a good deal to me.

If people want to keep generating test checks from all of the asm output, perhaps there's limited benefit. But that means we just don't care about creating focused tests. That to me is not a great outcome.

So back to @lebedev.ri 's question, if this is mainly for downstream use, it's better to keep it downstream. Otherwise, please give a scope which in tree tests you want to apply it to, so that we can judge the value based on that.

Now that D119368 is posted for review, it provides another motivating case. Most likely folks will want to scrub the temporary SDNode names from the isel output.