This is an archive of the discontinued LLVM Phabricator instance.

Update update_test_checks to work properly with phi nodes and other fun things.
ClosedPublic

Authored by dberlin on Jan 5 2017, 3:31 PM.

Details

Summary

Prior to this change, phi nodes were never considered defs, and so we ended up with undefined variables for any loop. Now, instead of trying to find just defs, we iterate over each actual IR value in the line, and replace them one by one with either a definition or a use. (This is n^2 due to python's mutable strings, but not a problem in practice so far).

We also don't try to match anything in the comment portions of the line.

I've tested it even on things like function pointer calls, etc, and against existing test cases uses update_test_checks
With this change, we are able to use update_tests on the cyclic cases in newgvn.

The only case i'm aware of that will misfire is if you have a string with which contains a valid token.
However, this is the same as it is now, with a slightly larger set of strings that may misfire.
Prior to this change, a test with the string " %a =" would be replaced.

Diff Detail

Repository
rL LLVM

Event Timeline

dberlin updated this revision to Diff 83316.Jan 5 2017, 3:31 PM
dberlin retitled this revision from to Update update_test_checks to work properly with phi nodes and other fun things..
dberlin updated this object.
dberlin added reviewers: spatel, chandlerc.
dberlin added a subscriber: llvm-commits.
dberlin added inline comments.Jan 5 2017, 3:32 PM
utils/update_test_checks.py
73 ↗(On Diff #83316)

Note that the ugly | is necessary because \Z (or $) does nothing in the character class :(

Thanks for working on this!

I made a note to myself that the script didn't work with loops, but I wasn't sure how to fix it. I'm highly unqualified to review because hacking together update_llc_test_checks.py with Tim's 'FileCheck for clang' script is my only experience with python...ever.

One thing I would like to note is that I was hoping to unify this script and update_llc_test_checks.py since they're probably still 95% the same thing, but maybe it's best to just specialize this script for 'opt' and let the 'llc' script evolve separately?

RKSimon edited edge metadata.Jan 6 2017, 7:49 AM

One thing I would like to note is that I was hoping to unify this script and update_llc_test_checks.py since they're probably still 95% the same thing, but maybe it's best to just specialize this script for 'opt' and let the 'llc' script evolve separately?

I think we're better off enforcing the split between these now - if we're going to get better asm support in for different arch, update_llc_test_checks.py is going to diverge pretty quickly. Its not out of the question to share common code between them if we want to.

bryant edited edge metadata.Jan 6 2017, 8:47 AM

drive-by comment.

utils/update_test_checks.py
176 ↗(On Diff #83316)

You're limiting the number of replaced uses to 1. Can a line have more than one use?

bryant added inline comments.Jan 6 2017, 9:26 AM
utils/update_test_checks.py
176 ↗(On Diff #83316)

Just to clarify, what I meant is that you can do a lot better than manually replacing one at a time:

def genericize_check_lines(lines):
    def ssaify(match):
        v = match.group(1)
        if v in var_seen:
            rv = get_value_use(v)
        else:
            vars_seen.add('%' + v)
            rv = get_value_definition(v)
        return rv
    for line in lines:
        scrubbed_line = ...
        new_ine = IRVALUE_RE.sub(ssaify, scrubbed_line)
    ...

I suspect that underneath, _sre is building the replacement string incrementally, so it won't be quadratic.

bryant added inline comments.Jan 6 2017, 10:38 AM
utils/update_test_checks.py
176 ↗(On Diff #83316)

Yes, but you want to replace one at a time anyway.

because it could be a use of "bb", and the line could be phi [a, bb], [b, bb17], [c, bb].

If you replace all uses of bb, you'll mistakenly replace bb17.

We can do better, but not in this way (see my other email, this will generate wrong answers) , and it's IMHO not worth it.

You can avoid doing the replacement directly by using the match start and end, but if you do, we either have to mess with the loop quite a bit.
Even if we stop using finditer (because we are replacing the string it is using as it goes, and it does not keep up to date) and use match directly, we have to offset where it starts looking for matches by the amount of text we've added by converting to var defs/uses, etc, or risk false matches.

None of this is worth it compared to keeping it simple like I did, imho.

13:34:06 ~> cat faster.py 

import re

xs = "PHI [a, bb], [b, bb17], [c, bb] " * 901
varz = {"bb": "nop", "bb17": "movzx"}
myre = re.compile(r"[a-z0-9]+")

vrz = {str(i) : str(i) + "a" for i in xrange(4000)}
vrz["sd"] = "g"

def pythonic():
    rv = xs
    def f(match):
        return varz.get(match.group(0), match.group(0))
    return myre.sub(f, rv)

def manual():
    rv = xs
    for match in myre.finditer(xs):
        rv = rv.replace(match.group(0), varz.get(match.group(0),
                        match.group(0)), 1)
    return rv

print pythonic()[:50]
print pythonic() == manual()
from timeit import timeit
print timeit("pythonic()", "from __main__ import *",  number=100)
print timeit("manual()", "from __main__ import *",  number=100)

13:35:55 ~> python faster.py 

PHI [a, nop], [b, movzx], [c, nop] PHI [a, nop], [
True
<timeit-src>:2: SyntaxWarning: import * only allowed at module level
1.07926893234
<timeit-src>:2: SyntaxWarning: import * only allowed at module level
12.0660321712

https://docs.python.org/2/library/re.html#re.sub

dberlin edited edge metadata.
  • Make this faster

The benchmark is invalid since it can't possibly work doing it that way:

There is no way to do it the way you want entirely without rewriting a lot more of this code (because we work on the scrubbed line, but can't output it), so we'd have to rewrite the function calling it

Also, this requires we capture more of the group. You are just taking group(1), which is ... not right, because it will delete all the spacing, the parens, the commas, etc, because sub expects a replacement for the entire match.
We then have to reconstitute a correct replacement by capturing all of the line.

So i think that boils down to "i can give you wrong answers much faster" :)

A closer test would also be to just replace only starting at the match point.
The reason replace is slower is mainly because it's starting the entire string.

If you look, something like

line[:match.start(1)] + line.replace(match.group(1), whatever) + line[match.end(1):] is 6-10 times faster.

But since you seem to really want the .sub approach, i've taken that.

bryant added inline comments.Jan 6 2017, 12:21 PM
utils/update_test_checks.py
73 ↗(On Diff #83397)

How come you added more capture groups? Isn't this the syntax that you're expecting?

FILE ::== LINE*
LINE ::== LINE_ENT+
LINE_ENT ::== IR_VALUE | INTERVENING_STUFF

And the purpose of all this is to capture and rename IR_VALUEs, right?

175 ↗(On Diff #83397)

Is there a reason for moving vars_seen to the top level scope? I don't think that global is needed.

def generic_check(lines):
    vars_seen = set()

    def ssaify(match):
        v = match.group(1)
        if v in vars_seen:
            rv = get_value_use(v)
        else:
            vars_seen.add(v) # this works just fine.
            rv = get_value_definition(v)
        return the new match which is match.group(1) in the previous IR_VALUE_RE

Global would only be needed if you're reassigning the variable vars_seen, whereas the update here is done with a method call.

When a callable is passed to sub, sub will pass the match the callable and expect the replacement string to be returned. Then sub will sub the match with the return value.

Also, this requires we capture more of the group. You are just taking group(1), which is ... not right, because it will delete all the spacing, the parens, the commas, etc, because sub expects a replacement for the entire match.
We then have to reconstitute a correct replacement by capturing all of the line.

Capture only that which needs to be replaced. Why do you need to capture the spacing, parens, commas, INTERVENING_STUFF? I think I'm missing something critically obvious.

dberlin marked 2 inline comments as done.Jan 6 2017, 1:26 PM

My strong suggestion would be for you to try your idea out :)
If you tried your original ssaify code, you'd immediately notice it deletes spaces and parens :)

This is one of the reasons i said the original code was simpler.

In any case, let me try to expand on "sub expects to replace the entire match":

The entire match is group 0. Group 0 includes the match of the entire regexp, not just the capture groups (which start at 1).

Python 2.7.6 (default, Oct 26 2016, 20:30:19) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> def foo(match):
...     return "12345"
... 
>>> matcher = re.compile("\s+(aaaaaa)\s+")
>>> matcher.sub(foo,"                    aaaaaa             ")
'12345'
>>>

If you want to use sub, you have to capture the other groups so you can put the entire match back together (or call replace on group 0, which is equivalent to what i was doing before, just more verbose)

I'm happy to go a little further, but at some point, this code is just not worth the time of using this script.
If you want to spend time to fix all of these issues, and have the energy to take this over and make it what you want, that would be awesome.
I'm just trying to make it work better than it did.

utils/update_test_checks.py
73 ↗(On Diff #83397)

Because they are needed to make sub work, see below.

175 ↗(On Diff #83397)

I'm aware you can do this, i just tend to find nested functions much uglier to follow in simple scripts like this one.
Again, i don't care about this code enough, so if you wanted a nested function, i'll make one.

dberlin marked 2 inline comments as done.Jan 6 2017, 1:30 PM

If you want to use sub, you have to capture the other groups so you can put the entire match back together (or call replace on group 0, which is equivalent to what i was doing before, just more verbose)

Sorry, you can also use non-capturing groups i believe (using ?:), but this makes the regexp even uglier

(I could also use backreferences in the replacement string, but i like being explicit)

dberlin added a comment.EditedJan 6 2017, 3:29 PM

If you want to use sub, you have to capture the other groups so you can put the entire match back together (or call replace on group 0, which is equivalent to what i was doing before, just more verbose)

Sorry, you can also use non-capturing groups i believe (using ?:), but this makes the regexp even uglier

(I could also use backreferences in the replacement string, but i like being explicit)

Unfortunately, it looks like sub has non-intuitive behavior here.
The non-capturing groups definitely cause the match group to change, but sub still replaces the entire regex:

In [10]: def foo(match):
    ...:     print(match.groups())
    ...:     return "12345"
    ...:

In [11]: matcher = re.compile('(?:\s+)(aaaaaa)(?:\s+)')
In [12]: matcher.sub(foo, "                aaaaaa            ")
('aaaaaa',)
Out[12]: '12345'
dberlin updated this revision to Diff 83455.Jan 6 2017, 4:12 PM

Use a nested function

chandlerc edited edge metadata.Jan 6 2017, 6:58 PM

OK, I'm still catching up here, but it seems like you are all converging on an implementation strategy you like?

I'm happy to make the REs and script faster ... *as long as it adds no complexity*. The nested function seems fine. But if we have to write more code in order to speed things up, that seems really bad, because this script should be run *very* infrequently. (I think this is what Danny was saying at one point, I'm mostly trying to echo it.)

I'd like to get some helpful comments to a future person (perhaps myself) about what all of this machinery does and how it works. The 'ssaify' nested function seems to definitely warrant some comments (and the non-nested-function variant would have too).

dberlin updated this revision to Diff 83525.Jan 7 2017, 3:51 AM
dberlin edited edge metadata.

I believe this resolves all outstanding comments.

bryant accepted this revision.Jan 7 2017, 10:41 AM
bryant edited edge metadata.

LGTM, with one nit.

Thank you for working on this.

utils/update_test_checks.py
183 ↗(On Diff #83525)

for i, line in enumerate(lines):

line = line.replace
scrubbed_line = 
lines[i] = IR_VALUE_RE.sub(..., scrubbed_line)
This revision is now accepted and ready to land.Jan 7 2017, 10:41 AM
This revision was automatically updated to reflect the committed changes.