This is an archive of the discontinued LLVM Phabricator instance.

[lit] Extend DEFINE/REDEFINE with function substitutions
Needs ReviewPublic

Authored by jdenny on Sep 1 2022, 5:33 PM.

Details

Summary

For example, in a test file, the following directives define a
%{check} substitution with two formal parameters and then use it
with varying actual arguments:

// DEFINE: %{check}( CFLAGS %, FC_PREFIX %) =                   \
// DEFINE:   %clang_cc1 %{CFLAGS} -emit-llvm -o - %s |          \
// DEFINE:              FileCheck -check-prefix=%{FC_PREFIX} %s

// RUN: %{check}( -foo %, FOO %)
// RUN: %{check}( -bar %, BAR %)

Diff Detail

Event Timeline

jdenny created this revision.Sep 1 2022, 5:33 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2022, 5:33 PM
jdenny requested review of this revision.Sep 1 2022, 5:33 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: sstefan1. · View Herald Transcript
jdenny updated this revision to Diff 458026.Sep 5 2022, 9:08 AM

Update for changes to D132513.

jdenny updated this revision to Diff 458066.Sep 5 2022, 4:26 PM

Update for changes to D132513.

jdenny updated this revision to Diff 458279.Sep 6 2022, 2:10 PM
  • Fixed examples in the doc: consistently use FCFLAGS not FC_PREFIX as a parameter.
  • Updated the examples in the test suite to match the ones in the documentation.
jdenny updated this revision to Diff 458529.Sep 7 2022, 12:28 PM

Update for changes to D132513.

jdenny updated this revision to Diff 459131.Sep 9 2022, 11:01 AM

Update for changes to D132513.

jdenny updated this revision to Diff 459228.Sep 9 2022, 4:30 PM

Update for changes to D132513.

jdenny updated this revision to Diff 459399.Sep 11 2022, 9:14 PM

Update for changes to D132513.

jdenny updated this revision to Diff 459487.Sep 12 2022, 9:18 AM

Updated for changes to D132513.

jdenny updated this revision to Diff 460811.Sep 16 2022, 10:02 AM

Updated for changes to D132513.

I'll be honest, I don't like the syntax of this, so much so that I think I'd prefer people use the existing method of providing function arguments for these substitutions (i.e. DEFINE then REDEFINE them as needed) versus this. The complex regex required in the lit scripts to make this format work also doesn't help this solution's case.

Is the following format at all viable?

%{func(arg1, arg2)}

This would require lit to treat these kinds of substitutions specially, but the format including ( and ) in the name means this would at least be fairly easy to detect.

Thanks for taking a look.

I'll be honest, I don't like the syntax of this,

I get it. It's not pretty. However, it was the simplest syntax I came up with to address the use cases I had in mind. I figured there would be a syntax debate, and I am open to suggestions. (For reference, @MaskRay and I also discussed the syntax some starting at https://discourse.llvm.org/t/lit-run-a-run-line-multiple-times-with-different-replacements/64932/14.)

so much so that I think I'd prefer people use the existing method of providing function arguments for these substitutions (i.e. DEFINE then REDEFINE them as needed) versus this.

For some uses cases, the existing DEFINE/REDEFINE functionality is sufficient and preferable. However, as the new TestingGuide.rst entry in this patch discusses, that syntax is not sufficient for some other uses cases.

The complex regex required in the lit scripts to make this format work also doesn't help this solution's case.

I'm not sure what you mean by "the lit scripts". Are you referring to TestRunner.py? Or are you thinking it will show up again somewhere else?

Is the following format at all viable?

%{func(arg1, arg2)}

I tried something similar originally:

%{func}(arg1, arg2)

However, I quickly ran into many cases where I needed , and ) in my actual arguments. For example, an argument might be a comma-delimited list of FileCheck/-verify prefixes. An argument might contain expected output strings that contained parenthesized elements, such as (null). The above suggestion has the same problems.

Ultimately, I settled on %, and %) as delimiters. That way, there’s no limit on what can go in an actual argument. Moreover, they stick with % for introducing new special sequences. The %if syntax ultimately chose to stick with % as well.

This would require lit to treat these kinds of substitutions specially,

I'm not sure I know what you mean.

Do you mean that, where we currently use python's re.sub and a regex to apply substitutions, we would, for the case of function substitutions, extend python with a hand-written parser for function argument lists? Given the previous negative reaction to making lit more complex for %if, I figured that expressing the argument list syntax in a regex that can be used within lit's existing implementation for applying substitutions would be more acceptable. I'm not sure that regex is actually more complicated than a hand-written parser would be. I'm pretty sure the regex is at least more concise. In any case, I'm open to rewriting if that's the direction the majority want. Eventually, we'll reinvent m4.

Also, I don't think the possibility of implementing with a hand-written parser vs. a regex is specific to your proposed syntax. Either is also possible for the syntax I proposed in this review.

But maybe I misunderstood your point.

but the format including ( and ) in the name means this would at least be fairly easy to detect.

I don't know what you mean. Can you explain more?

Thanks for taking a look.

I'll be honest, I don't like the syntax of this,

I get it. It's not pretty. However, it was the simplest syntax I came up with to address the use cases I had in mind. I figured there would be a syntax debate, and I am open to suggestions. (For reference, @MaskRay and I also discussed the syntax some starting at https://discourse.llvm.org/t/lit-run-a-run-line-multiple-times-with-different-replacements/64932/14.)

so much so that I think I'd prefer people use the existing method of providing function arguments for these substitutions (i.e. DEFINE then REDEFINE them as needed) versus this.

For some uses cases, the existing DEFINE/REDEFINE functionality is sufficient and preferable. However, as the new TestingGuide.rst entry in this patch discusses, that syntax is not sufficient for some other uses cases.

I see it. I didn't register the need on my first skim through this, but I think I get it on second look.

The complex regex required in the lit scripts to make this format work also doesn't help this solution's case.

I'm not sure what you mean by "the lit scripts". Are you referring to TestRunner.py? Or are you thinking it will show up again somewhere else?

No, TestRunner.py was what I was referring to.

Is the following format at all viable?

%{func(arg1, arg2)}

I tried something similar originally:

%{func}(arg1, arg2)

However, I quickly ran into many cases where I needed , and ) in my actual arguments. For example, an argument might be a comma-delimited list of FileCheck/-verify prefixes. An argument might contain expected output strings that contained parenthesized elements, such as (null). The above suggestion has the same problems.

Ultimately, I settled on %, and %) as delimiters. That way, there’s no limit on what can go in an actual argument. Moreover, they stick with % for introducing new special sequences. The %if syntax ultimately chose to stick with % as well.

The %if syntax is slightly different, because there's no way of "grouping" the whole thing into one "name", I think? (I haven't looked back at the syntax for it, so might be misremembering it). For the DEFINE/REDEFINE case, a format with %{foo(arg1, arg2)} has the closing } as a convenient terminator for the "pattern", so anything (e.g. , and )) in between could be considered special (just don't try to use a } in the arg list!).

One of my main concerns is that the argument portion of the proposed syntax doesn't necessarily look like it really is a set of arguments to the untrained eye. The %) maybe just about binds itself to the (, but none of it really binds itself to the %{name} part, so it just doesn't feel like the two halves of the whole expression have anything to do with each other.

This would require lit to treat these kinds of substitutions specially,

I'm not sure I know what you mean.

Do you mean that, where we currently use python's re.sub and a regex to apply substitutions, we would, for the case of function substitutions, extend python with a hand-written parser for function argument lists? Given the previous negative reaction to making lit more complex for %if, I figured that expressing the argument list syntax in a regex that can be used within lit's existing implementation for applying substitutions would be more acceptable. I'm not sure that regex is actually more complicated than a hand-written parser would be. I'm pretty sure the regex is at least more concise. In any case, I'm open to rewriting if that's the direction the majority want. Eventually, we'll reinvent m4.

I think I might have misremembered how these substitutions are handled in practice, but in essence, I imagined doing (effectively) re.search to find a substitution in the input string, and swapping it in, repeating the process for each substitution. As the %{name(arg1, arg2)} syntax wouldn't work for that, you'd have to do something different or extra for these cases. Perhaps it would be re.search for the %{name part and then find the next } manually to form the "matching" block. It may not be less complex in terms of overall code, but it seems like it would require a much simpler regex, therefore making the code perhaps easier to understand?

Also, I don't think the possibility of implementing with a hand-written parser vs. a regex is specific to your proposed syntax. Either is also possible for the syntax I proposed in this review.

But maybe I misunderstood your point.

but the format including ( and ) in the name means this would at least be fairly easy to detect.

I don't know what you mean. Can you explain more?

I was thinking that you'd be able to distinguish the arg-based substitutions from regular ones, by identifying the presence of ( in the pattern, and then could decide to do a different thing for that substitution.

The %if syntax is slightly different, because there's no way of "grouping" the whole thing into one "name", I think? (I haven't looked back at the syntax for it, so might be misremembering it).

At one point, %if used a single regex. There were two issues, as I recall:

  1. The match had to be processed to evaluate the condition and select the then or else. As a result, re.sub wasn't used. Instead, re.search was used followed by post-processing. (Actually, I believe the post-processing could be have been handled by providing a python function for the replacement in re.sub.)
  1. People wanted to nest %if in the then and else. That requires a parser stack, so a simple regex wasn't sufficient, and a recursive-descent parser was implemented. My hunch is that this part of the implementation led to the negative comments about complexity.

Function substitutions as proposed in this patch also have problem 2 (requires parser stack): you might want a function call to appear in an argument of another function call. Because I never found a reason to do that in my test suites, I suspected it wouldn't be common. I also complicated the regex a bit to be sure recursiveExpansionLimit could make such cases work. So, for now, I went with a regex applied via re.sub instead of a recursive-descent parser.

I've read that recursive regexes will be possible in python one day. I don't know the current status for sure, but I got the impression it's a bleeding-edge feature requiring a different regex library than re. That feature will probably enable addressing 2 elegantly for both %if and function substitutions. One day.

For the DEFINE/REDEFINE case, a format with %{foo(arg1, arg2)} has the closing } as a convenient terminator for the "pattern", so anything (e.g. , and )) in between could be considered special (just don't try to use a } in the arg list!).

OK, I see the idea now. However, that adds a third character that cannot be used in arguments.

One of my main concerns is that the argument portion of the proposed syntax doesn't necessarily look like it really is a set of arguments to the untrained eye. The %) maybe just about binds itself to the (, but none of it really binds itself to the %{name} part, so it just doesn't feel like the two halves of the whole expression have anything to do with each other.

I understand. I suffer from the lack of objectivity that comes from a "trained eye". That is, I stared at my syntax too long as it evolved into its current form.

To help ( and %) seem to bind better, we could change them to %( and %). To rationalize how the argument list then binds to %{name}, compare name(args) in a C-like programming language to %{name}%(args%). That is, name -> %{name}, ( -> %(, and ) -> %). Notice that % is used to introduce every special sequence.

I considered %( and %) early on, but it seemed more cluttered than technically necessary, so I decided to think of %{name}( as a single token instead. I don't care that much though, so I'm fine to switch to %( if people like it better.

I think I might have misremembered how these substitutions are handled in practice, but in essence, I imagined doing (effectively) re.search to find a substitution in the input string, and swapping it in, repeating the process for each substitution. As the %{name(arg1, arg2)} syntax wouldn't work for that, you'd have to do something different or extra for these cases. Perhaps it would be re.search for the %{name part and then find the next } manually to form the "matching" block.

It sound like you're trying to handle the call-in-args part of this, which requires a parser stack or recursiveExpansionLimit, as discussed above. In that regard, I see no difference between the syntax proposed in the current patch and the %{name(arg1, arg2)} syntax.

It may not be less complex in terms of overall code, but it seems like it would require a much simpler regex, therefore making the code perhaps easier to understand?

I guess that's in the eye of the beholder. Where possible, I tend to prefer formal definitions of syntax (like regexes) over ad-hoc implementations (like recursive-descent parsing), and I'm looking forward to a more powerful regex library in python. However, I'm open to reimplementing this as a recursive-descent parser now if that's more acceptable to everyone. Or maybe I should spend more time researching that new regex library to see if it's usable now somehow.

but the format including ( and ) in the name means this would at least be fairly easy to detect.

I don't know what you mean. Can you explain more?

I was thinking that you'd be able to distinguish the arg-based substitutions from regular ones, by identifying the presence of ( in the pattern, and then could decide to do a different thing for that substitution.

I might be misunderstanding your point, but I don't think that's an actual problem inherent in the syntax proposed in this patch.

The general substitution algorithm now is to iterate the list of substitutions and apply one at a time to the current RUN line. When arriving at a function substitution %{name} in the list, if we were to use recursive-descent parsing, I imagine it would then search for %{name}( in the RUN line. Upon finding it, it would then look for the closing %) while recursively balancing anything like %{looks-like-another-func}(args%) in the argument list. Regardless of whether %{looks-like-another-func}( is actually a function substitution, the matching %) binds to it for the sake of processing the arguments of %{name}. With your proposed syntax, I think the behavior would be analogous: regardless of whether %{looks-like-another-func( is actually a function substitution, the matching )} binds to it.

jdenny added a comment.EditedSep 23 2022, 11:24 AM

I guess that's in the eye of the beholder. Where possible, I tend to prefer formal definitions of syntax (like regexes) over ad-hoc implementations (like recursive-descent parsing), and I'm looking forward to a more powerful regex library in python. However, I'm open to reimplementing this as a recursive-descent parser now if that's more acceptable to everyone. Or maybe I should spend more time researching that new regex library to see if it's usable now somehow.

Arguing the other side, recursive-descent parsing can help to improve the diagnostics. That's relevant here: where a function substitution is used with unbalanced parentheses or the wrong number of arguments, the current patch just doesn't perform the substitution. The lit user experience would be better if lit instead terminated with a diagnostic.

The general substitution algorithm now is to iterate the list of substitutions and apply one at a time to the current RUN line. When arriving at a function substitution %{name} in the list, if we were to use recursive-descent parsing, I imagine it would then search for %{name}( in the RUN line. Upon finding it, it would then look for the closing %) while recursively balancing anything like %{looks-like-another-func}(args%) in the argument list. Regardless of whether %{looks-like-another-func}( is actually a function substitution, the matching %) binds to it for the sake of processing the arguments of %{name}. With your proposed syntax, I think the behavior would be analogous: regardless of whether %{looks-like-another-func( is actually a function substitution, the matching )} binds to it.

I just thought of a potentially problematic case, which is perhaps what you were after:

%{real-func}( %{not-a-func}(expr) %)

%{not-a-func} might be a substitution but not a function substitution, and there happens to be a following parenthesized expression that would remain after expansion. %{not-a-func}( can prevent %{real-func}(...%) from expanding because the one %) matches %{not-a-func}(.

That's not a big deal with the current patch because you just increase recursiveExpansionLimit, and %{real-func)(...%) will be sure to expand after %{not-a-func}. In general, the current patch can require increasing recursiveExpansionLimit when something that looks like a call appears in an argument.

However, if we move to recursive-descent parsing and diagnose unbalanced parentheses, then this case can fail on the first attempt at expanding %{real-func}(...%). I noticed your proposed syntax suffers from the same problem, but I think it's only for cases that are probably less apt to occur in practice:

%{real-func( %{not-a-func(foo)-bar} )}

With either syntax, one way to solve this problem is to, upon each potential function use like %{not-a-func}(, do a substitution lookup to see if it's a function substitution.

A simpler way is to move to the other syntax I mentioned in my last comment:

%{real-func}%( %{not-a-func}(expr) %)

That is, %(...%) would always indicate a function substitution use. If you don't want it to, escape it using %%. Lit could even diagnose leftover %(...%) that never matched a function substitution.

OK, I see the idea now. However, that adds a third character that cannot be used in arguments.

This is true. However, it seems like any solution has a restriction along these lines, unless you provide some form of escaping. In the current solution, you've effectively done this escaping via the use of %). I don't know whether this is a good idea, but in my suggestion, you could potentially use % to escape the special characters to actually appear as literals in the input arguments. However, I'm also inclined to say we don't need to try to solve this problem up-front necessarily.

To help ( and %) seem to bind better, we could change them to %( and %). To rationalize how the argument list then binds to %{name}, compare name(args) in a C-like programming language to %{name}%(args%). That is, name -> %{name}, ( -> %(, and ) -> %). Notice that % is used to introduce every special sequence.

I considered %( and %) early on, but it seemed more cluttered than technically necessary, so I decided to think of %{name}( as a single token instead. I don't care that much though, so I'm fine to switch to %( if people like it better.

I'm not sure that would help the situation. The full syntax would look like this: %{name}%(arg1, arg2%). By squinting, I can get the %( and %) to marry up, but I don't see the name half pairing up with the arg part intuitively. I don't know if this is necessarily a good idea overall, but I'd almost find it easier if the syntax were %{name}(arg1, arg2)% with the two % characters providing a "paired grouping" much like parentheses already do in typical computer programming. I think my problem is that the % characters in the current syntax aren't symmetrical about the midpoint of the syntax.

It sound like you're trying to handle the call-in-args part of this, which requires a parser stack or recursiveExpansionLimit, as discussed above. In that regard, I see no difference between the syntax proposed in the current patch and the %{name(arg1, arg2)} syntax.

Actually, I wasn't looking for nested functions: I think they are likely to be far less useful than function arguments, which I think are already likely to be relatively rare. I also think there's an argument for not adding too much complexity to the lit command language, because if you make the test script too complex, then it becomes impossible to figure out what the test is testing. I don't think regular function-like subsitutions fall into this situation, but if we start doing function calls in arg lists, then I think it might start becoming rather unintuitive.

It may not be less complex in terms of overall code, but it seems like it would require a much simpler regex, therefore making the code perhaps easier to understand?

I guess that's in the eye of the beholder. Where possible, I tend to prefer formal definitions of syntax (like regexes) over ad-hoc implementations (like recursive-descent parsing), and I'm looking forward to a more powerful regex library in python. However, I'm open to reimplementing this as a recursive-descent parser now if that's more acceptable to everyone. Or maybe I should spend more time researching that new regex library to see if it's usable now somehow.

I'm just one opinion, but I believe that one of the problems of regexes is that they require a decent knowledge of that particular language's regex language, since every variant is somewhat different, especially when getting into the more complicated functionality. Contrast that to a (hopefully simple) parser, if someone is reading this code, it's likely they already know python, so it should be fairly simple to follow.

OK, I see the idea now. However, that adds a third character that cannot be used in arguments.

This is true. However, it seems like any solution has a restriction along these lines, unless you provide some form of escaping. In the current solution, you've effectively done this escaping via the use of %).

Right, and that can be escaped as %%), reusing the existing %% substitution.

I don't know whether this is a good idea, but in my suggestion, you could potentially use % to escape the special characters to actually appear as literals in the input arguments. However, I'm also inclined to say we don't need to try to solve this problem up-front necessarily.

I disagree about the last sentence. In my downstream work, I often needed commas because they often appear in command-line arguments, and I sometimes need parens there too.

I think this is an example of what you're proposing:

DEFINE: %{check( CFLAGS, FCFLAGS )} = %clang %{CFLAGS} -o %t %s && %t | FileCheck %{FCFLAGS} %s
DEFINE: %{fcflags-common} = -DPTR='%(null%)'
RUN: %{check( -Wl%,-q, -check-prefixes=CHECK%,QUIET %{fcflags-common} )}

which would expand to:

RUN: %clang -Wl,-q -o %t %s && %t | FileCheck -check-prefixes=CHECK,QUIET -DPTR='(null)' %s

Here's why I find that proposal complicated:

  • Parens, comma, and }, which are normally inert text, would be special only within uses of function substitution argument lists. (I understand that, depending on how it would be parsed, } might not need to be special before the closing ), but the main point still stands.)
  • We need to introduce several additional lit substitutions to allow those characters to appear literally in arguments : %,, %(, %), and %}.
  • Where do we define those new substitutions? I argue they should be special substitutions that, similar to %%, expand after all other substitutions have been expanded recursiveExpansionLimit times. That way, substitutions like %{fcflags-common} can be reused outside argument lists in other RUN lines. Moreover, the test author doesn't have to worry the new substitutions will expand too early (before emerging in an argument list that's ready to be parsed) because he didn't think through substitution order, function uses hidden behind other substitutions, function uses within other functions' args, and how all that interacts with recursiveExpansionLimit.
  • Test authors have to adjust all existing substitution values that contain commas or parens if they start to also use those substitutions in the args of functions. For example, maybe %{fcflags-common} was actually defined already in a lit configuration file.

It seems to me that it's simpler for the test author and the lit developer if we just say % is meaningful to lit, and other text is inert, as in the following example:

DEFINE: %{check}%( CFLAGS %, FCFLAGS %) = %clang %{CFLAGS} -o %t %s && %t | FileCheck %{FCFLAGS} %s
DEFINE: %{fcflags-common} = -DPTR='(null)'
RUN: %{check}%( -Wl,-q %, -check-prefixes=CHECK,QUIET %{fcflags-common} %)

I'm not sure that would help the situation. The full syntax would look like this: %{name}%(arg1, arg2%). By squinting, I can get the %( and %) to marry up, but I don't see the name half pairing up with the arg part intuitively. I don't know if this is necessarily a good idea overall, but I'd almost find it easier if the syntax were %{name}(arg1, arg2)% with the two % characters providing a "paired grouping" much like parentheses already do in typical computer programming. I think my problem is that the % characters in the current syntax aren't symmetrical about the midpoint of the syntax.

For the record, the %if syntax already uses %{ and %}. That is, % consistently starts a special sequence.

I'm just one opinion, but I believe that one of the problems of regexes is that they require a decent knowledge of that particular language's regex language, since every variant is somewhat different, especially when getting into the more complicated functionality. Contrast that to a (hopefully simple) parser, if someone is reading this code, it's likely they already know python, so it should be fairly simple to follow.

OK, if no one else objects, let's agree to go with a recursive-descent parser instead of a single regex. That should address several issues with the current patch:

  • Some lit developers might find it easier to understand. If that's true for the majority of lit developers, then my original motivation for the single regex is defeated.
  • For invalid uses of function substitutions (e.g., wrong number of arguments, unclosed argument list), it can provide helpful diagnostics instead of just quietly failing to substitute.
  • It can handle function calls within function arguments without requiring recursiveExpansionLimit.
  • Function substitution entries in the substitution list (that is, config.substitutions) will not have the form (pattern, replacement) but instead will list parameter names so a recursive-descent parser can parse function substitution uses. For example, we might decide that any entry with more than two items would be interpreted as (func_name, param1, param2, ..., replacement). On the one hand, the substitution list will be more complicated. On the other hand, it means lit configuration files would immediately have a way to define function substitutions. The current patch doesn't provide one.

But let's converge on the previous syntax questions before I pursue the rewrite.

Sorry for the delayed response, and even now I'm not going to respond directly to your latest comments, as I have to get on with some high priority work that I need to finish ahead of a week off next week. I'll try to look back at this when I'm back.

It might be worth raising some of these points on the discourse thread and see if there's any further thoughts on them, as I don't feel confident the two of us represent a real consensus given the interest shown elsewhere!

Sorry for the delayed response, and even now I'm not going to respond directly to your latest comments, as I have to get on with some high priority work that I need to finish ahead of a week off next week. I'll try to look back at this when I'm back.

No worries. Thanks for the conversation so far. I should also focus on other things for the moment.

It might be worth raising some of these points on the discourse thread and see if there's any further thoughts on them, as I don't feel confident the two of us represent a real consensus given the interest shown elsewhere!

That sounds like a good plan. I'll try to pursue it soon.