This is an archive of the discontinued LLVM Phabricator instance.

[ELF] - Use std::regex instead of hand written logic in elf::globMatch()
ClosedPublic

Authored by grimar on Aug 24 2016, 2:40 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

grimar updated this revision to Diff 69090.Aug 24 2016, 2:40 AM
grimar retitled this revision from to [ELF] - Use std::regex instead of hand written logic in elf::globMatch().
grimar updated this object.
grimar added reviewers: ruiu, rafael.
grimar added subscribers: llvm-commits, grimar, davide, evgeny777.
ruiu added inline comments.Aug 24 2016, 2:58 AM
ELF/Config.h
47–49 ↗(On Diff #69090)

I wouldn't think that this class is designed well. Name and Wildcard are mutually exclusive, and if hasWildcard(Sym.Name) is false, you need to use Sym.Wildcard. That's not easy to understand.

49 ↗(On Diff #69090)

This should be named Re.

ELF/LinkerScript.cpp
130–132 ↗(On Diff #69090)

Isn't this slow?

ELF/Strings.cpp
33 ↗(On Diff #69090)

This converts "[*]" to "[.*]" which is wrong.

46 ↗(On Diff #69090)

You need to remove this function. This is not even a "glob" match anymore.

grimar added inline comments.Aug 24 2016, 3:09 AM
ELF/Config.h
47–49 ↗(On Diff #69090)

That is because I am trying not to use regex matching when it is unnessecary, like in next case:
SymbolTable.cpp(linke 646)

SymbolBody *B = Sym.IsExternCpp ? Demangled[Sym.Name] : find(Sym.Name);

I can probably introduce some helper method hiding this logic.

ELF/Strings.cpp
33 ↗(On Diff #69090)

Before this patch we did not have support of "[" or "]". This patch does not add any testcases as well as does not add support for them.
I am assuming that testcases and support for new wildcard characters should be added separately.

grimar added inline comments.Aug 24 2016, 3:33 AM
ELF/Strings.cpp
33 ↗(On Diff #69090)

The real problem here is that is converts ".foo" to ".foo" (unchanged), what is incorrect. I`ll add testcase and fix that.

grimar added inline comments.Aug 24 2016, 4:05 AM
ELF/LinkerScript.cpp
130–132 ↗(On Diff #69090)

I think this should not be slow. Even FreeBSD script
which I think is from largest group there are ~150 input sections pattern in total.
That is not huge amount itself.
I think 99% of them starts from "." (a dot, not a any character, like ".data" will have pattern "\.data")
And matching should end on first symbol comparsion for most of them ('C' vs '.').
Am I missing something here ?

grimar updated this revision to Diff 69095.Aug 24 2016, 4:26 AM
  • Fixed bug (".data" expression was generated from ".data" instead of "\.data"), added testcase.
  • Addressed review comments.
ELF/Config.h
47–49 ↗(On Diff #69090)

I think we do not need that new field at all. It can be created on demand since only used in scanVersionScript(). I removed it.

ELF/Strings.cpp
46 ↗(On Diff #69090)

Done.

ruiu added inline comments.Aug 29 2016, 4:40 PM
ELF/LinkerScript.cpp
836–841 ↗(On Diff #69095)

This function can return a std::regex instead of a vector of std::regexs by concatenating regexs using |.

ELF/Strings.cpp
29–30 ↗(On Diff #69095)

This is not a state machine.

ELF/SymbolTable.cpp
486 ↗(On Diff #69095)

clang-format

ELF/SymbolTable.h
94 ↗(On Diff #69095)

clang-format

Our software requirements include that GCC should be >= 4.7.0: http://llvm.org/docs/GettingStarted.html#software

However, std::regex wasn't implemented until GCC 4.9.0

Could you use llvm::Regex instead?

grimar updated this revision to Diff 69675.Aug 30 2016, 6:52 AM
  • Addressed review comments.
  • Switched to llvm::Regex
grimar added inline comments.Aug 30 2016, 6:53 AM
ELF/LinkerScript.cpp
836–841 ↗(On Diff #69095)

"|" does not work with std::regex::basic, because not a part of basic regular expressions syntax (http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05).

I guess removing std::regex::basic should be slower while there is no real profit.

105 ↗(On Diff #69675)

I dont like this and other const casts I had to use,
D24027 should help to fix that.

ELF/Strings.cpp
29–30 ↗(On Diff #69095)

Hmm technically it can be I think.

Like from initial state basing on input it do a transition to state N,
and each state N returns the machine to initial state after doing its job.

I changed comment anyways :)

ruiu added inline comments.Aug 30 2016, 12:39 PM
ELF/LinkerScript.cpp
95 ↗(On Diff #69675)

Nit: we probably should use Re as a local variable name throughout this patch instead of Pat, Pattern nor Patterns because it is no longer a glob pattern (which Pat was named after) but a regular expression.

847–852 ↗(On Diff #69675)

It shouldn't be slow. I did not actually take a look at the implementation, but at least in theory, any regular expression that we use can be compiled to a DFA because we do not use anything that requires NFA such as capturing. If you can't use | because you are using the basic re, you shouldn't use the basic re but instead extended one.

ELF/LinkerScript.h
97 ↗(On Diff #69675)

Why unique_ptr?

132 ↗(On Diff #69675)

Why pointers?

ELF/Strings.h
19 ↗(On Diff #69675)

Let's name this compileGlobPattern because it complies a glob pattern to a re.

grimar updated this revision to Diff 69826.Aug 31 2016, 4:47 AM
  • Addressed review comments.
grimar added inline comments.Aug 31 2016, 4:48 AM
ELF/LinkerScript.cpp
95 ↗(On Diff #69675)

Done.

847–852 ↗(On Diff #69675)

Ok, I see.
Found one more problem with that unfortunatelly:

I can't just switch current:

std::vector<llvm::Regex> ExcludedFiles;
std::vector<llvm::Regex> SectionPatterns;

to

llvm::Regex ExcludedFiles;
llvm::Regex SectionPatterns;

Because as I mantioned above, llvm::Regex does not have constructor with no arguments.
Not sure we what we should do (I guess anyways we will switch to use of std::regex one day probably), so possible solutions are
either not to use "|" or to make them unique_ptr.
What do you think ?

ELF/LinkerScript.h
97 ↗(On Diff #69675)

In compare with std::regex, llvm::Regex does not have constructor with no arguments.
Removed unique_ptr, added argument for InputSectionDescription instead.

132 ↗(On Diff #69675)

Restriction of API. llvm::Regex is non-copyable.
That looks to be intended:
r198334 "Make llvm::Regex non-copyable but movable."

So I cant just copy kept patterns into KeptSections.

106 ↗(On Diff #69826)

Not sure, do we want to rename SectionPatterns and FilePattern to something then ?

ruiu added inline comments.Aug 31 2016, 7:27 AM
ELF/LinkerScript.cpp
907–912 ↗(On Diff #69826)

Why don't you add a constructor with no arguments to llvm::Regex?

grimar added inline comments.Aug 31 2016, 8:26 AM
ELF/LinkerScript.cpp
907–912 ↗(On Diff #69826)

OK, I can do that. What do you think about const_casts this patch introduces ?
We can either avoid using const llvm::Regex references to be able to use non-const match or reimplement llvm::Regex to be able to make match() const.

grimar added inline comments.Aug 31 2016, 8:32 AM
ELF/LinkerScript.cpp
907–912 ↗(On Diff #69826)

One more variant is to leave const_casts as is for now.

ruiu edited edge metadata.Aug 31 2016, 9:19 AM

llvm::Regex is not used in that many places, so it seems you can make Match const and update all places where that member function is called.

In D23829#530459, @ruiu wrote:

llvm::Regex is not used in that many places, so it seems you can make Match const and update all places where that member function is called.

Actually it is. Just search "Regex" in llvm. I wanted to do it const here: D24027. It requires making error field to be mutable.

David Blaikie was against this change:

"Sounds like the error handling is part of the identity of the object - I'm not sure it'd be a good idea to make that mutable & make the object logically const when it's actually changing that error field. (it seems it'd be easy to introduce misuse there - where one user does something to the Regex, then passes a const ref somewhere, then examines the error field and finds it changed - let alone multithreading)"

His suggestion was:

"(or remove the side-channel error handling, and have the APIs return errors explicitly instead - the side channel's likely to be accidentally unchecked by many callers, which is probably a mistake)"

Reworking llvm::Regex API is not a little change and I am not sure it is worth that as std::regex is a part of c++ 11 and I hope soon all compilers will have a support.

grimar edited edge metadata.Aug 31 2016, 10:04 AM
grimar added a subscriber: dblaikie.
ruiu added a comment.Aug 31 2016, 10:59 AM

How about defining a copy constructor to llvm::Regex and create mutable copies from const Regexs? Regex seems to be just two-pointer-sized object so copying it should be cheap.

In D23829#530626, @ruiu wrote:

How about defining a copy constructor to llvm::Regex and create mutable copies from const Regexs? Regex seems to be just two-pointer-sized object so copying it should be cheap.

And wrap llvm_regex *preg into something like std::shared_ptr with custom deleter ?

grimar added a comment.EditedAug 31 2016, 12:10 PM
In D23829#530626, @ruiu wrote:

How about defining a copy constructor to llvm::Regex and create mutable copies from const Regexs? Regex seems to be just two-pointer-sized object so copying it should be cheap.

And wrap llvm_regex *preg into something like std::shared_ptr with custom deleter ?

I did not mean we should really do that, only that it is not that simple just to copy it properly.
Probably we can remove the set of error field from match() at all.

It has next code:

if (rc != 0) {
  // regexec can fail due to invalid pattern or running out of memory.
  error = rc;
  return false;
}

I do not think that set a srror in case of invalid pattern is fine. As we might want to call it second time with valid one.
Out of memory probably is the error we can ignore.
Method anyways return bool result, that should be enough.

grimar abandoned this revision.Aug 31 2016, 1:21 PM

Closing this as r280284 was committed instead.

grimar reclaimed this revision.Aug 31 2016, 1:21 PM

Closed by mistake, sorry.

grimar updated this revision to Diff 69983.Sep 1 2016, 4:20 AM
  • Updated to use "|" for regex concatenations.
ruiu added inline comments.Sep 1 2016, 2:17 PM
ELF/LinkerScript.cpp
96–97 ↗(On Diff #69983)
if (Re->match(S->getSectionName()))
ELF/LinkerScript.h
101 ↗(On Diff #69983)

Please do not use the implicit conversion from an object of type T to ArrayRef<T>. That confused me. Add {} around FilePattern. You also want to rename compileGlobPatterns since now it takes not a single glob pattern but many patterns.

107 ↗(On Diff #69983)

I'd rename SectionRe and FileRe.

ELF/Strings.cpp
31 ↗(On Diff #69983)

There's a room to simplify this function. First , define a function that compiles single glob pattern.

static std::string toRegex(StringRef S);

and then use it as

Regex elf::compileGlobPatterns(ArrayRef<StringRef> V) {
  std::string T = "^(" + toRegex(V[0]);
  for (StringRef S : V.slice(1))
    T += "|" + toRegex(S);
  return Regex(T + ")$");
}
43 ↗(On Diff #69983)

. is not the only one metacharacter we want to quote. For example, if it contains \, it needs to be converted to \\. Did you do that?

grimar added inline comments.Sep 2 2016, 2:02 AM
ELF/LinkerScript.h
101 ↗(On Diff #69983)

Ok, I just was unsure what is the best way to avoid that. Was thinking about separate method for single pattern compilation, unfortunately it needs to be exported so rejected that way.
Approach with {} should work and looks evident enough, will use it.

grimar updated this revision to Diff 70167.EditedSep 2 2016, 8:34 AM
  • Addressed review comments.
  • Returned const_casts for Regex.
ELF/LinkerScript.cpp
96–97 ↗(On Diff #69983)

Done.

ELF/LinkerScript.h
107 ↗(On Diff #69983)

Done.

ELF/Strings.cpp
31 ↗(On Diff #69983)

Done.

43 ↗(On Diff #69983)

Nope. Looking on the list of metacharacters I dont think I ever saw them in globs. We also do not have testcases with them yet. Do you think this can be done in separate patch ?

ruiu added inline comments.Sep 2 2016, 8:43 AM
ELF/Strings.cpp
42 ↗(On Diff #70167)

Are you sure? What will happen if you have this?

SECTIONS { .text : { "foo(bar*).o"( .text ); } }

where "foo(bar*).o" is an actual file name.

grimar added inline comments.Sep 2 2016, 9:22 AM
ELF/Strings.cpp
42 ↗(On Diff #70167)

I did not say it is not possible. I meant it is so unusual that we even do not have testcases covering that. And sutuation is unlikely to happen, that is why I suggested to prepare testcases and separate patch for that.
From your comments it seems you would like to see it in this patch. Ok, I'll do that.

grimar updated this revision to Diff 70210.Sep 2 2016, 12:49 PM
  • Addressed review comments.
grimar added inline comments.Sep 2 2016, 12:51 PM
ELF/Strings.cpp
32 ↗(On Diff #70210)

Not supported because there is D23803 for that.

grimar added inline comments.Sep 2 2016, 12:54 PM
test/ELF/wildcards2.s
18 ↗(On Diff #70210)

It is not possible to create file with "|" "/" or "\" in name, so this is not tested here.

ruiu added inline comments.Sep 2 2016, 1:08 PM
ELF/Strings.cpp
35–36 ↗(On Diff #70210)
for (char F : S)

By the way, why F?

41 ↗(On Diff #70210)

StringRef(".+^${}()|/\\[]").find_first_of(F) is probably more conventional.

42 ↗(On Diff #70210)

What is this std::string() for?

test/ELF/wildcards2.s
18 ↗(On Diff #70210)

You don't need to test this this vigorously. It is too easy to break on minor filesystems. Add a test just for ..

grimar added inline comments.Sep 2 2016, 1:20 PM
ELF/Strings.cpp
35–36 ↗(On Diff #70210)

Front. Do you prefer other name ?

42 ↗(On Diff #70210)

"\\" is a pointer to char.
+ F just moves that pointer, but I need concatenation of string.

grimar added inline comments.Sep 2 2016, 1:28 PM
ELF/Strings.cpp
42 ↗(On Diff #70210)

I mean if I would just write

T += "\\" + F;

test/ELF/wildcards2.s
18 ↗(On Diff #70210)

Done.

grimar added inline comments.Sep 2 2016, 1:29 PM
test/ELF/wildcards2.s
18 ↗(On Diff #70210)

Posted this comment too early. Will be done in updated diff I mean.

grimar updated this revision to Diff 70223.Sep 2 2016, 1:33 PM
  • Addressed review comments.
test/ELF/wildcards2.s
18 ↗(On Diff #70210)

So, removed the last added testcase because first one already do check for dot.

ruiu added a comment.Sep 2 2016, 1:35 PM

Please ping me after all depending patches are submitted.

grimar added a comment.Sep 2 2016, 1:37 PM
In D23829#533376, @ruiu wrote:

Please ping me after all depending patches are submitted.

They are already. D24102 looks will not be accepted and the latest diffs of current patch does not require it (that is why it has const_casts for Regex).

ruiu added inline comments.Sep 2 2016, 1:48 PM
ELF/LinkerScript.cpp
103–105 ↗(On Diff #70223)

This is the same as

return const_cast<Regex &>(Desc->FileRe).match(Filename) &&
       !const_cast<Regex &>(Desc->ExcludedFiles).match(Filename);
ELF/LinkerScript.h
101 ↗(On Diff #70223)

compileGlobPatterns (notice the last s.)

106 ↗(On Diff #70223)

Maybe this should be ExcludedFileRe.

ELF/Strings.cpp
35–36 ↗(On Diff #70223)

I think we usually use C for a character.

50 ↗(On Diff #70223)

This is not a method but just a function. It doesn't convert into a regex form, but converts into a regex object. It always takes multiple glob patterns (it's just that the array could contain just one element).

ELF/Strings.h
19 ↗(On Diff #70223)

S doesn't match with the actual parameter name.

grimar updated this revision to Diff 70230.Sep 2 2016, 1:59 PM
  • Addressed review comments.
ELF/LinkerScript.cpp
103–105 ↗(On Diff #70223)

Done.

ELF/LinkerScript.h
101 ↗(On Diff #70223)

Forget about "s", sorry. Fixed.

106 ↗(On Diff #70223)

Done.

ELF/Strings.cpp
35–36 ↗(On Diff #70223)

Done.

50 ↗(On Diff #70223)

Done.

ELF/Strings.h
19 ↗(On Diff #70223)

Done.

ruiu accepted this revision.Sep 2 2016, 2:06 PM
ruiu edited edge metadata.

LGTM with all these changes.

ELF/Strings.cpp
51 ↗(On Diff #70230)

Why P? I believe we usually use V or Arr for an ArrayRef.

ELF/SymbolTable.cpp
486 ↗(On Diff #70230)

given regex.

ELF/SymbolTable.h
95 ↗(On Diff #70230)

Pattern -> Re (Please be consistent with naming.)

This revision is now accepted and ready to land.Sep 2 2016, 2:06 PM
grimar added a comment.Sep 2 2016, 2:08 PM

Thanks for review !

ELF/Strings.cpp
51 ↗(On Diff #70230)

Because Patterns. Actually choosed between V and P. Will switch to P then.

ruiu added inline comments.Sep 2 2016, 2:11 PM
ELF/Strings.cpp
51 ↗(On Diff #70230)

Actually I found that it is easy to read if you name variables of a small scope according to it's type rather than what the variable has, so please stick with V or Arr.

ruiu added inline comments.Sep 2 2016, 2:13 PM
ELF/Strings.cpp
51 ↗(On Diff #70230)

We know that the argument is an array of glob patterns. Because the function name and the comment says so. But if you name it P, it'd make everybody wonder what P is. At least, it doesn't make much sense to me.

This revision was automatically updated to reflect the committed changes.