This is an archive of the discontinued LLVM Phabricator instance.

Add regex support to file (-f) and module (-s) breakpoint options.
AbandonedPublic

Authored by hintonda on Oct 30 2017, 2:56 PM.

Details

Summary

Add regex support to file and module names when setting breakpoints.
This solution is enabled via a "regex:" prefix, and doesn't require any other
api changes.

In this example, setting breakpoints for a particular function was twice as
fast using "regex:.*/clang/.*":

br s -n "DiagnosticsEngine::Report" -f "regex:.*/clang/.*"

Event Timeline

hintonda created this revision.Oct 30 2017, 2:56 PM
hintonda retitled this revision from Add regex support to file (-f) and module (-s) breakpoints. to Add regex support to file (-f) and module (-s) breakpoint options..Oct 30 2017, 3:52 PM
zturner edited edge metadata.Oct 30 2017, 3:58 PM

Would you mind deleting and re-creating this revision with lldb-commits added as a subscriber? I don't think it's sufficient to just "add" it as a subscriber after the fact, I think it has to be done as part of the initial creation, or for some reason it won't show up on the mailing list.

Hmm, weird. Maybe it already is? Even though I didn't see it on the original email. Anyway, ignore my last suggestion.

On to actual comments: I'm not really crazy about the regex: prefix. Seems ugly and non-intuitive. Couldn't we just treat every filename as an argument to fnmatch (3)? Then you don't need any weird prefixes. Besides, regex:foo is a valid filename so if we're going to be making it so that some arguments to -f all of a sudden have to be escaped because they're no longer interpreted literally, then we might as well do it in a way that's intuitive and people are already familiar with.

Would you mind deleting and re-creating this revision with lldb-commits added as a subscriber? I don't think it's sufficient to just "add" it as a subscriber after the fact, I think it has to be done as part of the initial creation, or for some reason it won't show up on the mailing list.

Added lldb-commits to subscribers via arc, so I'm not sure why the initial email didn't go out -- always worked with (cfe|llvm)-commits.
But at least it's working now...

Hmm, weird. Maybe it already is? Even though I didn't see it on the original email. Anyway, ignore my last suggestion.

On to actual comments: I'm not really crazy about the regex: prefix. Seems ugly and non-intuitive. Couldn't we just treat every filename as an argument to fnmatch (3)? Then you don't need any weird prefixes. Besides, regex:foo is a valid filename so if we're going to be making it so that some arguments to -f all of a sudden have to be escaped because they're no longer interpreted literally, then we might as well do it in a way that's intuitive and people are already familiar with.

The problem with -f and -s is that they create FileSpec's, and FileSpec has some special handling for paths, so I'm not sure how to handle this without a prefix. Is there a delimiter that's invalid on all platforms? I thought ':' would work since we use it for path delimiters, but I've only got a Mac to test on right now. I suppose we could get really ugly and use a uri style prefix, e.g., "regex://".

In any case, I think it's worth exploring a solution that doesn't require a lot of api changes.

jingham requested changes to this revision.Oct 30 2017, 4:20 PM

I don't like this change.

First off, the whole point of having options in the commands is so that we don't have to have magic encodings in the values.

We also don't have FileSpec's that resolve to multiple files. What do "Exists", "IsExecutable", "GetPath" etc. mean in this context? There would have to be a really strong reason for making this kind of change. Just avoiding an extra option does not warrant this change so far as I can see.

This revision now requires changes to proceed.Oct 30 2017, 4:20 PM

Hmm, weird. Maybe it already is? Even though I didn't see it on the original email. Anyway, ignore my last suggestion.

On to actual comments: I'm not really crazy about the regex: prefix. Seems ugly and non-intuitive. Couldn't we just treat every filename as an argument to fnmatch (3)? Then you don't need any weird prefixes. Besides, regex:foo is a valid filename so if we're going to be making it so that some arguments to -f all of a sudden have to be escaped because they're no longer interpreted literally, then we might as well do it in a way that's intuitive and people are already familiar with.

The problem with -f and -s is that they create FileSpec's, and FileSpec has some special handling for paths, so I'm not sure how to handle this without a prefix. Is there a delimiter that's invalid on all platforms? I thought ':' would work since we use it for path delimiters, but I've only got a Mac to test on right now. I suppose we could get really ugly and use a uri style prefix, e.g., "regex://".

Couldn't they create a *list* of FileSpecs? Just glancing at the code for -f, it does this:

case 'f':
  m_filenames.AppendIfUnique(FileSpec(option_arg, false));
  break;

So, just resolve the expression at this point and append more than one FileSpec?

Zachary's suggestion is better than adding regex patterns to FileSpec, but I still don't like the idea of encoding option types in the option values.

Moreover, this doesn't really apply to the main use of -f - file & line breakpoints. The only plausible use for a source file filter for file & line breakpoints is when you want to restrict breaking on inlined code to a subset of the files that inline it. But in that case you would want a separate option, since you need to specify the inlined file = with the -f option - and the inlining files - with some other option.

Wouldn't it be better to add a --source-file-regex option, pick some free single letter, and use that if you want to provide a pattern for the "-p" breakpoint option. Then we could also use it when specifying some other file filter.

BTW, to Z's comment: you can't really resolve the regex pattern when you make the breakpoint. What if another library gets loaded later on, which has source files that match the source file pattern the user entered, and have source code that matches the -p pattern. The breakpoint should be updated to encompass those new locations, but you can't do that if you've already matched the pattern against the original list of files, and then thrown away the pattern.

This needs to be maintained as a pattern in the breakpoint, if you want it to be an actual pattern. But we shouldn't overload FileSpec for that purpose, IMO.

Zachary's suggestion is better than adding regex patterns to FileSpec, but I still don't like the idea of encoding option types in the option values.

Moreover, this doesn't really apply to the main use of -f - file & line breakpoints. The only plausible use for a source file filter for file & line breakpoints is when you want to restrict breaking on inlined code to a subset of the files that inline it. But in that case you would want a separate option, since you need to specify the inlined file = with the -f option - and the inlining files - with some other option.

Wouldn't it be better to add a --source-file-regex option, pick some free single letter, and use that if you want to provide a pattern for the "-p" breakpoint option. Then we could also use it when specifying some other file filter.

Actually now that you mention it, yea this option seems to only make sense when paired with -p. I actually explicitly don't like the idea of being able to set a breakpoint in a bunch of random files that share nothing in common other than some characters in the filename. I can't think of a reasonable use case for that.

Even combining it with -p, maybe the real solution here is to rename all-files to file-pattern. The only usage of --all-files is in conjunction with -p, so changing the name of the flag to file-pattern would be low impact. And not changing the name would mean you have one option which essentially has no real value, since --file-pattern=.* would be identical to --all-files, and there would be no value to ever specifying --all-files over --file-pattern

I don't like this change.

First off, the whole point of having options in the commands is so that we don't have to have magic encodings in the values.

We also don't have FileSpec's that resolve to multiple files. What do "Exists", "IsExecutable", "GetPath" etc. mean in this context? There would have to be a really strong reason for making this kind of change. Just avoiding an extra option does not warrant this change so far as I can see.

It seems that FileSpec already has a dual role, and can be either a search criteria (possibly representing multiple files) or a resolved file.

The former is used with the '-f' and '-s' parameters, and can easily match multiple files, e.g.: if I pass '-f "foo.cpp"', with no directory, it'll match every foo.cpp file included in the target or loaded module. I'd need to check, but I think 'Exists', 'IsExecutable', and 'GetPath' would still work as expected.

Hmm, weird. Maybe it already is? Even though I didn't see it on the original email. Anyway, ignore my last suggestion.

On to actual comments: I'm not really crazy about the regex: prefix. Seems ugly and non-intuitive. Couldn't we just treat every filename as an argument to fnmatch (3)? Then you don't need any weird prefixes. Besides, regex:foo is a valid filename so if we're going to be making it so that some arguments to -f all of a sudden have to be escaped because they're no longer interpreted literally, then we might as well do it in a way that's intuitive and people are already familiar with.

The problem with -f and -s is that they create FileSpec's, and FileSpec has some special handling for paths, so I'm not sure how to handle this without a prefix. Is there a delimiter that's invalid on all platforms? I thought ':' would work since we use it for path delimiters, but I've only got a Mac to test on right now. I suppose we could get really ugly and use a uri style prefix, e.g., "regex://".

Couldn't they create a *list* of FileSpecs? Just glancing at the code for -f, it does this:

case 'f':
  m_filenames.AppendIfUnique(FileSpec(option_arg, false));
  break;

So, just resolve the expression at this point and append more than one FileSpec?

Yes, passing multiple '-f' and/or '-s' options is supports, but tedious.

I don't like this change.

First off, the whole point of having options in the commands is so that we don't have to have magic encodings in the values.

We also don't have FileSpec's that resolve to multiple files. What do "Exists", "IsExecutable", "GetPath" etc. mean in this context? There would have to be a really strong reason for making this kind of change. Just avoiding an extra option does not warrant this change so far as I can see.

It seems that FileSpec already has a dual role, and can be either a search criteria (possibly representing multiple files) or a resolved file.

The former is used with the '-f' and '-s' parameters, and can easily match multiple files, e.g.: if I pass '-f "foo.cpp"', with no directory, it'll match every foo.cpp file included in the target or loaded module. I'd need to check, but I think 'Exists', 'IsExecutable', and 'GetPath' would still work as expected.

There's a big difference between "dir not specified" and a full on glob or regex matching.

I do think "all inlines of line 7 of my_cool_templates.h in some_file.cpp" is potentially useful. So adding a "source-pattern" and using it as a limiter for a search in much the same way --shlib is used seems generally useful to me. So it seems worthwhile adding this as its own option.

Whether we remove --all-files at the same time is a separate issue. I don't see much need to remove it. The most common cases for -p are the current file & all-files so having an accelerator for it (and one that will avoid running a needless regex compare against every filename) seems reasonable.

Zachary's suggestion is better than adding regex patterns to FileSpec, but I still don't like the idea of encoding option types in the option values.

Are you talking about fnmatch? Is that portable? If not, i can rewrite it,.

Moreover, this doesn't really apply to the main use of -f - file & line breakpoints. The only plausible use for a source file filter for file & line breakpoints is when you want to restrict breaking on inlined code to a subset of the files that inline it. But in that case you would want a separate option, since you need to specify the inlined file = with the -f option - and the inlining files - with some other option.

Wouldn't it be better to add a --source-file-regex option, pick some free single letter, and use that if you want to provide a pattern for the "-p" breakpoint option. Then we could also use it when specifying some other file filter.

This is valid for all breakpoints, not just those created via the '-p' option. I can gen up patch that add --source-file-regex to all breakpoints, but I also want to use it from python, so that api change might be problematic -- can we add additional default parameters without breaking ABI?

Btw, here's an example of how I want to use it -- cuts time to create breakpoint by over half, from 15 seconds down to 7 -- which makes my clangdiag module load much faster:

br s -n "DiagnosticsEngine::Report" -f "regex:.*/clang/.*"

BTW, to Z's comment: you can't really resolve the regex pattern when you make the breakpoint. What if another library gets loaded later on, which has source files that match the source file pattern the user entered, and have source code that matches the -p pattern. The breakpoint should be updated to encompass those new locations, but you can't do that if you've already matched the pattern against the original list of files, and then thrown away the pattern.

Isn't this how it already works with all breakpoints? If you load more modules, do more locations get automatically loaded? Sorry if that's a dumb question.

This needs to be maintained as a pattern in the breakpoint, if you want it to be an actual pattern. But we shouldn't overload FileSpec for that purpose, IMO.

It already is a pattern if used without a path.

Zachary's suggestion is better than adding regex patterns to FileSpec, but I still don't like the idea of encoding option types in the option values.

Are you talking about fnmatch? Is that portable? If not, i can rewrite it,.

Moreover, this doesn't really apply to the main use of -f - file & line breakpoints. The only plausible use for a source file filter for file & line breakpoints is when you want to restrict breaking on inlined code to a subset of the files that inline it. But in that case you would want a separate option, since you need to specify the inlined file = with the -f option - and the inlining files - with some other option.

Wouldn't it be better to add a --source-file-regex option, pick some free single letter, and use that if you want to provide a pattern for the "-p" breakpoint option. Then we could also use it when specifying some other file filter.

This is valid for all breakpoints, not just those created via the '-p' option. I can gen up patch that add --source-file-regex to all breakpoints, but I also want to use it from python, so that api change might be problematic -- can we add additional default parameters without breaking ABI?

Btw, here's an example of how I want to use it -- cuts time to create breakpoint by over half, from 15 seconds down to 7 -- which makes my clangdiag module load much faster:

br s -n "DiagnosticsEngine::Report" -f "regex:.*/clang/.*"

Since you already have a python module and are creating the breakpoints from inside of python, couldn't you just walk the directory tree yourself creating breakpoints manually on each file that matches the regex? It seems like one of those things that would get used once in a blue moon, and in my experience it's not worth adding the maintenance burden to the larger project for an option that someone is going to use once every couple of years, especially when you could just bake the logic into your application.

I don't like this change.

First off, the whole point of having options in the commands is so that we don't have to have magic encodings in the values.

We also don't have FileSpec's that resolve to multiple files. What do "Exists", "IsExecutable", "GetPath" etc. mean in this context? There would have to be a really strong reason for making this kind of change. Just avoiding an extra option does not warrant this change so far as I can see.

It seems that FileSpec already has a dual role, and can be either a search criteria (possibly representing multiple files) or a resolved file.

The former is used with the '-f' and '-s' parameters, and can easily match multiple files, e.g.: if I pass '-f "foo.cpp"', with no directory, it'll match every foo.cpp file included in the target or loaded module. I'd need to check, but I think 'Exists', 'IsExecutable', and 'GetPath' would still work as expected.

There's a big difference between "dir not specified" and a full on glob or regex matching.

Yes, which is what I'm trying to add. Granted, this might not be acceptable, but the searchfilters already work this way. Not sure how to enhance them to support this if we don't use FileSpec -- it's copy constructed in vector, so even thought it's passed around as a ref, we can't customize it -- at least without changing the vectors to take shared pointers.

BTW, to Z's comment: you can't really resolve the regex pattern when you make the breakpoint. What if another library gets loaded later on, which has source files that match the source file pattern the user entered, and have source code that matches the -p pattern. The breakpoint should be updated to encompass those new locations, but you can't do that if you've already matched the pattern against the original list of files, and then thrown away the pattern.

Isn't this how it already works with all breakpoints? If you load more modules, do more locations get automatically loaded? Sorry if that's a dumb question.

That's how it should work. Zachary was suggesting (though I don't think he is anymore) taking the pattern in as an argument, but then matching it against the files that matched the pattern at the time the breakpoint was set. That is NOT - as you correctly state - the way breakpoints are supposed to work. So if we want to make this a pattern, we have to store the pattern.

This needs to be maintained as a pattern in the breakpoint, if you want it to be an actual pattern. But we shouldn't overload FileSpec for that purpose, IMO.

It already is a pattern if used without a path.

Not really. If you don't supply a directory, I'm free to treat this as "the first one I find" if I want. A regex is explicitly multiple files.

Zachary's suggestion is better than adding regex patterns to FileSpec, but I still don't like the idea of encoding option types in the option values.

Are you talking about fnmatch? Is that portable? If not, i can rewrite it,.

Moreover, this doesn't really apply to the main use of -f - file & line breakpoints. The only plausible use for a source file filter for file & line breakpoints is when you want to restrict breaking on inlined code to a subset of the files that inline it. But in that case you would want a separate option, since you need to specify the inlined file = with the -f option - and the inlining files - with some other option.

Wouldn't it be better to add a --source-file-regex option, pick some free single letter, and use that if you want to provide a pattern for the "-p" breakpoint option. Then we could also use it when specifying some other file filter.

This is valid for all breakpoints, not just those created via the '-p' option. I can gen up patch that add --source-file-regex to all breakpoints, but I also want to use it from python, so that api change might be problematic -- can we add additional default parameters without breaking ABI?

Btw, here's an example of how I want to use it -- cuts time to create breakpoint by over half, from 15 seconds down to 7 -- which makes my clangdiag module load much faster:

br s -n "DiagnosticsEngine::Report" -f "regex:.*/clang/.*"

Since you already have a python module and are creating the breakpoints from inside of python, couldn't you just walk the directory tree yourself creating breakpoints manually on each file that matches the regex? It seems like one of those things that would get used once in a blue moon, and in my experience it's not worth adding the maintenance burden to the larger project for an option that someone is going to use once every couple of years, especially when you could just bake the logic into your application.

You wouldn't have to do quite that much work, you'd just have to scan the directory tree and make up an SBFileSpecList of the files you find, then use:

lldb::SBBreakpoint
BreakpointCreateBySourceRegex(const char *source_regex,
                              const SBFileSpecList &module_list,
                              const SBFileSpecList &source_file);

You still get one breakpoint, but a pretty straightforward way of doing it. That's why I originally suggested adding

SBFieSpecList SBTarget::FindFilesMatchingFilenameRegex(const char *regex);

then you could call this and pass the file list to your regex breakpoint.

As we said earlier, this won't work like a pattern if you get subsequent shared library loads. If we really need that then we do need a source expression option.

I'm not opposed to adding useful options for breakpoint creation, but at this point, I'd really like to make up an SBBreakpointSpecificationOptions, make an:

SBBreakpoint SBTarget::BreakpointCreateWithOptions(const SBBreakpointSpecificationOptions &spec_options, SBBreakpointOptions &options);

and not have to keep overloading the individual creator functions.

hintonda updated this revision to Diff 120933.Oct 30 2017, 9:00 PM
hintonda edited edge metadata.
  • Remove prefix and add options.
clayborg requested changes to this revision.Oct 31 2017, 9:49 AM

A few general things: don't modify FileSpec, we have many of these objects and we can't increase their size without directly affecting memory usage. FileSpec objects represent one file on disk, not multiple. We should make a new class that contains a FileSpec and a regex, but not put it into FileSpec.

include/lldb/Utility/FileSpec.h
96

This constructor seem dangerous as it might be confused with a path.

587

We shouldn't add something to FileSpec here. Now every FileSpec in LLDB will carry around an extra RegularExpression object as part of the object? Please remove. We should make a new class instead of adding something to FileSpec. FileSpec represents a single file on disk, not N files on disk.

source/Commands/CommandObjectBreakpoint.cpp
566 ↗(On Diff #120933)

We should probably have m_file_regexes (arrary) or just m_file_regex (one).

570 ↗(On Diff #120933)

We should probably have m_module_regexes (arrary) or just m_module_regex (one).

source/Utility/FileSpec.cpp
183–188

Revert. We can't add stuff to FileSpec because some piece of code wants to use it in a non-standard way. FileSpec represents one file.

188–190

Revert. We can't add stuff to FileSpec because some piece of code wants to use it in a non-standard way. FileSpec represents one file.

213

Revert

312–314

Revert

425–428

revert

This revision now requires changes to proceed.Oct 31 2017, 9:49 AM

A few general things: don't modify FileSpec, we have many of these objects and we can't increase their size without directly affecting memory usage. FileSpec objects represent one file on disk, not multiple. We should make a new class that contains a FileSpec and a regex, but not put it into FileSpec.

Will do...

Btw, if size is an issue, would it make sense to add a type to the enum declarations? e.g.:

diff --git a/include/lldb/Utility/FileSpec.h b/include/lldb/Utility/FileSpec.h
index 67926d01e..55d44d840 100644
--- a/include/lldb/Utility/FileSpec.h
+++ b/include/lldb/Utility/FileSpec.h
@@ -61,7 +61,7 @@ namespace lldb_private {
 //----------------------------------------------------------------------
 class FileSpec {
 public:
-  enum PathSyntax {
+  enum PathSyntax : unsigned char {
     ePathSyntaxPosix,
     ePathSyntaxWindows,
     ePathSyntaxHostNative

Actually, it wouldn't matter for this one due to alignment.

I was unhappy when we went over two pointers for a FileSpec when m_syntax was added due to the extra size. Anything we can do to make this smaller would be great, so the type on the enum would work, but as you say the alignment will nullify that. The two ConstString members contain a pointer which isn't aligned so we can't use any bits from the low end of the pointer. Are there any classes that take advantage of high bits in pointers? Most if not all OS's don't use the entire 64 bit address space... It would be great to get lldb_private::FileSpec down to just 2 pointers again.

I was unhappy when we went over two pointers for a FileSpec when m_syntax was added due to the extra size. Anything we can do to make this smaller would be great, so the type on the enum would work, but as you say the alignment will nullify that. The two ConstString members contain a pointer which isn't aligned so we can't use any bits from the low end of the pointer. Are there any classes that take advantage of high bits in pointers? Most if not all OS's don't use the entire 64 bit address space... It would be great to get lldb_private::FileSpec down to just 2 pointers again.

ConstString doesn't *currently* contain aligned pointers, but there's no reason we couldn't make it contain aligned pointers. Then we could use llvm::PointerUnion.

That said, I want to state again that I think this change is the wrong direction. I don't think we need this functionality in FileSpec, or even in another class. I think it is better served in the script.

I was unhappy when we went over two pointers for a FileSpec when m_syntax was added due to the extra size. Anything we can do to make this smaller would be great, so the type on the enum would work, but as you say the alignment will nullify that. The two ConstString members contain a pointer which isn't aligned so we can't use any bits from the low end of the pointer. Are there any classes that take advantage of high bits in pointers? Most if not all OS's don't use the entire 64 bit address space... It would be great to get lldb_private::FileSpec down to just 2 pointers again.

ConstString doesn't *currently* contain aligned pointers, but there's no reason we couldn't make it contain aligned pointers. Then we could use llvm::PointerUnion.

I would be fine with that.

That said, I want to state again that I think this change is the wrong direction. I don't think we need this functionality in FileSpec, or even in another class. I think it is better served in the script.

Agreed as well. I do like the idea of source path regexes, but only if we have a real need to add it to the API.

I was unhappy when we went over two pointers for a FileSpec when m_syntax was added due to the extra size. Anything we can do to make this smaller would be great, so the type on the enum would work, but as you say the alignment will nullify that. The two ConstString members contain a pointer which isn't aligned so we can't use any bits from the low end of the pointer. Are there any classes that take advantage of high bits in pointers? Most if not all OS's don't use the entire 64 bit address space... It would be great to get lldb_private::FileSpec down to just 2 pointers again.

ConstString doesn't *currently* contain aligned pointers, but there's no reason we couldn't make it contain aligned pointers. Then we could use llvm::PointerUnion.

I would be fine with that.

That said, I want to state again that I think this change is the wrong direction. I don't think we need this functionality in FileSpec, or even in another class. I think it is better served in the script.

Agreed as well. I do like the idea of source path regexes, but only if we have a real need to add it to the API.

I haven't had time to really look into this, but it seems that maintaining two independent strings, one for directory and one for basename, is just for convenience. We could easily keep it in a single string with an index to basename. When the directory portion is updated, e.g., resolved, etc., we just overwrite the string and adjust the index. And adjust accessors as needed.

I was unhappy when we went over two pointers for a FileSpec when m_syntax was added due to the extra size. Anything we can do to make this smaller would be great, so the type on the enum would work, but as you say the alignment will nullify that. The two ConstString members contain a pointer which isn't aligned so we can't use any bits from the low end of the pointer. Are there any classes that take advantage of high bits in pointers? Most if not all OS's don't use the entire 64 bit address space... It would be great to get lldb_private::FileSpec down to just 2 pointers again.

ConstString doesn't *currently* contain aligned pointers, but there's no reason we couldn't make it contain aligned pointers. Then we could use llvm::PointerUnion.

I would be fine with that.

That said, I want to state again that I think this change is the wrong direction. I don't think we need this functionality in FileSpec, or even in another class. I think it is better served in the script.

Agreed as well. I do like the idea of source path regexes, but only if we have a real need to add it to the API.

I haven't had time to really look into this, but it seems that maintaining two independent strings, one for directory and one for basename, is just for convenience. We could easily keep it in a single string with an index to basename. When the directory portion is updated, e.g., resolved, etc., we just overwrite the string and adjust the index. And adjust accessors as needed.

The reason it's two strings is for memory efficiency and de-duplication. Suppose you make FileSpec instances from foo/bar/baz/ and foo/bar/buzz. This gets separated into 4 instances of ConstString. foo/bar, baz, foo/bar, and buzz. Because ConstStrings are pooled, there's actually only 3 strings here. foo/bar, baz, and buzz.

It probably doesn't seem like a lot, but over the course of thousands and thousands of files (which debuggers often examine and which often share a common parent directory) this is a large memory savings.

The main reason for two strings is for searching efficiency. Most people don't set breakpoints using full paths, they give the basename:

(lldb) b main.c:12

When setting breakpoints is it very easy to search for matches by basename since this is what users usually type in. Easy to do full paths as well if needed. So we should not try to use a single string as it will adversely affect the speed of file and line breakpoints.

And yes, the memory savings are quite large as well when sharing directories.

I haven't had time to really look into this, but it seems that maintaining two independent strings, one for directory and one for basename, is just for convenience. We could easily keep it in a single string with an index to basename. When the directory portion is updated, e.g., resolved, etc., we just overwrite the string and adjust the index. And adjust accessors as needed.

The reason it's two strings is for memory efficiency and de-duplication. Suppose you make FileSpec instances from foo/bar/baz/ and foo/bar/buzz. This gets separated into 4 instances of ConstString. foo/bar, baz, foo/bar, and buzz. Because ConstStrings are pooled, there's actually only 3 strings here. foo/bar, baz, and buzz.

It probably doesn't seem like a lot, but over the course of thousands and thousands of files (which debuggers often examine and which often share a common parent directory) this is a large memory savings.

Ah, thanks for the explanation.

hintonda updated this revision to Diff 121214.Nov 1 2017, 5:21 PM
hintonda edited edge metadata.

Addressed @clayborg's comments.

  • Added FileSpec ctor that takes an explicit RegularExpression object when using FileSpec as a regex pattern.
  • Removed the RegularExpression member variable, added a flag, and create the regex on demand in operator==().
  • Added type to PathSyntax enum, and reordered member variables from, largest to smallest (evens thought the last three are now the same size) to make sure sizeof(FileSpec) doesn't grow with this change.
  • Removed implementation of special methods that had default behavior, and declared the default ctor '= default', which should reduce the maintenance burden.

Next steps:

  • Create SBRegularExpression and add ctor to SBFileSpec that takes it as parameter. That will allow this use case without any other changes:

    regex = lldb.SBRegularExpression(".*/some/path/.*") target.BreakpointCreateBySourceRegex(name, lldb.SBFileSpec(regex))
  • Then wire up SBFileSpec to do the right thing. SBRegularExpression can validate the expression on construction to give immediate feedback.
hintonda marked 6 inline comments as done.Nov 1 2017, 5:23 PM
zturner added inline comments.Nov 1 2017, 5:30 PM
include/lldb/Utility/FileSpec.h
65–69

This is actually a very nice change, as it reduces the size of FileSpec by a couple of bytes. I think you can submit this change as a one-liner by itself, independent of this patch.

586–587

I thought in previous comments we had decided that this wasn't really the right direction, and that FileSpec should represent one file. If we want this functionality in LLDB (and again, I'm not convinced), it should be done in such a way that the FileSpec class remains unmodified. We should not have to touch this class for any of this.

labath edited edge metadata.Nov 1 2017, 5:32 PM

In case there isn't enough people disagreeing, I'd like to say that I also think this does not belong in FileSpec. If for nothing else, then for the FileSpec::Equal implementation: the semantics of the IsEqual method are complicated enough to follow even without having regexes. Now it seems that in case of regexes the meaning of a == b and b == a is completely different. Also, what are you going to do if both FileSpecs are regexes? What if one of them is a *case-insensitive* regex, while the other is a regular one? ...

jingham requested changes to this revision.Nov 1 2017, 5:34 PM

IIUC, Zachary, Greg and I have all said we don't think adding regular expressions patterns to FileSpec is a good idea. I'm not sure why you are pursuing this avenue.

Of course, if this were likely to get accepted, you'd have to add tests for it since this is totally new behavior. But I almost don't want to bring that up since I don't want you to spend your time going off writing tests for a direction that we are unlikely to accept...

This revision now requires changes to proceed.Nov 1 2017, 5:34 PM
hintonda added inline comments.Nov 1 2017, 5:38 PM
include/lldb/Utility/FileSpec.h
65–69

I suppose it depends on you compiler/OS, but this by it self doesn't change the size of FileSpec at all -- just changes the padding from 3 to 6. It's still the size of 3 pointers due to alignment -- at least that's my understanding.

However, if you did have a way to encode this stuff into the two existing pointers, it might help -- you still need at least 4 bits if I'm not mistaken.

586–587

I actually spent quite a bit of time trying to do it the other way, i.e., not touching FileSpec, but the diff got so big, I decided it was probably a mistake. I still have it in a local branch, but it's not ready to commit.

hintonda abandoned this revision.Nov 1 2017, 6:00 PM

Okay, got the message.

Sorry for the noise.

zturner added inline comments.Nov 1 2017, 6:15 PM
include/lldb/Utility/FileSpec.h
65–69

It's possible for sizeof(int) to be equal to the size of a pointer. This happens always when building x86, but it can happen on x64 too. For example. Note that it returns 12. If you remove the specification of the underlying type, it returns 16 instead.

hintonda added inline comments.Nov 1 2017, 8:54 PM
include/lldb/Utility/FileSpec.h
65–69

You're absolutely correct. I don't do much on 32 bit, but this would definitely help when compiling with -m32.

Good catch...

hintonda updated this revision to Diff 121350.Nov 2 2017, 1:15 PM
hintonda edited edge metadata.

I realize this probably isn't an acceptable change, but it was a good
learing experience for me, as I'd never looked at the lldb code before
last Sunday, so I figured I'd go ahead and complete it.

While the basic technique remains the same, this version just adds a
single enum value, ePathSyntaxRegex, and modifies PathSyntaxIsPosix()
to treat it as a posix path so it doesn't get modified.

The logic for testing the regex was moved to
FileSpecList::FindFileIndex() which does the filtering and already
deals with equality issues, e.g., whether or not to include the
directory in the match.

Finally, the SB API was modified to use accept a new regex flag, and a
test was added.

Thanks again for all your feedback and suggestions...

I was unhappy when we went over two pointers for a FileSpec when m_syntax was added due to the extra size. Anything we can do to make this smaller would be great, so the type on the enum would work, but as you say the alignment will nullify that. The two ConstString members contain a pointer which isn't aligned so we can't use any bits from the low end of the pointer. Are there any classes that take advantage of high bits in pointers? Most if not all OS's don't use the entire 64 bit address space... It would be great to get lldb_private::FileSpec down to just 2 pointers again.

Although completely unrelated to this patch, getting FileSpec back down to 2 pointers is easily done.

There are several ways to do this -- either internal to FileSpec itself or made available more generally. Depends on what you want -- all changes involve compromises.

labath resigned from this revision.May 8 2018, 8:55 AM
hintonda abandoned this revision.Mar 29 2019, 2:07 PM