This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Support Python imports relative the to the current file being sourced
ClosedPublic

Authored by JDevlieghere on Oct 13 2020, 11:04 AM.

Details

Summary

Make it possible to use a relative path in command script import to the location of the file being sourced. This allows the user to put Python scripts next to LLDB command files and importing them without having to specify an absolute path.

rdar://68310384

Diff Detail

Event Timeline

JDevlieghere requested review of this revision.Oct 13 2020, 11:04 AM
JDevlieghere created this revision.

Extend test case to show that this works with files sourcing other files.

Would you mind adding a couple tests for imports via a path to a python file, ex command script import command.py, maybe even a test that checks nested directories, ex: command script import path/to/command.py?

kastiglione added inline comments.Oct 13 2020, 12:24 PM
lldb/include/lldb/Interpreter/CommandInterpreter.h
644

if not too large, seems like it could be done in this change.

Fix edge cases and add more tests

JDevlieghere added inline comments.Oct 13 2020, 10:28 PM
lldb/include/lldb/Interpreter/CommandInterpreter.h
644

Fixing this requires adding a new variable which is orthogonal to this change. I prefer to keep it separate.

Would you mind adding a couple tests for imports via a path to a python file, ex command script import command.py, maybe even a test that checks nested directories, ex: command script import path/to/command.py?

This was a great suggestion and uncovered some unhandled edge cases. Thanks!

Remove bogus files

The main question on my mind is should we be adding the directory of the python file to the path (which is what I believe is happening now), or if we should add the directory of the command file that is being sourced (and adjust the way we import the file). The main impact of this is how will the imported module "see" itself (what will it's name be), and how it will be able to import other modules.

Imagine the user has the following (reasonable, I think) file structure.

$ROOT/utils/consult_oracle.py
$ROOT/automatic_bug_finder/main.py # uses consult_oracle.py
$ROOT/awesome_backtrace_analyzer/main.py # uses consult_oracle.py
$ROOT/install_super_scripts.lldbinit # calls command script import awesome_backtrace_analyzer/main.py

If "command script import awesome_backtrace_analyzer/main.py" ends up adding $ROOT/awesome_backtrace_analyzer to the path, then this module will have a hard time importing the modules it depends on (it would either have to use weird relative imports, or mess with sys.path itself. If we add just $ROOT then it could simply import utils.consult_oracle.

I think setting the import path to $ROOT would actually make the sys.path manipulation serve some useful purpose (and also reduce the number of sys.path entries we add). If, on the other hand, we are not interested making cross-module imports work "out of the box" (like, we could say that it's the responsibility of individual modules to ensure that), we could also try to import the file without messing with sys.path at all (https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path gives one way to do that).

lldb/include/lldb/Interpreter/CommandInterpreter.h
554

A comment would be very useful here.

The main question on my mind is should we be adding the directory of the python file to the path (which is what I believe is happening now), or if we should add the directory of the command file that is being sourced (and adjust the way we import the file). The main impact of this is how will the imported module "see" itself (what will it's name be), and how it will be able to import other modules.

Imagine the user has the following (reasonable, I think) file structure.

$ROOT/utils/consult_oracle.py
$ROOT/automatic_bug_finder/main.py # uses consult_oracle.py
$ROOT/awesome_backtrace_analyzer/main.py # uses consult_oracle.py
$ROOT/install_super_scripts.lldbinit # calls command script import awesome_backtrace_analyzer/main.py

If "command script import awesome_backtrace_analyzer/main.py" ends up adding $ROOT/awesome_backtrace_analyzer to the path, then this module will have a hard time importing the modules it depends on (it would either have to use weird relative imports, or mess with sys.path itself. If we add just $ROOT then it could simply import utils.consult_oracle.

I guess then the user should have called command script import awesome_backtrace_analyzer to import the package rather than the main.py inside it. But I get your point. FWIW just adding the $ROOT is how I did the original implementation before adding the tests for the nested directories and .py files that Dave suggested. It would solve this issues but then doesn't support those scenarios. I don't know if it would be less confusing that you can't pass a relative path to a .py file or that you can't import another module as you described.

I think setting the import path to $ROOT would actually make the sys.path manipulation serve some useful purpose (and also reduce the number of sys.path entries we add). If, on the other hand, we are not interested making cross-module imports work "out of the box" (like, we could say that it's the responsibility of individual modules to ensure that), we could also try to import the file without messing with sys.path at all (https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path gives one way to do that).

I would prefer this approach if it didn't require to name the module ourself. Any heuristic will have the risk of being ambitious as well (which is probably why the API makes you specify the module name).

The main question on my mind is should we be adding the directory of the python file to the path (which is what I believe is happening now), or if we should add the directory of the command file that is being sourced (and adjust the way we import the file). The main impact of this is how will the imported module "see" itself (what will it's name be), and how it will be able to import other modules.

Imagine the user has the following (reasonable, I think) file structure.

$ROOT/utils/consult_oracle.py
$ROOT/automatic_bug_finder/main.py # uses consult_oracle.py
$ROOT/awesome_backtrace_analyzer/main.py # uses consult_oracle.py
$ROOT/install_super_scripts.lldbinit # calls command script import awesome_backtrace_analyzer/main.py

If "command script import awesome_backtrace_analyzer/main.py" ends up adding $ROOT/awesome_backtrace_analyzer to the path, then this module will have a hard time importing the modules it depends on (it would either have to use weird relative imports, or mess with sys.path itself. If we add just $ROOT then it could simply import utils.consult_oracle.

I guess then the user should have called command script import awesome_backtrace_analyzer to import the package rather than the main.py inside it. But I get your point. FWIW just adding the $ROOT is how I did the original implementation before adding the tests for the nested directories and .py files that Dave suggested. It would solve this issues but then doesn't support those scenarios. I don't know if it would be less confusing that you can't pass a relative path to a .py file or that you can't import another module as you described.

I don't think the two options are mutually exclusive. I'm pretty sure this is just a limitation of our current importing code, which could be fixed to import awesome_backtrace_analyzer/main.py as awesome_backtrace_analyzer.main like it would be from python.

I think setting the import path to $ROOT would actually make the sys.path manipulation serve some useful purpose (and also reduce the number of sys.path entries we add). If, on the other hand, we are not interested making cross-module imports work "out of the box" (like, we could say that it's the responsibility of individual modules to ensure that), we could also try to import the file without messing with sys.path at all (https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path gives one way to do that).

I would prefer this approach if it didn't require to name the module ourself. Any heuristic will have the risk of being ambitious as well (which is probably why the API makes you specify the module name).

(I assume you meant ambiguous, not ambitious :P)

Well... yes, if we do the simplest thing of naming the "module" according to the file basename then it will be ambiguous. But I would say that even _that_ is better than what we do now, because it avoids funky interactions between all the sys.path entries that we're adding -- e.g. a random file in the same directory as one of the files user imported becoming visible to python import machinery, and shadowing some other real module. It also gives us the option to do something about that ambiguity -- we could add numerical suffixes to the imported names (and have the import command print the name it used) or whatever...

The main question on my mind is should we be adding the directory of the python file to the path (which is what I believe is happening now), or if we should add the directory of the command file that is being sourced (and adjust the way we import the file). The main impact of this is how will the imported module "see" itself (what will it's name be), and how it will be able to import other modules.

Imagine the user has the following (reasonable, I think) file structure.

$ROOT/utils/consult_oracle.py
$ROOT/automatic_bug_finder/main.py # uses consult_oracle.py
$ROOT/awesome_backtrace_analyzer/main.py # uses consult_oracle.py
$ROOT/install_super_scripts.lldbinit # calls command script import awesome_backtrace_analyzer/main.py

If "command script import awesome_backtrace_analyzer/main.py" ends up adding $ROOT/awesome_backtrace_analyzer to the path, then this module will have a hard time importing the modules it depends on (it would either have to use weird relative imports, or mess with sys.path itself. If we add just $ROOT then it could simply import utils.consult_oracle.

I guess then the user should have called command script import awesome_backtrace_analyzer to import the package rather than the main.py inside it. But I get your point. FWIW just adding the $ROOT is how I did the original implementation before adding the tests for the nested directories and .py files that Dave suggested. It would solve this issues but then doesn't support those scenarios. I don't know if it would be less confusing that you can't pass a relative path to a .py file or that you can't import another module as you described.

I don't think the two options are mutually exclusive. I'm pretty sure this is just a limitation of our current importing code, which could be fixed to import awesome_backtrace_analyzer/main.py as awesome_backtrace_analyzer.main like it would be from python.

I don't think we can do that in the general case without breaking users (see below). I guess we could do it for imports not relative to the current working directory, as this has never worked before. It would then replace the logic described by the comment on line 2793.

I think setting the import path to $ROOT would actually make the sys.path manipulation serve some useful purpose (and also reduce the number of sys.path entries we add). If, on the other hand, we are not interested making cross-module imports work "out of the box" (like, we could say that it's the responsibility of individual modules to ensure that), we could also try to import the file without messing with sys.path at all (https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path gives one way to do that).

I would prefer this approach if it didn't require to name the module ourself. Any heuristic will have the risk of being ambitious as well (which is probably why the API makes you specify the module name).

(I assume you meant ambiguous, not ambitious :P)

😅

Well... yes, if we do the simplest thing of naming the "module" according to the file basename then it will be ambiguous. But I would say that even _that_ is better than what we do now, because it avoids funky interactions between all the sys.path entries that we're adding -- e.g. a random file in the same directory as one of the files user imported becoming visible to python import machinery, and shadowing some other real module. It also gives us the option to do something about that ambiguity -- we could add numerical suffixes to the imported names (and have the import command print the name it used) or whatever...

From a technical point I agree a 100%, but I just don't see how we can do it without breaking tons of users that are using the module name in command script add:

def __lldb_init_module(debugger, internal_dict):
    debugger.HandleCommand('command script add -f module.function my_function')

I'll hold off on updating the patch until we reached consensus.

I guess then the user should have called command script import awesome_backtrace_analyzer to import the package rather than the main.py inside it. But I get your point. FWIW just adding the $ROOT is how I did the original implementation before adding the tests for the nested directories and .py files that Dave suggested. It would solve this issues but then doesn't support those scenarios. I don't know if it would be less confusing that you can't pass a relative path to a .py file or that you can't import another module as you described.

I don't think the two options are mutually exclusive. I'm pretty sure this is just a limitation of our current importing code, which could be fixed to import awesome_backtrace_analyzer/main.py as awesome_backtrace_analyzer.main like it would be from python.

I don't think we can do that in the general case without breaking users (see below). I guess we could do it for imports not relative to the current working directory, as this has never worked before. It would then replace the logic described by the comment on line 2793.

In general, we cannot impose any kind of natural structure on the path the user gives us -- in an absolute path, the python root could be anywhere. For cwd-relative imports, we could pretend that the cwd is the root, but that seems somewhat unintuitive (and breaks users) (*). Even the usage of the directory of the sourced file as root seems moderately unintuitive to me, though I think it might be a good fit for the motivational use case for this feature.

Speaking of users, if you know any, it might be interesting to ask them about this and see whether they'd be interested in such a thing. I don't write python scripts, so I'm just hypothesizing. (and trying to ensure we don't dig a bigger hole for ourselves than we already have.)

(*) On the flip side, we are already adding . to the path, so this would kind of make sense.

I guess then the user should have called command script import awesome_backtrace_analyzer to import the package rather than the main.py inside it. But I get your point. FWIW just adding the $ROOT is how I did the original implementation before adding the tests for the nested directories and .py files that Dave suggested. It would solve this issues but then doesn't support those scenarios. I don't know if it would be less confusing that you can't pass a relative path to a .py file or that you can't import another module as you described.

I don't think the two options are mutually exclusive. I'm pretty sure this is just a limitation of our current importing code, which could be fixed to import awesome_backtrace_analyzer/main.py as awesome_backtrace_analyzer.main like it would be from python.

I don't think we can do that in the general case without breaking users (see below). I guess we could do it for imports not relative to the current working directory, as this has never worked before. It would then replace the logic described by the comment on line 2793.

In general, we cannot impose any kind of natural structure on the path the user gives us -- in an absolute path, the python root could be anywhere. For cwd-relative imports, we could pretend that the cwd is the root, but that seems somewhat unintuitive (and breaks users) (*). Even the usage of the directory of the sourced file as root seems moderately unintuitive to me, though I think it might be a good fit for the motivational use case for this feature.

The issue is that we "resolve" cwd-relative paths to absolute paths and treat the result as an absolute path and the "." is only in the system path to make relative imports from inside the module work. We could do the same for the "source root" but that would mean adding yet another path to the system path.

The more I think about it the more I feel like the current approach in this patch is fits best with what we're already doing. If it doesn't work for the user they can always adjust the system path in the top level (imported) module.

I don't think we can do that in the general case without breaking users (see below). I guess we could do it for imports not relative to the current working directory, as this has never worked before. It would then replace the logic described by the comment on line 2793.

In general, we cannot impose any kind of natural structure on the path the user gives us -- in an absolute path, the python root could be anywhere. For cwd-relative imports, we could pretend that the cwd is the root, but that seems somewhat unintuitive (and breaks users) (*). Even the usage of the directory of the sourced file as root seems moderately unintuitive to me, though I think it might be a good fit for the motivational use case for this feature.

The issue is that we "resolve" cwd-relative paths to absolute paths and treat the result as an absolute path and the "." is only in the system path to make relative imports from inside the module work. We could do the same for the "source root" but that would mean adding yet another path to the system path.

I'm afraid you've lost me there. Yes, we turn relative paths into absolute ones, but I don't see how that translates into adding less entries into sys.path. For each module that we import, we add dirname(module_path) to sys.path. If we import two modules from different directories, we will get two sys.path entries, even if the modules were specified as paths relative to the same directory. Canonicalizing the path to that directory will mean less entries.

JDevlieghere added a comment.EditedOct 20 2020, 8:26 AM

I don't think we can do that in the general case without breaking users (see below). I guess we could do it for imports not relative to the current working directory, as this has never worked before. It would then replace the logic described by the comment on line 2793.

In general, we cannot impose any kind of natural structure on the path the user gives us -- in an absolute path, the python root could be anywhere. For cwd-relative imports, we could pretend that the cwd is the root, but that seems somewhat unintuitive (and breaks users) (*). Even the usage of the directory of the sourced file as root seems moderately unintuitive to me, though I think it might be a good fit for the motivational use case for this feature.

The issue is that we "resolve" cwd-relative paths to absolute paths and treat the result as an absolute path and the "." is only in the system path to make relative imports from inside the module work. We could do the same for the "source root" but that would mean adding yet another path to the system path.

I'm afraid you've lost me there. Yes, we turn relative paths into absolute ones, but I don't see how that translates into adding less entries into sys.path. For each module that we import, we add dirname(module_path) to sys.path. If we import two modules from different directories, we will get two sys.path entries, even if the modules were specified as paths relative to the same directory. Canonicalizing the path to that directory will mean less entries.

It doesn't, it translates to more: one for the . and one for dirname(module_path). Which basically translates to n+1 entries because the working directory doesn't change. I was saying we could do the exact same thing for modules relative to the source-dir, but that means 2n entries because for every module we add source_dir and dirname(module_path) and generally both will be different. My point was that it would solve the issue of a relative import that you described last week, but at the cost at adding twice as much entries, which means double the chance of an ambiguous (not ambitious ;-) import. I think that was your main objection to the patch as it is right now and I'm arguing that I think it's the best of the different trade-offs.

I'm afraid you've lost me there. Yes, we turn relative paths into absolute ones, but I don't see how that translates into adding less entries into sys.path. For each module that we import, we add dirname(module_path) to sys.path. If we import two modules from different directories, we will get two sys.path entries, even if the modules were specified as paths relative to the same directory. Canonicalizing the path to that directory will mean less entries.

It doesn't, it translates to more: one for the . and one for dirname(module_path). Which basically translates to n+1 entries because the working directory doesn't change. I was saying we could do the exact same thing for modules relative to the source-dir, but that means 2n entries because for every module we add source_dir and dirname(module_path) and generally both will be different. My point was that it would solve the issue of a relative import that you described last week, but at the cost at adding twice as much entries, which means double the chance of an ambiguous (not ambitious ;-) import. I think that was your main objection to the patch as it is right now and I'm arguing that I think it's the best of the different trade-offs.

Why would we be adding both? We've already checked the path and know if the file exists in the cwd or not. What's the point in adding that? My impression was that this patch already avoids adding two paths in this case...

Actually, this guessing of what they user meant to say is one of the things I don't like about this approach. I think it would be better if there was some way (command option or something) to specify where the module should be imported from. The docstring for that option could also explain the rule for how the module is being imported.

I'm afraid you've lost me there. Yes, we turn relative paths into absolute ones, but I don't see how that translates into adding less entries into sys.path. For each module that we import, we add dirname(module_path) to sys.path. If we import two modules from different directories, we will get two sys.path entries, even if the modules were specified as paths relative to the same directory. Canonicalizing the path to that directory will mean less entries.

It doesn't, it translates to more: one for the . and one for dirname(module_path). Which basically translates to n+1 entries because the working directory doesn't change. I was saying we could do the exact same thing for modules relative to the source-dir, but that means 2n entries because for every module we add source_dir and dirname(module_path) and generally both will be different. My point was that it would solve the issue of a relative import that you described last week, but at the cost at adding twice as much entries, which means double the chance of an ambiguous (not ambitious ;-) import. I think that was your main objection to the patch as it is right now and I'm arguing that I think it's the best of the different trade-offs.

Why would we be adding both? We've already checked the path and know if the file exists in the cwd or not. What's the point in adding that? My impression was that this patch already avoids adding two paths in this case...

Actually, this guessing of what they user meant to say is one of the things I don't like about this approach. I think it would be better if there was some way (command option or something) to specify where the module should be imported from. The docstring for that option could also explain the rule for how the module is being imported.

I'm afraid you've lost me there. Yes, we turn relative paths into absolute ones, but I don't see how that translates into adding less entries into sys.path. For each module that we import, we add dirname(module_path) to sys.path. If we import two modules from different directories, we will get two sys.path entries, even if the modules were specified as paths relative to the same directory. Canonicalizing the path to that directory will mean less entries.

It doesn't, it translates to more: one for the . and one for dirname(module_path). Which basically translates to n+1 entries because the working directory doesn't change. I was saying we could do the exact same thing for modules relative to the source-dir, but that means 2n entries because for every module we add source_dir and dirname(module_path) and generally both will be different. My point was that it would solve the issue of a relative import that you described last week, but at the cost at adding twice as much entries, which means double the chance of an ambiguous (not ambitious ;-) import. I think that was your main objection to the patch as it is right now and I'm arguing that I think it's the best of the different trade-offs.

Why would we be adding both? We've already checked the path and know if the file exists in the cwd or not. What's the point in adding that? My impression was that this patch already avoids adding two paths in this case...

For paths relative to the CWD we add both already:

  1. We add "." to the sys path once: https://github.com/llvm/llvm-project/blob/5d796645d6c8cadeb003715c33e231a8ba05b6de/lldb/source/Plugins/ScriptInterpreter/Python/ScriptInterpreterPython.cpp#L3234
  2. We "resolve" (make absolute) the relative path (https://github.com/llvm/llvm-project/blob/5d796645d6c8cadeb003715c33e231a8ba05b6de/lldb/source/Plugins/ScriptInterpreter/Python/ScriptInterpreterPython.cpp#L2747) and if it exists add its dir to the sys path (https://github.com/llvm/llvm-project/blob/5d796645d6c8cadeb003715c33e231a8ba05b6de/lldb/source/Plugins/ScriptInterpreter/Python/ScriptInterpreterPython.cpp#L2785).

For source-relative imports all this was a hypothetical solution to the relative import from Python issue you described.

Actually, this guessing of what they user meant to say is one of the things I don't like about this approach. I think it would be better if there was some way (command option or something) to specify where the module should be imported from. The docstring for that option could also explain the rule for how the module is being imported.

How are we guessing more than before? We're doing the exact same thing as for relative paths, i.e. we resolve them and add the dir's path to the system path (= 2 from above).

I'm afraid we've completely desynchronized by this point. Let's try to reset. The algorithm I'm proposing is:

if (is_relative_to_command_file(path))  {// how to implement that?
  ExtendPathIfNotExists(dirname(command_file));
  import(path);
} else {
  // Same as before
  path = make_absolute(path, cwd);
  ExtendPathIfNotExists(dirname(path));
  import(basename(path));
}

The algorithm that I think this patch implements is:

if (is_relative_to_command_file(path)) // implemented by checking cwd for this file ?
  path = make_absolute(path, dirname(command_file);
else
  path = make_absolute(path, cwd);
ExtendPathIfNotExists(dirname(path));
import(basename(path));

Is that an accurate depiction?

Both algorithms add at most one path entry for each import command. (I'm ignoring the '.' entry which gets added unconditionally.) For cwd-relative imports they behave the same way. The difference is in the command-relative imports. The first algorithm adds at most one path for each command file which executes import commands. The second one can add more -- if the imported scripts are in different directories, then all of those directories will be added to the path.

Actually, this guessing of what they user meant to say is one of the things I don't like about this approach. I think it would be better if there was some way (command option or something) to specify where the module should be imported from. The docstring for that option could also explain the rule for how the module is being imported.

How are we guessing more than before? We're doing the exact same thing as for relative paths, i.e. we resolve them and add the dir's path to the system path (= 2 from above).

But how do we know if the user meant to do a CWD-relative import or a command-relative one? That's an extra level of guessing (uncertainty). What if the file is present at both locations. I think it would be better if there was some way to explicitly say that you're importing a module using this command-relative scheme. And that this might give us an excuse to implement a more pythonic module import scheme (?)

Briefly discussed this with Pavel on IRC. The latest revision implements what I think you suggested:

  • Make the new logic conditional on a new flag (-c).
  • Add the "command search dir" to the path, not dirname(path).
  • Change the test to import a relative module as baz.hello instead of baz/hello.
labath accepted this revision.Oct 27 2020, 3:20 AM

Let's see how this goes.

lldb/source/Commands/Options.td
709 ↗(On Diff #300803)

It might be better to make this an error

lldb/test/Shell/ScriptInterpreter/Python/command_relative_import.test
4–9 ↗(On Diff #300803)

consider using the (new) split-file utility -- It can split single file into multiple chunks and place them in the appropriate folders.

This revision is now accepted and ready to land.Oct 27 2020, 3:20 AM

Let's see how this goes.

Thanks for bearing with me :-)

Herald added a project: Restricted Project. · View Herald TranscriptOct 27 2020, 9:21 AM