When building a tool in a non-standard environment (e.g. custom
compiler path -> LD_LIBRARY_PATH set) then
use_default_shell_env = True is required to run that tool in the same
environment or otherwise the build will fail due to missing symbols.
See https://github.com/google/jax/issues/7842 for this issue and
https://github.com/tensorflow/tensorflow/pull/44549 for related fix in
TF.
Details
Diff Detail
Event Timeline
I've done quite a bit of hunting around and reading about default_shell_env (why wasn't this in the documentation at https://docs.bazel.build/versions/main/skylark/lib/actions.html#run.use_default_shell_env?). From my reading, it appears that use_default_shell_env will include the entire local shell environment as part of each Bazel invocation. This has implications for build hermeticness and especially for shared caching, so I'm a bit reluctant to do it. AFAICT if you set --action_env=LD_LIBRARY_PATH and --host_action_env=LD_LIBRARY_PATH then actions (in target and host configuration, respectively) *should* have the LD_LIBRARY_PATH in their env. That's a decision that can be made in a more localized way so that individual users/projects can make their own tradeoff between hermeticness and actually getting their build to work :-P Following the thread on https://github.com/bazelbuild/bazel/issues/12579 and https://github.com/tensorflow/tensorflow/pull/44549 it seems that --host_action_env is broken and was fixed by https://github.com/bazelbuild/bazel/commit/e6670825b1 and then, if I'm reading GitHub UI's correctly, backported to the 4.x LTS release with https://github.com/bazelbuild/bazel/commit/80cf90e51a. Further, I believe that --incompatible_strict_action_env means that the default shell env *doesn't* inherit PATH, LD_LIBRARY_PATH, etc, so there is still a way to recover the more cache-friendly and hermetic behavior. I'm actually not sure in that case what the default_shell_env *does* contain and what shell env the action gets if not the default one. None, I guess? So still some confusion there.
So to evaluate the best path forward, could you try out whether building with Bazel 4.2 and --host_action_env=LD_LIBRARY_PATH fixes the issue for you? I'll also do a bit more investigation of what's going on with default_shell_env
AFAICT if you set --action_env=LD_LIBRARY_PATH and --host_action_env=LD_LIBRARY_PATH then actions (in target and host configuration, respectively) *should* have the LD_LIBRARY_PATH in their env.
jax builds with --distinct_host_configuration=false which means all action_env should also be passed to host_action_env. We also pass e.g. --action_env=PYTHONPATH for our builds.
Yet as you can see in https://github.com/google/jax/issues/7842 the env is literally empty. So I'm pretty sure the --host_action_env=LD_LIBRARY_PATH won't work.
Furthermore we can't really use Bazel 4.2 as the underlying TensorFlow isn't compatible with Bazel 4.x yet: https://github.com/tensorflow/tensorflow/blob/v2.6.0/configure.py#L53
I'm actually not sure in that case what the default_shell_env *does* contain and what shell env the action gets if not the default one. None, I guess? So still some confusion there.
Yeah, very much looks like None while as far as I can tell stuff like LD_LIBRARY_PATH and action_envs are passed if use_default_shell_env is set. The TF guys accepted my PR which adds this to the TF repo, so I guess even they don't know a better way.
Re --incompatible_strict_action_env I found https://github.com/bazelbuild/bazel/issues/6648 that mentions that --action_env *should* be passed, but clearly is not -.-
However if you search the Bazel repo that whole thing is a mess:
See https://github.com/bazelbuild/bazel/issues/3320, https://github.com/bazelbuild/bazel/issues/12059, https://github.com/bazelbuild/bazel/issues/11163 to name a few general ones
https://github.com/bazelbuild/bazel/issues/3057:
Bazel does not forward --action_env=NAME, but it does forward --action_env=NAME=VALUE.
What??? As this is a repository rule I tried --action_env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH but still nothing passed
https://github.com/bazelbuild/bazel/issues/6392:
Are you sure that params passed to --action_env are passed to repository execute calls? That sounds completely wrong.
I can't follow that reasoning -.-
So in conclusion: I don't see any other option but this.
Edit: Same results with Bazel 4.2.1
(╯°□°)╯︵ ┻━┻
Ugh yeah let's just add this for now at least. I'll follow up to see if we can get some clarity on wtf is going on with action environments. I assume you don't have commit access and would like me to land this for you?
And if so, please lmk what you'd like me to put for the git author field (name and email)
utils/bazel/llvm-project-overlay/mlir/tblgen.bzl | ||
---|---|---|
172 | Can you capture the rational presented above as a comment before this option? |
@GMNGeoffrey Updated the diff to include a comment. And yes please land this. Author tag is "Alexander Grund <alexander.grund@tu-dresden.de>", thanks.
Oh and I asked for more explanation for all this on https://groups.google.com/g/bazel-discuss/c/jHaj11mm6sU
Can you capture the rational presented above as a comment before this option?