This is an archive of the discontinued LLVM Phabricator instance.

[support][llvm-objcopy] Add support for shell wildcards
ClosedPublic

Authored by rupprecht on Aug 22 2019, 12:20 PM.

Details

Summary

GNU objcopy accepts the --wildcard flag to allow wildcard matching on symbol-related flags. (Note: it's implicitly true for section flags).

The basic syntax is to allow *, ?, \, and [] which work similarly to how they work in a shell. Additionally, starting a wildcard with ! causes that wildcard to prevent it from matching a flag.

Update GlobPattern in libSupport to handle a few more cases. It does not fully match the fnmatch used by GNU objcopy since named character classes (e.g. [[:digit:]]) are not supported, but this should support most existing use cases (mostly just * is what's used anyway).

Diff Detail

Event Timeline

rupprecht created this revision.Aug 22 2019, 12:20 PM

I've not fully reviewed this yet, and won't get around to it until next week, but I wanted to highlight your own comment from D57517 where you said the following:

Also a question, for others as well: is this a feature we actually want to carry forward? Are there any users of this feature/is it useful? This isn't something I've come across as actually being used, though I'm happy to review it if is.

We added --regex instead of --wildcard precisely to address the specific use-case @evgeny777 had. I'm not opposed to adding --wildcard, if there are actual users of it who don't want to try to migrate to --regex, but I do wonder about the additional complexity this is introducing, and how likely it is to get the rules wrong, e.g. to do with escaping etc.

llvm/lib/Support/Regex.cpp
214 ↗(On Diff #216686)

Why is BracketDepth signed? It can't ever be negative.

218 ↗(On Diff #216686)

Looks to me like the comments aren't lined up correctly with their case statements.

255 ↗(On Diff #216686)

"<=" -> "="

You assert BracketDepth is not negative elsewhere.

Also a question, for others as well: is this a feature we actually want to carry forward? Are there any users of this feature/is it useful? This isn't something I've come across as actually being used, though I'm happy to review it if is.

I have the same question.

llvm/lib/Support/Regex.cpp
214 ↗(On Diff #216686)

BracketDepth should just be a boolean. There is no nested depth. With the current implementation, the pattern [[]* does not match the string [a, while in fnmatch(3) (used by GNU objcopy), it matches.

Windows does not have fnmatch(3). If there were a similar function, Unicode might be its weakness...

To correctly parse brackets used in glob, I think the bracket parsing code should detect [:, [. or [= (
collation-related bracket symbols), and skip all characters until the ending symbol is found.

(Will get to the code comments later, just wanted to justify the patch)

I've not fully reviewed this yet, and won't get around to it until next week, but I wanted to highlight your own comment from D57517 where you said the following:

Also a question, for others as well: is this a feature we actually want to carry forward? Are there any users of this feature/is it useful? This isn't something I've come across as actually being used, though I'm happy to review it if is.

We added --regex instead of --wildcard precisely to address the specific use-case @evgeny777 had. I'm not opposed to adding --wildcard, if there are actual users of it who don't want to try to migrate to --regex

IMO --regex is nicer for interactive uses, but when using objcopy in a build script that may be used by different toolchains, with either GNU objcopy or llvm-objcopy, a common denominator is needed. Apparently clang+kernel people need it for some versions of the kernel, and we've had some internal users need this too. (For now, they use llvm-objcopy w/ --regex, but it would be a problem if someone using GNU objcopy in their toolchain needs to depend on their package). I also think it will generally help with the long tail of users that want to try it out.

but I do wonder about the additional complexity this is introducing, and how likely it is to get the rules wrong, e.g. to do with escaping etc.

My concern as well. Please review diligently, I won't take comments about regex bugs personally :)

! In D66613#1642709, @MaskRay wrote:

Also a question, for others as well: is this a feature we actually want to carry forward? Are there any users of this feature/is it useful? This isn't something I've come across as actually being used, though I'm happy to review it if is.

I have the same question.

Android/ChromeOS/Google is looking into using llvm substitutes for binutils for their Linux kernels. This feature is needed for us to substitute GNU strip for llvm-strip and GNU objcopy for llvm-objcopy. The upstream Linux kernel has removed the use of these for architectures we ship, but the Long Term Support (LTS) kernels that we will be shipping for the foreseeable future still do. Backporting the removal of these flags in a way that doesn't also backport features (which is against LTS policy) is likely to be tricky, as it would also cause the LTS branches to start to diverge, making future backports to those source files less likely to apply cleanly. I'm not confident we could get those patches accepted upstream. We could always carry them out of tree in Android/ChromeOS/Google, but I'd rather not.

All that not supporting this feature does is delay the use of llvm-objcopy/llvm-strip in a few major Linux distributions for their kernels for a few years.

rupprecht updated this revision to Diff 216983.Aug 23 2019, 5:10 PM
rupprecht marked 6 inline comments as done.
  • Split out separate methods for parsing brackets instead of allowing arbitrary nesting.
llvm/lib/Support/Regex.cpp
214 ↗(On Diff #216686)

Thanks, I missed that edge case.

I've removed the BracketDepth var entirely. I originally did have it as a boolean but it didn't handle the "nested" case of "[[:alpha:]]" (it's not arbitrarily nested, but it is still kind of nested...). It isn't necessary by changing this test case to handle eating all characters until a matching ].

218 ↗(On Diff #216686)

Fixed -- this indentation level is forced by clang-format, but it looks better after the case statement rather than before it.

MaskRay added a comment.EditedAug 23 2019, 6:30 PM

I just realized that you can remove the glob (I'd call it glob, not wildcard) change from this patch, and just use lib/Support/Regex.cpp:GlobPattern.

GlobPattern is currently used by lld to do version script/dynamic list matching. In version scripts/dynamic lists, [: and [= are syntax error (ld.bfd), and I don't think anyone using [.. But to make it fully fnmatch(pat, str, 0) capable (in case someone uses character classes like [[:digit:]]), you can add these enhancement to a separate change.

llvm/lib/Support/Regex.cpp
247 ↗(On Diff #216983)

character '^' may be slightly better/faster.

268 ↗(On Diff #216983)

character ']'

MaskRay added inline comments.Aug 29 2019, 12:28 AM
llvm/docs/CommandGuide/llvm-objcopy.rst
134 ↗(On Diff #216983)

On by default for section-related flags.

This is another thing I'm not sure if we want to do.

I'd like users to specify -w to get wildcard semantics for section options. I checked the ruby/nacl/netbsd links and they all appear to use -w

khm, I don't insist, but personally I would split out the changes in lib/Support and the corresponding unit tests into a separate patch

I wonder if you can use GlobPattern (llvm/Support/GlobPattern.h) or extend this class. One of the reasons we use it in lld is because of much better performance (see D26241),
This may not be the case for llvm-objcopy, but still using original pattern matcher looks nicer than translating to regexps.

rupprecht updated this revision to Diff 224709.Oct 11 2019, 4:42 PM
rupprecht marked 3 inline comments as done.
  • Use GlobPattern instead of Regex
  • Log a warning if the glob expression is invalid
In D66613#1659632, @alexshap wrote:

khm, I don't insist, but personally I would split out the changes in lib/Support and the corresponding unit tests into a separate patch

Ack; I'll submit them separately, but I'm leaving them together in the patch for now, e.g. to show how the added glob support is needed by llvm-objcopy tests.

I just realized that you can remove the glob (I'd call it glob, not wildcard) change from this patch, and just use lib/Support/Regex.cpp:GlobPattern.

GlobPattern is currently used by lld to do version script/dynamic list matching. In version scripts/dynamic lists, [: and [= are syntax error (ld.bfd), and I don't think anyone using [.. But to make it fully fnmatch(pat, str, 0) capable (in case someone uses character classes like [[:digit:]]), you can add these enhancement to a separate change.

I wonder if you can use GlobPattern (llvm/Support/GlobPattern.h) or extend this class. One of the reasons we use it in lld is because of much better performance (see D26241),
This may not be the case for llvm-objcopy, but still using original pattern matcher looks nicer than translating to regexps.

OK, removed the regex transformation and using glob instead.

I decided to leave out character classes for now, so the changes to glob are actually pretty minimal. lld tests are still passing w/ these glob changes. I hope the test coverage is sufficient there.

llvm/docs/CommandGuide/llvm-objcopy.rst
134 ↗(On Diff #216983)

Those were just easy-to-find examples of wildcard usage. There are others using wildcards w/o -w, e.g.

https://chromium.googlesource.com/chromiumos/platform/ec/+/refs/heads/master/Makefile.rules#66

cmd_ec_elf_to_flat ?= $(OBJCOPY) --set-section-flags .roshared=share -R .dram* \
				-O binary $< $@
cmd_ec_elf_to_flat_dram ?= $(OBJCOPY) -j .dram* -O binary $< $@

Older kernel versions: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/firmware/efi/libstub/Makefile?h=linux-4.4.y#n57

STUBCOPY_FLAGS-y		:= -R .debug* -R *ksymtab* -R *kcrctab*
MaskRay added inline comments.Oct 11 2019, 6:57 PM
llvm/test/tools/llvm-objcopy/ELF/wildcard-syntax.test
1 ↗(On Diff #224709)

Move the file-level comment before RUN:

3 ↗(On Diff #224709)

The name is wildcard-syntax.test (good). Both wildcard and glob are referred to in this test. Personally I think wildcard is a subset of glob and glob is a better name here - glob is a POSIX specified concept and the function fnmatch that objcopy uses accepts glob, not just wildcard.

Unify the naming here.

21 ↗(On Diff #224709)

What does --remove-section='.???' --remove-section='!.f*' --remove-section='.???' do?

25 ↗(On Diff #224709)

Full stop.

26 ↗(On Diff #224709)

If llvm-objcopy called setlocale, RE Bracket Expression (also used by glob) would be non-portable (behaviors of locales other than the POSIX locale are unspecified).

Just for fun, glibc>=2.27 make w and v have the same collating order and a w test may fail in the Swedish locale (if regex matching is used): https://sourceware.org/bugzilla/show_bug.cgi?id=23393 (Comment #41 said this is basically a wontfix)

In any case, llvm-objcopy does not call setlocale, w not used in the test, nor do we use regex, so we are good.

73 ↗(On Diff #224709)

Probably add a section named z to enhance the test.

llvm/tools/llvm-objcopy/CopyConfig.h
115 ↗(On Diff #224709)

If ErrorCallback is not stored, consider llvm::function_ref.

119 ↗(On Diff #224709)

Remove some return

rupprecht updated this revision to Diff 224903.Oct 14 2019, 2:04 PM
rupprecht marked 13 inline comments as done.
  • Remove returns that don't compile
  • std::function -> llvm::function_ref
  • Update docs
  • Fix punctuation
  • Clarify wildcard vs glob
  • Use better check prefixes
  • Change letters to avoid any locale issues
rupprecht added inline comments.Oct 14 2019, 2:04 PM
llvm/test/tools/llvm-objcopy/ELF/wildcard-syntax.test
3 ↗(On Diff #224709)

Done; I've kept it as "wildcard" when talking about the flag name, and also once to mention "shell wildcard" which it's commonly referred to as, but generally I've updated to "glob syntax"

21 ↗(On Diff #224709)

Flag ordering doesn't matter, the algorithm (which GNU objcopy seems to do as well) is to apply the flag if any positive matcher matches and no negative matcher matches. Your example is equivalent to what is written here.

I added some test cases that show this, and I updated the command guide docs.

26 ↗(On Diff #224709)

Not going to pretend I totally understand locales, but I'll just use [q-s] (to match the r in bar) to avoid v or w anyway :)

73 ↗(On Diff #224709)

I'm not sure it would add that much, but I added it anyway.

The point of this test is that []xyz] is interpreted as "a single character which is one of something in ]xyz" (which matches the ] section), as opposed to "an empty character matcher followed by the literal xyz]".

I added this as a test comment. Also added a section named xyz] to show that it is not removed.

llvm/tools/llvm-objcopy/CopyConfig.h
119 ↗(On Diff #224709)

I was wondering how this even compiled, then realized it doesn't :(

rupprecht edited the summary of this revision. (Show Details)Oct 14 2019, 2:06 PM
MaskRay accepted this revision.Oct 16 2019, 2:53 AM

If you have extra bandwidth, I hope you can raise the issues for projects that use glob in objcopy --*-section but does not pass -w. This patch LG.

This revision is now accepted and ready to land.Oct 16 2019, 2:53 AM
rupprecht reopened this revision.Oct 16 2019, 3:58 PM

Turns out treating \ breaks linker scripts on Windows due to path names :(

glob support added in r375051 and immediately reverted in r375052 after a fast buildbot.

This revision is now accepted and ready to land.Oct 16 2019, 3:58 PM
rupprecht planned changes to this revision.Oct 16 2019, 3:58 PM

Planning to reland as-is after D69074

rupprecht accepted this revision.Oct 17 2019, 11:05 AM
This revision was not accepted when it landed; it landed in state Changes Planned.Oct 17 2019, 11:11 AM
This revision was automatically updated to reflect the committed changes.

Post-commit comments since I've been away and spotted a few relatively minor things. I'll post directly on the commit for the llvm-objcopy side, if I see anything there.

llvm/docs/CommandGuide/llvm-objcopy.rst
135 ↗(On Diff #216983)

--regex -> :option:'--regex' (with back-ticks instead of apostrophes).

149 ↗(On Diff #216983)

'!' -> `!` for consistency? Not sure.

llvm/docs/CommandGuide/llvm-strip.rst
109 ↗(On Diff #216983)

Same comments as llvm-objcopy document.

llvm/lib/Support/GlobPattern.cpp
118

Nit: "* -> '*' ?

llvm/unittests/Support/GlobPatternTest.cpp
26

This and similar cases should at least be ASSERT_TRUE instead of EXPECT_TRUE. If for some reason a failure is hit, the following lines will attempt to dereference an Expected in a failing state and crash the test rather than simply fail it.

Better however would be to use ASSERT_THAT_EXPECTED(Pat1, Succeeded()); (double-check exact usage, but it's something like that).

124

Similar to above, you can use ASSERT_THAT_EXPECTED(Pat2, Failed()); to avoid weird behaviour.