This is an archive of the discontinued LLVM Phabricator instance.

export unique symbol list with llvm-nm new option "--export-symbols"
ClosedPublic

Authored by DiggerLin on Oct 28 2021, 10:12 AM.

Details

Summary

the patch implement of following functionality.

  1. export the symbols from archive or object files.
  2. sort the export symbols. (based on same symbol name and visibility)
  3. delete the duplicate export symbols (based on same symbol name and visibility)
  4. print out the unique and sorted export symbols (print the symbol name and visibility).

there are two new options are add in the patch

  1. --export-symbols (enable the functionality of export unique symbol)
  2. --no-rsrc (exclude the symbol name begin with "__rsrc" from be exporting from xcoff object file)

Export symbol list for xcoff object file has the same functionality as
The patch has the same functionality as
IBM CreateExportList

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
MaskRay added inline comments.Jan 27 2022, 4:43 PM
llvm/tools/llvm-nm/llvm-nm.cpp
723 ↗(On Diff #401385)

char is more efficient and makes the linked output smaller

MaskRay added inline comments.Jan 27 2022, 4:44 PM
llvm/docs/CommandGuide/llvm-nm.rst
278 ↗(On Diff #401385)

Is this option from AIX nm? Or you just think it useful?

DiggerLin edited the summary of this revision. (Show Details)

support --export-symbol for llvm bitcode objectfile etc.

DiggerLin marked 10 inline comments as done.Jan 28 2022, 12:23 PM
DiggerLin added inline comments.
llvm/docs/CommandGuide/llvm-nm.rst
278 ↗(On Diff #401385)

yes, the option is for aix only.

llvm/tools/llvm-nm/llvm-nm.cpp
270 ↗(On Diff #401385)

yes, in the aix, it is possible that same name have different visibilities in different object files of archive file.

1694 ↗(On Diff #401385)

yes, in aix , we skip export symbol list from shared objects.

1755 ↗(On Diff #401385)

this is not std::Regex, it is llvm specific Regex, and the code is easy to understand, if I write my own code, the reviewer maybe take some time to understand.
but I changed as suggestion anyway.

DiggerLin marked 4 inline comments as done.
DiggerLin updated this revision to Diff 404983.Feb 1 2022, 10:05 AM
DiggerLin retitled this revision from export unique symbol list for xcoff with llvm-nm new option "--export-symbols" to export unique symbol list with llvm-nm new option "--export-symbols".Feb 1 2022, 10:08 AM
DiggerLin edited the summary of this revision. (Show Details)
DiggerLin updated this revision to Diff 405078.Feb 1 2022, 1:27 PM

I've done a partial review of the latest update, but I'm currently quite exhausted and unable to focus enough to give you a further review.

llvm/docs/CommandGuide/llvm-nm.rst
147 ↗(On Diff #405078)
llvm/test/tools/llvm-nm/XCOFF/export-sym-list-with-invalid-section-index.test
1 ↗(On Diff #405078)

The test should also be named export-symbols-with-invalid-section-index.test.

llvm/test/tools/llvm-nm/XCOFF/export-sym-list.test
1 ↗(On Diff #405078)

Test should be named export-symbols.test.

2 ↗(On Diff #405078)
16 ↗(On Diff #405078)

Sorry, missed this. The bit in parentheses doesn't make sense, as we're talking about unique symbols, so in those cases, "same" isn't the right word.

43 ↗(On Diff #405078)

I'd delete this blank line, since the two lines either side are closely linked.

44 ↗(On Diff #405078)

Use count 0 rather than --allow-empty etc.

llvm/test/tools/llvm-nm/bitcode-export-sym.test
1 ↗(On Diff #405078)

I'm not sure if bitcode files are "object" files.

3 ↗(On Diff #405078)

As @MaskRay has suggested in another patch, use split-file rather than echo to generate the input.

llvm/tools/llvm-nm/Opts.td
20 ↗(On Diff #405078)

Could "for object files and archives" just be "for all inputs"?

55 ↗(On Diff #405078)
llvm/tools/llvm-nm/llvm-nm.cpp
688 ↗(On Diff #401385)

On that note, could we add operator< to the NMSymbol class? This would save the need for explicit predicates.

1751 ↗(On Diff #405078)
1836–1837 ↗(On Diff #405078)

As @MaskRay said before, you can use cast to get the assertion automatically.

DiggerLin updated this revision to Diff 405324.Feb 2 2022, 9:52 AM
DiggerLin marked 13 inline comments as done.

I've done a partial review of the latest update, but I'm currently quite exhausted and unable to focus enough to give you a further review.

Thanks a lot for your time and your professional review. I can update the patch based on the partial review.

llvm/test/tools/llvm-nm/bitcode-export-sym.test
1 ↗(On Diff #405078)

for the source code, llvm bitcode is IRObjectFile.

llvm/tools/llvm-nm/Opts.td
20 ↗(On Diff #405078)

there are MachOUniversalBinary and TapiUniversal, using "for all inputs" is more reasonable. thanks

55 ↗(On Diff #405078)

it only exclude the "__rsrc" symbol from the export symbol list.

As mentioned a bit inline, I think you should do some of the refactoring in separate patches, to simplify the number of changes being added here.

  1. Pull out the isSymbolDefined code as you've already done.
  2. Split out/update the symbol sorting code as you've done/as I've suggested in the inline comments in this patch.

I think you should also do a refactor whereby SymbolList is no longer a global variable - having it as a global variable is confusing, at best, and at worst risks cases where it is still populated when it shouldn't be.

Finally, I think having some printing happen before the symbols have been gathered may be a bad idea. At the moment, things like the input filename are printed quite early on, which isn't what we want in some cases. There's no reason this couldn't be deferred until the printing stage, so I'd look at a refactor that does this.

At the moment, the regular behaviour is essentially:

for each input file (and archive member):
    get all symbols
    sort them
    print them according to output format, filtering according to options

A possible additional refactor to the existing code would be to move filtering to the "getting" point, so that they are never added to the symbol list in the first place.

For this patch, I think you essentially want to reuse the getting, sorting and filtering stages. The difference is that rather than print then throw away the symbol list, you want to keep it and do additional stuff to it. In essence, the structure for export symbol list is:

for each input file (and archive member):
  add the symbols to a single global list
sort the full list
uniquify the full list
filter the full list
print the full list

Does this match your understanding? You'll notice that most of these steps, and the general structure are pretty similar. Consequently, I think we can restructure your new code a bit more logically. At the end of main, instead of where we currently have the llvm::for_each to dump symbol names, and then a separate if, you do something like the following:

if (ExportSymbols)
  printExportSymbolList();
else
  llvm::for_each(InputFilenames, dumpSymbolNamesFromFile);

dumpSymbolNamesFromFile (or more precisely dumpSymbolNamesFromObject) will then call a function (created by moving code out of dumpSymbolNamesFromObject) which gets the symbols and adds them to the symbol list. Let's call this getSymbolsFromObject for our purposes here. Beyond that, the process remains essentially the same. printExportSymbolList loops over all inputs and calls getSymbolsFromObject on each, storing the resultant symbols in a single combined list. It then sorts and merges the list before printing. The merging and printing is unique to this code path. The gathering, filtering and sorting is shared code with the regular code path.

Does that make sense? This isn't too far from the current implementation, but there are some subtle differences that should stop the need for lots of if(ExportSymbols) type checks (the only one necessary should be the one in main).

llvm/docs/CommandGuide/llvm-nm.rst
280–281 ↗(On Diff #405324)

Another suggested rewording - you're now in "XCOFF-specific", so don't need to mention "for XCOFF...". Also, this option will apply to multiple inputs as much as one single one, so use the plural.

That being said, I don't think this option should be restricted to the export symbol list, and instead should always apply. This will make it similar to options like --no-weak, which makes implementation more natural.

llvm/test/tools/llvm-nm/Inputs/bitcode-sym64.ll
13 ↗(On Diff #405324)

Nit: delete extra trailing blank line.

llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
1 ↗(On Diff #405324)

Sorry, another suggested rewording.

10 ↗(On Diff #405324)

Is it supposed to explicitly be "\_\_tf1" and "\_\_tf9" only, or should this include e.g. \_\_tf2, \_\_tf3 etc?

If it should include, I'd reword this comment. If it shouldn't, I'd add an additional symbol called __tf2... to show that the prefix isn't removed in that case.

16 ↗(On Diff #405324)
llvm/tools/llvm-nm/Opts.td
55–56 ↗(On Diff #405324)

(and reflow as appropriate)

Explanation is the same as the suggested change in the CommandGuide.

llvm/tools/llvm-nm/llvm-nm.cpp
235 ↗(On Diff #405324)

The addition of this new function is a useful improvement, but can be its own prerequisite patch. Please split it out and rebase this patch on top of that one.

243 ↗(On Diff #405324)

Sorry, didn't think through my comment from yesterday enough. Make this an out-of-class operator, i.e.

bool operator<(const NMSymbol &A, const NMSymbol &B) { ... }

Then implement the equality operator simply as follows:

bool operator==(const NMSymbol &A, const NMSymbol &B) {
  return !(A < B) && !(B < A);
}
680 ↗(On Diff #405324)

Consider implementing operator> as follows:

bool operator!=(const NMSymbol &A, const NMSymbol &B) {
  return !(A == B);
}

bool operator>(const NMSymbol &A, const NMSymbol &B) {
  return A != B && !(A < B);
}

and then using that directly as the predicate without the need for the lamdba:

llvm::stable_sort(SymbolList, operator>);
686 ↗(On Diff #405324)

Let's make this a member function of NMSymbol.

697 ↗(On Diff #405324)

We don't usually bother const-ifying local variables that aren't pointers/references.

It's not clear to me what this function name is trying to communicate. I'm guessing it's "should skip printing this symbol" so I'd rename it shouldSkipPrintingSymbol. I'd also consider, if it makes sense, making this a member function of NMSymbol, rather than passing in the flags (this only applies if all references use the NMSymbol flags member).

710 ↗(On Diff #405324)

Spell out the type, rather than using auto here.

713 ↗(On Diff #405324)

Should this be demangled?

1733–1737 ↗(On Diff #405324)

As per MaskRay's comment above.

1792 ↗(On Diff #405324)

Please don't start blocks with a blank line.

DiggerLin updated this revision to Diff 405710.Feb 3 2022, 10:51 AM
DiggerLin marked 12 inline comments as done.

address comment

DiggerLin added a comment.EditedFeb 3 2022, 10:57 AM

As mentioned a bit inline, I think you should do some of the refactoring in separate patches, to simplify the number of changes being added here.

  1. Pull out the isSymbolDefined code as you've already done.
  2. Split out/update the symbol sorting code as you've done/as I've suggested in the inline comments in this patch.
I think it is too late to create a separate refactor patch now ,we already have done several review on it.  the code is already in the patch, I do not think it is a good idea  to spend time to create a separate refactor patch and rebase current patch now.

I think you should also do a refactor whereby SymbolList is no longer a global variable - having it as a global variable is confusing, at best, and at worst risks cases where it is still populated when it shouldn't be.

for the export list, I think we need a global variable , otherwise we need a variable in main() , and then pass into each function.  The llvm-nm is c style code, I do not think we can do all the refactor in this patch.   if you have have good idea, please let me know, I will create  a separate refactor patch after current patch land.

Finally, I think having some printing happen before the symbols have been gathered may be a bad idea. At the moment, things like the input filename are printed quite early on, which isn't what we want in some cases. There's no reason this couldn't be deferred until the printing stage, so I'd look at a refactor that does this.

At the moment, the regular behavior is essentially:

for each input file (and archive member):
    get all symbols
    sort them
    print them according to output format, filtering according to options

A possible additional refactor to the existing code would be to move filtering to the "getting" point, so that they are never added to the symbol list in the first place.

Agree with you, we should do filter when get the all the symbols a separate refactor patch after the patch landed. there are several functions need to change include function dumpSymbolsFromDLInfoMachO()

For this patch, I think you essentially want to reuse the getting, sorting and filtering stages. The difference is that rather than print then throw away the symbol list, you want to keep it and do additional stuff to it. In essence, the structure for export symbol list is:

for each input file (and archive member):
  add the symbols to a single global list
sort the full list
uniquify the full list
filter the full list
print the full list

Does this match your understanding?

Yes.

You'll notice that most of these steps, and the general structure are pretty similar. Consequently, I think we can restructure your new code a bit more logically. At the end of main, instead of where we currently have the llvm::for_each to dump symbol names, and then a separate if, you do something like the following:

if (ExportSymbols)
  printExportSymbolList();
else
  llvm::for_each(InputFilenames, dumpSymbolNamesFromFile);

dumpSymbolNamesFromFile (or more precisely dumpSymbolNamesFromObject) will then call a function (created by moving code out of dumpSymbolNamesFromObject) which gets the symbols and adds them to the symbol list. Let's call this getSymbolsFromObject for our purposes here. Beyond that, the process remains essentially the same. printExportSymbolList loops over all inputs and calls getSymbolsFromObject on each, storing the resultant symbols in a single combined list. It then sorts and merges the list before printing. The merging and printing is unique to this code path. The gathering, filtering and sorting is shared code with the regular code path.

Does that make sense? This isn't too far from the current implementation, but there are some subtle differences that should stop the need for lots of if(ExportSymbols) type checks (the only one necessary should be the one in main).

  1. Actually , there is only two additional if(ExportSymbols) in the dumpSymbolNamesFromObject() , I do not think it is big problem.
  2. printExportSymbolList loops over all inputs and calls getSymbolsFromObject , I think it need to share the code the dumpSymbolNamesFromFile. and we need to added a additional if(ExportSymbols) in the dumpSymbolNamesFromFile to go to different function .

So I do think there is benefit.

llvm/docs/CommandGuide/llvm-nm.rst
280–281 ↗(On Diff #405324)

thanks, There maybe have multi "__rsrc" symbols but has one export symbol list.

in the aix nm tools, it do not have the similar option "--no-rsrc" .
only in IBM CreateExportList has the option.
name

llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
10 ↗(On Diff #405324)
  1. it is only "tf1....." and "tf9...."
  2. for "__tf1...." remove the first 6 char prefix
    1. for "__tf9..." remove the first 15 char prefix.

the shell script of IBM original CreateExportList , $NF is symbol name.

....

if (substr ($NF, 1, 7) != "__sinit" &&

    substr ($NF, 1, 7) != "__sterm" &&
    match($NF,/^__[0-9]+__/)==0 ) {
   if (substr ($NF, 1, 5) == "__tf1")
                            print (substr ($NF, 7)) VISIBILITY
   else if (substr ($NF, 1, 5) == "__tf9")
                            print (substr ($NF, 15)) VISIBILITY
   else if (!('$incl_resource' && $NF == "__rsrc"))
                            print $NF VISIBILITY
}

....

llvm/tools/llvm-nm/llvm-nm.cpp
243 ↗(On Diff #405324)
  1. I just wonder what is benefit to change to out-of-class operator ? the benefit is we can use llvm::stable_sort(SymbolList, operator>); ?

since I do not think it is effecient to implement

bool operator>(const NMSymbol &A, const NMSymbol &B) {
  return A != B && !(A < B);
}

I explain below.

2 .

bool operator==(const NMSymbol &A, const NMSymbol &B) {
  return !(A < B) && !(B < A);
}

the code is not efficient, for example , when we compare to string equal,

str1==str2 is more efficient than !(str1 < str2) && !(str2 <str1)

using str1 == str2 only go through each byte of the string once,
but
!(str1 < str2) && !(str2 <str1)
need go through each byte of the string twice.

680 ↗(On Diff #405324)

same reason
bool operator>(const NMSymbol &A, const NMSymbol &B) {

return A != B && !(A < B);

}

this need to go through each bytes twice , not efficient.

686 ↗(On Diff #405324)

thanks

697 ↗(On Diff #405324)

yes, the purpose of the function is "should skip printing this symbol" , thanks I will change the name , and I will it as member function.

713 ↗(On Diff #405324)

As I know , it do not need demangle for create export list now, if it need later, we can create a new patch for it later.

jhenderson requested changes to this revision.Feb 4 2022, 1:37 AM

As mentioned a bit inline, I think you should do some of the refactoring in separate patches, to simplify the number of changes being added here.

  1. Pull out the isSymbolDefined code as you've already done.
  2. Split out/update the symbol sorting code as you've done/as I've suggested in the inline comments in this patch.
I think it is too late to create a separate refactor patch now ,we already have done several review on it.  the code is already in the patch, I do not think it is a good idea  to spend time to create a separate refactor patch and rebase current patch now.

It is never too late to split up a patch. I am not happy with this patch in its current state, plus commits should be kept as small as practical. That way, if there is a problem, it's easier to identify where the problem might be, plus it makes reviewing of individual bits of it significantly simpler. Remember, it .

I think you should also do a refactor whereby SymbolList is no longer a global variable - having it as a global variable is confusing, at best, and at worst risks cases where it is still populated when it shouldn't be.

for the export list, I think we need a global variable , otherwise we need a variable in main() , and then pass into each function.  The llvm-nm is c style code, I do not think we can do all the refactor in this patch.   if you have have good idea, please let me know, I will create  a separate refactor patch after current patch land.

My suggested design I already provided in my previous comment actively needs to avoid the global variable - the proposed getSymbolsFromObject function would return this list for use by the respective calling functions. Global variables make code re-use harder, and lead to temporal dependencies, both of which this patch is inflicted with, in part due to the global SymbolList. Doing the refactor up-front would significantly improve this patch.

Finally, I think having some printing happen before the symbols have been gathered may be a bad idea. At the moment, things like the input filename are printed quite early on, which isn't what we want in some cases. There's no reason this couldn't be deferred until the printing stage, so I'd look at a refactor that does this.

At the moment, the regular behavior is essentially:

for each input file (and archive member):
    get all symbols
    sort them
    print them according to output format, filtering according to options

A possible additional refactor to the existing code would be to move filtering to the "getting" point, so that they are never added to the symbol list in the first place.

Agree with you, we should do filter when get the all the symbols a separate refactor patch after the patch landed. there are several functions need to change include function dumpSymbolsFromDLInfoMachO()

Doing this one later is fine - I don't think it makes much difference whether filtering is done up-front or on-printing in this case.

You'll notice that most of these steps, and the general structure are pretty similar. Consequently, I think we can restructure your new code a bit more logically. At the end of main, instead of where we currently have the llvm::for_each to dump symbol names, and then a separate if, you do something like the following:

if (ExportSymbols)
  printExportSymbolList();
else
  llvm::for_each(InputFilenames, dumpSymbolNamesFromFile);

dumpSymbolNamesFromFile (or more precisely dumpSymbolNamesFromObject) will then call a function (created by moving code out of dumpSymbolNamesFromObject) which gets the symbols and adds them to the symbol list. Let's call this getSymbolsFromObject for our purposes here. Beyond that, the process remains essentially the same. printExportSymbolList loops over all inputs and calls getSymbolsFromObject on each, storing the resultant symbols in a single combined list. It then sorts and merges the list before printing. The merging and printing is unique to this code path. The gathering, filtering and sorting is shared code with the regular code path.

Does that make sense? This isn't too far from the current implementation, but there are some subtle differences that should stop the need for lots of if(ExportSymbols) type checks (the only one necessary should be the one in main).

  1. Actually , there is only two additional if(ExportSymbols) in the dumpSymbolNamesFromObject() , I do not think it is big problem.

The specific if checks are just a symptom of a wider structural problem with this patch, which if not improved will lead to brittleness and the potential for bugs. I don't think we should allow this to happen.

  1. printExportSymbolList loops over all inputs and calls getSymbolsFromObject , I think it need to share the code the dumpSymbolNamesFromFile. and we need to added a additional if(ExportSymbols) in the dumpSymbolNamesFromFile to go to different function .

I'm not sure I follow. In my proposed design, dumpSymbolNamesFromFile should never be touched by the export symbols code route, only getSymbolsFromObject.

So I do think there is benefit.

Did you mean "do not"?

llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
10 ↗(On Diff #405324)

I'd add an additional symbol called __tf2... to show that the prefix isn't removed in that case.

Please address this comment then.

llvm/tools/llvm-nm/llvm-nm.cpp
243 ↗(On Diff #405324)

Yes, an out-of-class operator is simpler to use as a predicate for sorting and other algorithm functions.

I think you are being unnecessarily concerned about performance. Whilst it is important to be aware of performance, I think you'll find that the optimizer will ensure that there is no additional runtime cost to implementing the operators as suggested. If you still are concerned, try implementing it both ways and running some tests to compare the total time it takes to run llvm-nm in each case, when sorting a large list of symbols.

680 ↗(On Diff #405324)

As noted above: the optimizer will likely take care of this for you.

This revision now requires changes to proceed.Feb 4 2022, 1:37 AM
DiggerLin marked 2 inline comments as done.EditedFeb 4 2022, 10:45 AM

As mentioned a bit inline, I think you should do some of the refactoring in separate patches, to simplify the number of changes being added here.

  1. Pull out the isSymbolDefined code as you've already done.
  2. Split out/update the symbol sorting code as you've done/as I've suggested in the inline comments in this patch.
I think it is too late to create a separate refactor patch now ,we already have done several review on it.  the code is already in the patch, I do not think it is a good idea  to spend time to create a separate refactor patch and rebase current patch now.

It is never too late to split up a patch. I am not happy with this patch in its current state, plus commits should be kept as small as practical. That way, if there is a problem, it's easier to identify where the problem might be, plus it makes reviewing of individual bits of it significantly simpler. Remember, it .

I created a refactor patch for above and rebase current patch based on it.

I think you should also do a refactor whereby SymbolList is no longer a global variable - having it as a global variable is confusing, at best, and at worst risks cases where it is still populated when it shouldn't be.

for the export list, I think we need a global variable , otherwise we need a variable in main() , and then pass into each function.  The llvm-nm is c style code, I do not think we can do all the refactor in this patch.   if you have have good idea, please let me know, I will create  a separate refactor patch after current patch land.

My suggested design I already provided in my previous comment actively needs to avoid the global variable - the proposed getSymbolsFromObject function would return this list for use by the respective calling functions. Global variables make code re-use harder, and lead to temporal dependencies, both of which this patch is inflicted with, in part due to the global SymbolList. Doing the refactor up-front would significantly improve this patch.

Since we have not reach an agreement on the getSymbolsFromObject() , I am prefer to  do "removing the global variable SymbolList" in a later separate refactor patch. Our team is been blocking on the "export symbol list", I understand your concern about the quality of the code,  but I do not think "removing a global variable SymbolList" is in the scope of the patch.

Finally, I think having some printing happen before the symbols have been gathered may be a bad idea. At the moment, things like the input filename are printed quite early on, which isn't what we want in some cases. There's no reason this couldn't be deferred until the printing stage, so I'd look at a refactor that does this.

At the moment, the regular behavior is essentially:

for each input file (and archive member):
    get all symbols
    sort them
    print them according to output format, filtering according to options

A possible additional refactor to the existing code would be to move filtering to the "getting" point, so that they are never added to the symbol list in the first place.

Agree with you, we should do filter when get the all the symbols a separate refactor patch after the patch landed. there are several functions need to change include function dumpSymbolsFromDLInfoMachO()

Doing this one later is fine - I don't think it makes much difference whether filtering is done up-front or on-printing in this case.

You'll notice that most of these steps, and the general structure are pretty similar. Consequently, I think we can restructure your new code a bit more logically. At the end of main, instead of where we currently have the llvm::for_each to dump symbol names, and then a separate if, you do something like the following:

if (ExportSymbols)
  printExportSymbolList();
else
  llvm::for_each(InputFilenames, dumpSymbolNamesFromFile);

dumpSymbolNamesFromFile (or more precisely dumpSymbolNamesFromObject) will then call a function (created by moving code out of dumpSymbolNamesFromObject) which gets the symbols and adds them to the symbol list. Let's call this getSymbolsFromObject for our purposes here. Beyond that, the process remains essentially the same. printExportSymbolList loops over all inputs and calls getSymbolsFromObject on each, storing the resultant symbols in a single combined list. It then sorts and merges the list before printing. The merging and printing is unique to this code path. The gathering, filtering and sorting is shared code with the regular code path.

Does that make sense? This isn't too far from the current implementation, but there are some subtle differences that should stop the need for lots of if(ExportSymbols) type checks (the only one necessary should be the one in main).

  1. Actually , there is only two additional if(ExportSymbols) in the dumpSymbolNamesFromObject() , I do not think it is big problem.

The specific if checks are just a symptom of a wider structural problem with this patch, which if not improved will lead to brittleness and the potential for bugs. I don't think we should allow this to happen.

  1. printExportSymbolList loops over all inputs and calls getSymbolsFromObject , I think it need to share the code the dumpSymbolNamesFromFile. and we need to added a additional if(ExportSymbols) in the dumpSymbolNamesFromFile to go to different function .

I'm not sure I follow. In my proposed design, dumpSymbolNamesFromFile should never be touched by the export symbols code route, only getSymbolsFromObject.

when printExportSymbolList loops over all inputs ,  we still need to get object files from archive or from  MachOUniversalBinary, TapiUniversal, there are 9 different place call dumpSymbolNamesFromObject() in function dumpSymbolNamesFromFile() , that means the object file can be extracted in 9 different way from the binary file based on the file type of the binary.  we need the functionality  to extract object file out from "dumpSymbolNamesFromFile" ,  so it is better to share the logic of extracting object file out binary file.  So still need the dumpSymbolNamesFromFile() for your suggestion printExportSymbolList().  and the function dumpSymbolNamesFromFile() is too large.  it should be better to create a refactor patch to split the function into several small functions after the patch. as dumpInput() in llvm-readobj.cpp and llvm-objdump.cpp

So I do think there is benefit.

Did you mean "do not"?

Sorry for the typo, yes, I mean "do not"
llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
10 ↗(On Diff #405324)

I do not think we need to add additional symbol called tf2... to show that the prefix isn't removed in that case. there is no source code related the logic which can remove the prefix of tf2... But I add the test anyway.

llvm/tools/llvm-nm/llvm-nm.cpp
680 ↗(On Diff #405324)

-bash-4.2$ /xl_le/gsa/rtpgsa/projects/w/wyvern-environment/compilers/ppc64le/linux_leppc/clang.11.0.0/bin/clang++ -I/scratch/zhijian/llvm/src/llvm/include -I/scratch/zhijian/llvm/build/include -c llvm/tools/llvm-nm/llvm-nm.cpp -I/scratch/zhijian/llvm/build/tools/llvm-nm
llvm/tools/llvm-nm/llvm-nm.cpp:703:5: error: no matching function for call to 'stable_sort'

llvm::stable_sort(SymbolList, operator>);
^~~~~~~~~~~~~~~~~

/scratch/zhijian/llvm/src/llvm/include/llvm/ADT/STLExtras.h:1789:6: note: candidate template ignored: couldn't infer template argument 'Compare'
void stable_sort(R &&Range, Compare C) {

DiggerLin updated this revision to Diff 406068.Feb 4 2022, 12:47 PM
DiggerLin marked 2 inline comments as done.

created a separated refactor patch and rebase the patch on it, and address comments

I created a refactor patch for above and rebase current patch based on it.

Thanks.

I think you should also do a refactor whereby SymbolList is no longer a global variable - having it as a global variable is confusing, at best, and at worst risks cases where it is still populated when it shouldn't be.

for the export list, I think we need a global variable , otherwise we need a variable in main() , and then pass into each function.  The llvm-nm is c style code, I do not think we can do all the refactor in this patch.   if you have have good idea, please let me know, I will create  a separate refactor patch after current patch land.

My suggested design I already provided in my previous comment actively needs to avoid the global variable - the proposed getSymbolsFromObject function would return this list for use by the respective calling functions. Global variables make code re-use harder, and lead to temporal dependencies, both of which this patch is inflicted with, in part due to the global SymbolList. Doing the refactor up-front would significantly improve this patch.

Since we have not reach an agreement on the getSymbolsFromObject() , I am prefer to  do "removing the global variable SymbolList" in a later separate refactor patch. Our team is been blocking on the "export symbol list", I understand your concern about the quality of the code,  but I do not think "removing a global variable SymbolList" is in the scope of the patch.
  1. printExportSymbolList loops over all inputs and calls getSymbolsFromObject , I think it need to share the code the dumpSymbolNamesFromFile. and we need to added a additional if(ExportSymbols) in the dumpSymbolNamesFromFile to go to different function .

I'm not sure I follow. In my proposed design, dumpSymbolNamesFromFile should never be touched by the export symbols code route, only getSymbolsFromObject.

when printExportSymbolList loops over all inputs ,  we still need to get object files from archive or from  MachOUniversalBinary, TapiUniversal,  we need the functionality  in  "dumpSymbolNamesFromFile" ,  so we still need the dumpSymbolNamesFromFile() for printExportSymbolList().

If we need the object fetching code from dumpSymbolNamesFromFile then just pull that code into a separate function too... Here's what I'm thinking:

// Class for storing the binary and passing around associated properties, in case it's an object.
struct NMObject {
  std::unique_ptr<Binary> Bin; // Possibly any other members that might be necessary to store archive members.
  StringRef ArchiveName;
  StringRef ArchitectureName;
};

static std::vector<NMSymbol> getSymbolsFromDLInfoMachO(MachOObjectFile &MachO) {
  std::vector<NMSymbol> Symbols;
  // As original implementation of dumpSymbolsFromDLInfoMachO, but replace all references to SymbolList with Symbols.
  return Symbols;
}

static std::vector<NMSymbol> getSymbolsFromObject(NMObject &Obj) {
  std::vector<NMSymbol> Symbols;
  // Parts of original dumpSymbolNamesFromObject that get the symbols that are to be printed.
  // Filtering will be done later, rather than here, i.e. don't add the exportSymbolsForXCOFF call here.
  return Symbols;
}

static void dumpSymbolNamesFromObject(NMObject &Obj) {
  std::vector<NMSymbol> Symbols = getSymbolsFromObject(Obj);
  Symbols = sortSymbolList(std::move(Symbols)); // Or sort in place
  printSymbolList(Symbols, Obj, printName);
}

static std::vector<NMObject> getObjectsFromFile(StringRef InputFile) {
  std::vector<NMObject> Objects;
  /*
    Code from dumpSymbolNamesFromFile which retrieves the objects (and archive properties, if appropriate) inside a binary (may be a single object, or many).
    Emplace an NMObject in the vector for each such constructed object, instead of calling dumpSymbolNamesFromObject.
  */
  return Objects;
}

static void dumpSymbolNamesFromFile(StringRef InputFile) {
  std::vector<NMObject> Objects = getObjectsFromFile(InputFile);
  llvm::for_each(Objects, dumpSymbolNamesFromObject);
}

std::vector<NMSymbol> getExportSymbols(ArrayRef<NMSymbol> Symbols, NMObject &Obj) {
  std::vector<NMSymbol> ExportSymbols;
  if (auto *XCOFF = dyn_cast<XCOFFObjectFile>(Obj.Bin)) {
    // Do what's necessary to find the symbols appropriate for XCOFF export symbols.
  }
  // TODO: Add implementations for other objects here. Possibly warn otherwise.
  return ExportSymbols;
}

static void dumpExportSymbolList(ArrayRef<StringRef> InputFilenames) {
  std::vector<NMSymbol> Symbols;
  for(StringRef InputFile : InputFilenames) {
    std::vector<NMObject> FileObjects = getObjectsFromFile(InputFile);
    for(NMObject &Object : FileObjects) {
      std::vector<NMSymbol> ObjSymbols = getSymbolsFromObject(InputFile);
      ObjSymbols = getExportSymbols(ObjSymbols, Object);
      // Add ObjSymbols to the end of Symbols.
    }
  }
  Symbols = sortSymbolList(Symbols); // Or sort in place
  SymbolList.erase(std::unique(Symbols.begin(), Symbols.end()), Symbols.end());
  printExportSymbols(Symbols);
}

int main() {
  /* ... */
  if(ExportSymbols)
    dumpExportSymbolList(InputFilenames);
  else
    llvm::for_each(InputFilenames, dumpSymbolNamesFromFile);
}

I've not tested this structure out, but I'm pretty confident that it, likely with some small tweaks is both a) not a massive departure from the current structure (meaning we don't need to rewrite large amounts of code), and b) will resolve the structural issues with this patch.

llvm/tools/llvm-nm/llvm-nm.cpp
243 ↗(On Diff #406068)

I'd rename this function, to clarify the intent - setSymFlag implies it's changing from one value to another, whereas initializeFlags shows that we're putting them into their proper initial state, and that it's bad if things go wrong.

254 ↗(On Diff #406068)

It's easier to work with positives, so I'd suggest renaming this shouldPrint, and invert the logic.

285 ↗(On Diff #406068)
702 ↗(On Diff #406068)

Maybe to clarify intent, I'd call this printExportSymbolList.

1716–1723 ↗(On Diff #406068)

This can be simplified, I believe as per the inline edit.

1718 ↗(On Diff #406068)

Use llvm::isDigit

1743 ↗(On Diff #406068)

Consider making this else if, for clarity.

1784–1788 ↗(On Diff #406068)

This code doesn't really belong here, as it's just filtering out the export symbols. Later code will already retrieve XCOFf symbols without needing to do anything extra - we should filter that set of symbols after that point.

This block here is also a good example of why the current implementation in this patch is not a good structure. It looks like none of the code before this block is relevant. The if (DynamicSyms) block isn't relevant, becuase that's only for ELF. The if (!SegSect.empty() && MachO) block is also irrelevant, since an object cannot be both MachO and XCOFF. That also implies that if (!(MachO && DyldInfoOnly)) is always true, if we are XCOFF. Nothing after this block is relevant either, since it returns unconditionally. This shows that this function is not the place for this piece of code.

1849 ↗(On Diff #406068)

I don't understand the need for this addition.

DiggerLin updated this revision to Diff 406840.Feb 8 2022, 8:24 AM

rebased the code base on the https://reviews.llvm.org/D119028 [NFC] Refactor llvm-nm symbol comparing and split sorting

DiggerLin updated this revision to Diff 406935.Feb 8 2022, 12:33 PM
DiggerLin marked 9 inline comments as done.
DiggerLin added a comment.EditedFeb 8 2022, 12:40 PM
I created a refactor patch for above and rebase current patch based on it.

Thanks.

I think you should also do a refactor whereby SymbolList is no longer a global variable - having it as a global variable is confusing, at best, and at worst risks cases where it is still populated when it shouldn't be.

for the export list, I think we need a global variable , otherwise we need a variable in main() , and then pass into each function.  The llvm-nm is c style code, I do not think we can do all the refactor in this patch.   if you have have good idea, please let me know, I will create  a separate refactor patch after current patch land.

My suggested design I already provided in my previous comment actively needs to avoid the global variable - the proposed getSymbolsFromObject function would return this list for use by the respective calling functions. Global variables make code re-use harder, and lead to temporal dependencies, both of which this patch is inflicted with, in part due to the global SymbolList. Doing the refactor up-front would significantly improve this patch.

Since we have not reach an agreement on the getSymbolsFromObject() , I am prefer to  do "removing the global variable SymbolList" in a later separate refactor patch. Our team is been blocking on the "export symbol list", I understand your concern about the quality of the code,  but I do not think "removing a global variable SymbolList" is in the scope of the patch.
  1. printExportSymbolList loops over all inputs and calls getSymbolsFromObject , I think it need to share the code the dumpSymbolNamesFromFile. and we need to added a additional if(ExportSymbols) in the dumpSymbolNamesFromFile to go to different function .

I'm not sure I follow. In my proposed design, dumpSymbolNamesFromFile should never be touched by the export symbols code route, only getSymbolsFromObject.

when printExportSymbolList loops over all inputs ,  we still need to get object files from archive or from  MachOUniversalBinary, TapiUniversal,  we need the functionality  in  "dumpSymbolNamesFromFile" ,  so we still need the dumpSymbolNamesFromFile() for printExportSymbolList().

If we need the object fetching code from dumpSymbolNamesFromFile then just pull that code into a separate function too... Here's what I'm thinking:

// Class for storing the binary and passing around associated properties, in case it's an object.
struct NMObject {
  std::unique_ptr<Binary> Bin; // Possibly any other members that might be necessary to store archive members.
  StringRef ArchiveName;
  StringRef ArchitectureName;
};

static std::vector<NMSymbol> getSymbolsFromDLInfoMachO(MachOObjectFile &MachO) {
  std::vector<NMSymbol> Symbols;
  // As original implementation of dumpSymbolsFromDLInfoMachO, but replace all references to SymbolList with Symbols.
  return Symbols;
}

static std::vector<NMSymbol> getSymbolsFromObject(NMObject &Obj) {
  std::vector<NMSymbol> Symbols;
  // Parts of original dumpSymbolNamesFromObject that get the symbols that are to be printed.
  // Filtering will be done later, rather than here, i.e. don't add the exportSymbolsForXCOFF call here.
  return Symbols;
}

static void dumpSymbolNamesFromObject(NMObject &Obj) {
  std::vector<NMSymbol> Symbols = getSymbolsFromObject(Obj);
  Symbols = sortSymbolList(std::move(Symbols)); // Or sort in place
  printSymbolList(Symbols, Obj, printName);
}

static std::vector<NMObject> getObjectsFromFile(StringRef InputFile) {
  std::vector<NMObject> Objects;
  /*
    Code from dumpSymbolNamesFromFile which retrieves the objects (and archive properties, if appropriate) inside a binary (may be a single object, or many).
    Emplace an NMObject in the vector for each such constructed object, instead of calling dumpSymbolNamesFromObject.
  */
  return Objects;
}

static void dumpSymbolNamesFromFile(StringRef InputFile) {
  std::vector<NMObject> Objects = getObjectsFromFile(InputFile);
  llvm::for_each(Objects, dumpSymbolNamesFromObject);
}

std::vector<NMSymbol> getExportSymbols(ArrayRef<NMSymbol> Symbols, NMObject &Obj) {
  std::vector<NMSymbol> ExportSymbols;
  if (auto *XCOFF = dyn_cast<XCOFFObjectFile>(Obj.Bin)) {
    // Do what's necessary to find the symbols appropriate for XCOFF export symbols.
  }
  // TODO: Add implementations for other objects here. Possibly warn otherwise.
  return ExportSymbols;
}

static void dumpExportSymbolList(ArrayRef<StringRef> InputFilenames) {
  std::vector<NMSymbol> Symbols;
  for(StringRef InputFile : InputFilenames) {
    std::vector<NMObject> FileObjects = getObjectsFromFile(InputFile);
    for(NMObject &Object : FileObjects) {
      std::vector<NMSymbol> ObjSymbols = getSymbolsFromObject(InputFile);
      ObjSymbols = getExportSymbols(ObjSymbols, Object);
      // Add ObjSymbols to the end of Symbols.
    }
  }
  Symbols = sortSymbolList(Symbols); // Or sort in place
  SymbolList.erase(std::unique(Symbols.begin(), Symbols.end()), Symbols.end());
  printExportSymbols(Symbols);
}

int main() {
  /* ... */
  if(ExportSymbols)
    dumpExportSymbolList(InputFilenames);
  else
    llvm::for_each(InputFilenames, dumpSymbolNamesFromFile);
}

I've not tested this structure out, but I'm pretty confident that it, likely with some small tweaks is both a) not a massive departure from the current structure (meaning we don't need to rewrite large amounts of code), and b) will resolve the structural issues with this patch.

thanks for your sample code, I agree with you, the patch is quite large now. if I implement as your suggestion in the patch , I believe that we will need another refactor patch which split the dumpSymbolNamesFromFile function into several small function before the patch. Our patch https://reviews.llvm.org/D119147 is depend on the patch. My suggestion is that I will do refactor as your suggestion after the current patch land and at least let the change code in XCOFFObjectFile.cpp , test case, llvm-nm.rst etc landed, and we can focus on refactoring the llvm-nm.cpp.

if you strong insisted that we should do as your suggestion in the patch first, I can do it and postpones our schedule.

FYI @hubert.reinterpretcast .

llvm/tools/llvm-nm/llvm-nm.cpp
243 ↗(On Diff #406068)

thanks

254 ↗(On Diff #406068)

thanks

285 ↗(On Diff #406068)

thanks

1716–1723 ↗(On Diff #406068)

This can be simplified, I believe as per the inline edit.

1716–1723 ↗(On Diff #406068)

thanks for simplify

1784–1788 ↗(On Diff #406068)

agree with you , but I do not think there is better choice in current structure.

1849 ↗(On Diff #406068)

without setSymFlag(Obj). the S do not have a value SymFlags, it will crash at function
bool shouldSkipPrinting()

DiggerLin added a comment.EditedFeb 8 2022, 12:42 PM
I created a refactor patch for above and rebase current patch based on it.

Thanks.

I think you should also do a refactor whereby SymbolList is no longer a global variable - having it as a global variable is confusing, at best, and at worst risks cases where it is still populated when it shouldn't be.

for the export list, I think we need a global variable , otherwise we need a variable in main() , and then pass into each function.  The llvm-nm is c style code, I do not think we can do all the refactor in this patch.   if you have have good idea, please let me know, I will create  a separate refactor patch after current patch land.

My suggested design I already provided in my previous comment actively needs to avoid the global variable - the proposed getSymbolsFromObject function would return this list for use by the respective calling functions. Global variables make code re-use harder, and lead to temporal dependencies, both of which this patch is inflicted with, in part due to the global SymbolList. Doing the refactor up-front would significantly improve this patch.

Since we have not reach an agreement on the getSymbolsFromObject() , I am prefer to  do "removing the global variable SymbolList" in a later separate refactor patch. Our team is been blocking on the "export symbol list", I understand your concern about the quality of the code,  but I do not think "removing a global variable SymbolList" is in the scope of the patch.
  1. printExportSymbolList loops over all inputs and calls getSymbolsFromObject , I think it need to share the code the dumpSymbolNamesFromFile. and we need to added a additional if(ExportSymbols) in the dumpSymbolNamesFromFile to go to different function .

I'm not sure I follow. In my proposed design, dumpSymbolNamesFromFile should never be touched by the export symbols code route, only getSymbolsFromObject.

when printExportSymbolList loops over all inputs ,  we still need to get object files from archive or from  MachOUniversalBinary, TapiUniversal,  we need the functionality  in  "dumpSymbolNamesFromFile" ,  so we still need the dumpSymbolNamesFromFile() for printExportSymbolList().

If we need the object fetching code from dumpSymbolNamesFromFile then just pull that code into a separate function too... Here's what I'm thinking:

// Class for storing the binary and passing around associated properties, in case it's an object.
struct NMObject {
  std::unique_ptr<Binary> Bin; // Possibly any other members that might be necessary to store archive members.
  StringRef ArchiveName;
  StringRef ArchitectureName;
};

static std::vector<NMSymbol> getSymbolsFromDLInfoMachO(MachOObjectFile &MachO) {
  std::vector<NMSymbol> Symbols;
  // As original implementation of dumpSymbolsFromDLInfoMachO, but replace all references to SymbolList with Symbols.
  return Symbols;
}

static std::vector<NMSymbol> getSymbolsFromObject(NMObject &Obj) {
  std::vector<NMSymbol> Symbols;
  // Parts of original dumpSymbolNamesFromObject that get the symbols that are to be printed.
  // Filtering will be done later, rather than here, i.e. don't add the exportSymbolsForXCOFF call here.
  return Symbols;
}

static void dumpSymbolNamesFromObject(NMObject &Obj) {
  std::vector<NMSymbol> Symbols = getSymbolsFromObject(Obj);
  Symbols = sortSymbolList(std::move(Symbols)); // Or sort in place
  printSymbolList(Symbols, Obj, printName);
}

static std::vector<NMObject> getObjectsFromFile(StringRef InputFile) {
  std::vector<NMObject> Objects;
  /*
    Code from dumpSymbolNamesFromFile which retrieves the objects (and archive properties, if appropriate) inside a binary (may be a single object, or many).
    Emplace an NMObject in the vector for each such constructed object, instead of calling dumpSymbolNamesFromObject.
  */
  return Objects;
}

static void dumpSymbolNamesFromFile(StringRef InputFile) {
  std::vector<NMObject> Objects = getObjectsFromFile(InputFile);
  llvm::for_each(Objects, dumpSymbolNamesFromObject);
}

std::vector<NMSymbol> getExportSymbols(ArrayRef<NMSymbol> Symbols, NMObject &Obj) {
  std::vector<NMSymbol> ExportSymbols;
  if (auto *XCOFF = dyn_cast<XCOFFObjectFile>(Obj.Bin)) {
    // Do what's necessary to find the symbols appropriate for XCOFF export symbols.
  }
  // TODO: Add implementations for other objects here. Possibly warn otherwise.
  return ExportSymbols;
}

static void dumpExportSymbolList(ArrayRef<StringRef> InputFilenames) {
  std::vector<NMSymbol> Symbols;
  for(StringRef InputFile : InputFilenames) {
    std::vector<NMObject> FileObjects = getObjectsFromFile(InputFile);
    for(NMObject &Object : FileObjects) {
      std::vector<NMSymbol> ObjSymbols = getSymbolsFromObject(InputFile);
      ObjSymbols = getExportSymbols(ObjSymbols, Object);
      // Add ObjSymbols to the end of Symbols.
    }
  }
  Symbols = sortSymbolList(Symbols); // Or sort in place
  SymbolList.erase(std::unique(Symbols.begin(), Symbols.end()), Symbols.end());
  printExportSymbols(Symbols);
}

int main() {
  /* ... */
  if(ExportSymbols)
    dumpExportSymbolList(InputFilenames);
  else
    llvm::for_each(InputFilenames, dumpSymbolNamesFromFile);
}

I've not tested this structure out, but I'm pretty confident that it, likely with some small tweaks is both a) not a massive departure from the current structure (meaning we don't need to rewrite large amounts of code), and b) will resolve the structural issues with this patch.

thanks for your sample code, I agree with you, but the patch is quite large now. if I implement as your suggestion in the patch , I believe that we will need another refactor patch which split the dumpSymbolNamesFromFile function into several small function before the patch. Our patch https://reviews.llvm.org/D119147 is depend on the patch. My suggestion is that I will do refactor as your suggestion after the current patch land and at least let the change code in XCOFFObjectFile.cpp , test case, llvm-nm.rst etc which not related to the refactor landed first, and we can focus on refactoring the llvm-nm.cpp.

if you strong suggest that we should do as your suggestion in the patch first, I can do it and postpones our schedule. thanks.

FYI @hubert.reinterpretcast .

Okay, I'm willing to accept delaying further refactoring, as long as it is definitely going to happen sooner rather than later. Just a few nits to tidy up before I think you can land this patch.

llvm/test/tools/llvm-nm/XCOFF/export-symbols-with-invalid-section-index.test
1 ↗(On Diff #406935)

Could you fold this test case into the positive test case file? Would that allow you to reuse the YAML object, using -D to provide the invalid section index?

llvm/tools/llvm-nm/llvm-nm.cpp
254 ↗(On Diff #406068)

Just noting this hasn't been done, although you can defer to another patch if you prefer.

1784–1788 ↗(On Diff #406068)

Let's at least in this patch pull the block outside of the if (!(MachO && DyldInfoOnly)) clause, since it doesn't need to be inside it.

1849 ↗(On Diff #406068)

But the code didn't have this before, so why does it need it now? Is this a result of some of the refactoring work you've already done? I take it there's a concrete reproducible you could craft that would cause the crash without this change?

1658 ↗(On Diff #406935)

Let's rename this function now: getXCOFFExports seems like a nice concise name.

DiggerLin marked 5 inline comments as done.EditedFeb 9 2022, 10:00 AM

Okay, I'm willing to accept delaying further refactoring, as long as it is definitely going to happen sooner rather than later. Just a few nits to tidy up before I think you can land this patch.

Thanks, I can start the refactor from next week.

Okay, I'm willing to accept delaying further refactoring, as long as it is definitely going to happen sooner rather than later. Just a few nits to tidy up before I think you can land this patch.

Thanks a lot , I will begin to refactor as soon as possible, I think I can begin to refactor  from late of next week.
llvm/test/tools/llvm-nm/XCOFF/export-symbols-with-invalid-section-index.test
1 ↗(On Diff #406935)

thanks.

llvm/tools/llvm-nm/llvm-nm.cpp
254 ↗(On Diff #406068)

done ,thanks

1784–1788 ↗(On Diff #406068)

thanks

1849 ↗(On Diff #406068)

the function shouldSkipPrinting() is called
in the printExportSymbolList()
but the printExportSymbolList are called after dumpSymbolNamesFromFile , The object binary file maybe be freed from memory when there are several files inputs from command line. in the function dumpSymbolNamesFromFile() ,

Expected<std::unique_ptr<Binary>> BinaryOrErr =
      createBinary(BufferOrErr.get()->getMemBufferRef(), ContextPtr);

it generate a binary and freed it at the end of the function . so if we still try to get the SymFlags of symbol from bit code object file , it will be crashed.

so I set the SymFlag for getSymbolNamesFromObject()
as

if (S.initializeFlags(Obj))
  SymbolList.push_back(S);

dumpSymbolsFromDLInfoMachO() initiate the SymFlag for SymFlags . but getSymbolNamesFromObject do not set it, for consistency, I set the SymFlag for the NMSymbol in getSymbolNamesFromObject

1658 ↗(On Diff #406935)

thanks

DiggerLin marked 4 inline comments as done.Feb 9 2022, 10:01 AM
DiggerLin updated this revision to Diff 407285.EditedFeb 9 2022, 1:53 PM
  1. address comment.
  2. move the functionality of "removing the symbols which should not be printed" to before the functionality of "removing duplication symbols". and add a new test for it. Name: __tf2value Section: .data Type: 0x0 StorageClass: C_HIDEXT AuxEntries:
    • Type: AUX_CSECT SymbolAlignmentAndType: 0x21 StorageMappingClass: XMC_TC

Sometime we have two symbols, which both has the same name and no visibility, one's storageClass is C_HIDEXT (the should not be export), another is C_EXT(the one should be exported).
When compare these two symbols for "--export-symbols", these two symbol are same, if we remove duplication symbol before removing the symbols which should not print. the Symbol with C_EXT maybe removed. it will cause none of the two symbol be printed out.

jhenderson added inline comments.Feb 10 2022, 1:40 AM
llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
5 ↗(On Diff #407285)

Tip: you can avoid the need for -DSECT=2 in the "regular" case, by using a default value in the YAML. I can't remember if the syntax is [[SECT=2]] or [[SECT:2]], but one of them should work (look for examples in other tests).

llvm/tools/llvm-nm/llvm-nm.cpp
1849 ↗(On Diff #406068)

Thanks. It would be nice we can avoid this issue with the proposed refactoring (and therefore delete this new code), but if it's needed for now, so be it.

701 ↗(On Diff #407285)

Nit: clang-format is complaining.

1668 ↗(On Diff #407285)

Should this be a continue rather than a return? Also, should it be closer to the rest of the name checks further down in this loop?

1677–1713 ↗(On Diff #407285)

I haven't got the time right now to check this myself, but you should ensure that you have symbols that trigger every possible code path that lead to them being skipped/not skipped.

A quick skim suggests at least the following:

  1. INTERNAL visibility
  2. HIDDEN visibility
  3. No visibility attribute
  4. Visibility attribute that isn't internal or hidden.
  5. Symbol with error when looking up section
  6. Symbol that is not in a section (SecIter == XCOFFObj->section_end())
  7. Text symbol
  8. Data symbol
  9. BSS symbol
  10. Symbol that is not one of the three previous types
  11. Symbol with empty name
  12. Symbol with name look-up failure
  13. __sinit prefixed symbol
  14. __sterm prefixed symbol
  15. . prefixed symbol
  16. ( prefixed symbol
  17. ____ named symbol (i.e. four underscores, nothing between prefix and suffix)
  18. __<digits>__ symbol
  19. __<digits> (no suffix)
  20. <digits>__ (no prefix)
  21. __<digits and something non digit>__ symbol
1720–1721 ↗(On Diff #407285)

This block appears after the name adjustments for __tf1 and __tf9. If the intent is that __tf1__rsrc and __tf9_rsrc are skipped with --no-rsrc, you need to check them. In fact, it probably doens't hurt to have that test case even if they should be kept.

1773–1777 ↗(On Diff #407285)

I'd be tempted to move this block right to the start of the function, since the other bits are completely irrelevant if we trigger this block.

DiggerLin updated this revision to Diff 407656.Feb 10 2022, 1:11 PM
DiggerLin marked 5 inline comments as done.
DiggerLin added inline comments.Feb 10 2022, 1:15 PM
llvm/tools/llvm-nm/llvm-nm.cpp
1668 ↗(On Diff #407285)

thanks

1677–1713 ↗(On Diff #407285)

I added all test cases you mention above except the "12 Symbol with name look-up failure"

from the source code in the XCOFFObjectFile.cpp

Expected<StringRef> XCOFFSymbolRef::getName() const {
  // A storage class value with the high-order bit on indicates that the name is
  // a symbolic debugger stabstring.
  if (getStorageClass() & 0x80)
    return StringRef("Unimplemented Debug Name");

  if (Entry32) {
    if (Entry32->NameInStrTbl.Magic != XCOFFSymbolRef::NAME_IN_STR_TBL_MAGIC)
      return generateXCOFFFixedNameStringRef(Entry32->SymbolName);

    return OwningObjectPtr->getStringTableEntry(Entry32->NameInStrTbl.Offset);
  }

  return OwningObjectPtr->getStringTableEntry(Entry64->Offset);
}

It never return an Error.

The code in the function exportSymbolsForXCOFF

Expected<StringRef> NameOrErr = Sym.getName();
   if (!NameOrErr) {
     warn(NameOrErr.takeError(), XCOFFObj->getFileName(),
          "for symbol with index " +
              Twine(XCOFFObj->getSymbolIndex(Sym.getRawDataRefImpl().p)),
          ArchiveName);
     continue;
   }
   StringRef SymName = *NameOrErr;

can be modified to

StringRef SymName = cantFail(Sym.getName());

but I prefer to keep as it is. The reason as:

if Expected<StringRef> XCOFFSymbolRef::getName() const is changed to return with Error in some day.

using StringRef SymName = cantFail(Sym.getName()); will cause an llvm-unreable.

1720–1721 ↗(On Diff #407285)

good catch, thanks , the tf1_rsrc should be export out as "__rsrc" , I changed the code and add test case for it.

jhenderson added inline comments.Feb 11 2022, 12:13 AM
llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
72 ↗(On Diff #407656)

It's probably worth adding a comment to the start of each symbol block in this YAML explaining what case(s) that symbol covers.

llvm/tools/llvm-nm/llvm-nm.cpp
1677–1713 ↗(On Diff #407285)

I think you should switch to cantFail, unless you know that you are going to change getName soon. The reasons are:

  1. it simplifies the code dramatically, making it more readable.
  2. you can't test the current code, so there's no guarantee that even if getName is changed that it is handled here appropriately.
  3. if getName is changed to return an Error at some point, call sites should be audited for additional checks/tests that are needed. As such, this case should be picked up anyway.
DiggerLin marked 3 inline comments as done.Feb 14 2022, 12:23 PM
DiggerLin added inline comments.
llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
72 ↗(On Diff #407656)

I think most the symbol name can express what we want to test. I only added some visibility comment on the YAML . and symbol name "var_extern" which test the

if (SecIter == XCOFFObj->section_end())
  continue;
5 ↗(On Diff #407285)

thanks a lot

llvm/tools/llvm-nm/llvm-nm.cpp
1677–1713 ↗(On Diff #407285)

thanks

DiggerLin marked 3 inline comments as done.

We're basically there (for now). Just some test tidy-ups/clarifications remaining.

llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
55 ↗(On Diff #408561)

I couldn't find a symbol that fits this test case: "Symbol that is not one of the three previous types", i.e. that isn't text, data, or bss. Please add a comment/rename the symbol to highlight it.

119 ↗(On Diff #408561)

I might have missed it, but I think this is the only .text symbol? I believe you want a .text symbol that is actually exported (to show that such symbols can be exported). It's okay for this to be combined with another property that is being tested (e.g. protected visibility).

195–196 ↗(On Diff #408561)

I noticed that the "StorageClass" is different between this and the "Exported" case. If you changed it to C_EXT, would the symbol be exported? I'm assuming that the "Type" field is the one that defines visibility here...

More generally, for the cases where the symbol isn't printed, if possible there should be exactly one thing that would need changing to cause it to be printed. This goes for every case (hidden, internal, empty name etc).

220–221 ↗(On Diff #408561)

I think this is the "symbol not in a section case?" If so, perhaps name it more explicitly, and/or add a comment.

226 ↗(On Diff #408561)

Probably should have a comment to highlight what is significant about this case.

235 ↗(On Diff #408561)

Probably should have a comment to highlight what is significant about this case.

244 ↗(On Diff #408561)

Probably should have a comment to highlight what is significant about this case.

253 ↗(On Diff #408561)

Probably should have a comment to highlight what is significant about this case.

llvm/tools/llvm-nm/llvm-nm.cpp
1661 ↗(On Diff #408561)

SymbolRef is designed for lightweight copying, like StringRef, so no need for const &.

DiggerLin marked 7 inline comments as done.Feb 15 2022, 12:40 PM
DiggerLin added inline comments.
llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
55 ↗(On Diff #408561)

the .debug in the section .debug which is not the text, data, or bss.

The

Name:            debug
 Section:         .debug

I think symbol name 'debug" and Section Name ".debug" can explain itself.

119 ↗(On Diff #408561)
Name:            .func
Section:         .text

means the symbol ".func" in the .text section.

 Name:  .text
Section: .text

means the symbol ".text" in the .text section.

 Name:            .func
 Section:         .text 
 
test

Do not export global symbols beginning with "."

195–196 ↗(On Diff #408561)

I noticed that the "StorageClass" is different between this and the "Exported" case. If you changed it to C_EXT, would the symbol be exported? I'm assuming that the "Type" field is the one that defines visibility here.

If you changed it to C_EXT, and visibility is Hidden, it not exported. "Type" field is the one that defines visibility.

for empty Name. it is generated by the IBM compiler xlclang, which storage name always is C_HIDEXT, always not exported. and it is no sense to export empty symbol name.

I changed StorageClass of empty symbol, "internal_var"
, "hidden_var" to C_EXT.

and add a new test

  - Name:            hidext_var
    Section:         .data
## Protected visibility.
    Type:            0x3000
    StorageClass:    C_HIDEXT
    AuxEntries:
     - Type:                   AUX_CSECT
       SymbolAlignmentAndType: 0x09
       StorageMappingClass:    XMC_RW
220–221 ↗(On Diff #408561)

I change to undef_var, which is not in any section

llvm/tools/llvm-nm/llvm-nm.cpp
1661 ↗(On Diff #408561)

no copy is not better than any lightweight copying ? I change as your suggestion anyway.

DiggerLin updated this revision to Diff 409023.Feb 15 2022, 1:03 PM
DiggerLin marked 5 inline comments as done.
jhenderson added inline comments.Feb 16 2022, 1:53 AM
llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
55 ↗(On Diff #408561)

I'd like a comment that says "A symbol that is neither text, nor data, nor bss." specifically, to show that it's not that the symbol is a debug symbol, but rather that it isn't one of those three categories. That way, if behaviour were to change in the future, and debug symbols became special, there wouldn't be a risk of loss of coverage (in theory).

119 ↗(On Diff #408561)

When I say ".text" symbol, I mean a symbol in ".text". The current logic has SecIter->isText() as a possible condition for exporting. The .func test case hits this, but is then not exported for other reasons. You need a test case that also hits this, and is exported. Otherwise, you could delete that part of the conditional and you'd see no test failures.

llvm/tools/llvm-nm/llvm-nm.cpp
1661 ↗(On Diff #408561)

You left the const in, please remove it ;)

I wouldn't be surprised if the compiler optimizes away the copy anyway, since there's no modification of Sym. The copy is then less code, so makes it a little easier to read.

DiggerLin updated this revision to Diff 409252.Feb 16 2022, 7:48 AM
DiggerLin marked an inline comment as done.
DiggerLin added inline comments.
llvm/test/tools/llvm-nm/XCOFF/export-symbols.test
119 ↗(On Diff #408561)

I changed the

Name:            __tf1_tf1value
Section:         .data

-->

Name:            __tf1_tf1value
Section:         .text
jhenderson accepted this revision.Feb 16 2022, 7:51 AM

Okay, looks good from me. Best wait for @MaskRay too.

MaskRay accepted this revision.Feb 16 2022, 8:38 PM

LGTM.

llvm/docs/CommandGuide/llvm-nm.rst
280 ↗(On Diff #409252)
llvm/test/tools/llvm-nm/bitcode-export-sym.test
1 ↗(On Diff #409252)

Add # REQUIRES: powerpc-registered-target

This revision is now accepted and ready to land.Feb 16 2022, 8:38 PM
This revision was landed with ongoing or failed builds.Feb 17 2022, 8:37 AM
This revision was automatically updated to reflect the committed changes.
DiggerLin marked 2 inline comments as done.Feb 17 2022, 8:39 AM
DiggerLin added inline comments.
llvm/docs/CommandGuide/llvm-nm.rst
280 ↗(On Diff #409252)

thanks

llvm/test/tools/llvm-nm/bitcode-export-sym.test
1 ↗(On Diff #409252)

I wonder why we need # REQUIRES: powerpc-registered-target here . I think the llvm is cross compile, it can generate the bit code for powerpc on other platform, I added the line anyway.