...and then remove all the duplicates after this process.
Many test-cases here were taken from the actual symbols defined in a
build of llvm/clang. But, having this file contain almost every
llvm/clang symbol name is somewhat irritating when using 'git grep',
and makes this file much larger than is necessary.
So, for the set of llvm/clang-extracted symbols, rewrite the names to
single-letter names, this rather-hacky python script:
import sys, re, string
from collections import OrderedDict
highletters=''.join([chr(0x80+n) for n in range(26)])
letters=''.join([chr(ord('a')+n) for n in range(26)])
translation = string.maketrans(highletters, letters)
word_re = re.compile('[a-zA-Z_][a-zA-Z0-9_]*')
def rewrite_demangle(mangled, demangled):
allwords = [word for word in word_re.findall(demangled) if word != 'const' and mangled.find('%d%s' % (len(word), word)) != -1] allwords = list(enumerate(OrderedDict.fromkeys(allwords))) allwords.sort(key=lambda x: len(x[1]), reverse=True) # Replace names with a unique character first, so that subsequent # replacements don't accidentally replace it a second time. for namenum, word in allwords: mangled = mangled.replace('%d%s' % (len(word), word), '1' + chr(namenum + 0x80)) demangled = re.sub(r'\b'+word+r'\b', chr(namenum + 0x80), demangled) # Then return, with actual alphabetic characters. return mangled.translate(translation), demangled.translate(translation)
for l in sys.stdin:
mangled, unmangled = l.rstrip('\n').split(' ', 1) sys.stdout.write(" {\"%s\", \"%s\"},\n" % rewrite_demangle(mangled, unmangled))