This patch implements a variant of the DJB hash function which folds the
input according to the algorithm in the Dwarf 5 specification (Section
6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18,
"Case Mappings").
To achieve this, I have added a llvm::sys::unicode::foldCharSimple
function, which performs this mapping. The implementation of this
function was generated from the CaseMatching.txt file from the Unicode
spec using a python script (which is also included in this patch). The
script tries to optimize the function by coalescing adjecant mappings
with the same shift and stride (terms I made up). Theoretically, it
could be made a bit smarter and merge adjecant blocks that were
interrupted by only one or two characters with exceptional mapping, but
this would save only a couple of branches, while it would greatly
complicate the implementation, so I deemed it was not worth it.
Since we assume that the vast majority of the input characters will be
US-ASCII, the folding hash function has a fast-path for handling these,
and only whips out the full decode+fold+encode logic if we encounter a
character outside of this range. It might be possible to implement the
folding directly on utf8 sequences, but this would also bring a lot of
complexity for the few cases where we will actually need to process
non-ascii characters.
The know values tests have already payed off in that they detected that the result of the hash function depends on the signedness of char (which did seem a bit dodgy to me when I looked at it, but I assumed we knew what we were doing).
For the new function the fix is obvious (as unsigned char, as dwarf5 specifies), but it's not clear to me how to handle the existing hash function, whose broken hashes are already in the wild.
@aprantl, @JDevlieghere: How should we handle this? This is only used on apple platforms, and I expect most of apple code to be compiled on x86 hosts (right?), which has char signed, so explicitly setting this to signed for the apple hash would probably have the least impact while maintaining consistency for the future. OTOH, lldb already implements this hash functions via unsigned char, and this discrepancy has gone unnoticed, so maybe it does not matter? (This effect is only visible on char >=128, so US-ASCII is fine, but any utf8 character would trigger this).