This is an archive of the discontinued LLVM Phabricator instance.

Implement a case-folding version of DJB hash
ClosedPublic

Authored by labath on Jan 31 2018, 8:06 AM.

Details

Summary

This patch implements a variant of the DJB hash function which folds the
input according to the algorithm in the Dwarf 5 specification (Section
6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18,
"Case Mappings").

To achieve this, I have added a llvm::sys::unicode::foldCharSimple
function, which performs this mapping. The implementation of this
function was generated from the CaseMatching.txt file from the Unicode
spec using a python script (which is also included in this patch). The
script tries to optimize the function by coalescing adjecant mappings
with the same shift and stride (terms I made up). Theoretically, it
could be made a bit smarter and merge adjecant blocks that were
interrupted by only one or two characters with exceptional mapping, but
this would save only a couple of branches, while it would greatly
complicate the implementation, so I deemed it was not worth it.

Since we assume that the vast majority of the input characters will be
US-ASCII, the folding hash function has a fast-path for handling these,
and only whips out the full decode+fold+encode logic if we encounter a
character outside of this range. It might be possible to implement the
folding directly on utf8 sequences, but this would also bring a lot of
complexity for the few cases where we will actually need to process
non-ascii characters.

Diff Detail

Repository
rL LLVM

Event Timeline

labath created this revision.Jan 31 2018, 8:06 AM
labath updated this revision to Diff 132186.Jan 31 2018, 8:41 AM

Run the generation script over unicode v9 case folding database (previous one was v8). This adds a couple of additional mappings.

JDevlieghere accepted this revision.Jan 31 2018, 8:54 AM

LGTM. I'm particularly happy with the python script to auto-generate this. Thanks Pavel!

This revision is now accepted and ready to land.Jan 31 2018, 8:54 AM

Wow. Thanks!

Is this function something that needs to be updated with every new version of Unicode? Should we add a "How to update this for newer versions" comment?

aprantl added inline comments.
lib/Support/DJB.cpp
63 ↗(On Diff #132186)

Nit: We try to spell this DWARF v5 consistently.

lib/Support/UnicodeCaseFold.cpp
5 ↗(On Diff #132186)

by utils/dwarf-case-fold.py

JDevlieghere added inline comments.Jan 31 2018, 9:12 AM
lib/Support/DJB.cpp
63 ↗(On Diff #132186)

To add to the confusion, I usually write DWARFv5...

Nvm, a quick grep shows that I'm the one doing it wrong :-(

Nvm, a quick grep shows that I'm the one doing it wrong :-(

It might actually make sense to fix this to facilitate git-grepping...

Nvm, a quick grep shows that I'm the one doing it wrong :-(

It might actually make sense to fix this to facilitate git-grepping...

Agree, let's go with DWARF v5. I'll update my spelling when I touch the files :-)

probinson requested changes to this revision.Jan 31 2018, 10:29 AM
probinson added inline comments.
lib/Support/UnicodeCaseFold.cpp
1 ↗(On Diff #132186)

This file should have a header. If we expect it will always be a generated file, it should have a comment saying so and where to look for how to regenerate it. If not, it should have a normal LLVM module header.

This revision now requires changes to proceed.Jan 31 2018, 10:29 AM
dblaikie added inline comments.Jan 31 2018, 10:47 AM
unittests/Support/DJBTest.cpp
16 ↗(On Diff #132186)

Given that this is a specific hash with a guaranteed result (for portability - debuggers are expected to have their own implementation of this hash that would produce the same values, etc) - should this have some tests for the actual hash values of some non-trivial text?

(also this test doesn't test for any hash values differing - the whole test would pass if the hash function returned a single constant for all inputs, right?)

labath marked 3 inline comments as done.Feb 1 2018, 6:18 AM

Is this function something that needs to be updated with every new version of Unicode? Should we add a "How to update this for newer versions" comment?

That's a good question. The DWARF spec specifically cites the version 9 of the Unicode Standard. This was the latest version at the time dwarf v5 was published, so it's not fully clear whether they meant to hard-code the version or they just meant "the latest".

Currently, the latest version is v10 that came out in June 2017. However, it has no changes in the case folding database, so we don't have to figure this out yet.

For the future, it's not really clear to me what to do here. The Unicode spec does offer some guarantees about the backwards compatibility in regards to the case folding database (*), so updating the mappings should be kinda safe, because any new characters they add are the ones that would be invalid according to the previous specifications. OTOH, the dwarf spec is pretty explicit about the version we should use.

(*) Unicode v10, Section 5.18, "Case Mappings", page 243 http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf

Stability. The definition of case folding is guaranteed to be stable, in that any string of
characters case folded according to these rules will remain case folded in Version 5.0 or later
of the Unicode Standard. To achieve this stability, there are constraints on additions of case
pairs for existing encoded characters. Typically, no new lowercase character will be added
to the Unicode Standard as a casing pair of an existing upper- or titlecase character that
does not already have a lowercase pair. In exceptional circumstances, where lowercase characters
must be added to the standard in a later version than the version in which the corresponding
uppercase characters were encoded, such lowercase characters can only be
defined as new case pairs with a corresponding change to case folding to ensure that they
case fold to the old uppercase letters. See the subsection “Policies” in Section B.3, Other Unicode
Online Resources.

lib/Support/UnicodeCaseFold.cpp
1 ↗(On Diff #132186)

I think this will always remain an autogenerated file. All tweaks to the implementation would probably be easier made by modifying the script and regenerating.

I've added a header which explains this and how to regenerate it. Let me know what you think.

unittests/Support/DJBTest.cpp
16 ↗(On Diff #132186)

Good point, I've added more tests, which compare hashes against fixed constants.

labath updated this revision to Diff 132380.Feb 1 2018, 6:21 AM
  • renamed the generating script to unicode-case-fold (as it does nothing dwarf-specific).
  • made the script take a url as an argument (mainly so it can then dump it out in the generated code)
  • tweaked the script to generate clang-format-compatible files
  • added a header to the generater file and
  • added more tests

Header looks fine.

utils/unicode-case-fold.py
5 ↗(On Diff #132380)

"Parses the database..."

42 ↗(On Diff #132380)

Missing full-stop.

103 ↗(On Diff #132380)

Capitalize, full-stop.

Is this function something that needs to be updated with every new version of Unicode? Should we add a "How to update this for newer versions" comment?

That's a good question. The DWARF spec specifically cites the version 9 of the Unicode Standard. This was the latest version at the time dwarf v5 was published, so it's not fully clear whether they meant to hard-code the version or they just meant "the latest".

Currently, the latest version is v10 that came out in June 2017. However, it has no changes in the case folding database, so we don't have to figure this out yet.

I don't remember the discussion that well, but I'd guess the idea was to fix it permanently (or at least permanently for DWARF v5). Changes to the folding would change the hash for strings that used those characters, which is a Bad Thing.
If there's a change to the case-folding database before DWARF v6 comes out, and the committee notices, the hash might become version-dependent. But that's years away.

JDevlieghere added inline comments.Feb 1 2018, 8:34 AM
lib/Support/UnicodeCaseFold.cpp
1 ↗(On Diff #132380)

And here.

unittests/Support/DJBTest.cpp
1 ↗(On Diff #132380)

Same here.

utils/unicode-case-fold.py
110 ↗(On Diff #132380)

Nit: missing ---*- C++ -*-===// at the end.

As @dblaikie pointed out via e-mail, the C/C++ stuff is only required for headers files, so feel free to ignore my previous comments.

labath updated this revision to Diff 132558.Feb 2 2018, 3:01 AM

Fixup comments in the python file.

labath marked 10 inline comments as done.Feb 8 2018, 3:51 AM

I believe I have resolved all outstanding issues. If there's anything else you'd like me to do, let me know (this is mainly addressed at @probinson, as his initial NACK is keeping this review red).

This revision is now accepted and ready to land.Feb 8 2018, 8:54 AM
joerg added a subscriber: joerg.Feb 8 2018, 9:52 AM
joerg added inline comments.
lib/Support/UnicodeCaseFold.cpp
26 ↗(On Diff #132558)

Given that this should be applied to symbol names a lot, I would explicitly make the ASCII range fully covered to avoid all the other branches from getting triggered.

labath added inline comments.Feb 8 2018, 9:59 AM
lib/Support/UnicodeCaseFold.cpp
26 ↗(On Diff #132558)

There's a fast path in caseFoldingDjbHash for 7-bit ascii which should cover all reasonable symbol names. That will short circuit the case folding before even this function gets invoked (skipping all utf8 decode-recode logic, etc). Is that what you were looking for?

joerg added inline comments.Feb 8 2018, 10:47 AM
lib/Support/UnicodeCaseFold.cpp
26 ↗(On Diff #132558)

Partially. I wonder if it wouldn't be better in general to restructure the output to have full distinct ranges. I don't think the performance will be very nice specifically for non-upper case letters otherwise. It would also be nice if the code generator didn't do obviously stupid output like the C % 1 == 0 above. Just in terms of branches when doing simple codegen, the following is no worse:

if (C < 0x0041) ...
if (C <= 0x005a) ...
if (C <= 0xc0) ...

and folding the parts of the switch into the appropiate ranges.

labath updated this revision to Diff 133576.Feb 9 2018, 2:49 AM

Modify the code generator script to produce patterns suggested by joerg. This
will make the function return quickly for lower-valued characters, which are
probably more likely to appear in input.

This revision was automatically updated to reflect the committed changes.
labath reopened this revision.Feb 14 2018, 3:49 AM
labath added inline comments.
llvm/trunk/lib/Support/DJB.cpp
26–27

The know values tests have already payed off in that they detected that the result of the hash function depends on the signedness of char (which did seem a bit dodgy to me when I looked at it, but I assumed we knew what we were doing).

For the new function the fix is obvious (as unsigned char, as dwarf5 specifies), but it's not clear to me how to handle the existing hash function, whose broken hashes are already in the wild.

@aprantl, @JDevlieghere: How should we handle this? This is only used on apple platforms, and I expect most of apple code to be compiled on x86 hosts (right?), which has char signed, so explicitly setting this to signed for the apple hash would probably have the least impact while maintaining consistency for the future. OTOH, lldb already implements this hash functions via unsigned char, and this discrepancy has gone unnoticed, so maybe it does not matter? (This effect is only visible on char >=128, so US-ASCII is fine, but any utf8 character would trigger this).

This revision is now accepted and ready to land.Feb 14 2018, 3:49 AM

Jonas, Adrian: Do you have any thoughts on how to handle the sign extension issue on the existing djb hash function? (I could theoretically punt that and just fix the case-folding one, but it would be simpler to handle both of them together).

ARM and x86 implement different chars, don't they?

aprantl added inline comments.Feb 21 2018, 12:35 PM
llvm/trunk/lib/Support/DJB.cpp
26–27

This is only used on apple platforms, and I expect most of apple code to be compiled on x86 hosts (right?)

I think that's the key, yeah. Ever since the Intel transition it is safe to assume that pretty much all code that runs on Apple platforms was compiled on some form of an x86 machine, so we should fix the signedness to hardcode the x86 behavior and call it signedDjbHash and say that it's used for the apple accelerator tables.

But now it gets weird. I just looked at the LLDB source code and it seems to implement the DJB function with an unsigned char!?

aprantl added a subscriber: friss.Feb 21 2018, 12:36 PM

As an unrelated side-note: I recently discovered that there is yet another DJB implementation in StringExtras (http://llvm.org/doxygen/StringExtras_8h_source.html#l00051) that we probably should get rid of.

aprantl added inline comments.Feb 21 2018, 12:42 PM
llvm/trunk/lib/Support/DJB.cpp
26–27

Well, since it only affects characters outside of the lower 7 bit ASCII, I think it would be fine to just fix the LLVM implementation to match the one in LLDB and DWARF. It looks like this would have never worked for non-ASCII strings in the past and fixing the implementation will make it work from here on forth.

labath added inline comments.Feb 21 2018, 2:03 PM
llvm/trunk/lib/Support/DJB.cpp
26–27

Thanks for the feedback. I'm going to resubmit the patch with the unsigned behavior hard-coded.

This revision was automatically updated to reflect the committed changes.