This is an archive of the discontinued LLVM Phabricator instance.

[libc.search] [PATCH 1/6] add wyhash as a header library
AcceptedPublic

Authored by SchrodingerZhu on Oct 21 2022, 1:27 PM.

Details

Summary

According to SMhasher, wyhash is one of fastest 64-bit hash functions that
does not have primary quality issues. This hash function is added
inorder to support string hashing used in hsearch related functions.
The code is derived from the original wyhash code released under
the public domain.

Diff Detail

Event Timeline

SchrodingerZhu created this revision.Oct 21 2022, 1:27 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptOct 21 2022, 1:27 PM
SchrodingerZhu requested review of this revision.Oct 21 2022, 1:27 PM

Phabricator supports stacks. At the top right, you will find Edit Related Revisions ... .

To whom this may concern,

I am still hoping to work with you all to hear your opinions and to see if we can finally get these changes landed or not.
In the following week, I might be occupied by some work at my school department, which means I could be slow in response.
Please do leave comments if you ever get some time to review some piece of the code --- I will be back to this ASAP.

Best,
Yifan

To whom this may concern,

I am still hoping to work with you all to hear your opinions and to see if we can finally get these changes landed or not.
In the following week, I might be occupied by some work at my school department, which means I could be slow in response.
Please do leave comments if you ever get some time to review some piece of the code --- I will be back to this ASAP.

Best,
Yifan

Sorry for the delay. I have not forgotten about this and will definitely get back to you at the end of next week. Same with your post about the tsearch related work.

I have started reviewing this patch set. It will take time but I will get back with feedback as soon as I can. Keep in mind that we are heading into the US holiday season.

I've done a general style review

libc/src/__support/wyhash.h
16

nit: "released"

24

the cpp namespace is for the CPP subdirectory. This should be internal instead.

26–27

add a comment explaining how you calculated these constants.

46–47

this should be a loop with shifts instead of memcpy, which should also allow you to avoid the endianness call below.

@michaelrj well received. Will work on it ASAP.

SchrodingerZhu updated this revision to Diff 481317.EditedDec 8 2022, 9:04 AM

This commit updates the wyhash header library.
Notice that subsequent changes in other stacked patches are yet to be uploaded.

Changes:

  • address code review
    • Use looped read with shift to avoid memcpy and endian conversion
    • Address the source of secret parameters
    • Move wyhash to internal namespace (TODO: Similar change need to be applied to other patches)
  • Sync wyhash Implementation with version 4.1
    • hash function behavor changes at the initializing and finalizing stage
    • generate new sanity test hash values with ea3b25e1aef55d90f707c3a292eeb9162e2615d8
c++
        for (size_t i = 0; i < 7; ++i) {
            std::cout << std::hex << "0x" << wyhash(data[i], strlen(data[i]), i, _wyp) << std::endl;
        }

The patch overall looks good to me, but I'd like to see some more comprehensive tests. Specifically it would be good to demonstrate the avalanche effect by having test inputs that differ by one bit but have outputs that differ by much more. Once that is done I will approve.

  • address problems mentioned reviews:
    • add tests to examine hash function has low bias in byte range
    • add tests to examine hash function has enough avalanche effect
michaelrj accepted this revision.Dec 16 2022, 10:07 AM

Looks good to me.

This revision is now accepted and ready to land.Dec 16 2022, 10:07 AM

Hi,
Before you land this, can you explain what the long term plan would be to keep this libc version in sync with the original wyhash implementation?
Thanks,
Siva Chandra

Hi,
I would like to share some ideas about the maintenance:

  • Since the property of this Wyhash implementation is proved by both SMHasher and the provided tests, we can keep the it unchanged, unless:
  • "security issue" is confirmed for this implementation such as one can prove that particular sequence of hashing leads to heavy collisions
  • or, new hash implementation is highly promising in performance and robustness

Although wyhash is also stable across micro-architectures as it does not use platform-specific instructions and it take care of the endianness, currently this hash function is only intended to be used in in-memory hashmap that only relies on the stability of hash value (given a fixed seed) at runtime.

If runtime stability is all we required, I think there should be no bit concern for updating the hash implementation or even changing the algorithm if ever needed in the future.

Best,
Yifan

I am referring to updates of the kind you have done to this patch itself, "Sync wyhash Implementation with version 4.1". What would be the strategy to take updates on a continuous basis?

Not active really here, but wanted to point out --

wyhash is not under the public domain AFAIK. The repository is here:
https://github.com/wangyi-fudan/wyhash

The license is the so-called "unlicense" which you can read about here:
https://softwareengineering.stackexchange.com/questions/147111/what-is-wrong-with-the-unlicense

In general, contributions to LLVM projects need to follow the basic rules in the developer policy, including that it is licensed under the LLVM license:
https://llvm.org/docs/DeveloperPolicy.html#copyright-license-and-patents:~:text=When%20you%20contribute%20code%20to%20the%20LLVM%20project%2C%20you%20license%20it%20under%20these%20terms.

That policy includes specific instructions for suggesting the inclusion of code under other licenses:
https://llvm.org/docs/DeveloperPolicy.html#copyright-license-and-patents:~:text=In%20certain%20circumstances,requesting%20a%20review.

Snippet for ease:

In certain circumstances, code licensed under other licenses can be added to the codebase. However, this may only be done with approval of the LLVM Foundation Board of Directors, and contributors should plan for the approval process to take at least 4-6 weeks. If you would like to contribute code under a different license, please create a Phabricator review with the code you want to contribute and email board@llvm.org requesting a review.

Hi, @chandlerc,

I am so sorry I did not notice that Unlicense is actually complicating the situation. I regard it as public domain because projects like Go includes it as a derived work without mentioning the original license. That is being said, I totally agree that one should be careful about the Unlicense after reading through the stackoverflow discussion.

I will try to reach out the board. Meanwhile, I saw previous issue discussion at WyHash repo that the author intended to set their code completely free -- assuming they are still holding this intention, I will also try to reach out them to see if they are willingly to do multi licensing.

I ping you here just because I saw that the hash function in LLVM was maintained by you. If the worst case ever happens, is it possible to port the code in llvm library into libc library and use that instead?

@sivachandra, @michaelrj, I apologize again if I have added any complication to your work. Thanks for the reviews.

Yifan