Page MenuHomePhabricator

Random Number Generator (clang)
AbandonedPublic

Authored by yln on Apr 15 2014, 10:25 PM.

Details

Reviewers
ahomescu
jfb
rinon
Summary

Provides an abstraction for a random number generator (RNG) that produces a stream of pseudo-random numbers.
The current implementation uses C++11 facilities and is therefore not cryptographically secure.

The RNG is salted with the text of the current command line invocation.
In addition, a user may specify a seed (reproducible builds).

In clang, the seed can be set via

-frandom-seed=X

In the back end, the seed can be set via

-rng-seed=X

This is the clang part of the patch.
LLVM part: D3390

Diff Detail

Event Timeline

yln updated this revision to Unknown Object (????).Apr 16 2014, 8:24 PM

Remove unused variable, minimize diff.

yln updated this revision to Unknown Object (????).Apr 16 2014, 8:34 PM

Remove unused variable, minimize diff. (fixed context)

jfb added inline comments.Apr 17 2014, 2:29 PM
lib/Frontend/CompilerInvocation.cpp
1682 ↗(On Diff #8594)

std::accumulate instead of the loop.

I think you should also add a comment that explains why this is a desirable salt value compared to, say, timestamp. It gives a predictable outcome per clang invocation from one local build to another, but different developers will get different results if they had a different path for the code, and different clang invocations get different salt (e.g. each TU will have different salt).

Is it worth throwing in other things that differ between developers, but are the same from one build to the next for a particular developer? Say the moral equivalent of /proc/cpuinfo, some OS information, and/or the git hash of the LLVM master that LLVM was built from?

ahomescu added inline comments.Apr 17 2014, 2:59 PM
lib/Frontend/CompilerInvocation.cpp
1682 ↗(On Diff #8594)

We want the build to be repeatable over time and across machines. If two developers use the same seed, same source tree and same build options, they should get exactly the same binary. On a per-project basis, you can use the timestamp to build the seed, or any of the other things you mentioned.

jfb added inline comments.Apr 17 2014, 3:22 PM
lib/Frontend/CompilerInvocation.cpp
1682 ↗(On Diff #8594)

Using getAllArgValues doesn't meet what you want then: it'll contain pathes, and those will be different between machines.

The LLVM git source hash would meet this requirement.

yln updated this revision to Unknown Object (????).Apr 17 2014, 5:57 PM

Refine construction of salt string.

yln added inline comments.Apr 17 2014, 6:08 PM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

@ahomescu
Are we sure that we want repeatable builds across machines?

@jfb
"The LLVM git source hash would meet this requirement."
I am not sure I understand, can you elaborate?

jfb added inline comments.Apr 17 2014, 6:14 PM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

I do think you want repeatable builds across machines, otherwise one developer can't ask another developer to reproduce a bug.

For git hash I mean something like baking the result of `git rev-parse HEAD` into the build, maybe by passing it in from the command line through -DFOO=${HASH) or something.
I'm not sure git is LLVM's source of truth though, SVN still is, so the rev number would be the salt then, until that changes.

rinon added inline comments.Apr 18 2014, 11:21 AM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

Producing different random streams per- compilation unit (file) is the entire point of the salt. We originally determined that if a single file in a build leaks to the attacker, the attacker can reproduce the random stream and thus reproduce any other part of a larger build process (make, etc.). Thus, we want a unique random stream for each file.

I agree that getAllArgValues is not ideal for this, since it could contain arbitrary paths. Is there some way we can filter this out? Even simply using the input filename (without the full path) could work here. A previous version I saw filtered out the -o filepath flag, but I'm not sure this is necessarily sufficient.

I'm also not sure that adding a llvm version-specific salt is helpful here. I can see how changing other optimization decisions would affect the compiled results, but current best-practices dictate matching compiler versions already if you want to reproduce a build.

jfb added inline comments.Apr 18 2014, 11:42 AM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

git rev-parse HEAD would give you a salt that identifies the exact LLVM source base that the compiler was based on. I think it's a good thing to throw in, but it indeed doesn't give a different stream for each compilation unit. Throwing it in means that you will only get a deterministic build if both developers have the exact same LLVM source.

On the per-compilation unit side, maybe you can iterate through the command line options but only pick out non-string values? Even then I'm not sure that's good enough since the build architecture could be an enum. Thinking about it, I don't think the command line is good...

Here's another line of thought: you're trying to have unique salt per compilation unit, which is equivalent to an LLVM Module. Could you salt at Module creation time? There are things like triple, datalayout that aren't Module-specific but useful, and the Module does have a ModuleIdentifier which should be unique. You could also iterate through function names and global names and accumulate those, but you'd have to do so in a stable way.

The one issue with this idea is that it prevents one Module from kicking off multiple threads (since the RNG would be shared if that were the case) but that's what LLVM currently imposes, you'd merely be following this.

Do you think setting the salt from Module would work?

rinon added inline comments.Apr 18 2014, 11:56 AM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

@ahomescu and I were just discussing this yesterday, in fact. Yes I think this seems like a good idea. It looks to me like the ModuleID should work great as a seed. @ahomescu, @yln what do you think? I think we should go ahead and rip out the current salting and just use getModuleIdentifier().

yln added inline comments.Apr 18 2014, 2:57 PM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

ModuleID seems to be the simplest thing that fits our needs.
Maybe other reviewers can comment on jfb's/rinon's one RNG per module (or context) suggestion.

ahomescu added inline comments.Apr 18 2014, 3:02 PM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

I just checked to see what ModuleID is when compiling with clang.
$ clang -O2 -emit-llvm -S -o test2.ll /home/andrei/temp/cpp/test2.c
gives ModuleID = "/home/andrei/temp/cpp/test2.c", AKA the full path.
Using the ModuleID is still not good enough.
We should strip the path from the ID, and keep only the base file name.

rinon added inline comments.Apr 18 2014, 3:10 PM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

After more offline discussions, I think we've come up with a workable proposal. Hopefully this will fit well into the rest of the llvm infrastructure.

Since an LLVMContext is thread-specific, we could just create an RNG instance for the context when it is constructed. Then add a bit of code to LLVMContext::addModule to resalt the RNG with each added ModuleID as follows in pseudo-code:

LLVMContext() {
  rng = new RandomNumberGenerator(seed);
}

LLVMContext::addModule(M) {
  rng->seed(rng->Random() + M.getModuleIdentifier());
}

There is a small difficulty with ModuleID though. This appears to sometimes be the full path the source file, so we will need to grab just the filename out of it, instead of the full path.

Will adding an RNG instance to the context be acceptable to the llvm community at large? I've tried to stay away from adding anything to core classes, but I think this really makes the most sense.

jfb added inline comments.Apr 18 2014, 3:12 PM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

Is it also useful to mix other things into the salt, like datalayout? It's not file specific, but does provide more than just the file name.

1678 ↗(On Diff #8620)

+1 on just using basename. Note that filename isn't unique, but I don't think we can reliably determine the "full path of a file from the project's root directory", so I think it's the best we have to work with.

rinon added inline comments.Apr 22 2014, 1:01 PM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

I'm not sure it is necessary to add any other salts. We are relying on the seed to have sufficient entropy to be unpredictable. If we want to make any other aspects of the build result in differing binaries, then more salting would be required. However, as is I don't see a use case for this offhand. Maybe you've got one that we haven't considered?

jfb added inline comments.Apr 22 2014, 1:46 PM
lib/Frontend/CompilerInvocation.cpp
1678 ↗(On Diff #8620)

Sounds good to me then. I can't think of one, but figured it would be better to ask.

yln updated this revision to Diff 8788.Apr 24 2014, 1:20 AM
yln edited edge metadata.

Refactor RNG to be non-static and make it available via LLVMContext.

jfb added a comment.Apr 30 2014, 3:53 PM

Trying to wrap things up, this change LGTM if the other changes aren't contentious with other reviewers.

jfb accepted this revision.Apr 30 2014, 3:53 PM
jfb added a reviewer: jfb.
This revision is now accepted and ready to land.Apr 30 2014, 3:53 PM
yln abandoned this revision.Aug 15 2017, 4:08 PM