This is an archive of the discontinued LLVM Phabricator instance.

[scudo] PRNG makeover
ClosedPublic

Authored by cryptoad on Jul 10 2017, 1:24 PM.

Details

Summary

This follows the addition of GetRandom with D34412. We remove our
/dev/urandom code and use the new function. Additionally, change the PRNG for
a slightly faster version. One of the issues with the old code is that we have
64 full bits of randomness per "next", using only 8 of those for the Salt and
discarding the rest. So we add a cached u64 in the PRNG that can serve up to
8 u8 before having to call the "next" function again.

During some integration work, I also realized that some very early processes
(like init) do not benefit from /dev/urandom yet. So if there is no
getrandom syscall as well, we have to fallback to some sort of initialization
of the PRNG.

Now a few words on why XoRoShiRo and not something else. I have played a while
with various PRNGs on 32 & 64 bit platforms. Some results are below. LCG 32 & 64
are usually faster but produce respectively 15 & 31 bits of entropy, meaning
that to get a full 64-bit, you would need to call them several times. The simple
XorShift is fast, produces 32 bits but is mediocre with regard to PRNG test
suites, PCG is slower overall, and XoRoShiRo is faster than XorShift128+ and
produces full 64 bits.


root@tulip-chiphd:/data # ./randtest.arm
[+] starting xs32...
[?] xs32 duration: 22431833053ns
[+] starting lcg32...
[?] lcg32 duration: 14941402090ns
[+] starting pcg32...
[?] pcg32 duration: 44941973771ns
[+] starting xs128p...
[?] xs128p duration: 48889786981ns
[+] starting lcg64...
[?] lcg64 duration: 33831042391ns
[+] starting xos128p...
[?] xos128p duration: 44850878605ns

root@tulip-chiphd:/data # ./randtest.aarch64
[+] starting xs32...
[?] xs32 duration: 22425151678ns
[+] starting lcg32...
[?] lcg32 duration: 14954255257ns
[+] starting pcg32...
[?] pcg32 duration: 37346265726ns
[+] starting xs128p...
[?] xs128p duration: 22523807219ns
[+] starting lcg64...
[?] lcg64 duration: 26141304679ns
[+] starting xos128p...
[?] xos128p duration: 14937033215ns

Event Timeline

cryptoad created this revision.Jul 10 2017, 1:24 PM
alekseyshl accepted this revision.Jul 10 2017, 5:28 PM
alekseyshl added inline comments.
lib/scudo/scudo_allocator.cpp
583

This change is unrelated, please revert.

lib/scudo/scudo_utils.h
86

Does this trick really help with performance?

This revision is now accepted and ready to land.Jul 10 2017, 5:28 PM
cryptoad added inline comments.Jul 10 2017, 6:51 PM
lib/scudo/scudo_allocator.cpp
583

Ack. Slipped in from another one sorry.

lib/scudo/scudo_utils.h
86

The numbers do not differ enough to go one way or the other, but the u8 was taking up to extra 8bytes depending on architecture. It feels it should be faster, but it definitely save space.

cryptoad updated this revision to Diff 106032.Jul 11 2017, 8:27 AM
cryptoad marked 2 inline comments as done.

Addressing comment: removing an UNLIKELY that slipped through.

cryptoad added inline comments.Jul 11 2017, 9:09 AM
lib/scudo/scudo_utils.h
86

Actually scratch that. My machine was too loaded to give correct results.
The initial version appears to be faster (with the u8):

kostyak@kostyak-linux:~$ clang++ -O3 rand.cc -o rand -DWITH_CACHEDBYTESAVAILABLE
kostyak@kostyak-linux:~$ ./rand                                                 
[?] duration: 4009332558ns
kostyak@kostyak-linux:~$ clang++ -O3 rand.cc -o rand
kostyak@kostyak-linux:~$ ./rand
[?] duration: 4788913046ns

For 1<<32 iterations of getU8, and quite stable over the course of multiple runs.

I am going to reintroduce it.

cryptoad requested review of this revision.Jul 11 2017, 10:08 AM
cryptoad edited edge metadata.
cryptoad added inline comments.
lib/scudo/scudo_utils.h
86

It's actually a lot more nuanced and tricky than I was expecting.
Seeding through /dev/urandom 1<< 12 times, and iterating 1<<20 times per seeding, we get the numbers below (stable over multiple runs).
With clang, 32-bit seems equivalent, 64-bit favors the CachedBytesAvailable version.
With gcc, 32-bit favors CachedBytesAvailable, 64-bit favors the other (and overall slower for either).

kostyak@kostyak-linux:~$ clang++ -m32 -O3 rand.cc -o rand -DWITH_CACHEDBYTESAVAILABLE                                                  
kostyak@kostyak-linux:~$ ./rand                                                                                                        
[?] duration: 4814670294ns
kostyak@kostyak-linux:~$ clang++ -m32 -O3 rand.cc -o rand
kostyak@kostyak-linux:~$ ./rand
[?] duration: 4830693788ns
kostyak@kostyak-linux:~$ clang++ -O3 rand.cc -o rand -DWITH_CACHEDBYTESAVAILABLE                                                       
kostyak@kostyak-linux:~$ ./rand
[?] duration: 3115400364ns
kostyak@kostyak-linux:~$ clang++ -O3 rand.cc -o rand
kostyak@kostyak-linux:~$ ./rand
[?] duration: 4394574294ns
kostyak@kostyak-linux:~$ g++ -m32 -O3 rand.cc -o rand -DWITH_CACHEDBYTESAVAILABLE                                                                                                                                                                                               
kostyak@kostyak-linux:~$ ./rand
[?] duration: 8782558601ns
kostyak@kostyak-linux:~$ g++ -m32 -O3 rand.cc -o rand
kostyak@kostyak-linux:~$ ./rand
[?] duration: 9332069877ns
kostyak@kostyak-linux:~$ g++ -O3 rand.cc -o rand -DWITH_CACHEDBYTESAVAILABLE                                                                                                                                                                                                    
kostyak@kostyak-linux:~$ ./rand
[?] duration: 5651244009ns
kostyak@kostyak-linux:~$ g++ -O3 rand.cc -o rand
kostyak@kostyak-linux:~$ ./rand
[?] duration: 4407575998ns

Doing some ARM & Aarch64 tests additionally.
At this point I still feel reintroducing CachedBytesAvailable might provide the most benefits. LMKWYT.

alekseyshl added inline comments.Jul 11 2017, 11:05 AM
lib/scudo/scudo_utils.h
86

Given the lack of the clear performance advantage, we should go with simpler and straightforward version (CachedBytesAvailable).

cryptoad added inline comments.Jul 11 2017, 11:07 AM
lib/scudo/scudo_utils.h
86

And to make things a bit more complicated, on ARM & Aarch64 (at least with the Android toolchain), the version without CachedBytesAvailable appears faster for 32-bit, way faster for 64-bit.

cryptoad updated this revision to Diff 106073.Jul 11 2017, 11:11 AM

Reintroducing CachedBytesAvailable to assess the amount of bytes left in the
cached u64.

alekseyshl accepted this revision.Jul 11 2017, 2:41 PM
alekseyshl added inline comments.
lib/scudo/scudo_utils.h
85

sizeof(u64) -> sizeof(CachedBytes)

This revision is now accepted and ready to land.Jul 11 2017, 2:41 PM
cryptoad updated this revision to Diff 106203.Jul 12 2017, 6:59 AM

Addressing Aleksey's comment.

cryptoad closed this revision.Jul 12 2017, 8:29 AM