This is an archive of the discontinued LLVM Phabricator instance.

Added optional validation of svn sources to Dockerfiles.
ClosedPublic

Authored by ilya-biryukov on Aug 24 2017, 1:40 AM.

Details

Event Timeline

ilya-biryukov created this revision.Aug 24 2017, 1:40 AM
mehdi_amini edited edge metadata.Aug 24 2017, 10:01 PM

Would this be all obsolete with git repos?

Also all the python code seems complicated to me for what is basically find | grep -v '/\.git/' | grep -v '/\.svn/' | LC_ALL=C sort | tar -cf - -T - --no-recursion | sha1sum, can you elaborate why this is needed?

Would this be all obsolete with git repos?

I would think so.

Also all the python code seems complicated to me for what is basically find | grep -v '/\.git/' | grep -v '/\.svn/' | LC_ALL=C sort | tar -cf - -T - --no-recursion | sha1sum, can you elaborate why this is needed?

That's roughly what I started with, but's it's more complicated if you want to compute separate checksums for projects inside single tree source checkout (i.e. llvm, llvm/tool/clang, llvm/projects/lldb, etc.).
The use-case is rather bizzare:

  1. You continuously import and review commits from svn repo for a set a llvm projects (e.g., llvm, clang, lldb). You use single tree source checkout for that.
  2. You want to build docker images for a subset of those projects (e.g., only for llvm and clang). However, you want to checkout source code from official svn repo, not your mirror.
  3. You want to make sure the code you checkout matches the code you reviewed in step 1.

There is also a problem that some files inside LLVM repo use svn substitutions. The substitutions are mostly fine, but $Date$ and $LastChangedDate$ are locale-specific and we have to account for that.

I totally agree with you, though, the code is quite complicated. Wish it was simpler.

Would this be all obsolete with git repos?

I would think so.

Also all the python code seems complicated to me for what is basically find | grep -v '/\.git/' | grep -v '/\.svn/' | LC_ALL=C sort | tar -cf - -T - --no-recursion | sha1sum, can you elaborate why this is needed?

That's roughly what I started with, but's it's more complicated if you want to compute separate checksums for projects inside single tree source checkout (i.e. llvm, llvm/tool/clang, llvm/projects/lldb, etc.).

Well first using a flat layout makes it easier: checkout clang next to llvm instead of inside llvm/tools (and use -DLLVM_ENABLE_PROJECTS).
Then it is a matter of wrapping the command I gave in a loop around the directories and producing one entry per project.

There is also a problem that some files inside LLVM repo use svn substitutions. The substitutions are mostly fine, but $Date$ and $LastChangedDate$ are locale-specific and we have to account for that.

OK that's terrible :(

I totally agree with you, though, the code is quite complicated. Wish it was simpler.

Yeah, like just using git ;)

Anyway I'm fine with adding this, but I won't have time to review deeply the python code right now.

klimek added inline comments.Aug 30 2017, 7:03 AM
utils/docker/scripts/llvm_checksum/llvm_checksum_utils.py
1

Why 2 files? (generally, I dislike files named "util" :)

27–28

Why's the algo not part of the content_hasher?
Can't we just use a hasher that has the exact interface we need?

48

Call it process_file? (proc_file sounds like a file in /proc :P)

82–83

Can't we put that in the same line?

122–123

Given that we feed in paths, this seems redundant?

131–133

Why not just feed the content (file content or symlink target)?

ilya-biryukov marked 3 inline comments as done.

Addressed review comments.

  • Renamed proc_file to process_file.
  • Don't feed number of files to the hasher, it's redundant.
  • Removed next_dirs local var.
utils/docker/scripts/llvm_checksum/llvm_checksum_utils.py
1

Totally agree, util is a bad choice.
One file(util) is a source code of a library, the other is a source code of an executable.
The idea is to split the command argument parsing logic from an actual library code.
How about _lib instead of _util? Or do you think merging the files is better?

27–28
  • content_hasher is responsible for reading files and replacing the contents of the file, while hash_algo is responsible for providing hash functions.
  • we don't call content_hasher for broken symlinks and use hash_algo on the symlink target instead. It does not make sense to do any replacements in there, as it's not a file.
  • it's nice that this function controls the lifetime of hash_algo() objects. Otherwise, there's a chance the client will create hasher only once and use that for all subsequent calls.

It also seems nice to have the checksumming algorithm name as part of the function interface.

48

All short names are already taken by Linux :-(
Renamed to process_file.

122–123

Totally agree, removed it.

131–133

To distinguish between a broken symlink and a file with content equivalent to the broken symlink's target.

klimek added inline comments.Aug 31 2017, 2:15 AM
utils/docker/scripts/llvm_checksum/llvm_checksum_utils.py
27–28

I think my main concern is that we're mixing levels of abstraction; generally, we have a strategy how to do the hashing, and we have an algorithm that does the visitation (giving project structure, whitelist, etc).

I'd probably just do a visit_project function that takes a class with perhaps 2 functions:
visit_file(path, content)
visit_symlink(path, link)

Then we'd have a Checksum class that implements these 2 functions, has a hasher member and does the hashing in the visitation, cutting at the responsibilities.

That way, we also could split this up into multiple files more easily:
project_tree.py (perhaps with a better name :) allows to visit all files in the project.
llvm_checksum.py - can now be the main method and the checksum'ing parts of the algo

131–133

Why do we care?

ilya-biryukov added inline comments.Aug 31 2017, 3:45 AM
utils/docker/scripts/llvm_checksum/llvm_checksum_utils.py
27–28

Good point, will do.

131–133

Frankly, we don't. Broken symlinks and files are different beasts, but for our use-case this is probably irrelevant.
Could remove this logic altogether and make the script simpler.

  • Separated project tree walking and checksumming implementations.
ilya-biryukov added a comment.EditedSep 12 2017, 7:53 AM

Followed @klimek's suggestions and split the checksumming and project walking logic.
Also simplified the logic of dealing with broken symlinks - now we simply hash the target of the symlink, therefore it's indistinguishable from file with the same content. This makes the code a little simpler.

klimek added inline comments.Sep 14 2017, 5:34 AM
utils/docker/scripts/llvm_checksum/llvm_checksum.py
87

Substitute subsitutions for substitutions.

115

So the main reason we hash each file is for debugging purposes?

  • Rename read_replacing_substituions to read_and_collapse_svn_substitutions.
ilya-biryukov marked an inline comment as done.Sep 14 2017, 7:03 AM
ilya-biryukov added inline comments.
utils/docker/scripts/llvm_checksum/llvm_checksum.py
87

Substituted with a name mentioning less substitutions.

115

Yes, makes debugging when checksums don't match easier.

An alternative I was thinking about is to just feed all files to a single hasher, but that would probably allow to easily craft two different directory trees with the same hashes.

This revision is now accepted and ready to land.Sep 15 2017, 4:59 AM
This revision was automatically updated to reflect the committed changes.

Are the checksums intended to be valid across multiple platforms? I recall that @zturner noticed that we use SVN in weird ways to munge newlines for some files on Windows.

ilya-biryukov marked an inline comment as done.Sep 18 2017, 9:27 AM

Are the checksums intended to be valid across multiple platforms? I recall that @zturner noticed that we use SVN in weird ways to munge newlines for some files on Windows.

Currently the checksums are only intended to match on Linux. I haven't checked, but some files will definitely have different line-endings on Windows.