This commit also adds a script to compute sha256 hashes of llvm checkouts.
Details
Diff Detail
- Build Status
Buildable 10130 Build 10130: arc lint + arc unit
Event Timeline
Would this be all obsolete with git repos?
Also all the python code seems complicated to me for what is basically find | grep -v '/\.git/' | grep -v '/\.svn/' | LC_ALL=C sort | tar -cf - -T - --no-recursion | sha1sum, can you elaborate why this is needed?
I would think so.
Also all the python code seems complicated to me for what is basically find | grep -v '/\.git/' | grep -v '/\.svn/' | LC_ALL=C sort | tar -cf - -T - --no-recursion | sha1sum, can you elaborate why this is needed?
That's roughly what I started with, but's it's more complicated if you want to compute separate checksums for projects inside single tree source checkout (i.e. llvm, llvm/tool/clang, llvm/projects/lldb, etc.).
The use-case is rather bizzare:
- You continuously import and review commits from svn repo for a set a llvm projects (e.g., llvm, clang, lldb). You use single tree source checkout for that.
- You want to build docker images for a subset of those projects (e.g., only for llvm and clang). However, you want to checkout source code from official svn repo, not your mirror.
- You want to make sure the code you checkout matches the code you reviewed in step 1.
There is also a problem that some files inside LLVM repo use svn substitutions. The substitutions are mostly fine, but $Date$ and $LastChangedDate$ are locale-specific and we have to account for that.
I totally agree with you, though, the code is quite complicated. Wish it was simpler.
Well first using a flat layout makes it easier: checkout clang next to llvm instead of inside llvm/tools (and use -DLLVM_ENABLE_PROJECTS).
Then it is a matter of wrapping the command I gave in a loop around the directories and producing one entry per project.
There is also a problem that some files inside LLVM repo use svn substitutions. The substitutions are mostly fine, but $Date$ and $LastChangedDate$ are locale-specific and we have to account for that.
OK that's terrible :(
I totally agree with you, though, the code is quite complicated. Wish it was simpler.
Yeah, like just using git ;)
Anyway I'm fine with adding this, but I won't have time to review deeply the python code right now.
utils/docker/scripts/llvm_checksum/llvm_checksum_utils.py | ||
---|---|---|
1 ↗ | (On Diff #112515) | Why 2 files? (generally, I dislike files named "util" :) |
27–28 ↗ | (On Diff #112515) | Why's the algo not part of the content_hasher? |
48 ↗ | (On Diff #112515) | Call it process_file? (proc_file sounds like a file in /proc :P) |
82–83 ↗ | (On Diff #112515) | Can't we put that in the same line? |
122–123 ↗ | (On Diff #112515) | Given that we feed in paths, this seems redundant? |
131–133 ↗ | (On Diff #112515) | Why not just feed the content (file content or symlink target)? |
Addressed review comments.
- Renamed proc_file to process_file.
- Don't feed number of files to the hasher, it's redundant.
- Removed next_dirs local var.
utils/docker/scripts/llvm_checksum/llvm_checksum_utils.py | ||
---|---|---|
1 ↗ | (On Diff #112515) | Totally agree, util is a bad choice. |
27–28 ↗ | (On Diff #112515) |
It also seems nice to have the checksumming algorithm name as part of the function interface. |
48 ↗ | (On Diff #112515) | All short names are already taken by Linux :-( |
122–123 ↗ | (On Diff #112515) | Totally agree, removed it. |
131–133 ↗ | (On Diff #112515) | To distinguish between a broken symlink and a file with content equivalent to the broken symlink's target. |
utils/docker/scripts/llvm_checksum/llvm_checksum_utils.py | ||
---|---|---|
27–28 ↗ | (On Diff #112515) | I think my main concern is that we're mixing levels of abstraction; generally, we have a strategy how to do the hashing, and we have an algorithm that does the visitation (giving project structure, whitelist, etc). I'd probably just do a visit_project function that takes a class with perhaps 2 functions: Then we'd have a Checksum class that implements these 2 functions, has a hasher member and does the hashing in the visitation, cutting at the responsibilities. That way, we also could split this up into multiple files more easily: |
131–133 ↗ | (On Diff #112515) | Why do we care? |
utils/docker/scripts/llvm_checksum/llvm_checksum_utils.py | ||
---|---|---|
27–28 ↗ | (On Diff #112515) | Good point, will do. |
131–133 ↗ | (On Diff #112515) | Frankly, we don't. Broken symlinks and files are different beasts, but for our use-case this is probably irrelevant. |
Followed @klimek's suggestions and split the checksumming and project walking logic.
Also simplified the logic of dealing with broken symlinks - now we simply hash the target of the symlink, therefore it's indistinguishable from file with the same content. This makes the code a little simpler.
utils/docker/scripts/llvm_checksum/llvm_checksum.py | ||
---|---|---|
86 | Substituted with a name mentioning less substitutions. | |
114 | Yes, makes debugging when checksums don't match easier. An alternative I was thinking about is to just feed all files to a single hasher, but that would probably allow to easily craft two different directory trees with the same hashes. |
Are the checksums intended to be valid across multiple platforms? I recall that @zturner noticed that we use SVN in weird ways to munge newlines for some files on Windows.
Currently the checksums are only intended to match on Linux. I haven't checked, but some files will definitely have different line-endings on Windows.
Substitute subsitutions for substitutions.