Moving to GitHub - Unified Proposal
ClosedPublic

Authored by mehdi_amini on Sep 1 2016, 4:19 PM.

Details

Summary

This document described the proposal to move to GitHub, and includes the two proposals side-by-side with a comparison between the two.
It also goes through various workflow examples, presenting the current set of commands following by the ones involved in each of the two proposals.

It is intended to supersede the previous "submodule proposal" document entirely, and drive the design of a survey addressed to the community.

Diff Detail

Repository
rL LLVM
There are a very large number of changes, so older changes are hidden. Show Older Changes
beanz added inline comments.Sep 30 2016, 2:30 PM
docs/Proposals/GitHubMove.rst
207 ↗(On Diff #73090)

How does the mono-repo do this? It might make it easier, but since it is likely that even with a mono-repo most people won't build all projects I don't think it actually encourages updates across all sub-projects.

214 ↗(On Diff #73090)

You still haven't addressed the feedback here. Saying the multi-repo would lose history is still inaccurate.

For starters, you're not actually deleting the history from the repository you're moving code from. Also with a multi-repo you can easily preserve the file history by using git filter-branch. Using filter-branch will not follow history across renames that are outside the filter, but will follow them within the filter.

For example if you were to use filter branch on lib/Support to break it out into its own repository, filter branch would preserve history of files under lib/Support that are renamed as long as they remain under libSupport. It would not preserve the history of a file being renamed and moved under libSupport. Even with that the history before that point *is* traceable because the history would still exist in the old repository, so you are not losing history, you just aren't moving it with the file.

mehdi_amini marked 27 inline comments as done.

Address review.

docs/Proposals/GitHubMove.rst
207 ↗(On Diff #73090)

I was thinking about the fact that if I change the API createTargetMachineFromTriple(), and git grep to find the uses, then all the uses in sub-projects will show up.

214 ↗(On Diff #73090)

Fair enough: replaced "losing history" with "the history of the refactored code won't be available from the new place".

222 ↗(On Diff #73090)

Can you clarify what you're referring to exactly? (No regression compared to now I believe)

334 ↗(On Diff #73090)

I was referring to:

svn co http://llvm.org/svn/llvm-project/ --depth=immediates
cd llvm-project/
svn up llvm/trunk clang/trunk libcxx/trunk

You can then have a build with only LLVM configured like:

mkdir ../build-llvm && cd ../build-llvm
cmake ../llvm-project/llvm/trunk

And a build dir with llvm+clang:

mkdir ../build-clang && cd ../build-clang
cmake ../llvm-project/llvm/trunk -DLLVM_EXTERNAL_CLANG_DIR=../llvm-project/clang/trunk/

So that a single svn up $projects in the source directory update all the sources and you can still build a subset of the projects from these sources.

This is also how I'd synchronize if I was integrating downstream from SVN.

445 ↗(On Diff #73090)

This applies to both proposals right? Where do you want me to add this?

461 ↗(On Diff #73090)

(Copy/pasted commands above)

461 ↗(On Diff #73090)

Copy/pasted above (I'm not sure I really want to document it on llvm.org now).

471 ↗(On Diff #73090)

I don't believe so, but if you insist...

584 ↗(On Diff #73090)

(I'm waiting for the story to support this above)

589 ↗(On Diff #73090)

(I'm waiting for the story to support this above)

601 ↗(On Diff #73090)

"The same way" implies "a single SVN revision number to me". One could even say "a single SVN checkout" (cf the command I copy/pasted above).
I don't see how it'd work with the multi-repo?
How would someone downstream integrating from SVN be able to correlate revision across repositories?

622 ↗(On Diff #73090)

The paragraph *starts* with " Using a script that rewrites history" and end with "changes the fork's commit hashes", it seems to me that this makes explicit that the downside of rewriting history is that the hashes change.
(I'm not sure how "rewriting history" is a downside by itself otherwise)

570 ↗(On Diff #70272)

(I'm waiting for the story to support this above)

249 ↗(On Diff #70099)

It seems to me that at the beginning the idea was that the submodules would be updated every few minutes, *so that* we'd be able to have rev-locked commits pushed to multiple projects at the same time and have them appear a single umbrella update (with somehow a heuristic like "update the submodules when there hasn't been a push for 2 min").

Apparently your idea is rather than we should update it with single commits, but what's the story for rev-locked?
How would the tooling not have a race condition? Example:

  1. I commit to LLVM
  2. I commit to Clang
  3. the script runs, pull LLVM, no change
  4. I push to LLVM
  5. I push to Clang
  6. the script pulls Clang, see my commit
  7. the script is done with pulling and update the submodule with the clang change, *before* the LLVM change, even though the commit date would be reversed.

I don't see a principled solution to implement the umbrella without server-side (i.e. native git hook) support. Sure you can craft it, and it'll work fine most of the time, but that does not make it bulletproof.

beanz added inline comments.Sep 30 2016, 4:57 PM
docs/Proposals/GitHubMove.rst
207 ↗(On Diff #73090)

That is 'making easier' not 'encouraging'. Personally I fall to 'grep' way before I fall to 'git grep' for things like this, and I don't think the monorepo has any enforcement of this.

214 ↗(On Diff #73090)

In your example of moving clang-tools-extra there would be *no* need for loss of history at all. There is no need for filter-branch. You can literally reformat clang-tools-extra to be under tools/extra/ and merge the whole tree into the clang master branch.

The only point where you would lose any history at all is if you were trimming one part of a repository into another repository, and even in that situation you can minimize the losses pretty well using filter-branch and index scripts. It is complicated but possible.

222 ↗(On Diff #73090)

Ah. I misread. I see what you are saying. This is fine.

334 ↗(On Diff #73090)

I can't imagine that is a common workflow. It certainly isn't the documented recommended workflow on llvm.org, so I'm not sure there is value in bringing it into the discussion.

350 ↗(On Diff #73090)

Can you add per-project sizes?

445 ↗(On Diff #73090)

I think it is worth noting under the multi-repo proposal something along the lines of:

Because we will be maintaining a linear history you can perform a timestamp based checkout of each project repository with the following command:

git checkout 'master@{...}'

Additionally you can use the umbrella repository...

If you want to also add the timestamp checkout to the mono-repo proposal, that makes sense too. I just think it is worth noting under the multi-repo proposal that timestamp based checkouts are expected to work due to the linear history requirement, which means you don't need the submodule repo.

461 ↗(On Diff #73090)

Fine if you don't want to document it, but I certainly would not describe that as "easy". Especially because if you ever mix up and type "svn up" in the root it starts updating *everything*. I think this is an incredibly fragile workflow, which is probably why it is also incredibly uncommon.

584 ↗(On Diff #73090)

Again, above.

601 ↗(On Diff #73090)

Maybe rather than "the same way" "with similar workflows to today"?

622 ↗(On Diff #73090)

Fine.

570 ↗(On Diff #70272)

See above.

249 ↗(On Diff #70099)

The automation will run. It will collect a list of commits that have been pushed to each repository since the last time the script ran. It will then sort them by committer timestamp order, and commit one at a time to the umbrella repo as submodule updates.

We can setup the automation to run based on GitHub WebHooks, and periodically in case a WebHook gets dropped.

There is no race condition that I see.

If we need to support revlocked changes, (and I'm not convinced this is the case since they are by far a minority of commits) we can support them via annotations on the commit messages. We can teach the automation to look for markers in the commit message denoting that it is revlocked to other changes, and we can have it group revlocked changes together.

There is no need for server-side hooks, and this solution would work as well as any mirroring system. I don't believe there is any need for this solution to be bulletproof, but I see no reason why it cannot be as robust as the single-project mirrors that the mono-repo proposal includes.

mehdi_amini marked 2 inline comments as done.Sep 30 2016, 5:36 PM
mehdi_amini added inline comments.
docs/Proposals/GitHubMove.rst
207 ↗(On Diff #73090)

That is 'making easier' not 'encouraging'.

"All the source is there by default" *+* "making it easier" => why I wrote "encouraging".

Personally I fall to 'grep' way before I fall to 'git grep' for things like this, and I don't think the monorepo has any enforcement of this.

Not sure why "enforcement" comes into play here?

214 ↗(On Diff #73090)

So do you have anything *concrete* that could be added here, be practical (something we'd be willing to encourage in the future), be understandable by any dev, *and* not take > 20 lines to describe?

350 ↗(On Diff #73090)

That'd make a long list, how should it be presented?

445 ↗(On Diff #73090)

Are you sure that this command does what you think it does?
If I read correctly the doc, it is looking at your *reflog*, not the history.

The right one should be something like git checkout `git rev-list -n 1 --before="2009-07-27 13:37" master`

I just think it is worth noting under the multi-repo proposal that timestamp based checkouts are expected to work due to the linear history requirement, which means you don't need the submodule repo.

OK that wasn't clear to me the first time.

601 ↗(On Diff #73090)

I'm still missing what would be similar for someone integrating multiple projects from SVN today (assuming such downstream integrator exists) with the multi-repo?

249 ↗(On Diff #70099)

The automation will run. It will collect a list of commits that have been pushed to each repository since the last time the script ran.

Atomically?

There is no race condition that I see.

Did you read my sequence 1-7 that describes an example of race?

but I see no reason why it cannot be as robust as the single-project mirrors that the mono-repo proposal includes.

Define "robust". The single-project mirrors have a very well *deterministic* algorithm to construct, and reconstruct them at will, you don't have one for the multi-repo. That's not "robust" to me.

Add mention of the ability to check out the individual repos according to a timestamp

mehdi_amini marked an inline comment as done.Oct 2 2016, 11:17 AM

Ping?

beanz added inline comments.Oct 3 2016, 10:05 AM
docs/Proposals/GitHubMove.rst
207 ↗(On Diff #73090)

"All the source is there by default"

This is what makes it easier. Your math is double counting it. I disagree with your wording here. I've told you I disagree. You can continue to disregard my feedback or you can fix it. The choice is yours.

214 ↗(On Diff #73090)

You gave an example that is factually incorrect. I'm asking you to fix it. That is concrete. In my earlier comment I told you why your example was incorrect. You can remove the example, or come up with an alternative. That is your choice. What you cannot do, is use this factually inaccurate example.

350 ↗(On Diff #73090)

However you think it is best presented. A table would seem fitting. You could put it below and have a link down to it. I think that if you're bringing size into the discussion you need to provide sufficient data.

445 ↗(On Diff #73090)

You are correct, you need to use rev-list to get the commit hash.

601 ↗(On Diff #73090)

I strongly suspect that very few users are using a single SVN checkout that contains more than one sub-project. If you discount that workflow, the workflow for interfacing using the GitHub SVN bridge is very similar whether you are using one repo or many.

Additionally, with the mono repo the combined SVN workflow is actually a lot better than with SVN today. It is way less fragile since you aren't doing sub-directory checkouts. This means you don't run the risk of inadvertently running svn up and pulling down way more than you wanted.

249 ↗(On Diff #70099)

I've updated my automation (https://github.com/llvm-beanz/llvm-submodules) to make one umbrella commit per commit to sub-project repository. This has a single commit granularity. That was the original point I was arguing. It works. It is done.

Is it perfect? No. There are a number of situations where the order of the commits to the submodule can be impacted by the order and proximity of commits to the project repositories. That is irrelevant to the point I was making. I'm more than happy to debate with you about whether or not that matters, but that is a separate issue from what I was pointing out.

Do we need to belabor this further, or will you update the document based on my feedback?

mehdi_amini added inline comments.Oct 3 2016, 11:43 AM
docs/Proposals/GitHubMove.rst
207 ↗(On Diff #73090)

"All the source is there by default"

This is what makes it easier.

Sorry, but I mentioned earlier git grep and you answered That is 'making easier'.
All the source presents by default is more than making it easier.

I disagree with your wording here. I've told you I disagree.

I strongly disagree with your disagreement here.

214 ↗(On Diff #73090)

The current spelling (Friday, 3:51pm) is: "With the multirepo, moving clang-tools-extra into clang would be more complicated than a simple git mv command, and the history of the refactored code won't be available from the new place."
I can change the example to: "Refactoring some functions from clang to make it a utility in one of the llvm/lib/Support file to share it across sub-projects wouldn't carry the history of the code in the llvm repo."

That said, I asked you on 9/9 (over 3 weeks ago) "Can you provide an example where the history of a single file *contents* can be preserved without pulling all the source repository entirely? I'd like to try it and see how git log/git blame deals with that."
You haven't been able to provide me with this. So you can claim whatever you want about "factual innacuracy", you still failed to provide counter facts to support your claim.

601 ↗(On Diff #73090)

If you discount that workflow, the workflow for interfacing using the GitHub SVN bridge is very similar whether you are using one repo or many.

"Very similar" is subjective, to me it can't be similar as long as there is no longer a single revision number.

Additionally, with the mono repo the combined SVN workflow is actually a lot better than with SVN today. It is way less fragile since you aren't doing sub-directory checkouts. This means you don't run the risk of inadvertently running svn up and pulling down way more than you wanted.

I don't understand what you mean here.

249 ↗(On Diff #70099)

You're moving goal posts. Your previous message said that there is no race, while now you're eluding it with "There are a number of situations where...".

Also you're changing the definition of the multi-repo as I was foreseeing it. I think it is worse, and if we were to adopt the multi-repo proposal, I would be totally against this.

Now, *just to please you*, because again I don't think it does any good to this proposal, I'll re-formulate making clear that:

  1. update in the multi-repo are single commits based.
  2. commits can be in different orders.
  3. it does not handle cross-project commits.
kparzysz added inline comments.Oct 3 2016, 12:41 PM
docs/Proposals/GitHubMove.rst
355 ↗(On Diff #73174)

Even with sparse checkout? Am I going to see new files in projects that were not originally included in the sparse checkout?

366 ↗(On Diff #73174)

A conflicting change would have to affect the same file. This is regardless of whether it's monorepo or multirepo. Am I missing something here?

Rebasing is always a good practice, but it's not strictly required. If there are no conflicts, the system will just add the change on top of the current ToT, even if they have not been fetched to the local repo.

jlebar added inline comments.Oct 3 2016, 1:41 PM
docs/Proposals/GitHubMove.rst
355 ↗(On Diff #73174)

What do you mean by "see"?

In order to push a commit without -f, the commit's parent commit must be the current remote head. The commits in git are unaffected by sparse checkout. So, if you have a commit you want to push, you will need to rebase it atop current remote HEAD -- you'll have to do this rebase even if you're using sparse checkouts and all of the changes between your current base revision and current remote HEAD are to subprojects that you don't have checked out.

If you don't like this, you can continue to use the single-subproject mirrors exactly as you currently do (with git-svn and everything), by changing the configs as explained elsewhere in this document. But I've been using a monorepo (http://github.com/llvm-project/llvm-project) for months now. I've pushed maybe 30 commits using my custom script (https://github.com/jlebar/llvm-repo-tools) and this necessity to rebase hasn't once been an annoyance for me.

366 ↗(On Diff #73174)

Rebasing is always a good practice, but it's not strictly required. If there are no conflicts, the system
will just add the change on top of the current ToT, even if they have not been fetched to the local
repo.

That is what git-svn will do, yes. But that's not pure git's behavior.

mehdi_amini added inline comments.Oct 3 2016, 1:52 PM
docs/Proposals/GitHubMove.rst
355 ↗(On Diff #73174)

Even with sparse checkout? Am I going to see new files in projects that were not originally included in the sparse checkout?

If you mean are you seeing them when typing ls in your terminal, then no you don't. I can add "unless you're using a sparse checkout" to make it more clear.

366 ↗(On Diff #73174)

A conflicting change would have to affect the same file. This is regardless of whether it's monorepo or multirepo. Am I missing something here?

The point was that when you run git pull --rebase, you have new changes, and even without an explicit "diff conflict" your changes that you're about to push may use an API that have changed upstream. Note today this is not addressed: SVN will blindly accept the push and break the build.

Rebasing is always a good practice, but it's not strictly required. If there are no conflicts, the system will just add the change on top of the current ToT, even if they have not been fetched to the local repo.

As Justing mentions, this is not true with git push AFAIK. You have to pull (merge or rebase) before being able to push.

beanz added a comment.Oct 3 2016, 2:02 PM

After this round of feedback I'm removing myself from this discussion.

docs/Proposals/GitHubMove.rst
207 ↗(On Diff #73090)

You asked for feedback. If you want to disregard it that is your decision.

214 ↗(On Diff #73090)

"Can you provide an example where the history of a single file *contents* can be preserved without pulling all the source repository entirely? I'd like to try it and see how git log/git blame deals with that."

git-filter-branch can preserve the history of a single file. It does not follow renames, however if you know a file was renamed, you can use git-filter-branch's --tree-filter or --index-filter flags to perform more complicated slicing of the repository to preserve that history. If you're unfamiliar with the types of things you can do with filter branch, this article gives a good overview (https://devsector.wordpress.com/2014/10/05/advanced-git-branch-filtering/).

601 ↗(On Diff #73090)

Saying the workflows is "similar" is not a subjective wording.

Today someone who writes:
svn co svn co http://llvm.org/svn/llvm-project/llvm/trunk
Under the mono-repo could write something like:
svn co http://github.com/llvm/llvm-project/master/llvm
Under the multi-repo could write something like:
svn co http://github.com/llvm/llvm/master/

The *workflow* of svn co -> svn add -> svn commit is *similar* in all cases.

249 ↗(On Diff #70099)

From the beginning I said:

It won't be perfect, but it should be good enough for sorting commits in close proximity...

If you want to debate that statement we can do so, but I would prefer not to in this thread.

Also you're changing the definition of the multi-repo as I was foreseeing it. I think it is worse, and if we were to adopt the multi-repo proposal, I would be totally against this.

You don't get to dictate how the proposal in opposition to your preferred approach is written. I think you've been pretty clear about being against the multi-repo proposal, so I don't see how your opinion factors in to the final document, which shouldn't be opinion based.

beanz removed a subscriber: beanz.Oct 3 2016, 2:02 PM
kparzysz added inline comments.Oct 3 2016, 2:02 PM
docs/Proposals/GitHubMove.rst
355 ↗(On Diff #73174)

What do you mean by "see"?

I'm referring to this (and the rest of this paragraph):
"However when you fetch you'll likely pull in changes to sub-projects you don't care about."

The intent wasn't clear---I wasn't aware of the requirement about parent commit (I use SVN for upstreaming changes). But that brings another question: what is the anticipated frequency of commits to the monorepo? My concern is that the "rebuild and retest" approach may take long enough to require another rebase...

mehdi_amini updated this revision to Diff 73339.Oct 3 2016, 2:12 PM
  • remove reference to size.
  • change the model for multi-repos: single commit based, losing cross-project commits.
  • clarify that sparse-checkout don't see changes from other projects
mehdi_amini added inline comments.Oct 3 2016, 2:16 PM
docs/Proposals/GitHubMove.rst
355 ↗(On Diff #73174)

When you commit to SVN, you add a "patch" on top of the existing codebase. Unless there is a conflict your patch will be committed.
It does not mean it will build, since someone else may just have changed an API you're using in your patch.

The new monorepo won't be different from SVN on this aspect: you have the same frequency of commits, and you can run git pull && git push which is roughly equivalent to git svn dcommit today.
The thing is that between the git pull and the git push, you can also inspect what changed since your last build/check, and decide if you need to rebuild or not.

mehdi_amini added inline comments.Oct 3 2016, 2:18 PM
docs/Proposals/GitHubMove.rst
355 ↗(On Diff #73174)

Maybe we should just remove all this paragraph, it is confusing...

mehdi_amini updated this revision to Diff 73342.Oct 3 2016, 2:20 PM

Remove confusing paragraph.

jlebar added inline comments.Oct 3 2016, 2:21 PM
docs/Proposals/GitHubMove.rst
355 ↗(On Diff #73174)

My concern is that the "rebuild and retest" approach may take long enough to require another rebase...

This isn't a function of the monorepo. You choose when to rebuild/retest, and that's orthogonal to the repository structure. If "rebuild/retest only when there were changes to files I changed" is what you want to do, you can still do that. You can ask that question of git before pushing. Or you could ask "have any of the projects I care about changed?" Or you could ask a different question. And you could ask those questions of the monorepo, or the multirepo (although it might be a bit more work in the multirepo -- I say "might" so beanz doesn't jump on me).

In this sense it's safer than SVN, which assumes that you only care about retesting if there were modifications to files you also changed.

jlebar added inline comments.Oct 3 2016, 2:23 PM
docs/Proposals/GitHubMove.rst
355 ↗(On Diff #73174)

I really think you want this paragraph, btw. This is a very common question -- it's been asked many times before. "I don't want the monorepo because it will mean I have to rebuild/retest a *lot* more than I do today." False, but we need to explain why.

mehdi_amini marked 45 inline comments as done.Oct 3 2016, 2:37 PM
mehdi_amini added inline comments.
docs/Proposals/GitHubMove.rst
249 ↗(On Diff #70099)

You don't get to dictate how ...

Sorry, you mischaracterizing my position and what I wrote, I don't appreciate this.

355 ↗(On Diff #73174)

OK, I'll try to rephrase it then.
The main point is that git pull && git push is not different from today SVN.

mehdi_amini marked an inline comment as done.Oct 3 2016, 2:38 PM
mehdi_amini updated this revision to Diff 73349.Oct 3 2016, 2:55 PM
mehdi_amini marked 13 inline comments as done.

Restore the paragraph.

dtzWill added a subscriber: dtzWill.Oct 6 2016, 7:43 AM
mehdi_amini updated this revision to Diff 73858.Oct 6 2016, 3:13 PM

Address Duncan's inline comments.

New layout attempt

I believe what Duncan is asking for is basically the same thing I (and others) have also been asking for: An explicit "not dryly-factual" section where the experts explain their various positions.

I am saddened that we won't have these sections in the document -- I think not having them does a disservice to the readers who, like you, want this material. (We've had at least one other person comment in this thread, and we've had nobody say they don't want it.) But the disagreement seems to be based on Mehdi having fundamentally different conceptions of concourse than certainly I have, and I'm not prepared to litigate the philosophy of argument just to get a section added to this document.

On the other hand, in light of the amount of abuse it seems that whoever drives this process will receive no matter what they say, I'm certainly not willing to switch places with Mehdi, and I think he deserves a heaping ton of credit for frankly superhuman positivity here (in addition to credit for doing the work itself). So if he continues to oppose this idea, I respect his decision. In that case, I think we're just going to have to write it up separately, and hope that it gets the visibility it deserves. Maybe we'll be able to get a link in this document, although if that turns into a fight, like so much else here has, I hope I'll have the self-control to turn and run in the other direction.

Try another layout: add first a description of the multirepo, then one for the monorepo, then the interleaved comparison.

ioeric added a subscriber: ioeric.Oct 12 2016, 2:30 AM
ioeric added inline comments.Oct 12 2016, 4:19 AM
docs/Proposals/GitHubMove.rst
180 ↗(On Diff #74258)

I am wondering where we are in the process now. Specifically, when would we get to this step (2.5)?

Phabricator is seeing frequent connection errors from svn server (might due to the increased number of svn connections after the recent phabricator upgrade):

svn: OPTIONS of 'http://llvm.org/svn-robots/llvm-project': could not connect to server (http://llvm.org)

This blocks syncing svn commits from time to time. I'd expect Github to be more stable.

Address Duncan's feedback

(Remove duplicated section)

Split the bullet about the overhead of the monorepo for users that care only about a single subproject.

This revision was automatically updated to reflect the committed changes.