Git is a technology I have taken much of an interest in over the years. This feed is for people primarily interested in git. Note: articles I wrote about the Perl git conversion are in the comp.perl section.
The GitTorrent Commit Reel
The commit reel is defined in section 5 of the GitTorrent RFC.
It is defined as an uncompressed stream of objects, sorted in a particular way. In practice, it is only the commit objects that are sorted, and all of the dependent objects for those commits are placed with the commit which first introduces them.
So, you start with a repository:
You sort the objects so that they are in reverse date order (tie breaking is still required over git rev-list --date-order, as well as fetching their types and sizes, to produce the commit reel index.
| SHA1 hash | type | size | info |
|---|---|---|---|
| e951c3b45579 | blob | 971 | lib/VCS/Git/Torrent/Tracker.pm |
| 4a39b387218e | tree | 38 | lib/VCS/Git/Torrent |
| 46a6dd40761e | blob | 1797 | lib/VCS/Git/Torrent.pm |
| cb169dea8427 | tree | 72 | lib/VCS/Git |
| 6856da5de8a8 | tree | 30 | lib/VCS |
| e028c2ec652f | tree | 30 | lib |
| a8c6175cb855 | tree | 30 | |
| 6d669a0d7649 | commit | 177 | |
| d7934d77db6d | blob | 508 | lib/VCS/Git/Torrent/PWP/Message.pm |
| 831a2dce3123 | tree | 38 | lib/VCS/Git/Torrent/PWP |
| b67f62af3325 | blob | 2062 | lib/VCS/Git/Torrent/PWP.pm |
| 8e49bb567004 | tree | 102 | lib/VCS/Git/Torrent |
| d9cfbd2965e1 | tree | 72 | lib/VCS/Git |
| 760c03b92584 | tree | 30 | lib/VCS |
| 58e8231290fa | tree | 30 | lib |
| 08d6743bc1cd | tree | 30 | |
| 6e85df39b2e9 | commit | 233 | |
| ae59d4c6cdad | blob | 239 | t/91-pod-coverage.t |
| ... | |||
| 9f21fdc6b232 | commit | 504 | |
| 7ed81b753c34 | blob | 528 | lib/VCS/Git/Torrent/Reference.pm |
| 111a3c708d42 | tree | 321 | lib/VCS/Git/Torrent |
| 32f0b74a2902 | blob | 6311 | lib/VCS/Git/Torrent.pm |
| da591fe54883 | tree | 72 | lib/VCS/Git |
| 7b702d0cf7de | tree | 30 | lib/VCS |
| 39ec1765b517 | tree | 30 | lib |
| 6e5bb34706f6 | tree | 245 | |
| 5e8f6a7807a3 | commit | 277 | |
a commit reel
Then, you take the total size of the "tape" and divide by the number of blocks you require. Let's go with 4 for this example.
The listing from the test commit in VCS::Git::Torrent has a total of 233141 bytes of uncompressed object data. Let's divide that into 4 segments on 58285 byte boundaries:
| Chunk 1 | Chunk 2 | Chunk 3 | Chunk 4 |
|---|---|---|---|
6d669a0d7649 commit 3145 6e85df39b2e9 commit 6250 d16fe9b37f1c commit 7269 b9b5df08c542 commit 10216 9f5380b003fc commit 13715 3d954bf97808 commit 15211 |
53b2a50ab357 commit 64934 f8a02453062d commit 76844 60f7c92ec68f commit 78718 8e4c833bc0ed commit 90027 9595e4d0ed4a commit 99113 2499769d4e5b commit 113780 2b67a6d1898a commit 116380 |
c24dcdcd46de commit 158557 bffe789b4a13 commit 162339 cc77ed21cf03 commit 164454 1dfd53badd66 commit 170494 497da251f9dc commit 174642 |
5b7e980dce4b commit 178961 6c1fd6467f49 commit 183229 ae4aee0f484e commit 187522 69ff2248cf7f commit 191852 40149c3f6e62 commit 199468 93083bfcc5ee commit 202889 4ff65c62c570 commit 209765 76ed2bbc552c commit 214713 9f21fdc6b232 commit 225327 5e8f6a7807a3 commit 233141 |
The testpacking.pl script in the VCS::Git::GitTorrent distribution can generate these lists and show how much bandwidth is wasted by using 4 separate packs:
$ git update-ref refs/heads/oldeg 5e8f6a7807a3 $ perl bin/testpacking.pl -n4 oldeg Generating index... Length is 233141, 4 blocks of 58286 each do_pack(3d954bf97808) Slice #0 (up to 58286): 15211 => 6554 (43%) do_pack(2b67a6d1898a 9595e4d0ed4a --not 3d954bf97808) Slice #1 (up to 116572): 101169 => 30035 (29%) do_pack(497da251f9dc --not 2b67a6d1898a 9595e4d0ed4a) Slice #2 (up to 174858): 58262 => 16951 (29%) do_pack(5e8f6a7807a3 --not 497da251f9dc 9595e4d0ed4a) Slice #3: 58499 => 10224 (17%) Overall: 233141 => 63764 (27%) vs Bundle: 233141 => 58297 (25%) Overall inefficiency: 9% $
So what this is saying is that our repository, originally a 58k bundle, can be split into 4 chunks, defined by the listed boundary commits. At the end, you get 4 bundles of varying sizes, with an extra 5k, or 9% of overhead (yes, these packs are thin).
So that's the idea anyway. To run the above example, you can clone the github repository, and install the requisite modules via CPAN:
$ git clone git://github.com/samv/vcs-git-torrent.git \
VCS-Git-Torrent
...
$ cd VCS-Git-Torrent
$ perl Makefile.PL
...
$ make
...
$
If it complains about missing modules, install via CPAN:
$ cpan Test::Depends Bencode IO::Plumbing ...Update: Figures for my git.git clone:
arcturus:~/src/git$ time perl ../VCS-Git-Torrent/bin/testpacking.pl -n32 master maint pu missing fields on Reference at /usr/lib/perl5/Class/MOP/Mixin/AttributeCore.pm line 53 Generating index... Length is 1104821033, 32 blocks of 34525658 each do_pack(7e011c40bc6c 466fede1bdfd 76a8323ac7f5) Slice #0 (up to 34525658): 34518888 => 1909503 (5%) do_pack(cf1fe88ce1fb b3f041fb0f7d a9572072f0ab fdeb2fb61669 --not 7e011c40bc6c 466fede1bdfd 76a8323ac7f5) Slice #1 (up to 69051316): 34529558 => 1417850 (4%) do_pack(38035cf4a51c 1b83ace35e78 50b44eceed21 2326acfa95ac --not cf1fe88ce1fb b3f041fb0f7d a9572072f0ab fdeb2fb61669) Slice #2 (up to 103576974): 34528221 => 1243468 (3%) do_pack(c7162c1db6fe b642d9ef6433 ada5853c98c5 --not 38035cf4a51c 1b83ace35e78 f2f880f53707 50b44eceed21 2326acfa95ac cf1fe88ce1fb) Slice #3 (up to 138102632): 34483917 => 1109044 (3%) do_pack(f16db173a468 f25b79397c97 61ffbcb98804 8c6ab35efe63 3d234d0afacd efffea033457 53cda8d97e6e da7bad50ed08 c27d205aaefb 96bc4de85cf8 8e27364128b0 a0764cb838c2 b1e9fff7e76c 5faf64cd28bf --not c7162c1db6fe b642d9ef6433 2e1ded44f709 ada5853c98c5 cf1fe88ce1fb) Slice #4 (up to 172628290): 34429886 => 898299 (2%) do_pack(a2540023dcf8 3159c8dc2da4 5a03e7f25334 ab41dfbfd4f3 e4fe4b8ef7cd 9c7b0b3fc46e a06f678eb998 d0b353b1a7a2 d0c25035df48 18b0fc1ce1ef 1729fa9878ed 1f24c58724a6 f2b579256475 937a515a15f7 --not f16db173a468 f25b79397c97 61ffbcb98804 8c6ab35efe63 3d234d0afacd efffea033457 53cda8d97e6e da7bad50ed08 c27d205aaefb 96bc4de85cf8 8e27364128b0 a0764cb838c2 b1e9fff7e76c 5faf64cd28bf cf1fe88ce1fb) ... do_pack(607a9e8aaa9b e39e0d375d1d 106a36509dc7 0e098b6d79fb 14c674e9dc52 43485d3d16e4 7a4ee28f4127 118d938812f3 cc580af88507 86386829d425 3b5ef0e216d2 36e4986f26d1 41fe87fa49cb 9e4b7ab65256 3deffc52d88d b53bb301f578 ad17f01399a9 17635fc90067 375881fa6a43 --not f8b5a8e13cb4 50ff23667020 345a38039414 3f721d1d6d6e 977e289e0d73 2ff4d1ab9ef6 69932bc6117d 1d7b1af42028 fcdd0e92d9d4 754ae192a439 3eb969973335 0cd29a037183 6e0800ef2575 df533f34a318 32d86ca53195 f0cea83f6316 4e65b538acc9 3cb1f9c98203 0eaadfe625fd cf1fe88ce1fb) Slice #29 (up to 1035769740): 35405510 => 677049 (1%) do_pack(609621a4ad81 eab58f1e8e5e e7e55483439b 46e09f310567 134748353b2a 500348aa6859 a4ca1465ec8a d23749fe36f1 c8998b4823cb 4d23660e79db ad3f9a71a820 b1a01e1c0762 24ab81ae4d12 c591d5f311e0 9f67d2e8279e 2aae905f23f7 a75d7b54097e 86140d56c150 9bccfcdbff3b 02edd56b84f0 204d363f5a05 7c85d2742978 a5ca8367c223 46148dd7ea41 b7b10385a84c a099469bbcf2 fe0a3cb23c79 6b87ce231d14 1ba447b8dc2e 9fa708dab1cc 1414e5788b85 aa43561ac0c1 63267de2acc1 --not 607a9e8aaa9b e39e0d375d1d 106a36509dc7 30ae47b4cc19 e9c5dcd1313d 51ea55190b6e d5f6a96fa479 0e098b6d79fb 14c674e9dc52 43485d3d16e4 7a4ee28f4127 118d938812f3 cc580af88507 86386829d425 3b5ef0e216d2 36e4986f26d1 41fe87fa49cb 9e4b7ab65256 3deffc52d88d b53bb301f578 ad17f01399a9 17635fc90067 375881fa6a43 3cb1f9c98203 cf1fe88ce1fb) Slice #30 (up to 1070295398): 34563903 => 641336 (1%) do_pack(8644f69753e0 --not 609621a4ad81 eab58f1e8e5e e7e55483439b d52dc4b10b2f ebc9d420566d f740cc25298e 492cf3f72f9d 46e09f310567 134748353b2a 500348aa6859 a4ca1465ec8a d23749fe36f1 c8998b4823cb 4d23660e79db ad3f9a71a820 b1a01e1c0762 24ab81ae4d12 c591d5f311e0 9f67d2e8279e 2aae905f23f7 a75d7b54097e 86140d56c150 9bccfcdbff3b 02edd56b84f0 204d363f5a05 7c85d2742978 a5ca8367c223 46148dd7ea41 b7b10385a84c a099469bbcf2 fe0a3cb23c79 6b87ce231d14 1ba447b8dc2e 9fa708dab1cc 1414e5788b85 aa43561ac0c1 63267de2acc1 17635fc90067 cf1fe88ce1fb) Slice #31: 34528236 => 603576 (1%) Overall: 1104821033 => 26021211 (2%) vs Bundle: 1104821033 => 23888867 (2%) Overall inefficiency: 8% real 16m54.074s user 4m30.961s sys 11m9.302s
That's dividing the pack defined by three branches into 32 generally evenly-sized chunks. Actually the chunks at the beginning are larger than the later ones, which are all between 500kB and 950kB. While they are not perfectly sized, at least they can be generated by any node with the underlying objects, without transferring a binary pack.
However, what will matter is that execution time; the Perl prototype is needlessly inefficient. With a revision cache, we should be able to reduce that time drastically and hopefully be able to retrieve the boundary commits for a given range of commits and number of chunks in milliseconds; the remaining work is mostly on git pack-objects, but given we've drastically reduced the work it has to do, the overall load on the network should not be drastically higher; and because peers can potentially trade these blocks, the workload can be spread out.
GitTorrent: a synthesis of past efforts
If you read this list post (gmane archive), then you will probably see not much new here. I include it as a back-drop for the subsequent articles.
GitTorrent concept: torrent the pack files
The idea of applying the straight BitTorrent protocol to the pack files was the starting point for GitTorrent. However, this turns out not to be useful, as the pack files are not determinisitic. It is only under a very strict set of precarious circumstances that any two nodes computing a pack for a git set of git objects will produce the same binary content. Fluke, if you will.
Therefore, it seemed to add little to the idea of using unmodified BitTorrent, perhaps distributing a pack file or a git bundle; for instance, no peer could participate in the swarm - even with a complete clone of the repository - without downloading the exact pack file that the repository was serving.
So, over the period of several months, Jonas and I revised the RFC principally to expressed it in terms of stable object manifests, with the goal that nodes could participate with . You can get a flavour for the exchance by glancing at the RFC source history.
The resultant RFC invents terms such as "Commit Reel", defined by a sorting algorithm for objects, similar to the order returned by:
git rev-list --date-order --objects
The above ordering is for all intents and purposes stable, with only a very minor edge case where no strict order exists.
GitTorrent Summer of Code project
There is prototype code from a 2008 Google Summer of Code project. While this project was not considered successful, some key concepts can be demonstrated with it and so I will make that the starting point of the next post in this series, and use it to illustrate the design of the protocol.
One of the practical discoveries was that the code base could not quickly generate the object indexes required for efficiently answering GitTorrent messages.
Related project: git rev-cache
This project was aimed at being a generic cache for git revision tree walking. The idea is that while git's graph colouring algorithm is fast enough for most operations that are important to a user, such as good interactive performance, they are not sufficient for a gittorrent server, or even for the 'initial git clone' case:
Computing the results involves a huge amount of pointer chasing that requires that the cache be hot. If the cache is not hot, such as on a busy server, it can take minutes just to calculate the amount of work to do.
If you want to take a large amount of objects and retrieve a particular sub-section of them, then you have to do all the above work.
So, the revision cache helps by keeping just the important data in a binary, sequential file: all of the important information necessary for graph traversal can be retrieved quickly and computed quickly, too. I will dedicate at least one post to this project, where I will try to merge it with the latest git and show it in action.
GitTorrent distilled: mirror-sync
One of the challenges with GitTorrent was the amount of infrastructure that was required just to get to the point where the core algorithms could be designed. By using Perl, there were already off-the-shelf packages available for things like Bencoding, etc - but it was still quite a drag.
After some reflection on this, and from having read the BitTorrent protocol, I decided that the BitTorrent protocol itself is all cruft and that trying to cut it down to be useful was a waste of time.
The idea of "automatic mirroring" came from this. With Automatic Mirroring, the two main functions of P2P operation - peer discovery and partial transfer - are broken into discrete features.
I presented this idea at GitTogether 2009, and produced a patch series called "client-side mirroring" that was to be efforts towards this goal.
The design of Mirror-Sync is simple enough to be expressed on a single page, making it a vast improvement over GitTorrent already. Additionally, it would fit within the existing git protocol, allowing existing git servers to smoothly get the benefits from peer to peer technology.
If you want to follow this series, you can subscribe to the gittorrent tag, my git section, my comp section or even my entire blog.
Subversion review
The design roots of Subversion can be traced back to the first very simplistic attempts at version control, such as SCCS and RCS. The design of it has steamrolled on from the 70's with little consideration of stable internet development methods practiced since at least the mid-eighties.
The claim is made that Subversion "just fixes CVS". And while Subversion is generally more robust and versatile than CVS, some still see it as a step backwards. Unlike CVS, SVN is hard to fix when it goes wrong - there are no user-servicable parts inside. Branches and even tags are denied first class recognition by the system, no doubt borrowing some design from Perforce but missing the important bit that made it work (p4's integration - only now being added with "merge tracking"). CVS fixed? Hardly - CVS re-engineered as a cripple. (For a true "drop-in" replacement for CVS that fixes the most important bugs of CVS and doesn't remove features, try git-cvsserver)
Don't buy the "svn 1.5 will fix merging" snake-oil; the new design is still vastly deficient compared with the real A-class tools out there today, such as Git, Bazaar-NG and Mercurial. It might be almost as good as Perforce, hooray you've caught up to 10 years ago. That's if we ever see a release - since last November, the Subversion team have managed 6 minor releases, compared to git's 1 major release, 3 minor releases, 30 stable releases and 26 stable release candidates. There really is no comparison.
As for the speed, after using Git or Mercurial for a while, you go back to SVN and you seriously start to think it's broken or hung - then you realise no, it's just slow. Especially if you are trying to treat your code as a revision data warehouse, for techniques such as code annotation or bisection.
As far as "using HTTP infrastructure" - this is an oversold benefit - note that Subversion is actually using HTTP+WebDAV as a horrific delivery mechanism for its XML-RPC messages. There's nothing standard about it at all - and that's ignoring the fact that WebDAV required us all to upgrade our webservers. Some users were forced into an upgrade treadmill to install the specific, alpha version of Apache that was required.
By my own observation, virtually every proponent of Subversion left either has a significant stake in it, or has simply never tried any other system. They are in another world - a world where removing the ability to do sane branching, merging and tagging was construed as a feature. The net effect is that the open source community is now left with a legacy of useless history for the 5 years or so that the SVN fad has taken the world by storm. This legacy is not caused by the difficulty in conversion - not at all - but more from the dreadful development practices its idiotic design promotes. The buzz word of "commit bit" disguises a widespread practice of skimping on code review. Sure, it might be possible to figure out what the individual changes are in that repository, but who can dig them out from the mess of commits? And with sufficiently few eyeballs to review changes, all code bases are buggy.
And buggy it certainly is. Virtually every project I encountered that tried to use its API - assuming they could figure its crazy system of batons and allocation pools and callbacks out - were mired with random segfaults and difficult to track down core bugs.
Subversion has already become a modern relic; it's a zombie project unable to make stable releases or effectively manage their spaghetti codebase. Abandon ship now.
(NOTE: not that I don't have some good things to say about it, see for instance use perl article on subversion, and also this section in an article I wrote )
Public Access Git Repositories
It seems that a lot of sites have cropped up that offer free Git hosting;
- First there was repo by Petr Baudis of cogito fame. A service running from Prague, based on a few simple CGIs, themselves published.
- Then I think gitorious came along, and also GitHub - both Ruby implementations and some adding services
I have a hunch that people are writing these things as they cotton on to the benefits of distributed version control, and none of the centralised based sites out there (eg, SourceForge, etc) were really coming to the table quickly enough.
These sites use CTAN / CPAN rules - ie, first come, first served when it comes to project names. However, unlike CPAN, these systems will allow forks without requiring a package name change. This is an idea which was specified for Perl 6, and a problem space that I debated extensively with Mark Overmeer, the result being the early design documents for CPAN6.
It's funny how Canonical's Launchpad never really achieved this ball of motion, despite bzr being technically just as capable as git - though missing some Ferrari Features. My hunch is that it is because no version control system really got lots of people excited as when git hit the scene.
I have managed to mirror repo to a Catalyst-hosted machine - currently browsable under http at git.utsl.gen.nz and updating every 8 hours. I hope to also talk to the other major providers of git hosting, and see if I can pull together some kind of co-ordination of effort, so that mirroring and searching these git hosting sites can be easy.
It's funny, in lots of ways I keep feeling that I'm going around in circles. The CPAN6 thing first came up ... was it two years ago? I feel so disappointed in the results of that process; however I think there were interesting lessons to be learned about avoiding the Big Design Up Front thing.
This time, it's the pragmatist's approach - get a huge amount of mirroring going, get the minimum indexing going that will allow for a peer-to-peer cloud to push to each other, and leave it at that. Based on the earlier findings, I have no reason to believe that the key goals CPAN6 design would not cleanly fit on top of these open Git repository mirrors, with a very thin veneer - the veneer itself not even requiring any support from the hosting providers.
The upshots of this should be vastly reduced barriers to entry of people's code into free software projects. No longer will you have to convince software authors that your feature is worthwhile - you can just make a feature branch of your own, and upload to the git hosting cloud. So, "hit and run" patches are more likely to be created, and works in progress shared more easily. Especially in light of the very interesting cross-distro effort spearheaded by madduck - vcs-pkg. Just imagine - if a mainstream distro such as FreeBSD's Ports, or Debian's source archive were to "piggy back" on one of the other sites, or more likely start their own, then it should be possible to distrubute all of these different FLOSS systems using the same bandwidth and mirror space. These sites could easily let you create a fork, with a lightweight fork for just submitting a patch a very simple case of that.
This might seem like "git madness" - but to those who understand that git is a different class of software to the other VCS systems out there, this sort of thing is what the excitement was about all along.
Scathing review of Subversion on OHLOH being ++'d
Ok, so I wrote a review (see my other post Why are you still using Subversion) of Subversion which bordered on ranting. I even advertised it on #git, but relatively few people marked it as "useful". While it was sitting on "3 of 8 found this useful", I was going to delete it, but revisiting it, I found that it's been consistently getting marked as "useful" by people. It's now sitting on "7 of 13". With a few more upvotes, it could hit the front page of Subversion's user reviews! :->
Update: LOL!
More: This post on the Subversion blog makes me think they're at least catching the drift - I wrote about something like this in my svn departer's guide.
History is not linear - the case for micro-branching
During the Perl history conversion, I have found there are very few patches from the p5p archives which I have found I couldn't apply. However, sometimes a pumpking will have integrated a patch which was posted relative to an older version of Perl. How am I representing that?
The patch "Forbid ++ and -- on readonly values" was relative to 5.003_08, and would not cleanly apply to any closer version than that. So, I made a new microbranch, applied the patch there, and included it as a parent to the perl-5.003_21 release. There is also a series of patches, "Fix for anon-lists with tied entries coredump" .. "Full documentation generation patch". These were all successfully applied based on messages from the p5p archives. The result is a whole bunch of changes, followed by a mega-commit which is a "merge" of the loose ends. It's also a very good representation of what really happened. You can also see little "diamonds" in the history (eg, "Re: MakeMaker and 'make uninstall'"), where one micro-branch has the version posted to p5p, and the other the version included in the final release (if they resulted in the same thing, no diamond was formed).
With SVN, how would I do all that? Well, I'd make a new branch, pulling a name out of thin air. Apply the change on that branch, then merge back to trunk using the experimental svnmerge tool, and then delete the branch. What a PITA. Well, unless git-svnserver were to do it automatically from a git master...
An introduction to git-svn for Subversion/SVK users and deserters
[note Feb 2011: this is the first version of the successful introduction to git-svn for Subversion/SVK users and deserters which saw substantial revision from this relatively unstructured rant. I found it in my blogpile and thought it might be worth preserving.]
This article is aimed at people who want to contribute to projects who are using Subversion as their code-wiki. It is particularly targetted at SVK users, who are already used to the disconnected operation work-flow.
People who are responsible for Subversion servers and are converting them to git in order to lay them down to die are advised to consider the one-off git-svnimport, which is useful for bespoke conversions where you don't necessarily want to leave SVN/CVS/etc breadcrumbs behind.
Step 1. track the upstream repository
There are lots of options here. We can just check out the head, we can do a full import from the master server, or we can convert our existing SVK mirror paths.
Quickest, Easiest - find a git-svn conversion
You can check out the entire history of the project with:
$ git-clone git://utsl.gen.nz/parrot Initialized empty Git repository in /home/samv/tmp/parrot/.git/ remote: Generating pack... remote: Done counting 152636 objects. remote: Deltifying 152636 objects. remote: 100% (152636/152636) done Indexing 152636 objects. remote: Total 152636, written 152636 (delta 102789), reused 152478 (delta 102789) 100% (152636/152636) done Resolving 102789 deltas. 100% (102789/102789) done Checking files out... 100% (2990/2990) done
Great! Didn't take long, and that was the same as the whole svk co sequence - add mirror, sync revisions, and checkout. If you are close enough on the network to utsl.gen.nz, that may have taken less than a minute. You can proceed to using your git-svn git repository, below, if you just want to play with it and not worry about the painful migration part.
Of course if svn.perl.org is down and you don't have an SVK mirror then this is your only option.
Building git
Though you don't need it to follow most of this tutorial, a version of git-core more recent than the one provided by your distribution is almost certainly going to give you fewer issues in the long run.
Get yourself a tarball release of git - repo.or.cz is probably a good place to go. git itself is fairly simple to build (apart from the docs, which have a dependency called asciidoc - but just read the .txt files under Documentation/ in lieu of getting that if you have difficulty). You'll also need to install the Subversion SWIG bindings to get git-svn to work, which of course is a world of pain that I won't go into here. You'll also want Tk installed for one of the most important GUIs.
Checking out trunk from SVN
It is probably second fastest to just check out the SVN head using git-svn; this is a bit like setting up a mirror path with svk mirror, then syncing only to the head revision using svk sync -s NNN (where NNN is the head revision, found below using svn log):
$ svn log https://svn.perl.org/parrot/trunk|head
------------------------------------------------------------------------
r17048 | bernhard | 2007-02-19 07:32:13 +1300 (Mon, 19 Feb 2007) | 3 lines
Remove the PIR.pg and bc.pg examples as they are
now covered by languages/abc and languages/PIR.
------------------------------------------------------------------------
r17047 | bernhard | 2007-02-19 07:09:00 +1300 (Mon, 19 Feb 2007) | 5 lines
[languages/PIR]
$ mkdir parrot
$ cd parrot
$ git-svn init https://svn.perl.org/parrot/trunk
Initialized empty Git repository in .git/
git-svn Using higher level of URL: https://svn.perl.org/parrot/trunk => https://svn.perl.org/parrot
$ git-svn fetch -r17048
A DEPRECATED.pod
A debian/libparrot-dev.install
A debian/parrot-doc.install
...
A examples/streams/ParrotIO.pir
A examples/streams/Include.pir
A examples/streams/Filter.pir
r17048 = a57c09abef48d73f3c74c6a307793301b5956bfd (git-svn)
Checking files out...
100% (2959/2959) done
Checked out HEAD:
https://svn.perl.org/parrot/trunk r17048
$
Well, that was almost as quick - under 2 minutes for a head checkout; it had to download about as much as a release tarball. If you like, from here you can proceed to using your git-svn git repository.
But people who use git are used to treating their repositories as a revision data warehouse which they use to mine useful information when they are trying to understand a codebase.
We can't do that, but once your git-fu is strong, you can see it is easy to graft on the earlier history if you want to, using history rewriting. I'll briefly mention grafting (and its drawbacks) later on.
Convert your SVK depot's mirror path
So, it is better to have the complete project history converted, but you probably won't want to wait the day or two it can take to replay a moderately sized Subversion repository using SVK (can anyone mirror the 48GB KDE Subversion repository?).
The support for this isn't yet in a released git, so until it gets merged you'll need to clone git://git.bogomips.org/git-svn.git and build and install that. Look for --useSvmProps in git-svn init -h to see if your git-svn is new enough..
First, svk mi -l will tell us where the mirror paths are.
$ svk mi -l | grep parrot /parrot/master https://svn.perl.org/parrot $
That's everything we need to get started. Now we just need to convert /parrot/master to an SVN url; the depot is everything up to the second "/", and most SVK users will just be using a single depot with an empty name, //
$ svk depotmap -l | grep '/parrot/' /parrot/ /home/samv/.svk/parrot $
So, I take the depot path and add on the rest of the mirror path, I should be able to look at the path using plain svn;
$ svn pl file:///home/samv/.svk/parrot/master Properties on 'file:///home/samv/.svk/parrot/master': svm:source svm:uuid svk:merge $ svn ls file:///home/samv/.svk/parrot/master branches/ tags/ trunk/ $
Great! The pl (proplist) command was important - the properties there, particularly svm:source and svm::uuid, must be there for git-svn to convert this repository correctly. We use the --useSvmProps option to set up the repository rewriting:
Set up the fetch using git-svn init:
$ git-svn init -t tags -b branches -T trunk \
--useSvmProps file:///home/samv/.svk/parrot/master
Initialized empty Git repository in .git/
Using higher level of URL: file:///home/samv/.svk/parrot/master => file:///home/samv/.svk/parrot
$
git-svn is quite capable of tracking multiple Subversion repositories that hold mirrors of the same project, though of course probably most people actually doing that are SVK users, and the "other repository" is your local depot. The above command set up a git-svn remote with the default name of "svn". Take a look at what was configured by running cat .git/config.
$ cat .git/config
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
[svn-remote "svn"]
url = file:///home/samv/.svk/parrot
fetch = trunk:refs/remotes/trunk
branches = branches/*:refs/remotes/*
tags = tags/*:refs/remotes/tags/*
$
All look good? So,
$ git-svn fetch --repack 1000 --useSvmProps
A README
r2 = 5c2dbc76df3fc7569d0b779841427d5ddf406e9d (trunk)
M README
r3 = 9aa2f03a26ed9617cf7002bbe4acae5d3d24dadf (trunk)
...
$
So once that's all complete what did we win so far?
$ du -sk //home/samv/.svk/parrot .git 353576 //home/samv/.svk/parrot 155245 .git $
Well, that's a bit of savings. git saved half the space compared to Subversion fsfs. But it turns out that a lot of it is just git-svn metadata. And we can compress it more; I've got CPU to burn so I ran this command:
$ git-repack -a -d -f --window 100 Generating pack... Done counting 131402 objects. Deltifying 131402 objects. 100% (131402/131402) done Writing 131402 objects. 100% (131402/131402) done Total 131402 (delta 99440), reused 31385 (delta 0) Pack pack-079a95f55810fc1eea600bc89c911a2bf85c1add created. $ ls -l .git/objects/pack/ total 33745 -r--r--r-- 1 samv samv 3154712 2007-02-20 16:00 pack-079a95f55810fc1eea600bc89c911a2bf85c1add.idx -r--r--r-- 1 samv samv 31360284 2007-02-20 16:00 pack-079a95f55810fc1eea600bc89c911a2bf85c1add.pack $
You may be wondering, "353MB of Subversion repository squeezed into 31MB of git pack? That's smaller than an SVN head checkout! Have not all the revisions been copied? Did something get missed?"
It turns out that git is just being incredibly space-efficient. More incredible stories about shrunken repositories can be found all over the internet. Talk to the GCC, Mozilla and KDE folk for the most impressive ones.
Now, in theory, we could keep using SVK to mirror revisions, and keep using git-svn fetch to copy them into the git repository. But we want some more space on our laptop to hold more MP3s, so we'll eventually delete it. Ideally we also want to convert our local branches - getting this working is still on my TODO list, but the intention is that git-svn will be extended to perform this functionality.
Converting the upstream repository from SVN, with branches
This procedure is the same as the SVK one above, but we can just use the published repository URL.
$ mkdir parrot $ cd parrot $ git-svn init -t tags -b branches -T trunk https://svn.perl.org/parrot Initialized empty Git repository in .git/ $ git-svn fetch ...
I didn't test this one - I have already waited the many hours it took to sync the first time. Doing this for FAI took days. And the repository had the sheer indencency to end up tiny.
Using your git-svn checkout
What I'll do now is go through the SVK tutorials and convert them to git-svn, then introduce some examples of stuff that you can do with git that is difficult to get right using SVK. Actually why not get onto some cool stuff first.
Visualisation
This is your all-seeing eye. You can crank open gitk on it and click on commits, see their patches and the state of the tree at that point in time.
$ gitk --all
gitk does some really cool things but is most useful when looking at projects that have cottoned onto feature branches (see feature branches, below). If you're looking at a project where everyone commits largely unrelated changes to one branch it just ends up a straight line, and not very interesting.
The "depot map" is gone
So far we've got as far as the equivalent of svk mirror and svk sync. Didn't we miss svk depotmap?
Git normally stores its repository information under .git at the top level of your checkout. But everything's compressed and the filenames don't resemble the files in your checkout so grep -r and find etc don't hate you. You can set GIT_DIR to get all the tools to look somewhere else if you really care, but for most people this system works very well. GIT_DIR doesn't work the same as SVKROOT in SVK, it's a per-checkout path, not pointing to a central place.
I don't know about you but I was always running into situations where my ~/.svk/config didn't match reality, and there were no breadcrumbs left in the checkout to do anything with it. I much prefer these floating repositories and I hear that they have been recently added to SVK.
Making a 'local' branch
One of the nice things about git (and darcs and bzr and ...) is that to make branches is simple. Say you want to take a directory, and work on it somewhere else in a different direction, you can just make a copy. Contrast this with Subversion, where you have to do some crap with the branches/ paths and svn cp, svn switch, etc, and worry about whether you branch on the mirror path or the local path and what effect that would have, etc. And put up with Subversion followers saying that was natural and easy. Whatever.
$ cp -a parrot parrot.my-branch $
Each of those copies is fully independent, as if you gave them to someone else. You can easily push and pull changes between them without tearing your hair out. But that was too slow and heavy. We want to create new branches at the drop of a hat (trust me on that for now, yer just do, OK?). Maybe you don't want to copy the actual repository, just make another checkout. We can use git-clone again;
$ git-clone -l parrot parrot.my-branch Initialized empty Git repository in /home/samv/.svk/parrot.clone/.git/ 0 blocks Checking files out... 100% (2815/2815) done $
The -l option to git-clone told git to hardlink the objects together, so not only are these two sharing the same repository but they can still be moved around independently. Cool.
But all that's a lot of work and most of the time I don't care to create lots of different directories for all my branches. I can just make a new branch and switch to it immediately with git-checkout:
$ git-checkout -b localbranch remotes/trunk $
But wait, you say, don't I have to enter a commit message for this new branch?
Well, a branch in git is just a pointer to a commit. If you look at "gitk" now, you'll see a new green label on the same commit as "remotes/trunk" called "localbranch". They're like little "post-it" notes - with a new enough gitk you can pepper your history with them wherever you like with a click and then typing the name in. They generally don't form a part of the permanent history - it's the actual commits, the changes to the code, that are the history.
Making changes to your local branch
Once you have some edits you want to commit, you can use git-commit to commit them. Nothing (not even file changes) gets committed by default; you'll probably find yourself using git-commit -a to get similar semantics to svn commit.
This is because git has a powerful concept called the staging area (old name: "index", hence commands like git-update-index), which is where you can prepare your changes before you actually save the commit.
$ vi CREDITS $ git-commit -a committed tree 6b513546099f01826c5cc7bc25042d00bc2560b0 $
Interactive commit is not really there, unless you use Cogito's cg-commit -p with this patch (which fixes the way that editing a patch for a change before it gets committed works). Normally people just use the staging area functionality, though. This is certainly one area where SVK's UI is better than git-core. But UI sophistication is usually only a temporary problem, especially for a project with as much energy pouring into it as git. Update: oh, dear, I'm just so out of touch.
Correcting changes in your local branch
Did you mess up a change? Commit something poorly? Well, no worries, there are lots of ways to fix it.
Again, we're diverging from things that SVK supports well, but I think they're important to get a taste for how things are different. According to one source, lack of support in SVK for this is a "philosophical" stance. I really don't understand this - I make mistakes all the time and it's better that I correct the ones I catch early so other people don't waste their time on them.
If it's the top commit, you can just add --amend to your regular git-commit command to, well, amend the last commit.
You can also uncommit. It's such a crude thing to do that there isn't a command for it (if you add the cogito wrappers, you can use cg-admin-uncommit).
$ git-update-ref refs/heads/localbranch HEAD~1 $HEAD~1 is a special syntax that means "one commit before the reference called HEAD". I could have also put a complete revision number, a partial (non-ambiguous) revision number, or something like remotes/trunk. See git-rev-parse(1) for the full list of ways in which you can specify revisions.
And just like that, your most recent commit was unlinked. If it really was garbage, that was what you wanted. Actually, it isn't completely gone;
$ git-fsck dangling commit 2ef718cf5434eeb8fdec74e69968f64fadd28761 $
If you wanted, you could see it with, eg, gitk 2ef718. I sometimes write commands like `gitk --all `git-fsck | awk '/dangling commit/ {print $3}'`' to see all the commits in the repository, not just the ones with "post-it notes" (aka references) stuck to them.
But that aside, uncommitting really is a primitive mode of operation, and you'd probably end up getting confused by the fact that git-update-ref didn't change the staging area. This is because git-update-ref is a plumbing command; it does one thing, and does it quickly and well. Commands like git-commit are considered porcelain - that is, designed for user interface. So, the technical name for the above dangling commit is spillage. This analogy doesn't seem to extend far enough to make git-prune (which would delete that commit) called something like git-flush or git-pull-chain, however.
Git's more of a toolkit for writing VCS than a VCS in its own right, and if you're the sort of person who doesn't like too many commands, then try Cogito - its simplified feature set is much easier for beginners. However, I'm not one of those people - git is like that little engine that one day you realised you could take apart completely yourself and understand what each part does (at least in principle). But I still prefer Cogito commands much of the time.
So, anyway, there are other tools for revising commits, and to be the king of patch revisioning is Stacked Git.
Say I discover a change that I actually wanted to apply three commits ago. Assuming that I haven't sent the patches out yet, then I can just go ahead and change them; no-one need know. In fact I can anyway, it's just that the longer ago you change things the more antisocial the behaviour becomes. In this scenario, we'll assume that what I'm currently working on isn't finished, either - and I don't want to have to finish it first.
$ stg init branch 'localbranch' initialised $ stg new -m "WIP." new-commit ... $ stg uncommit -n 3 ... $
Now, stg uncommit didn't do the same thing as cg-admin-uncommit; specifically, it didn't unlink any patches. They've just moved onto the patch stack, which I can jump around with using stg commands. First I'll extract the current patch with stg diff, edit it, then apply it a few revisions up.
$ stg diff -r /bottom > this_commit.patch $ vi this_commit.patch $ stg pop -n 2 now at patch 'do_something_interesting' $ patch -p1 < commit.patch patching file foobar.c $ stg refresh $ stg push -n 2 now at patch 'do_something_else_interesting' $ stg commit $ stg push now at patch 'new-commit' $ vi foo.c $ stg refresh -e $ stg commit $ stg clean No patches applied $
But this isn't a tutorial on stacked git. See the Stacked Git homepage for that.
"Another" way to revise commits is to make a branch from the point a few commits ago, then make a new series of commits that is revised in the way that you want. This is the same scenario as before.
$ git-commit -a -m "WIP." committed tree 5ef9339c5b5bc6572b69ff61cdb1dd4af4603f0b $ git-checkout -b tempbranch HEAD~4 $ git-cherry-pick --no-commit -r localbranch~3 ... $ vi foobar.c $ git-commit -a $ git-cherry-pick -r localbranch~2 ... $ git-cherry-pick -r localbranch~1 ... $ git-cherry-pick --no-commit -r localbranch ... $
This technique is called rebasing commits.
There are many, many ways to skin this cat. To tell the truth a lot of them don't play well together, for example you'd better remember to use "stg clean" before committing with something else. It's the old Cathedral vs. Bazaar thing. Using Git opens the door to a bazaar of VCS tools rather than sacrificing your projects at the altar of one. That said, these situations are usually easy enough to recover from in practice, especially by asking for help in #git on freenode.
Tracking updates to the upstream Subversion server
If you pulled from my source, you can update the latest Subversion revisions I've put there using the native git command:
$ git-fetch ... $
This command completes very quickly even when pulling thousands of new revisions, modulo bugs for obscure corner cases like repositories with a huge number of non-overlapping revisions.
On the other hand, if you pulled from the Subversion Server - the slowest option above - or you are continuing to use SVK to do the real fetching (and have just run svk sync), you can just use:
$ git-svn fetch ... $
If you converted the repository from your SVK depot, and you don't want to continue using SVK, then the safest thing to do is first clean out the git-svn metadata; but look out for git-svn updates that do this in a smarter way.
$ rm -r .git/svn $ vi .git/config $
If you copied the repository somewhere else (eg, from me) via git-clone, then you won't have any SVN metadata - just commits. In that case, you need to rebuild your SVN metadata, using the same command as in the earlier section, but with the upstream URL:
$ git-svn init -t tags -b branches -T trunk \
https://svn.perl.org/parrot
...
$
You should see a stream of messages saying "r1234 = e79d0a84830becb10f6f6d24a9e0b7e3663c2921" (etc) as git-svn scans through the commits and makes its index. After you've rebuilt the index, the above git-svn fetch command should do the trick.
Keeping your local branch up to date with Subversion updates
The recommended way to do this for people familiar with Subversion is to use git-svn rebase. You actually don't need to use git-svn fetch separately; it will automatically fetch new revisions first.
$ git-svn rebase ... $
This command is doing something similar to the above commands that used git-cherry-pick; it's copying the changes from one point on the revision tree to another, just like svk sm -Il.
Pushing back to Subversion
The command to use is git-svn dcommit. The d stands for delta (there used to be a git-svn commit command that has since been renamed to git-svn set-tree because its behaviour was considered a little surprising for first-time users).
git-svn won't let the server merge revisions on the fly; if there were updates since you fetched / rebased, you'll have to do that again. People are not used to this, thinking somehow that if somebody commits something to file A, then somebody else commits something to file B, both changes should survive despite none of the people committing having a local copy with both changes.
Sending patches to mailing lists or RT instances
Again there are lots of ways to do this. Let's say we've made some changes and want to make patch files for all of the ones since trunk:
$ git-format-patch remotes/trunk ... $
A command like git-log remotes/trunk..HEAD would show you the commits that this involves. You can then take those patch files and attach them to e-mails or whatever.
If the project uses the kernel patch submission policy, which strangely enough is very similar to best practices for sending patches to usenet etc since 'patch' was invented, then you probably don't want to use --attach.
If the upstream applies your patch without changes, then if you later merge, the changes shouldn't need to re-merged. git will notice that there has been a revision since the "merge base" that an identical change was applied and realise it has already been done.
The real tangible benefits of using Git
We've shown a lot of stuff so far that shows that git-svn can do everything that we expected of SVK in order to have a much better Subversion client. What else did we win?
I've already talked a bit about the fact that git is a toolkit for writing VCS systems. As a result, one huge benefit is a flexibility and wide range of tools to choose from. Writing a tool to do something that you want is often quite a simple matter of plugging together a few core commands. The git repository model is also simple enough that there are even alternate git implementations you can draw upon.
I've also talked about patch revising using stacked git, touched on rebasing, and I'm sure you can read between the lines that dropping commits is also possible.
Then there was the repository efficiency, which affects everything - the virtual memory footprint while mining information from the repository, how much data needs to be transferred during "push" and "pull" operations, and so on.
But really, what does git win you? For a start...
Publishing your changes for others to pull
You can easily publish your changes for others who are switched on to git to pull. At a stretch, you can just throw the .git directory on an HTTP server somewhere and publish the path. You don't need any silly Web-DAV extensions built into the web server just to share revisions.
There are also sites like repo.or.cz which will let anyone start a new project (or publish their fork of an existing project).
There's also the git-daemon for more efficient serving of repositories (at least, in terms of network use), and gitweb.cgi to provide a visualisation of a git repository.
This means you can...
Break free from the "star" pattern
With Subversion, everyone has to commit their changes back to the central wiki, I mean repository, to share them. SVK claims to be distributed, but this is, at best, demonstrating misunderstanding of what being ditributed means. By almost all definitions SVK merely offers disconnected operation. If I meet you in the middle of a cruise and we both have a copy of a subversion repository, I can't easily share my local branch with you if we're both on SVK. Doing this has come up on the SVK list before to a resounding "dunno, never tried, might work..."
With Git (actually this is completely true for other distributed systems), it's trivial to push and pull changes between each other. If what you're pulling has common history then git will just pull the differences.
So I'd just copy my repository to a USB key, stick it into the target machine, then run:
$ git-pull /media/usbdisk/project.git ... $
Sure, a USB stick isn't as gimmicky as a peer to peer wireless protocol featuring autodiscovery. But frankly I'll put up with that for sane branching support in the first place.
If the person publishes their repository as described above, using the git-daemon(1), http or anything else that you can get your kernel to map to its VFS, then you can set it up as a "remote" and pull from it;
$ cat > .git/remotes/friend >>EOF URL: file:///net/friend/git/project Pull: refs/heads/*:refs/remotes/friend/* EOF $ git-fetch friend
Here we're configuring all of the heads (aka branches) of the repository which appears at /net/friend/git/project to appear as remotes/friend/XXX in our repository.
Merging works better (TITLEFIXME)
$ git-merge remotes/trunk ... $
If git's history-sensitive merging doesn't automatically resolve things like patches applied in a different order, you end up with conflicts. The local file gets conflict markers - which might sound apalling, but the "ancestor", "left" and "right" versions of the file are nearby in the staging area. I like to use ediff-merge-files-with-ancestor to merge, so my merge script handles starting this for me to make merging easy. And I don't have to worry about breaking out of a merge aborting the whole thing and throwing away work.
No doubt some will say that SVK's UI is better because it lets you make per-file decisions as you make the merge. I see that as an easily possible addition to the git-commit interface. It's just I've been quite happy to resolve using the facilities of the staging area.
When you look at the commits you make using git-merge in gitk, you'll see something interesting - the new commit has two lines coming back from it. It has two parents. It's something equivalent to SVK's merge tickets, except that the information doesn't become worthless when pushed back to a server.
Feature branches - the "stable" development model
This is an interesting one. Some repositories, for instance the Linux kernel, run a policy such as no commit may break the build.
Because you can easily separate your repositories into stable branches, temporary branches, etc, then you can easily set up programs that only let commits through if they meet criteria of your choosing.
You might use a continual integration server to check that no commit happens that breaks your build. You might say that no merge can happen unless the branch added tests, and that tests pass. You might say that commits either have to add tests or make tests pass.
Your "trunk" becomes merely a point where branches considered stable are merged into. Each of your feature branches can merge from the trunk easily, which means that an immediate merge back in the other direction will involve no actual changes (and, in fact, no extra commit will be made in such a case - the head pointer will just be moved).
Bazaar comes with some great utilities like the Patch Queue Manager which helps show you your feature branches. With PQM, you just create a branch with a description of what you're trying to do, make it work against the version that you branched off, and then you're done. The branch can be updated to reflect changes in trunk, and eventually merged and closed.
Mirroring, resilience and distribution
Your SVN server going down doesn't kill your team's group development if people use systems like repo to mirror and track each other's repositories. They just stop pushing to the published branch and push to each other for a bit.
Git's limitations
Of course if I didn't mention these then I'd have people ranting about how I was biased and partisan etc. But there are many shortfallings in git.
Not least is that it doesn't support two popular styles of developing
Brain melt integration development model
This is where instead of merging in patches completely, you merge bits of them in on a file-by-file basis, and expect the VCS to tell you what you did.
Ghetto development model
This is where you send new features into the ghetto so that they can 'battle it out'. The last features standing get re-integrated into another branch known as the trailer park to try to find a new life for themselves.
Note that ghetto is frequently called trunk, and the trailer park something like releng.
Summary
We have the tools we need to break away from centralisation! Now, we just need to convert the 10,000 projects...
Epilogue on history rewriting
Earlier in this article I referred to history rewriting in passing. I include this as a pointer for the keen, but bear in mind that this falls into the class of "history munging", and for various reasons is best done in the privacy of an unpublished project.
Let's say that we have a branch (the current one) that contains all the patches that we want to move to a rebased history.
We manually find a common commit (possibly using gitk). Let's say it was commit 7cbf53525bc6387495edd574ecdb248e1e4f872a, which became aa3e7febb0477e15257c89126d037f6f81a7974c. You'd re-write that using the cogito command:
$ cd-admin-rewritehist -k 7cbf53 \
--parent-filter "sed -e 's/7cbf53525bc6387495edd574ecdb248e1e4f872a/aa3e7febb0477e15257c89126d037f6f81a7974c/'" \
new-branch
That's a one-line history graft. You now need to go through all of your refs that point to the old commit IDs and point them at the new ones.
Be careful with this kind of history munging, you might just end up with somebody wondering why their "git-pull" is taking so long to negotiate which commits it has and hasn't got.