GitTorrent: a synthesis of past efforts
GitTorrent concept: torrent the pack files
The idea of applying the straight BitTorrent protocol to the pack files was the starting point for GitTorrent. However, this turns out not to be useful, as the pack files are not determinisitic. It is only under a very strict set of precarious circumstances that any two nodes computing a pack for a git set of git objects will produce the same binary content. Fluke, if you will.
Therefore, it seemed to add little to the idea of using unmodified BitTorrent, perhaps distributing a pack file or a git bundle; for instance, no peer could participate in the swarm - even with a complete clone of the repository - without downloading the exact pack file that the repository was serving.
So, over the period of several months, Jonas and I revised the RFC principally to expressed it in terms of stable object manifests, with the goal that nodes could participate with . You can get a flavour for the exchance by glancing at the RFC source history.
The resultant RFC invents terms such as "Commit Reel", defined by a sorting algorithm for objects, similar to the order returned by:
git rev-list --date-order --objects
The above ordering is for all intents and purposes stable, with only a very minor edge case where no strict order exists.
GitTorrent Summer of Code project
There is prototype code from a 2008 Google Summer of Code project. While this project was not considered successful, some key concepts can be demonstrated with it and so I will make that the starting point of the next post in this series, and use it to illustrate the design of the protocol.
One of the practical discoveries was that the code base could not quickly generate the object indexes required for efficiently answering GitTorrent messages.
Related project: git rev-cache
This project was aimed at being a generic cache for git revision tree walking. The idea is that while git's graph colouring algorithm is fast enough for most operations that are important to a user, such as good interactive performance, they are not sufficient for a gittorrent server, or even for the 'initial git clone' case:
Computing the results involves a huge amount of pointer chasing that requires that the cache be hot. If the cache is not hot, such as on a busy server, it can take minutes just to calculate the amount of work to do.
If you want to take a large amount of objects and retrieve a particular sub-section of them, then you have to do all the above work.
So, the revision cache helps by keeping just the important data in a binary, sequential file: all of the important information necessary for graph traversal can be retrieved quickly and computed quickly, too. I will dedicate at least one post to this project, where I will try to merge it with the latest git and show it in action.
GitTorrent distilled: mirror-sync
One of the challenges with GitTorrent was the amount of infrastructure that was required just to get to the point where the core algorithms could be designed. By using Perl, there were already off-the-shelf packages available for things like Bencoding, etc - but it was still quite a drag.
After some reflection on this, and from having read the BitTorrent protocol, I decided that the BitTorrent protocol itself is all cruft and that trying to cut it down to be useful was a waste of time.
The idea of "automatic mirroring" came from this. With Automatic Mirroring, the two main functions of P2P operation - peer discovery and partial transfer - are broken into discrete features.
The design of Mirror-Sync is simple enough to be expressed on a single page, making it a vast improvement over GitTorrent already. Additionally, it would fit within the existing git protocol, allowing existing git servers to smoothly get the benefits from peer to peer technology.