So starts the svn to git migration…

by Willie Wong

For five years now I have been a happy user of svn to manage my research work, and I probably would have remained so if it weren’t for my next job favoring git instead. So in the past few weeks I have been reading up on git and in the process discovering all sorts of things that I have been doing wrong, or at least sub-optimally. So here are just some notes on what I’ve just figured out (yay slow me!).

Each paper should be a repository

Previously I keep one single giant repository for all my research work. I’ve discovered that this is not the best idea for multiple reasons:

  • Collaboration: one of the great things about version control systems is that it makes collaboration easier to manage. But your collaborators are not a static set and you probably don’t want them to peek at every one of your research ideas. The easiest way to share individual projects with only those who should be allowed to see and edit them is to have one repo for each paper. (I got away with what I did mostly because I failed to convince any of my collaborators to use a VCS beyond that built-in support in Dropbox.)
  • Organisation: to keep track of papers I have them stored in subdirectories, some of which are “stuff I am working on” and some of which are “stuff that is finished from year X” and some of which are “stuff that is being refereed”. It is a bit silly that I have to do svn mv changes to “graduate” a project from one subdirectory to the next. By keeping each paper in its own (git) repository, the local directory representation of the storage is immaterial. And this makes more sense to me.
  • (In)compatibility: here’s something that I changed my mind on. Previously I thought it a great idea to keep a single up-to-date bibtex file containing all the references that I can ever need, and a single up-to-date version of my custom LaTeX class and style files. The advantage of course is that I just need to issue one svn up to get the newest versions of everything. But the disadvantage is that when upgrading my class and style files, or when updating my bibtex files, I have to maintain backward compatibility. And when I do break the compatibility, it is then required that I keep a copy of the old versions of the files along with the LaTeX source that uses them, which, when you think about it, defeats the purpose of having a single up-to-date version in one repo completely.

So my new workflow, instead of one giant repository, is that I will create a repo for each paper/project. My LaTeX class and style files will be itself a separate Git repo, on which I can upgrade and develop to my hearts desire. When I start a new paper I will simply make a copy of the current version of the files (with git archive instead of git clone because I won’t need the previous versions, nor will I want to track the changes). This also allows me to set-up my “development environment” (via .gitattributes and .gitignore) quickly.

Keyword substitution is not necessary

The papers I keep in my svn repo I have been using the svn and svn-multi packages to add time-stamp and versioning information to the PDF files. Both of those packages rely on the “keyword substitution” capabilities of the svn system at commit time. Naturally when I wanted to start using git, I looked for a replacement. The obvious one is gitinfot2. One thing I don’t like is that unlike the keyword replacements, this package does not directly modified the source LaTeX file; instead it creates (via commit and checkout hooks) a supplementary file in the .git/ directory which it searches for and inserts when building the PDF file. This makes it a bit more of a hassle when uploading stuff to the arXiv, for example.

So I started reading up on how one can actually imitate keyword expansion using commit and checkout filters. And I went so far as to implement something for LaTeX. And then I read the discussion by the kernel devs on this issue, and Linus Torvalds’ comments left an impression on me. In short:

  • When you are working on the code in a git repository, you don’t need this tagging since you can just “ask git”.
  • Conversely, this sort of tagging is only needed when your code is ready to leave the repository (upload to arXiv or sent to non-git-using collaborators, for example).

So philosophically it is much less useful to have something that work on the working copy compared to something that works on an exported archive. And while git, by design, cannot and will not do keyword expansion on commits, it is perfectly happy to do keyword expansion when one exports the repo. Furthermore, since the export substitution can be essentially formatted arbitrarily, this moots the need for something like svn or svn-multi to parse the string generated by the RCS: we can make the string appear how we want to start with. The only hiccup is that before the substitution (i.e. when you are working in the working copy), the syntax for the export substitution is not exactly compatible with LaTeX, and requires a little mucking about with catcodes. But with that problem solved, and with the workflow now accounting for each paper as a separate repository, for arXiv uploads the easiest thing will actually be to simply issue git archive and upload the resulting tarball.