Workflow

by Willie Wong

Recently I started rethinking how I organize my incomplete and under development notes.

I know full well the inherent dangers of such an exercise in terms of my actual productivity. But now that I have completed my newest workflow I think I’ve finally found one that works well. (Fingers crossed.)

Before I describe what I do now, I’d like to document what I used to do, what I changed the last time, and why I am changing it again.

Pre-history: me as a graduate student

When I started graduate school, a lot of journals hadn’t gone full digital. I had to make frequent trips to the library and photocopy journal articles. (That changed drastically within my four years in the PhD program.) After my first two years I had a giant box of photocopied articles, and had developed a habit of keeping paper copies of things.

(I even had the unrealistic idea of keeping a spiral notebook, carefully manuscript, of all my research ideas in general relativity that would eventually lead up to my PhD theses. I kept that up through my third year of graduate school. I had a system where I would keep draft versions in a yellow notebook and final versions in a white notebook.)

My friend Arick Shao< convinced me, at one point, that that is an extremely stupid thing to do. I consider myself rather a computer geek. Arick puts me to shame in that regard. He was the one that introduced me to a tiling window manager (he says it is great because it maximizes screen efficiency on his tiny netbook; I am now a sworn convert on my dual-screen desktop), and a hundred little software tools that have either become embedded in what I do or, in also many cases, been discarded because they are just not all that useful. In any case, Arick showed me how he keep all his notes digitized on his computer, and how he never has to worry about misplacing anything ever. For a third year graduate student with a mountain of notes, that was an easy sell.

Except the part where I have to digitize my existing notes.

(This was before the modern Print Stations with rapid scanning was widely in use. Scanning notes involved sitting in front of a flatbed scanner, waiting about 15 to 20 seconds for each page to be scanned, another 5 seconds for GIMP to render on screen, click “Save as…”, type in file name, rinse, and repeat. I gave up after about a quarter way into scanning my notes.)

Ever since then I have been typing up my notes as soon as is reasonable (now I often just work in TeX except for some large calculations that I either pull out a piece of scratch paper for or I turn around and use my ginormous blackboard).

Ancient History: my first postdoc

By this time, I have been, for a year or so, an adherent to the “type things up and throw it in a folder” work flow. This got messy soon: especially now I have multiple computers available to use, keeping files in sync and managing current versions became a chore. Then I discovered version control software. I started using Subversion (my vague recollection was that Git wasn’t as popular back then and I also had the option of running my own SVN server).

I certainly didn’t pay any attention to best practices: I just tossed everything in one giant repo and used it as a way to sync files between my Desktop and my Laptop, keeping versioning information.

This got rid of the problem with multiple versions of files. But I didn’t do anything about my notes being scattered among many files in different directories.

History: my second postdoc

When I started my second postdoc the volume of notes becomes hard to handle. I have many different files with different partial results and it has become sometimes difficult to locate that proof that I know I’d written down, but I just don’t remember where.

It is the digital version of the messy-desk problem where I can’t find my notes.

Around the same time, my friend Kyriacos Leptos showed me how he uses a Wiki to do knowledge management for his research group. A wiki would be an overkill for me: first, I no longer had the facilities to set-up my own private web server with Wiki attached, and secondly, I didn’t have a “research group”. But I wanted to try to reproduce some aspects of his “digital lab notebook” in a form that is suitable for mathematical research. A important thing for me is that I want my notes to be maximally reuseable, at least when it comes to preparing manuscripts for publication. This means I want my notes to be in LaTeX and ideally using the same formatting and macros.

So I had the idea of keeping a giant, monolithic document of all my research output (analogously to how in the mid 1900s scientists will keep paper logs of their experimental results). It is, quite literally, a digital lab notebook. I wrote a LaTeX class file which recreates the feel of “dated entries”. That together with keyword indices gives some rudimentary searchability for the document (both electronic and paper). I had it in my mind that once it gets to a certain length (say, around 500 pages) I will print out and bind the document to leave on my shelf for ease of reference. (There is something nice about paging through paper copies of reference material; spatial location does function as a memory aid.)

There are certain advantages of the monolithic format. Cross references is easy to manage using LaTeX. I no longer have to worry about “which file”; I can just open the file, flip to the index, and look for all the possibly relevant entries. The document is linearly sorted by time of writing, so if I remember roughly when I did the research that can also help too.

There are however certain disadvantages of the format. By the time I completed my second postdoc, my notes totaled 463 pages. The PDF file is 1.8 megabytes, and takes noticeable time to compile even on my Xeon quad-core desktop (on my netbook the compile time was very noticeable). It can be sometimes hard to find the right index term to look for the correct entry (I hadn’t enough foresight to allow each entry to have multiple keywords). And whenever I have to share notes with colleagues, I will have to first go through the PDF file, note down the relevant pages, then “Print to PDF” those pages, and e-mail the selection to my colleagues.

Recent history

When I started my current position, I switched to Git. A large part of it was because, even though at my previous position I cannot run my own SVN server, the institution has SVN support, so I could continue using Subversion. But Michigan State does not run an SVN server, favoring Gitlab instead. So I switched.

I also did some research, reading up the differences, and in many ways I do find the Git philosophy appealing. (I wrote more about this some time ago.) For the large part switching from Subversion to Git didn’t change my workflow too much, with the one large exception that I now put things in multiple repositories. (I will return to this later.)

Another recent change came last year, when I took some time to redevelop my “digital lab notebook” code. My main concern was to make entries more searchable (supporting arbitrary number of keywords/tags), as well as to solve the problem of creating extracts from the entries. I originally tried to do this using a custom written “filter-by-tag” LaTeX (so I can specify which keywords to show, and which to hide in the master document). The filtering package can be found in my LaTeX tools repo under the name www_tagging.sty. This turns out to be slow and inflexible, so instead I decided to abuse JabRef instead. This last solution worked okay: JabRef has good keyword filtering capabilities. But the downside is that in the entries I am not writing pure “TeX” code; I have to manually insert in the required metadata in BibTeX format in a specially crafted comment environment to hid it from true TeX code.

For this and other convenience reasons I have been avoiding actually putting content into my lab notebook!

Instead, for convenience I have been just dumping files into a ScratchPad repository and this repository is growing to look like what I had before I started doing lab notebooks. So I have successfully created a workflow so bad that I myself don’t want to use. And this brings me to the past week.

Now

Having come to terms that my natural preference is to have just a jumble of files lying around in a directory, I decided that the correct thing to do is not to force myself to change my workflow, but to figure out how to adapt my software to accommodate it. (In someway this was influenced by my teaching. At this point I’ve mostly internalized the idea of “alignment” in pedagogy, and this has spilt over to how I approach software as well.) I have decided that my workflow should:

  1. Feel natural. Since I default to a preference of storing notes as many different documents, I should leverage that.
  2. Allow for easy cross references.
  3. Allow for easy export of “entries”.
  4. Be searchable.

One of the things I noticed immediately is that by now “having a printed copy at hand” and “linearly ordered by time of writing” are not things I feel like I need to prioritize. This frees me up to make the “file” level organizing unit correspond to subject matter, and solve the export problem. Cross references between files can be accomplished using the xr package.

How about organization and searchability? This is where some scripting becomes useful. I still organize my notes using JabRef. But now that the documents are split by subject matter, I can use JabRef more like a reference manager! To aide the generation of the BibTeX file, I coded up a class file that spits out certain meta-data (title of the document, author, keywords, and the abstract) in BibTeX format into an auxiliary file. A shell script reads the creation and modification date data from Git (I am particularly proud of this part: why datestamp the files by hand when Git was designed to do it?), and compiles a BibTeX database by combining all these information. The info can be loaded into JabRef for easy viewing and searching: with both keywords and abstract searching for information should be much easier.