splitting up a "big" git repo, take two

Converting to git has been fun, educational, and annoying, at various times. Here are some notes (mostly to myself) on the fun, educational, and annoying bits that I solved today.

When I converted my personal Subversion repository (notes on my RPG, talks, letters, config files) to Git, I just turned one big Subversion repo into one big Git repo. It was about 400 megabytes, in Git.

I decided, a little later, that there was no reason that all my huge Keynote slides needed to be in the same place as my letters to my grandmother or my notes on my RPGs. A little digging made it seem like this would work just fine:

$ git filter-branch --subdirectory-filter letters HEAD

With that, my history would be rewritten to only include changes in the letters directory, which would become root. When I did it, I found that everything was still there – but listed as a new or modified file. That was easy to deal with, too:

$ git reset --hard

Great! Now I had only my fifteen letters with twenty total revisions. Only one small problem:

$ du -sh ../letters
381M letters

$ git-gc && git-prune
$ du -sh ../letters
381M letters

I couldn’t figure out what the hell was going on, mostly because I didn’t know much about how git stored things. More importantly, I didn’t realize that when cloning a git repository from one directory to another on the same filesystem, git-clone will make hardlinks. Then, when cleaning up, it will see that files are in use and not purge them. They’re not actually taking up any more space, and after I had removed the original, big repository that I’d cloned, a cleanup would have worked.

I realized this later, when reading the docs because of another problem. I had long since removed the trimmed-down repositories, because I thought they were taking up too much space. To deal with the problem from the get-go, I did something like this:

for dir in talks/exporter rpg/deliverance letters;
  do (
    cd $dir
    git filter-branch --subdirectory-filter talks/$dir HEAD
    git reset --hard
    cd ..
    git clone --no-hardlinks $dir _$dir
    cd _$dir
    git gc --aggressive
    git prune

The important thing here is the --no-hardlinks option to git clone. Obviously, it prevents hardlinking, instead making copies of everything. With that done, the gc and prune commands can work as I had expected, removing all the objects not used in this particular repo.

More and more, I find that git is a really well-designed system. I wish I had a reason to learn much more about it!

Written on September 11, 2007
🔀 git