Finding files in git repositories using checksums

Recently I wanted to replicate an evaluation of an older paper, which also required using an older version of one of my tools. I knew exactly which code was used for the evaluation because I had a tar (jdime-src.tar.xz) of the source code. What I did not know was whether this tar was the result of a specific release, a specific commit, or just something that was never put into git in this very state — So I wanted to find out and (if necessary) create a respective commit and tag it.

I started by extracting the tar and generating checksums for each relevant file:
tar -xJf jdime-src.tar.xz && cd jdime-src
find src/ -type f -name \*.java -exec sha1sum {} \; > /tmp/sha1sums

/tmp/sha1sums now looked like this:

015bd1948789546e977266196fbea427b2eb64d1  src/de/fosd/jdime/merge/package-info.java
c8cdccd30d215fb955a9b0011a442d6a792cf91f  src/de/fosd/jdime/merge/MergeInterface.java
28722ec3fd2e280519cc4d9bdc97496f4ae57505  src/de/fosd/jdime/merge/UnorderedMerge.java
5c1b1082417790b853c9bb971cf22667ecc25274  src/de/fosd/jdime/merge/OrderedMerge.java
bc18227a8c63100bb5d97e9d5923e76df3d9eb46  src/de/fosd/jdime/merge/Merge.java
...

Next, I needed a way to look for the files in my git repository and find the right commit for each file (i.e., if there is one that matches the checksum).

I found this gist on github by mloberg, that I rewrote a little bit for my needs.
The result was this (also available as gist):

#!/bin/sh
# find-by-hash.sh

usage() {
 echo "Usage: $0 [-m] [-s] hash file"
 echo "\t-m use md5 for hashing"
 echo "\t-s use sha1 for hashing (this is the default)"
 exit 1
}

HASHCMD="sha1sum"

while getopts ":m:s" opt; do
 case $opt in
  m)
   HASHCMD="md5sum"
   shift
   ;;
  s)
   HASHCMD="sha1sum"
   shift
   ;;
 esac
done

CHECKSUM=$1
FILE=$2

if [ -z "$CHECKSUM" -o -z "$FILE" ]; then
 usage
fi

# Check if valid git repo
ROOT=$(git rev-parse --show-toplevel)

if [ $? -ne 0 ]; then
 echo "Not a valid git repo."
 exit 1
fi

cd $ROOT

# git revision for file
REVS=$(git rev-list --all -- $FILE)

# temp file
file_to_check=$(mktemp)

# check each revision for checksum
FOUND=""
for rev in $REVS; do
 git show $rev:$FILE > $file_to_check 2>/dev/null
 if $HASHCMD $file_to_check | grep -q $CHECKSUM; then
  FOUND="$rev"
  # intentionally no break to see if we find an older revision
  # insert a break if you want the most recent commit instead of the oldest
 fi
done

# cleanup
rm $file_to_check

# output
if [ -n "$FOUND" ]; then
 echo "$FOUND"
 exit 0
else
 echo "Not found: $CHECKSUM $FILE"
 exit 1
fi

The rest was easy: Executing the script for each line in /tmp/sha1sums reveals whether there is a matching commit in the repository in the format commit checksum filename:

> while read -r line; do echo "$(./find-by-hash.sh $line) $line"; done < /tmp/sha1sums | tee /tmp/results
6b14e2bf8eeaba3176c4e310c055cd4480e06ebd 8f2ebe7afe4b540516388763688675d8  src/de/fosd/jdime/merge/package-info.java
6b14e2bf8eeaba3176c4e310c055cd4480e06ebd bb5d89779398697f956d7f6d5e7d16b6  src/de/fosd/jdime/merge/MergeInterface.java
...

Let’s see how many different commits we’re dealing with:

> awk '{print $1}' /tmp/results | sort | uniq -c
6b14e2bf8eeaba3176c4e310c055cd4480e06ebd
a0ba24e5c0237d6cb1c394cbfadb9d9df895a410
de655106866164efccd060b36064c8c6db87cb7a
f2f5265f1dd666948962cdeb63e90d08c1cf322c

Great, this means each file was found in the exact version that it is in the tar! They’re just spread over 4 different commits, but that’s no big problem. What I wanted to do next was to create a commit that resembles the state of the tar, so I can put a tag on it and find it easily in the future.

The majority of files (50) is already in a single commit (6b14e2bf), so I used that as a base for the branch:
git checkout -b old-evaluation 6b14e2bf8eeaba3176c4e310c055cd4480e06ebd

So what about the 4 remaining files that are from different commits? Let’s just put them on top of our new branch:

for rev in $(awk '{print $1}' /tmp/results | sort | uniq -c | tail -n+2 | awk '{print $2}'); do
 file=$(grep $rev /tmp/results | awk '{print $3}')
 git checkout $rev -- $file && git add $file
done

All there was left to do was committing these changes and tagging the revision:
git commit -m ... && git tag -a ...