BOM Away, in Git Style

Some time ago I made a Mercurial hook (killbom) that would remove BOM from UTF-8 encoded files. As I switched to Git, I didn’t want to part with it so it was a time for rewrite. Unlike Mercurial, there is no global hook mechanism. You will need to add hook for each repository you want it in.

Start is easy enough. Just create pre-commit file in .git/hooks directory, Looking from the base of the repository file name would thus be .git/hooks/pre-commit. Content of that file would then be as follows:

#!/bin/sh

git diff --cached --diff-filter=ACMR --name-only -z | xargs -0 -n 1 sh -c '
    for FILE; do
        file --brief "$FILE" | grep -q text
        if [ $? -eq 0 ]; then
            cp "$FILE" "$TEMP/KillBom.tmp"
            git checkout -- "$FILE"

            sed -b -i -e "1s/^\xEF\xBB\xBF//" "$FILE"
            NEEDSADD=`git diff --diff-filter=ACMR --name-only | wc -l`
            if [ $NEEDSADD -ne 0 ]; then
                sed -b -i -e "1s/^\xEF\xBB\xBF//" "$TEMP/KillBom.tmp"
                echo "Removed UTF-8 BOM from $FILE"
                git add "$FILE"
            fi

            cp "$TEMP/KillBom.tmp" "$FILE"
            rm "$TEMP/KillBom.tmp"
        else
            echo "BINARY $FILE"
        fi
    done
' sh

ANYCHANGES=`git diff --cached --name-only | wc -l`
if [ $ANYCHANGES -eq 0 ]; then
    git commit --no-verify
    exit 1
fi

What this script does is first getting list of all modified files separated by the null character so that we can deal with spaces in the file names.

git diff --cached --diff-filter=ACMR --name-only -z

For each of these files we then perform replacing of the first three bytes if they are 0xEF, 0xBB, 0xBF:

sed -b -i -e "1s/^\xEF\xBB\xBF//" "$FILE"

What follows is a bit of a mess. Since it is really hard to get information whether file has been changed without temporary files, I am abusing git to check if file has been changed since it was first staged. If that is the case, assumption will be made that it was due to sed before it. If that assumption is not correct, your commit will have one extra file. As people don’t have same file changed in both staged and un-staged are, I believe risk is reasonably low.

After all files are processed, final check is made whether anything is available for commit. If there are no files in staging area, current commit will be terminated and new commit will be started with --no-verify option. Only reason for this change is so that standard commit message can be written in cases when removal of UTF-8 BOM results in no actual files to commit. Replacing it with message “No files to commit” would work equally well.

While my goal of getting BOM removed via the hook has been reasonably successful, Git hook model is really much worse than one Mercurial has. Not only that global (local) hooks are missing but having multiple hooks one after another is not really possible. Yes, you can merge scripts together in a file but that means you’ll need to handle all exit scenarios for each hook you need. And let’s not even get into how portable these hooks are between Windows and Linux.

If you are wondering what is all that $TEMP operation, it is needed in case of interactive commits. Committing just part of file is useful but didn’t play well with this hook. Saving a copy on side sorted that problem.

Download for current version of pre-commit hook can be found at GitHub.

PS: Instead of editing pre-commit file directly, you can also create it somewhere else and create a symbolic link at proper location.

PPS: I have developed and tested this hook under Windows. It should work under Linux too, but your mileage might vary depending on exact distribution.

[2015-07-12: Added support for interactive commits.] [2015-11-17: Added detection for text/binary.]