Remove files from git history

When a repository contains files which should have never been committed, it is hard to remove them from the history as git is built to keep a history, not to change it. The following procedure will explain the procedure to rewrite history by removing the file or files from all past commits.

infoWhen reaching deep into the repository’s history like described here, starting from a clean state reduces the risk of something going wrong. With this in mind, make sure to start with a clean repository.

Rewrite affected commits

Rewriting the history is done with “git filter-branch” by walking through the complete history. For each commit, filters are applied after which the changes are re-committed. The different filters allow modifying different parts of the commit. The following command can be used to remove the file “directory/and/filename.extension” from the entire repository.

Advertisements
$ git filter-branch --force --index-filter \
  'git rm --cached --ignore-unmatch directory/and/filename.extension' \
  --prune-empty --tag-name-filter cat -- --all

The “filter-branch” allows to rewrite the git history by executing a filter for every commit. This filter is run for all branches specified (rev-list). In the above command, “–all” is specified after the “–” option which separates the rev-list from the “filter-branch” options.

With “–index-filter”, the repository’s index is filtered. The filter command used in the above command is “git rm –cached –ignore-unmatch directory/and/filename.extension”

This filter command itself is also a git command. It instructs git to remove the files specified by the last argument. The argument “–cached” causes the files to be removed from the index, while “–ignore-unmatch” causes git always to exit with return code 0. This is needed for the git filter-branch to continue in case the commit does not contain any matching files. The last argument is the file/directory name to be removed. Shell wild-cards could be used here to remove a number of files at once.

Another filter is applied via “–tag-name-filter”. It will be executed for each tag pointing to a rewritten commit. As the tag itself shall not be modified but just re-committed with reference to the modified commit, the “cat” command in this filter returns the tag name just as it received it – therefore not modifying it.

Providing “–prune-empty”, instructs git filter-branch to remove empty commits completely. In case the filters result in an empty commit, this causes the complete commit to be removed.

Executing the above command may take a while depending on the size and amount of commits in the repository. The following is example output from the above command. It has been shortened as it contains a lot of similar lines.

Rewrite f956cf74dce2f28db1955d6ad47138431255c8ad (35/1103) (2 seconds passed, remaining 61 predicted)
Rewrite dc263b8e6a4453bc0f5c11639fe2428faaa6c7cf (53/1103) (3 seconds passed, remaining 59 predicted)
Rewrite d8a44433139b553ae75e8f4133d7e0f041f390bb (197/1103) (11 seconds passed, remaining 50 predicted)    rm 'directory/and/filename.extension'
Rewrite 279c8a1a275d7cc3e4611adbbd684e0d39359789 (197/1103) (11 seconds passed, remaining 50 predicted)    rm 'directory/and/filename.extension'
...
Rewrite 8a6b1bf2b41d58164e1e2d4ee5b8edb275ac6382 (1103/1103) (90 seconds passed, remaining 0 predicted)
Ref 'refs/heads/master' was rewritten
Ref 'refs/remotes/origin/master' was rewritten
WARNING: Ref 'refs/remotes/origin/master' is unchanged

At this point, the repository’s history is rewritten and the new commits do not contain any reference to the files removed with the previous command. The content of the files (actually the old commits) are still in the local repository clone but not referenced in the “master” or “HEAD” branch. The old history is still referenced via “refs/original/refs/remotes/origin/master” and “refs/original/refs/heads/master”.

Cleanup the repository

So how to cleanup the local copy of the repository? As mentioned, the index is clean but the actual content of the files removed is still there.

git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin

To remove the remaining references to the old history in the local repository, “git for-each-ref” will print all refs matching the “refs/original” in the repository with the “delete” command prefixed. This command is piped to the “git update-ref” command which will delete any reference to the old history.

The above “git update-ref delete …” removed the references to the old commits, the following command expires the reflog as it still contains references to the old commits.

$ git reflog expire --expire=now --all

Using the “–expire=now” with “git reflog“, ensures that it expires the reflog up to now. Without this parameter only reflogs older than 90 days would be removed, and this would leave some of our references behind.

As the local clone of the repository still contains all objects for the old commits but no references to them, the objects need to be removed. This can be done using “git gc“. Git’s garbage collector (gc) can only remove these orphaned objects once there are no reference to them anymore.

$ git gc --prune=now
Counting objects: 6909, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (1970/1970), done.
Writing objects: 100% (6909/6909), done.
Total 6909 (delta 4832), reused 6407 (delta 4659)

In the same way as the “–expire=now” on the “reflog” command, “gc” uses a parameter “–prune” to cleanup objects of a certain age. Usually this defaults to 2 weeks. In this case, the “now” will remove all of those objects without any age limitation.

Push the rewritten history

Pushing the repositories with git push to the remote server will only push to the server what is referred from the index. As such the files are completely gone in the remote copy of the repository. To push the rewritten repository, the “–force” option needs to be given to rewrite the history of the remote repository. The second command will additionally force push the tags to the remote server.

$ git push origin --force --all
$ git push origin --force --tags

Update other clones

After the repository has been filtered and the history has been rewritten, the changes were force pushed to the remote server. Now every clone of this repository has to be updated. This can not be done with the usual “pull” alone.

The first step in updating the repository is to fetch the repository from the remote server. Using “reset” switches the repository from the old to the new repository state from the origin/master.

$ git fetch origin 
$ git reset --hard origin/master

In the same way as above, the old commits need to be removed in order to cleanup the local clone.

$ git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
$ git reflog expire --expire=now --all
$ git gc --prune=now

If the amount of removed content from the repository is significant, you might notice after the garbage collection a decrease in size of the local clone.


Read more of my posts on my blog at https://blog.tinned-software.net/.

This entry was posted in Version control system and tagged , , , , . Bookmark the permalink.