Git is a bless but pushing repositories into the wild got me worried with two points: security and size. The following is a course a summary of my experiences with snippets found on other sites to reach a good git optimization.

Git optimization of repo size

With time every git repository tends to get dirty. Going a direction and changing your mind the second after come at a cost in versionned environments: references are kept and take a lot of place, especially if your keep track of archives.

See how bad the situation is with:

git bundle create tmp.bundle --all  
du -sh tmp.bundle  
rm tmp.bundle  

We first need to detect which files are causing trouble. That comes in the following 3 commands:

git rev-list --objects --all | sort -k 2 > allfileshas.txt to get a list of all files in the history.

git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt to get a list of big files by decreasing size order.

Now get the real file names:

for SHA in `cut -f 1 -d\  < bigobjects.txt`;  
do  
   echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print $1,$3,$7}' >> bigtosmall.txt
done;  

Cloning is caring

The remedy comes with the following list. Although step #1 is probably the most important, cloning makes things clean with hard links. Clone locally your repository to have clean references and do:

  1. git filter-branch --index-filter 'git rm --cached --ignore-unmatch <FILE NAME>' Removes the file from all revisions.
    • ex. git filter-branch --index-filter 'git rm --cached --ignore-unmatch **/subdirectory/*' All subdirectories that appears in multiple directories
    • ex. git filter-branch --index-filter 'git rm --cached --ignore-unmatch /subdirectory/*.jpg' All jpg files in this subdirectory.
    • ex. git filter-branch --index-filter 'git rm --cached --ignore-unmatch /subdirectory/*' All files in this subdirectory and consequently this subdirectory because git doesn’t support empty directories.
    • ex. git filter-branch --index-filter 'git rm --cached --ignore-unmatch subdirectory/**/subdirectory2/*' All subdirectory2 which are contained within the subdirectory
  2. rm -rf .git/refs/original/ Remove git’s backup.
  3. git reflog expire --expire=now --all Expires all the loose objects.
  4. git fsck --full --unreachable Checks if there are any loose objects.
  5. git repack -A -d Repacks the pack.
  6. git gc --aggressive --prune=now Finally removes those objects.
  7. git push --force [remote] master You will need to do a force push, because the remote will sort of think you went back in time, so just make sure you’ve pulled before you started all of this.

Git optimization of repo security

Yet to come!