What Are You Testing?

I’m testing how well a variety of different popular archive types compress a various different types of data. This test was inspired by my need to keep backups of various different websites and projects that I work on (or have worked on in the past). For my purposes the amount of time that it takes to compress or uncompress an archive is irrelevant as they are all broadly comparable when you don’t need to create archives that often or when you are using them as part of an automated backup script that runs overnight.

Why Don’t You Just Read An Existing Article?

I have, and will be linking a few in this blog, however I think you always learn more by doing something yourself than by just reading up. Additionally I will be using some real world data here that matches my own specific use case (Tests 1 & 2 below) the other tests are merely to satisfy my own curiosity.

The Test Files

TestTotal Size (Uncompressed)Description
Test 1214477 items, totalling 7.1 GiBA large WordPress Website
Test 214377 items, totalling 583.8 MiBA small WordPress Website
Test 364 items, totalling 75.3 MiBA variety of image file
Test 4384 items, totalling 1.6 GiBA variety of audio file
Test 543 items, totalling 25.0 MiBA variety of text files
If You don’t know the difference between a MB and a MiB or a GB and a GiB check here.

Test File 1 (Large WordPress Website)

This is an actual WordPress website used by one of my clients, it’s a news site and has data relating to multiple posts per day going back several years it includes all of the WordPress php files, theme and plugin files, the media library, and various miscellaneous file such as database dumps and node modules used for my build tools (the files were copied from my local dev version of the site).

Test File 2 (Small WordPress Website)

This is a much smaller WordPress website with a limited number of posts but as with the larger site includes all the WordPress, Theme and Plugin files, media files and some other miscellaneous files. It includes a database dump but no node modules.

Test File 3 (Images Files)

This is a directory containing images mostly 1920×1080 jpgs and a few pngs.

Test File 4 (Music Files)

This is a directory full of music, it contains mostly a mix of mp3, mp4a and wma files but also has some album art saved as jpgs.

Test File 5 (Text Documents)

This is a directory containing various large txt files and some other text based formats (html, c, lsp, etc)

The Setup

Preparation

Some of the compression tools below only support individual files as inputs, others are capable of handling entire directories by them selves. In order to keep things fair I used tar to create a single file for each test, the command used is below.

tar -cf test-directory.tar test-directory

This actually led me to my first discovery which is that just using `tar` can reduce file sizes. After a little research this appears to be due to the file systems cluster size, you can read more here. In my tests this reduced the file size by up to 4% when dealing with a large number of files some of which may be quite small, see table below.

TestBefore Tar (KiB)After Tar (KiB)% Reduction
Test 1742512471034044.3%
Test 26034845810763.7%
Test 373888735720.4%
Test 4157844015778440.0%
Test 524536244600.3%
If You don’t know the difference between a KB and a KiB check here.

Commands Used

zip -9 -r test-directory.tar.zip test-directory.tar
bzip2 -9 -c test-directory.tar > test-directory.tar.bz2
rar -m5 a test-directory.rar test-directory.tar
7z -mx=9 a test-directory.tar.7z test-directory.tar
gzip -9 -c test-directory.tar > test-directory.tar.gz
xz -9 -c test-directory.tar > test-directory.tar.xz
lzip -9 -c test-directory.tar > test-directory.tar.lz

Command Notes

One thing I learned quite quickly doing these tests is that most of the different compression tools available use a similar syntax, in the list of commands below you will see that most use -9 which tells the tool to use the maximum level of compression it supports. Most of them seem to use 6 by default but there are some exceptions:

  • bzip2 uses 9 by default
  • rar uses 3 by default but only goes as high as 5
  • 7zip uses 5 by default

If you’re wondering about the -c that’s because some of the commands delete the original tar file by default and the -c redirects the output to stdout so that we can write it to a separate file.

It should also be noted that tar supports several different compression formats as does 7zip so the commands above are not necessarily the ones you would normally run but I think they provide a clearer picture of what’s happening.

Finally just some general notes for those who are not familiar, zip is a very old a well supported compression format. bzip2, gzip and rar are also quite popular and have been around for a while. 7z uses something called LZMA by default which is supposed to provide better compression in terms of both file sizes and time taken to compress / decompress which I’m not measuring here, and lzip also uses LZMA and xz uses LZMA2.

The Results

All of the results below are sorted by % Reduction (best to worst), the % Reduction is calculated based off the size of the tar file (as opposed to the size of the original directory pre tar)

Test 1 Results

Original Size (After Tar): 7103404 KiB

ToolSize KiB (Compressed)% Reduction
Xz584765621.25%
7Zip590096820.53%
Rar590096819.41%
Lzip599356819.28%
Bzip2646283612.96%
Gzip646283612.96%
Zip648702412.63%
If You don’t know the difference between a KB and a KiB check here.

Test 2 Results

Original Size (After Tar): 581076 KiB

ToolSize KiB (Compressed)% Reduction
Xz32303246.47%
7Zip32408046.30%
Lzip32613645.96%
Rar32715645.79%
Bzip243789227.44%
Gzip44372026.47%
Zip44372426.47%
If You don’t know the difference between a KB and a KiB check here.

Test 3 Results

Original Size (After Tar): 73572 KiB

ToolSize KiB (Compressed)% Reduction
Gzip733080.78%
Zip733080.78%
7Zip733200.77%
Xz733280.76%
Rar734400.61%
Bzip2734720.56%
Lzip736920.27%
If You don’t know the difference between a KB and a KiB check here.

Test 4 Results

Original Size (After Tar): 1577844 KiB

ToolSize KiB (Compressed)% Reduction
Lzip14457808.40%
Rar14459088.40%
Xz14466168.35%
7Zip14468568.34%
Gzip14484328.24%
Zip14484328.24%
Bzip214530247.95%
If You don’t know the difference between a KB and a KiB check here.

Test 5 Results

Original Size (After Tar): 24460 KiB

ToolSize KiB (Compressed)% Reduction
Lzip596475.69%
Xz597275.66%
7Zip597275.66%
Rar646873.64%
Bzip2620874.70%
Gzip774468.44%
Zip774468.44%
If You don’t know the difference between a KB and a KiB check here.

Conclusion

The results varied a bit depending on file type, honestly some of the tests (Test 3 and Test 5) could probably have used more data to deliver clearer result. That said my main test cases were Test 1 & Test 2 which used real world data of the type that I would like to compress and I think this delivered clear results and I also learned a few useful things in the process (such as how using tar alone can reduce the file size and that most compression commands support different compression levels, the highest of which is usually 9)

Best Commands

7zip

For my purposes it looks like xz and 7zip are the clear winners, they both give between 10 – 20 percent better compression than the older formats like zip and gzip. Based on the cross platform support and reasonably wide spread usage of 7zip I think that’s the one that I shall default to in future.

xz

If you’re considering using xz you should be aware that there is some debate (here is the discussion/rebuttal) about how suitable it is as an archive format however how much you care about this probably depends on who you are. If you’re working for AWS and you need to find an archive format that delivers great compression because you’re dealing with Pebibytes of data and you need to make sure it’s 100% reliable so that it never fails then you probably need to spend more time reading up and doing your own tests. If you’re just a random developer looking to save yourself a few Gibibytes of storage every month and you have multiple different backup strategies for redundancy like myself then you can probably afford a small amount of risk and don’t need to worry that much.

Additional Notes

Background Information

You can find some very summary level background information on zip, bzip2, tar+gzip, rar & 7zip at the link below.

Best File Formats for Archiving – Compression formats

A More Detailed Comparison

If you want a more in depth comparison that also covers things like compression and decompression time, memory usage etc then you might find the following article interesting. Note that in the linked article a lower percentage in the compression ratio represents a bigger reduction in file size (in other words it’s the opposite to this article where I’m looking at a percentage reduction as opposed to percentage of original file size)

Quick Benchmark: Gzip vs Bzip2 vs LZMA vs XZ vs LZ4 vs LZO

Another Short Compression Comparison

Just for good measure another short compression comparison from back in 2009 that I found interesting.

Linux Compression Comparison (GZIP vs BZIP2 vs LZMA vs ZIP vs Compress)


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.