What Are You Testing?
I’m testing how well a variety of different popular archive types compress a various different types of data. This test was inspired by my need to keep backups of various different websites and projects that I work on (or have worked on in the past). For my purposes the amount of time that it takes to compress or uncompress an archive is irrelevant as they are all broadly comparable when you don’t need to create archives that often or when you are using them as part of an automated backup script that runs overnight.
Why Don’t You Just Read An Existing Article?
I have, and will be linking a few in this blog, however I think you always learn more by doing something yourself than by just reading up. Additionally I will be using some real world data here that matches my own specific use case (Tests 1 & 2 below) the other tests are merely to satisfy my own curiosity.
The Test Files
Test | Total Size (Uncompressed) | Description |
---|---|---|
Test 1 | 214477 items, totalling 7.1 GiB | A large WordPress Website |
Test 2 | 14377 items, totalling 583.8 MiB | A small WordPress Website |
Test 3 | 64 items, totalling 75.3 MiB | A variety of image file |
Test 4 | 384 items, totalling 1.6 GiB | A variety of audio file |
Test 5 | 43 items, totalling 25.0 MiB | A variety of text files |
Test File 1 (Large WordPress Website)
This is an actual WordPress website used by one of my clients, it’s a news site and has data relating to multiple posts per day going back several years it includes all of the WordPress php files, theme and plugin files, the media library, and various miscellaneous file such as database dumps and node modules used for my build tools (the files were copied from my local dev version of the site).
Test File 2 (Small WordPress Website)
This is a much smaller WordPress website with a limited number of posts but as with the larger site includes all the WordPress, Theme and Plugin files, media files and some other miscellaneous files. It includes a database dump but no node modules.
Test File 3 (Images Files)
This is a directory containing images mostly 1920×1080 jpgs and a few pngs.
Test File 4 (Music Files)
This is a directory full of music, it contains mostly a mix of mp3, mp4a and wma files but also has some album art saved as jpgs.
Test File 5 (Text Documents)
This is a directory containing various large txt files and some other text based formats (html, c, lsp, etc)
The Setup
Preparation
Some of the compression tools below only support individual files as inputs, others are capable of handling entire directories by them selves. In order to keep things fair I used tar
to create a single file for each test, the command used is below.
tar -cf test-directory.tar test-directory
This actually led me to my first discovery which is that just using `tar` can reduce file sizes. After a little research this appears to be due to the file systems cluster size, you can read more here. In my tests this reduced the file size by up to 4% when dealing with a large number of files some of which may be quite small, see table below.
Test | Before Tar (KiB) | After Tar (KiB) | % Reduction |
---|---|---|---|
Test 1 | 7425124 | 7103404 | 4.3% |
Test 2 | 603484 | 581076 | 3.7% |
Test 3 | 73888 | 73572 | 0.4% |
Test 4 | 1578440 | 1577844 | 0.0% |
Test 5 | 24536 | 24460 | 0.3% |
Commands Used
zip -9 -r test-directory.tar.zip test-directory.tar
bzip2 -9 -c test-directory.tar > test-directory.tar.bz2
rar -m5 a test-directory.rar test-directory.tar
7z -mx=9 a test-directory.tar.7z test-directory.tar
gzip -9 -c test-directory.tar > test-directory.tar.gz
xz -9 -c test-directory.tar > test-directory.tar.xz
lzip -9 -c test-directory.tar > test-directory.tar.lz
Command Notes
One thing I learned quite quickly doing these tests is that most of the different compression tools available use a similar syntax, in the list of commands below you will see that most use -9
which tells the tool to use the maximum level of compression it supports. Most of them seem to use 6 by default but there are some exceptions:
- bzip2 uses 9 by default
- rar uses 3 by default but only goes as high as 5
- 7zip uses 5 by default
If you’re wondering about the -c
that’s because some of the commands delete the original tar file by default and the -c
redirects the output to stdout so that we can write it to a separate file.
It should also be noted that tar
supports several different compression formats as does 7zip
so the commands above are not necessarily the ones you would normally run but I think they provide a clearer picture of what’s happening.
Finally just some general notes for those who are not familiar, zip is a very old a well supported compression format. bzip2, gzip and rar are also quite popular and have been around for a while. 7z uses something called LZMA by default which is supposed to provide better compression in terms of both file sizes and time taken to compress / decompress which I’m not measuring here, and lzip also uses LZMA and xz uses LZMA2.
The Results
All of the results below are sorted by % Reduction (best to worst), the % Reduction is calculated based off the size of the tar file (as opposed to the size of the original directory pre tar)
Test 1 Results
Original Size (After Tar): 7103404 KiB
Tool | Size KiB (Compressed) | % Reduction |
---|---|---|
Xz | 5847656 | 21.25% |
7Zip | 5900968 | 20.53% |
Rar | 5900968 | 19.41% |
Lzip | 5993568 | 19.28% |
Bzip2 | 6462836 | 12.96% |
Gzip | 6462836 | 12.96% |
Zip | 6487024 | 12.63% |
Test 2 Results
Original Size (After Tar): 581076 KiB
Tool | Size KiB (Compressed) | % Reduction |
---|---|---|
Xz | 323032 | 46.47% |
7Zip | 324080 | 46.30% |
Lzip | 326136 | 45.96% |
Rar | 327156 | 45.79% |
Bzip2 | 437892 | 27.44% |
Gzip | 443720 | 26.47% |
Zip | 443724 | 26.47% |
Test 3 Results
Original Size (After Tar): 73572 KiB
Tool | Size KiB (Compressed) | % Reduction |
---|---|---|
Gzip | 73308 | 0.78% |
Zip | 73308 | 0.78% |
7Zip | 73320 | 0.77% |
Xz | 73328 | 0.76% |
Rar | 73440 | 0.61% |
Bzip2 | 73472 | 0.56% |
Lzip | 73692 | 0.27% |
Test 4 Results
Original Size (After Tar): 1577844 KiB
Tool | Size KiB (Compressed) | % Reduction |
---|---|---|
Lzip | 1445780 | 8.40% |
Rar | 1445908 | 8.40% |
Xz | 1446616 | 8.35% |
7Zip | 1446856 | 8.34% |
Gzip | 1448432 | 8.24% |
Zip | 1448432 | 8.24% |
Bzip2 | 1453024 | 7.95% |
Test 5 Results
Original Size (After Tar): 24460 KiB
Tool | Size KiB (Compressed) | % Reduction |
---|---|---|
Lzip | 5964 | 75.69% |
Xz | 5972 | 75.66% |
7Zip | 5972 | 75.66% |
Rar | 6468 | 73.64% |
Bzip2 | 6208 | 74.70% |
Gzip | 7744 | 68.44% |
Zip | 7744 | 68.44% |
Conclusion
The results varied a bit depending on file type, honestly some of the tests (Test 3 and Test 5) could probably have used more data to deliver clearer result. That said my main test cases were Test 1 & Test 2 which used real world data of the type that I would like to compress and I think this delivered clear results and I also learned a few useful things in the process (such as how using tar alone can reduce the file size and that most compression commands support different compression levels, the highest of which is usually 9)
Best Commands
7zip
For my purposes it looks like xz and 7zip are the clear winners, they both give between 10 – 20 percent better compression than the older formats like zip and gzip. Based on the cross platform support and reasonably wide spread usage of 7zip I think that’s the one that I shall default to in future.
xz
If you’re considering using xz you should be aware that there is some debate (here is the discussion/rebuttal) about how suitable it is as an archive format however how much you care about this probably depends on who you are. If you’re working for AWS and you need to find an archive format that delivers great compression because you’re dealing with Pebibytes of data and you need to make sure it’s 100% reliable so that it never fails then you probably need to spend more time reading up and doing your own tests. If you’re just a random developer looking to save yourself a few Gibibytes of storage every month and you have multiple different backup strategies for redundancy like myself then you can probably afford a small amount of risk and don’t need to worry that much.
Additional Notes
Background Information
You can find some very summary level background information on zip, bzip2, tar+gzip, rar & 7zip at the link below.
Best File Formats for Archiving – Compression formats
A More Detailed Comparison
If you want a more in depth comparison that also covers things like compression and decompression time, memory usage etc then you might find the following article interesting. Note that in the linked article a lower percentage in the compression ratio represents a bigger reduction in file size (in other words it’s the opposite to this article where I’m looking at a percentage reduction as opposed to percentage of original file size)
Quick Benchmark: Gzip vs Bzip2 vs LZMA vs XZ vs LZ4 vs LZO
Another Short Compression Comparison
Just for good measure another short compression comparison from back in 2009 that I found interesting.
Linux Compression Comparison (GZIP vs BZIP2 vs LZMA vs ZIP vs Compress)
0 Comments