Data compression algorithms and tools
This list contains the most popular data compression algorithms and tools. All of them are free and open source, an important detail if you want to preserve data for a long time from now and to be able to decompress the data in the future.
General purpose compression
7z
Gzip
- http://www.gzip.org/
- Not strong but fast and very widely supported.
- Pre-installed on pretty much every Linux computer.
Bzip2
- http://www.bzip.org/
- Stronger than GZip but much slower.
- Allows recovering undamaged parts of damaged archives using the
bzip2recover
utility, making it possible to merge undamaged parts of copies together. Given the structure of the TAR format (often used together with bzip2), this allows recovering non-missing parts of archives with a missing beginning or end.
xz
- Website
- An established high-end compressor in the Linux world. For example, kernel.org provides builds in the .tar.xz format as of 2025.
- Based on LZMA SDK
- The developer of lzip recommends against using it for long-term archival. For example, it lacks any means of error recovery (see section 2.12 in source).[1]
lzip
- Website
- Claims to be stronger yet faster than bzip2.
- Not as widely supported as xz, bzip2, and gzip.
- Well defined file format and emphasis on file integrity
- lziprecover can correct some bit-flip errors and merge damaged copies.
Zip
- Natively supported by all major operating systems. Included out of the box since Windows XP.
- Weaker compression than .tar.gz and .tar.bz2 due to lack of solid compression. However,
Zstandard
- https://facebook.github.io/zstd/
- Very efficient in both time and compression ratio.
- First-class support for custom dictionaries, which is particularly useful when compressing many small data units (e.g. WARC file with many HTML pages from one particular website). Using a trained dictionary for the compression massively improves the compression ratio in such scenarios.
RAR
The compression method is proprietary, but open-source tools for reading from RAR archives exist (unrar, 7z, unar), so there is a good chance of long-term support. 7z also supports extracting from the RAR 5 format introduced by RARlab in 2013. The main improvement of RAR 5 is its much higher dictionary size limit.
RAR has special features such as archive comments and pairity, but third-party tools (text editor, parchive) can be used for these purposes on other formats if necessary.
Heavy duty compression
These programs often use large amounts of memory to get the best possible compression ratio.
lrzip
"This is a compression program optimised for large files" -lrzip readme
lrzip is fantastic for archiving - the compression ratio improves as the size of the input file grows - albeit a terribly slow compressor. lrzip really shines when compressing large sets of redundant information - but distant, and otherwise unconnected. General purpose compression algorithms would never see this, given their tiny compression window.
ZPAQ
- http://mattmahoney.net/dc/zpaq.html
- Uses deduplication, journaling, and several different compression algorithms (LZ77, BWT, and PAQ context mixing)
- Supported by lrzip
- EXTREMELY slow
KGB
Uses the PAQ6 compression algorithm. Excellent compression ratio (better than 7z), but a bit slow.
You can install it in Ubuntu with: sudo apt-get install kgb
How to:
- kgb -m file.kgb originalfile
- m is a number from 0 to 9 (lowest compression ratio from higher; higher use 1616 MB of RAM, a lot of CPU and time)
not recommended
LZO
A format that is best avoided is LZO, given that its developer, Markus Oberhumer, has excluded his site which contains the source code (Oberhumer.com) from the Wayback Machine, so contrary to the developers of LZip, Oberhumer clearly is not interested in having the source code for his format preserved.
While we can't prevent people from requesting exclusions from the Wayback Machine, what we can do is distrusting them and avoiding usage of their software.
In addition, Oberhumer sells a proprietary compression format called "LZO Professional", with the proclaimed benefit of an improved compression ratio and speed[2], but without mentioning which format it is compared to, and it is unclear how it compares to existing freely licensed open-source formats.
Given that "LZO Professional" is both an obscure and a proprietary format, it is prone to digital obsolescence, so using it is strongly recommended against.
StuffIt
The StuffIt format (.sit
) is disrecommended because it is proprietary and hardly supported outside Mac OS.
StuffIt provides extraction support for Windows through the freeware tool "StuffIt Expander". Software to extract StuffIt archives on Linux has been made in the past, but is unstable.[3]
References
- ↑ Xz format inadequate for long-term archiving - Antonio Diaz Diaz
- ↑ oberhumer.com: LZO Professional real-time data compression library
- ↑ Stuffit Archives - LinuxMafia.com
External links
- http://en.wikibooks.org/wiki/Guide_to_Unix/Commands/File_Compression
- Matt Mahoney's Large Text Compression Benchmark
- Compression file formats