Suppose that I have 10,000 XML files. Now suppose that I want to send them to a friend. Before sending them, I would like to compress them.
Method 1: Don’t compress them
Resulting Size: 62 MB Percent of initial size: 100%
Method 2: Zip every file and send him 10,000 xml files
for x in $(ls -1) ; do echo $x ; zip "$x.zip" $x ; done
Resulting Size: 13 MB Percent of initial size: 20%
Method 3: Create a single zip containing 10,000 xml files
zip all.zip $(ls -1)
Resulting Size: 12 MB Percent of initial size: 19%
Method 4: Concatenate the files into a single file & zip it
cat *.xml > oneFile.txt ; zip oneFile.zip oneFile.txt
Resulting Size: 2 MB Percent of initial size: 3%
- Why do I get such dramatically better results when I am just zipping a single file?
- I was expecting to get drastically better results using method 3 than method 2, but don’t. Why?
- Is this behaviour specific to
zip? If I tried using
gzipwould I get different results?
$ zip --version Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license. This is Zip 3.0 (July 5th 2008), by Info-ZIP. Currently maintained by E. Gordon. Please send bug reports to the authors using the web page at www.info-zip.org; see README for details. Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip, as of above date; see http://www.info-zip.org/ for other sites. Compiled with gcc 4.4.4 20100525 (Red Hat 4.4.4-5) for Unix (Linux ELF) on Nov 11 2010. Zip special compilation options: USE_EF_UT_TIME (store Universal Time) SYMLINK_SUPPORT (symbolic links supported) LARGE_FILE_SUPPORT (can read and write large files on file system) ZIP64_SUPPORT (use Zip64 to store large files in archives) UNICODE_SUPPORT (store and read UTF-8 Unicode paths) STORE_UNIX_UIDs_GIDs (store UID/GID sizes/values using new extra field) UIDGID_NOT_16BIT (old Unix 16-bit UID/GID extra field not used) [encryption, version 2.91 of 05 Jan 2007] (modified for Zip 3)
Edit: Meta data
One answer suggests that the difference is the system meta data that is stored in the zip. I don’t think that this can be the case. To test, I did the following:
for x in $(seq 10000) ; do touch $x ; done zip allZip $(ls -1)
The resulting zip is 1.4MB. This means that there is still ~10 MB of unexplained space.
Zip treats the contents of each file separately when compressing. Each file will have its own compressed stream. There is support within the compression algorithm (typically DEFLATE) to identify repeated sections. However, there is no support in Zip to find redundancy between files.
That’s why there is so much extra space when the content is in multiple files: it’s putting the same compressed stream in the file multiple times.
ZIP compression is based on repetitive patterns in the data to be compressed, and the compression gets better the longer the file is, as more and longer patterns can be found and used.
Simplified, if you compress one file, the dictionary that maps (short) codes to (longer) patterns is necessarily contained in each resulting zip file; if you zip one long file, the dictionary is ‘reused’ and grows even more effective across all content.
If your files are even a bit similar (as text always is), re-use of the ‘dictionary’ becomes very efficient, and the result is a much smaller total zip.
In Zip each file is compressed separately. The opposite is ‘solid compression’, that is files are compressed together. 7-zip and Rar use solid compression by default. Gzip and Bzip2 can’t compress multiple files so Tar is used first, having the same effect as solid compression.
As the xml file have similar structure and probably similar content if the files are compressed together the compression will be higher.
For example if a file contains the string
"<content><element name=" and the compressor already found that string in another file it will replace it with a small pointer to the previous match, if the compressor doesn’t use ‘solid compression’ the first ocurrence of the string in the file will be recorded as a literal which is larger.
Zip doesn’t just store the contents of the file, it also stores file metadata like the owning user id, permissions, creation and modification times and so on. If you have one file you have one set of metadata; if you have 10,000 files you have 10,000 sets of metadata.