2006-01-15

Compression and archive formats: A description

recently, I went to Apple's games site and downloaded a bunch of puzzle/arcade games. I was dismayed to find that many of these games were distributed in formats other than UDIF (the .dmg format). so clearly this matter bears some explanation.

this is going to be a two-part post. part 1 follows. part 2, which I will probably post tomorrow, will cover why you should use a plain dmg (no StuffIt or *zip sprinkles) to distribute your software. this was originally one long post, but I Kill Billed it.

UPDATE 2006-01-15: changed the initials to capital letters (sentence case), so it reads better; changed to smart quotes. also added mention of Compact Pro and a link to ESR's article on The Art of UNIX Programming.

UPDATE 2006-02-17: linkified Compact Pro.

read on…


Part 1: A brief history of compression and archiving

First, a couple definitions:

A compression format contains compressed data: data that has been fit into less space than it would without compression. gzip format is a good example of this; a gzip file is nothing but compressed data. It doesn’t even contain a filename — that’s worked out from the name of the gzip file. (for example, if you download Foo.tar.gz, and rename it Bar.tar.gz, and gunzip it, you will get a file named Bar.tar.)

An archive format aggregates one or more files into a single file. USTAR format (the format used by tar and pax; see below) is a good example of this. It isn’t compressed (unless you compress it with gzip, bzip2, or something else), but it is an archive format.


On Mac OS (that is, before Mac OS X), almost all software was distributed in StuffIt format. StuffIt was software developed by Aladdin Systems (now Allume Systems). (Before StuffIt caught on, the Mac market was majorly ruled by Compact Pro. But StuffIt, combined with apathy from the author, defeated it.)

The StuffIt format was both an archive format and a compression format: a StuffIt file was an archive of files whose contents were compressed. StuffIt achieved its greatest popularity with the advent of the internet — Mac OS applications could not be transmitted over networks, because most of their essential parts were in the resource fork, which was omitted when a file was sent to another operating system or copied to another file system. StuffIt, being data-fork only, kept everything intact. (MacBinary was another archive format, developed separately, that served that purpose, but did not offer compression.)

In the interest of being thorough, I should mention that MacBinary and BinHex were also used on top of StuffIt files (.sit.bin and .sit.hqx, respectively). MacBinarying a StuffIt file accomplished exactly nothing: you were wrapping what was already wrapped. BinHex was useful if the file was believed likely to undergo newline conversion (conversion between CR, which indicated a newline on the Apple II and Mac OS; LF, which indicated a newline on UNIX; and CRLF, which indicated a newline on DOS and Windows), but I think such situations were really the minority.

Eventually, Allume completely replaced the old StuffIt formats (there are at least four versions) with StuffIt X. StuffIt X supported much more detail in how the compression was specified, as well as Unicode filenames, 64-bit dates, packages (folders that behave as files, e.g. .app bundles) and stronger encryption. But it never achieved popular usage, for two reasons.

The first reason was simply that StuffIt 6-8 sucked. The larger reason is that OS X has its own facilities for archiving, compressing, and expanding files.


Mac OS had for many years (I used it on a Mac Plus and an SE/30, among other models) an application called Disk Copy. You’d open Disk Copy, then insert a floppy disk (80mm [aka 3.5"] disks were the prevailing medium of the day). Disk Copy would read from the disk, then prompt you to insert a blank disk or cancel. You could do this as many times as you wanted; thus, Disk Copy was a disk-duplication (i.e. copying) program.

In Disk Copy version 4.2 (I think), Apple added the ability to create disk images. A disk image is literally a snapshot of the contents of a disk. Disk Copy was able to both create and apply these disk images.

The idea was that you could make an image of a disk, make a hundred copies, sell them all, then come back with another box of disks and make a hundred more copies. Resumable duplication, in effect, without having to read the disk in each time. It was also possible to distribute disk images (by compressing them and/or sending them over the internet), so that the image could be applied to floppy disks by separate individuals.

Disk Copy also had the ability to ‘mount’ these images. This was the equivalent of putting the disk in, as it would show up on the desktop, but you could have incinerated the original real disk and it would still work — everything was read from the disk image. Once the internet got popular, it was possible to send a disk image to somebody else, and they could make their own floppy disks from the disk image.

Around the time of System 7.5, Apple began distributing SMIs: Self-Mounting Images. Regular disk images were Disk Copy documents, so when you opened one, Disk Copy would launch to take care of it. An SMI had the disk-mounting code in it, so you didn’t need Disk Copy anymore. I think Disk Copy was made a custom-install option at this point.

Disk Copy 6 introduced a new format called NDIF (New Disk Image Format), which came in several variants: a read/write format, a read-only uncompressed format, and two read-only compressed formats (one using an unknown codec, and the other using a proprietary Apple codec named KenCode).


Independently of all of this, UNIX in all its flavours (and later GNU operating systems as well) had a program called tar. tar is a tape archive program, designed for recording and playing back backup tapes.

At some point, tar gained the ability to make tape-archive files (now colloquially known as tarballs). This is the -f part of tar -cf. Tarballs, in this way, are analogous to disk images. The tar format was eventually standardised as USTAR.

The UNIX philosophy of software design is ‘do one thing well’. So the tar format (and its companion application, in all its various separate implementations) never gained any sort of compression. Instead, various compression programs are used in series upon the tar data. The original way to do this was tar -cf - <files> | compress > foo.tar.Z.

Eventually, gzip replaced compress, and not long after, GNU tar introduced a -z option that made tar do the pipeline into gzip for you. Thus were compression and archiving merged in the UNIX world.


With Mac OS X, Disk Copy came back, and SMIs went away, and a new disk image format was introduced: UDIF. Like StuffIt X, UDIF was created to handle all the new and improved metadata supported by HFS+ — most conspicuously, Unicode filenames — as well as strong (AES-128) encryption. The old disk images had a filename extension of .img; this, unfortunately, was quite a common extension (used by at least one other imager and at least one picture format), so UDIF was given the filename extension of .dmg.

UDIF is a very flexible format. It comes in many variants. As of Mac OS X 10.4.3, these include:

  • Read-only, uncompressed
  • Read/write, uncompressed
  • Read-only, ADC compression (another Apple codec)
  • Read-only, zlib compression
  • Read-only, bzip2 compression
  • Entire device (?)
  • UDIF stub (?)
  • DVD/CD master
  • Sparse

(These were also taken from the manpage for hdiutil.)


Numerous other formats have existed for many years — specifically zip (on Windows) and gzip and bzip2 (on UNIX and Linux) — and StuffIt Expander has supported all of them since 3.0 (bzip2 since maybe 5).

But in OS X, StuffIt Expander (which came with the OS until Tiger) showed its age. Versions 6-8 were slow and ugly. This got better with version 9, but by that time, StuffIt Expander lost to the alternative that also came with the OS: Disk Copy (which merged with Disk First Aid and Drive Setup to form Disk Utility in Jaguar).

For several years, most software for Mac OS X was distributed in the form of a zlib-compressed dmg.

But now things have deteriorated.

2 comments:

at 1/27/2006 10:00:00 AM, Blogger Sören 'chucker' Kuklau said...

"MacBinarying a StuffIt file accomplished exactly nothing: you were wrapping what was already wrapped. BinHex was useful if the file was believed likely to undergo newline conversion [..], but I think such situations were really the minority."

Actually, I'm surprised nobody corrected this bit. Encapsulating a StuffIt file in MacBinary or BinHex was a means to avoid resource fork trouble. It was therefore commonly used when transferring files over the network or internet.

 
at 1/27/2006 09:33:00 PM, Blogger Peter Hosey said...

it was not corrected because it was already correct. see the previous paragraph:

"Mac OS applications could not be transmitted over networks, because most of their essential parts were in the resource fork, which was omitted when a file was sent to another operating system or copied to another file system. StuffIt, being data-fork only, kept everything intact."

there was no resource fork to preserve in a StuffIt file.

 

Post a Comment

<< Home