The file format is made of two sort of structures: headers and data-blocks. The headers are all a dictionary where the key is an integer. That way we can add new things in headers in next versions without having to break the file format.
Creating a normal tar.gz file is like compressing a tar file. It means if there is a corruption in the tar.gz file, we can’t ungzip it and then we can untar it. Just a single corruption in one byte at the beginning of the file make it unreadable.
For this reason fsarchiver is working the other way. It’s like a gz.tar: we archive files that have already been compressed before. It provides a better safety for your data: if one byte if corrupt, we just ignore the current file and we can extract the next one.
There is a problem with that: in case there are a lot of small files, like in the linux-sources-2.6, we compress each small files in a single data block of only few KB, which produces a bad compression ratio compared to compression of a big tar archive of all the small files.
For that reason, all the small files are processed in a different way in fsarchiver-0.4.4 and later: it creates a set of small files that can fit in a single data block (256KB by default) and then it compresses a block which is quite big compared to the size of a single file.
There is a threshold which is used to know whether a regular file will be considered as a small file (sharing a data block with other small files) or if it will be considered as a normal/big one having its own data blocks. This threshold depends of the size of a data block, which depends itself on the algorithm chosen to compress the data. In general, a data block is 256KB (it can be bigger) and the threshold will be 256KB/4 = 64KB.
There are many sort of objects in a filesystem:
Each data block which is written to the archive has its own header (FSA_MAGIC_BLKH). This header stores block attributes: original block size, compressed size, compression algorithm used, encryption algorithm used, offset of the first byte in the file, … When both compression and encryption are used, the compression is done first. It’s more efficient to process that way because the encryption algorithm will have less things to do because the compressed block is smaller. When the compression makes a block bigger than the original one, fsarchiver automatically ignores the compressed version and keeps the uncompressed block.
fsarchiver should be endianess safe. All the integers are converted to little-endian before they are written to the disk. So it’s neutral on intel platforms, and it requires a conversion on all the big-endian platforms. The only things that have to be converted are the integers which are part of the headers. These conversions don’t pollute the source code since they are all converted in the functions that manage the headers: write-header-to-disk and read-header-from-disk functions. All the higer level functions in the code just have to add a number to the header, they don’t have to worry about endianess.
All the information which is not the files contents itself (meta-data) are stored in a generic management system called headers. It’s not just a normal C structure which is written to disk. A specific object oriented mechanism has been written for that. The headers are quite high-level data structures that are reused a lot in the sources. They provide endianess management, checksumming, and extensibility. Since the data are stored as a dictinnary, where the keys are 16bit integers, it will be possible to extend the headers in the future by adding new keys to add new information in the next versions. These headers are just implemented as s_dico in the program, and the magic and checksum are added when we write it to the archive file.
Here is how an header is stored in an archive:
The random 32bit archive id is used to make sure the program won’t be disturbed by nested archive in case the archive is corrupt. When an archive is corrupt, we can skip data and search for the next header using the known magic strings. In case we have nested archives, we could find the magic string that belongs to a nested archive. It won’t be used since we also compare the random 32bit archive id, and the id of the nested archive header will be different so the program will know that it has to ignore that header and continue to search.
Almost everything in the archive is checksummed to make sure the program will never silently restore corrupt data. The headers that contains all the meta-data (file names, permissions, attributes, …) are checksummed using a 32bit fletcher32 checksum. The files contents are written in the archive as blocks of few hundreds kilo-bytes. Each block is compressed (except if the compression makes the block bigger) and it may also be encrypted (if the user provided a password). Each block also has its own 32bit fletcher32 checksum. All the files also have an individual md5 checksum that makes sure the whole file is exactly the same as the original one (the blocks are checksummed but it allows to make sure we did not drop one of the block of a file for instance). Because of the md5 checksum, we can be sure that the program is aware of the corruption if it happens.