Commit | Line | Data |
---|---|---|
31771f45 MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ======================= | |
4 | Squashfs 4.0 Filesystem | |
9eb425c0 PL |
5 | ======================= |
6 | ||
7 | Squashfs is a compressed read-only filesystem for Linux. | |
31771f45 | 8 | |
62421645 PL |
9 | It uses zlib, lz4, lzo, or xz compression to compress files, inodes and |
10 | directories. Inodes in the system are very small and all blocks are packed to | |
11 | minimise data overhead. Block sizes greater than 4K are supported up to a | |
12 | maximum of 1Mbytes (default block size 128K). | |
9eb425c0 PL |
13 | |
14 | Squashfs is intended for general read-only filesystem use, for archival | |
15 | use (i.e. in cases where a .tar.gz file may be used), and in constrained | |
16 | block device/memory systems (e.g. embedded systems) where low overhead is | |
17 | needed. | |
18 | ||
19 | Mailing list: squashfs-devel@lists.sourceforge.net | |
20 | Web site: www.squashfs.org | |
21 | ||
31771f45 | 22 | 1. Filesystem Features |
9eb425c0 PL |
23 | ---------------------- |
24 | ||
25 | Squashfs filesystem features versus Cramfs: | |
26 | ||
31771f45 | 27 | ============================== ========= ========== |
9eb425c0 | 28 | Squashfs Cramfs |
31771f45 MCC |
29 | ============================== ========= ========== |
30 | Max filesystem size 2^64 256 MiB | |
31 | Max file size ~ 2 TiB 16 MiB | |
32 | Max files unlimited unlimited | |
33 | Max directories unlimited unlimited | |
34 | Max entries per directory unlimited unlimited | |
35 | Max block size 1 MiB 4 KiB | |
36 | Metadata compression yes no | |
37 | Directory indexes yes no | |
38 | Sparse file support yes no | |
39 | Tail-end packing (fragments) yes no | |
40 | Exportable (NFS etc.) yes no | |
41 | Hard link support yes no | |
42 | "." and ".." in readdir yes no | |
43 | Real inode numbers yes no | |
44 | 32-bit uids/gids yes no | |
45 | File creation time yes no | |
46 | Xattr support yes no | |
47 | ACL support no no | |
48 | ============================== ========= ========== | |
9eb425c0 PL |
49 | |
50 | Squashfs compresses data, inodes and directories. In addition, inode and | |
51 | directory data are highly compacted, and packed on byte boundaries. Each | |
52 | compressed inode is on average 8 bytes in length (the exact length varies on | |
53 | file type, i.e. regular file, directory, symbolic link, and block/char device | |
54 | inodes have different sizes). | |
55 | ||
31771f45 | 56 | 2. Using Squashfs |
9eb425c0 PL |
57 | ----------------- |
58 | ||
59 | As squashfs is a read-only filesystem, the mksquashfs program must be used to | |
60 | create populated squashfs filesystems. This and other squashfs utilities | |
61 | can be obtained from http://www.squashfs.org. Usage instructions can be | |
62 | obtained from this site also. | |
63 | ||
812753d6 PL |
64 | The squashfs-tools development tree is now located on kernel.org |
65 | git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git | |
9eb425c0 | 66 | |
31771f45 | 67 | 3. Squashfs Filesystem Design |
9eb425c0 PL |
68 | ----------------------------- |
69 | ||
4c1d204c | 70 | A squashfs filesystem consists of a maximum of nine parts, packed together on a |
31771f45 | 71 | byte alignment:: |
9eb425c0 PL |
72 | |
73 | --------------- | |
74 | | superblock | | |
75 | |---------------| | |
4c1d204c PL |
76 | | compression | |
77 | | options | | |
78 | |---------------| | |
9eb425c0 PL |
79 | | datablocks | |
80 | | & fragments | | |
81 | |---------------| | |
82 | | inode table | | |
83 | |---------------| | |
84 | | directory | | |
85 | | table | | |
86 | |---------------| | |
87 | | fragment | | |
88 | | table | | |
89 | |---------------| | |
90 | | export | | |
91 | | table | | |
92 | |---------------| | |
93 | | uid/gid | | |
94 | | lookup table | | |
899f4530 PL |
95 | |---------------| |
96 | | xattr | | |
97 | | table | | |
9eb425c0 PL |
98 | --------------- |
99 | ||
100 | Compressed data blocks are written to the filesystem as files are read from | |
101 | the source directory, and checked for duplicates. Once all file data has been | |
89cab5b5 PL |
102 | written the completed inode, directory, fragment, export, uid/gid lookup and |
103 | xattr tables are written. | |
9eb425c0 | 104 | |
4c1d204c PL |
105 | 3.1 Compression options |
106 | ----------------------- | |
107 | ||
108 | Compressors can optionally support compression specific options (e.g. | |
109 | dictionary size). If non-default compression options have been used, then | |
110 | these are stored here. | |
111 | ||
112 | 3.2 Inodes | |
9eb425c0 PL |
113 | ---------- |
114 | ||
115 | Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each | |
116 | compressed block is prefixed by a two byte length, the top bit is set if the | |
117 | block is uncompressed. A block will be uncompressed if the -noI option is set, | |
118 | or if the compressed block was larger than the uncompressed block. | |
119 | ||
120 | Inodes are packed into the metadata blocks, and are not aligned to block | |
121 | boundaries, therefore inodes overlap compressed blocks. Inodes are identified | |
122 | by a 48-bit number which encodes the location of the compressed metadata block | |
123 | containing the inode, and the byte offset into that block where the inode is | |
124 | placed (<block, offset>). | |
125 | ||
126 | To maximise compression there are different inodes for each file type | |
127 | (regular file, directory, device, etc.), the inode contents and length | |
128 | varying with the type. | |
129 | ||
130 | To further maximise compression, two types of regular file inode and | |
131 | directory inode are defined: inodes optimised for frequently occurring | |
132 | regular files and directories, and extended types where extra | |
133 | information has to be stored. | |
134 | ||
4c1d204c | 135 | 3.3 Directories |
9eb425c0 PL |
136 | --------------- |
137 | ||
138 | Like inodes, directories are packed into compressed metadata blocks, stored | |
139 | in a directory table. Directories are accessed using the start address of | |
140 | the metablock containing the directory and the offset into the | |
141 | decompressed block (<block, offset>). | |
142 | ||
143 | Directories are organised in a slightly complex way, and are not simply | |
144 | a list of file names. The organisation takes advantage of the | |
145 | fact that (in most cases) the inodes of the files will be in the same | |
146 | compressed metadata block, and therefore, can share the start block. | |
147 | Directories are therefore organised in a two level list, a directory | |
148 | header containing the shared start block value, and a sequence of directory | |
149 | entries, each of which share the shared start block. A new directory header | |
150 | is written once/if the inode start block changes. The directory | |
151 | header/directory entry list is repeated as many times as necessary. | |
152 | ||
153 | Directories are sorted, and can contain a directory index to speed up | |
154 | file lookup. Directory indexes store one entry per metablock, each entry | |
155 | storing the index/filename mapping to the first directory header | |
156 | in each metadata block. Directories are sorted in alphabetical order, | |
157 | and at lookup the index is scanned linearly looking for the first filename | |
158 | alphabetically larger than the filename being looked up. At this point the | |
159 | location of the metadata block the filename is in has been found. | |
89cab5b5 | 160 | The general idea of the index is to ensure only one metadata block needs to be |
9eb425c0 PL |
161 | decompressed to do a lookup irrespective of the length of the directory. |
162 | This scheme has the advantage that it doesn't require extra memory overhead | |
163 | and doesn't require much extra storage on disk. | |
164 | ||
4c1d204c | 165 | 3.4 File data |
9eb425c0 PL |
166 | ------------- |
167 | ||
168 | Regular files consist of a sequence of contiguous compressed blocks, and/or a | |
169 | compressed fragment block (tail-end packed block). The compressed size | |
170 | of each datablock is stored in a block list contained within the | |
171 | file inode. | |
172 | ||
173 | To speed up access to datablocks when reading 'large' files (256 Mbytes or | |
174 | larger), the code implements an index cache that caches the mapping from | |
175 | block index to datablock location on disk. | |
176 | ||
177 | The index cache allows Squashfs to handle large files (up to 1.75 TiB) while | |
178 | retaining a simple and space-efficient block list on disk. The cache | |
179 | is split into slots, caching up to eight 224 GiB files (128 KiB blocks). | |
180 | Larger files use multiple slots, with 1.75 TiB files using all 8 slots. | |
181 | The index cache is designed to be memory efficient, and by default uses | |
182 | 16 KiB. | |
183 | ||
4c1d204c | 184 | 3.5 Fragment lookup table |
9eb425c0 PL |
185 | ------------------------- |
186 | ||
187 | Regular files can contain a fragment index which is mapped to a fragment | |
188 | location on disk and compressed size using a fragment lookup table. This | |
189 | fragment lookup table is itself stored compressed into metadata blocks. | |
190 | A second index table is used to locate these. This second index table for | |
191 | speed of access (and because it is small) is read at mount time and cached | |
192 | in memory. | |
193 | ||
4c1d204c | 194 | 3.6 Uid/gid lookup table |
9eb425c0 PL |
195 | ------------------------ |
196 | ||
197 | For space efficiency regular files store uid and gid indexes, which are | |
198 | converted to 32-bit uids/gids using an id look up table. This table is | |
199 | stored compressed into metadata blocks. A second index table is used to | |
200 | locate these. This second index table for speed of access (and because it | |
201 | is small) is read at mount time and cached in memory. | |
202 | ||
4c1d204c | 203 | 3.7 Export table |
9eb425c0 PL |
204 | ---------------- |
205 | ||
206 | To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems | |
207 | can optionally (disabled with the -no-exports Mksquashfs option) contain | |
208 | an inode number to inode disk location lookup table. This is required to | |
209 | enable Squashfs to map inode numbers passed in filehandles to the inode | |
210 | location on disk, which is necessary when the export code reinstantiates | |
211 | expired/flushed inodes. | |
212 | ||
213 | This table is stored compressed into metadata blocks. A second index table is | |
214 | used to locate these. This second index table for speed of access (and because | |
215 | it is small) is read at mount time and cached in memory. | |
216 | ||
4c1d204c | 217 | 3.8 Xattr table |
899f4530 PL |
218 | --------------- |
219 | ||
220 | The xattr table contains extended attributes for each inode. The xattrs | |
221 | for each inode are stored in a list, each list entry containing a type, | |
222 | name and value field. The type field encodes the xattr prefix | |
223 | ("user.", "trusted." etc) and it also encodes how the name/value fields | |
224 | should be interpreted. Currently the type indicates whether the value | |
225 | is stored inline (in which case the value field contains the xattr value), | |
226 | or if it is stored out of line (in which case the value field stores a | |
227 | reference to where the actual value is stored). This allows large values | |
228 | to be stored out of line improving scanning and lookup performance and it | |
229 | also allows values to be de-duplicated, the value being stored once, and | |
25985edc | 230 | all other occurrences holding an out of line reference to that value. |
899f4530 PL |
231 | |
232 | The xattr lists are packed into compressed 8K metadata blocks. | |
233 | To reduce overhead in inodes, rather than storing the on-disk | |
234 | location of the xattr list inside each inode, a 32-bit xattr id | |
235 | is stored. This xattr id is mapped into the location of the xattr | |
236 | list using a second xattr id lookup table. | |
9eb425c0 | 237 | |
31771f45 | 238 | 4. TODOs and Outstanding Issues |
9eb425c0 PL |
239 | ------------------------------- |
240 | ||
31771f45 | 241 | 4.1 TODO list |
9eb425c0 PL |
242 | ------------- |
243 | ||
899f4530 | 244 | Implement ACL support. |
9eb425c0 | 245 | |
31771f45 | 246 | 4.2 Squashfs Internal Cache |
9eb425c0 PL |
247 | --------------------------- |
248 | ||
249 | Blocks in Squashfs are compressed. To avoid repeatedly decompressing | |
250 | recently accessed data Squashfs uses two small metadata and fragment caches. | |
251 | ||
252 | The cache is not used for file datablocks, these are decompressed and cached in | |
253 | the page-cache in the normal way. The cache is used to temporarily cache | |
254 | fragment and metadata blocks which have been read as a result of a metadata | |
255 | (i.e. inode or directory) or fragment access. Because metadata and fragments | |
256 | are packed together into blocks (to gain greater compression) the read of a | |
257 | particular piece of metadata or fragment will retrieve other metadata/fragments | |
258 | which have been packed with it, these because of locality-of-reference may be | |
259 | read in the near future. Temporarily caching them ensures they are available | |
260 | for near future access without requiring an additional read and decompress. | |
261 | ||
262 | In the future this internal cache may be replaced with an implementation which | |
263 | uses the kernel page cache. Because the page cache operates on page sized | |
264 | units this may introduce additional complexity in terms of locking and | |
265 | associated race conditions. |