UNIX/Linux native filesystems allow files to have "holes".
This is done partly for speed, and partly to save disk space.
When files are being written with records in a random order, there is no need to actually allocate (and write) disk blocks that full of 0 byte values. It is only necessary to identify that the space is virtually present to preserve record locations in disk blocks actually written.
The original use was for using hash access to disk files - specifically a dictionary lookup. Instead of storing every word in a dictionary, it instead used a master dictionary (root words only, with prefix and suffix identifications), and a hash table containing a single bit - 1 for valid, 0 for error. To make lookups of root words fast, a "perfect hash" function was used for every root word entry. This hashing caused the lookup values to be spread over a 4GB range (much too large for the 5 MB disks in use in 1973), so to make them "fit" the huge number of blocks that had 0 bit values were just not written. If you did read them, you got a block with 0 values.
Now it did introduce a problem - if you copied the file the blocks would get allocated. And in 1973, you ran out of disk space when you did that.
Doesn't happen as often now - but for speed of writing (and reading), not having to perform the overhead of actually allocating/writing unused disk blocks is a big performance improvement for single record writes. It is still useful for hash functions too - now the hash can generate 64 bit values without running out of disk space.
Over time, cp (now derived/taken from the GNU project) can recognize files with holes, and preserves the holes during copies (basically just scanning the file for block boundaries where the data block has nothing but nulls, then seeks to the next block boundary with data without writing anything in between).
The resulting difference you see in the ls and du is the logical size (the expected data size) and the physical size (the actual recorded data).
Even for files without holes you can see a difference - the ls command will show the size of the recorded data, and the du command will show the size of the data plus any overhead blocks needed to manage the storage of the data. This overhead is usually reported as "meta-data" as it is data that only tells the kernel where the actual data blocks are.
An example of such is shown in http://dysphoria.net/OperatingSystem...tion_unix.html
. Not all filesystems are like this, but most Linux native filesystems are "close".
When the file is small, only direct pointers to data blocks are used. When the file is larger, it requires more pointers to data blocks, so an additional blocks are allocated for indirect pointers. If the file is REALLY large, then there are blocks allocated that only point to blocks containing pointers (double indirect), if the file is REALLY HUGE, then there are pointers to blocks containing pointers to blocks (triple indirect). In the example image, these are shown in various shades of blue, and only shows this for single and double indirect blocks.
The ls command shows only the byte length of the data. the du command shows the blocks used for the data, plus the blocks used for the overhead.
You can search for documentation on how the various filesystems are used (ext2/3 is similar to the above, where ext4 adds extents that work a bit differently), xfs uses a more drastic difference (its target is for really large volumes where the overhead of indirect blocks would be huge.. and wanted more flexible volume sizing).