I didn't want to spend the time computing the SHA for large files that have a size different from all others. ... So, I was finding all files and basically just saving their size and mtime.
That's a really nice optimization. In particular since normal file systems typically have very few gigantic files (otherwise they'd run out of space), and gigantic files have a huge phase space of sizes, so the probability of two gigantic files having the same size is really low. I like the idea so much, I just added it to my to-do list to steal and implement.
From an implementation viewpoint, this has a nasty drawback: If you suddenly find a second gigantic file that has the same size as one previously seen, you have to go back and calculate the SHA for both. From a code flow viewpoint, this sounds annoying.
A thought occurred to me while reading this. Perhaps do a SHA on (for example) the first 4K and the last 4K. That would be somewhat cheap and would be a very good first approximation of a match.
I had that idea a long time ago, except I did the SHA of the first 1MiB (not the last). Turns out to be a complete waste of time in my case. Here's why: I'm doing backups; when I find a new file (which has a SHA that doesn't match anything), I need to copy it to the backup. So I'm working through the file, and when I reach 1MiB, I complete the SHA calculation, and check whether I already have a match so far. But how does that help? If it is not a match, then I already know that this is guaranteed to be a new file, in which case I need to finish calculating the SHA anyway. And if I have a match in the first MiB, it is reasonably likely that I have two files that are identical in the first MiB but change later (for example, because both had something appended, not uncommon for logs), in which case I need to calculate the SHA to the end anyway to find out whether they're different. So the SHA of the first MiB is useless.
I like your idea of doing the beginning and end of the file, that might help performance a little bit. I'll try it.
In the meantime, I discovered a particularly annoying class of files that I need to learn to deal with, namely MP3-encoded music files. I have thousands of those. I'm now discovering that many of those are "fundamentally identical", the only difference is that the tags (composer name, play count ...) inside them have been updated. Today, I back up both versions. What I want to add is the following functionality: If the file format is MP3, then decode the audio content, and take a SHA of the audio content instead of the whole file. If the audio content is identical (to the bit), but the files are different, then always save only the later one (by mtime), not both. Implementing that is on my to-do list.
I still plan to do a "cmp" to make sure they are identical.
I decided not to. I use a 512-bit SHA, and I have far less than a million files. The probability that two different files (with the same size) have the same SHA is astronomically small, even after taking the birthday paradox into account.