Oh, I'm well aware of how rsync works (well, aside from this particular issue).

At the end of the day: because I can't rsync 100TB of data.
Furthermore, most of the data on disk are result sets - I don't need to backup the processed data, I need to backup the files that create the processed data... In the event of a fire or an earthquake (this is southern California, after all

), I can pull a minimal backup back online, set several hundred processors to work, and have everything back to the way it was in the matter of a day or three.
Why not just rsync the whole 100TB and let rsync figure out the differences? Two [main] reasons:
1. I would need another 100TB at a co-lo facility. Cost of hardware + cost of rack space + maintenance on that many spindles is prohibitive.
2. My company generates several hundred gigs to possible 1TB or more of data per day. Transfering that much data would take entirely too much time & cost entirely too much.
It makes little sense to backup data that can easily be regenerated. It makes a lot of sense to backup files that generate other data (which = money). So I have all these fancy scripts that find the generating data files... now I need to back them up.
That was probably a much longer explanation than you were interested in. heheh
