Bash Shell scripting : File and it's metadata

Rygir · 04-15-2014, 04:08 PM

What this question is about :
My goal is to build something that takes two directory roots A and B, then checks that everything that is in A exists in B.
If it exists, check that it is identical. To be identical :

Filename must be the same and it must be in the same relative path
Date for creation/last modification must be the same
Size must obviously be identical
Other metadata like alternate filestreams must be identical
content must be a match, bit for bit.
It may NOT be the SAME file/node!

The "other metadata" is actually a question, if someone can think of metadata that I should/could check, I would appreciate it.

If it is identical, delete the one in A (or at least mark for deletion using, say, a log file).

It must do this file by file because most likely the system will be reset before it reaches the end of the filesystems.

In addition, I would like to offer switches to make it delete from B instead of A, or not delete at all, choose a log file and possibly automatically trying to correct the situation by copying everything in A to B first.

The goal is to verify that a mirroring application has done it's job correctly so that I can conlude that I have archived the source copy and it can now be removed.

The actual question :

I'm currently stuck trying to test if a file date is identical... bash IF has options -ot and -nt for older and newer then comparison but there is no switch to check for identical dates. This is annoying because IF FILENAME -ot OTHERFILENAME is very easy to understand, however if I try to extract the dates and timestamps from a file using another command I will be converting them to strings probably... I'd like to know the cleanest way to compare the dates and avoid any regional/timezone issues.

For identicalness I'm using CMP command. For name I'm just using IF exist test. And then there's also an if test for the same node in the filesystem so you can easily see if it's the same file (at least it seems to work, if you think I'm doing it wrong, please tell me!).

Along with identical dates, Size and other metadata I don't know how to compare yet either...

JeremyBoden · 04-15-2014, 04:58 PM

Why do you need to check the modification date?
If two files are otherwise identical, does it really matter?

Date/time comparisons are very dodgy unless you know that all times are UTC.

A simple ls command gives much metadata such as permissions, user, group, creation date.

You could cover much of what you want to do by comparing md5 (or similar) file checksums.

Rygir · 04-15-2014, 05:44 PM

Because a file date is not stored inside the file and it is useful information. By default,copying sets the date to the date of the copy, which is useless. I want to be able to sort files by date, search for files from a certain period, etc... it's valuable information. About as valuable as the filename.

I don't know if they are all UTC but since source is copied to destination, I can be reasonably certain both are supposed to be in the same timezone.

ls only shows unfortunately, and I'd prefer to avoid complicating things by splitting that string by hand...there must be a better way.

Checksums are useless in this case, since I'm only checking any particular file exactly once. I would like to generate a checksum though for later use perhaps since I'm deleting the copy which deprives me of the possibility to verify afterwards.
In any case, since to generate the checksum you need to read the file once and you need to read it once to do a direct comparison, checksums will only complicate matters.

chrism01 · 04-16-2014, 07:31 AM

1.

Quote:

A simple ls command gives much metadata such as permissions, user, group, creation date.

Actually, it doesn't store the creation date: http://www.linux-faqs.info/general/d...time-and-atime

(very new ext4 based systems do have a new metadata field birthtime, but afaik, no tools actually maintain it yet ...)

2. I agree that we'd like to know what you (OP) are trying to accomplish here; checking every single possible piece of info about a file is possible but generally pointless.
The usual definition of 'the same' is identical content, for which a checksum eg md5sum is sufficient.
EG what are you going to do if the content is the same but the name is different ... or vice versa .. and similarly for any pair of factoids?
Think of all the pair combinations involved.

3. If you insist on checking everything(!), write a program in eg Perl, which will enable you to do so.

metaschima · 04-16-2014, 10:37 AM

I would tar the two directories and compare the tarballs using cmp.

Rygir · 05-07-2014, 11:25 AM

Quote:

Originally Posted by chrism01

1.

Actually, it doesn't store the creation date: http://www.linux-faqs.info/general/d...time-and-atime

I read that before and it made me wonder... what happens if you mount an NTFS system which definitly does keep track of creation times (because Windows does).

Quote:

2. I agree that we'd like to know what you (OP) are trying to accomplish here; checking every single possible piece of info about a file is possible but generally pointless.

It's not pointless : If you search by date to find something you want the dates to be something useful, not the date that everything was last synchronized or copied.

Quote:

EG what are you going to do if the content is the same but the name is different ... or vice versa .. and similarly for any pair of factoids?

Well my current approach wouldn't be aware of that since it compares only identical relative paths. In case dates don't match but content does, log it and set the date correctly (i.e. to the oldest date, or the date from the "source" side). For anything else, log file as not identical probably but aside from alternate data streams I can't think of anything... archive flags I'd just copy from source side...

Quote:

3. If you insist on checking everything(!), write a program in eg Perl, which will enable you to do so.

Which is exactly what I'm doing : I'm writing a (bash) shell script. Hence the question, because I haven't found a test (IF) that compares dates for equality, only older or newer than.

To clarify : The purpose is to do an incremental migration, I'm moving all the data of one system to another and i'd like to keep the file system "as is". I wanted to use dd but...do to system limitations this process is always interrupted before completion. In addition I would have to extract the files afterwards anyway from this volume image and go through the validation step anyway. Only I would be using desktop tools then which I don't have available on the source machine. So I was hoping of doing it all in one go and just get the files one by one but with all the meta data synchronized and validated immediately.

Rygir · 05-07-2014, 11:27 AM

Quote:

Originally Posted by metaschima

I would tar the two directories and compare the tarballs using cmp.

I would still have to validate that the eventual target file system, after extraction from the tarball, accepted the metadata. So while this would validate safe transition between systems, it adds a conversion step.

grail · 05-07-2014, 12:34 PM

So your actual requirement has nothing to do with all the testing you are referring to but rather that the data was migrated correctly. Why not then use a tool like rsync which allows you to track
what has been successfully moved, where your up to and I believe it can even start contiguously after an interruption?

And seeing as you mentioned it ... are we to assume this work will be done on Windows machines?

jpollard · 05-07-2014, 11:50 PM

You may as well know that not all the metadata CAN be the same.

In particular, the inode number will almost certainly be different.

Depending on the filesystem load, even the "size" of the file will be different (storage requires various metadata to point to the data), and some depend on the filesystems used for the source and destination - the filesystems themselves will have different metadata used.

In the case of the "creation date" (actually inode change date) which date do you want?

The creation of the original file (which is NOT the creation of the copy...) And if you store two dates for that - guess what, you still won't get the same. In the first case, the creation date and inode modification date might be the same... but the copy won't be. If it were, then the second date would be wrong... For nearly everything, "creation date" doesn't mean anything - which is why when you copy a file you get the date the file started being copied... as that is the "creation date" of the copy. In the second case (a creation date and a copy date) what you get depends on how you copy the file... use an editor to copy a file (it happens), and you don't get the original date, you get the date the file was copied... cp isn't the only way to copy a file, you can also use cat, dd, tr, tar, rsync, cpio, ... and then there are the dozens of programs out there that copy data as well..

Rygir · 05-08-2014, 03:45 AM

Quote:

Originally Posted by grail

So your actual requirement has nothing to do with all the testing you are referring to but rather that the data was migrated correctly. Why not then use a tool like rsync which allows you to track
what has been successfully moved, where your up to and I believe it can even start contiguously after an interruption?

Because the copying is taken care of, I'm only interested in validating that it did what I hoped it would do.

That said, I was interested in rsync but I haven't been able to get it to work. Supposedly my nas supports it, but I doubt my source machine supports it. I forgot what went wrong exactly but maybe I was just doing it wrong....

Quote:

And seeing as you mentioned it ... are we to assume this work will be done on Windows machines?

Nope, no windows machines involved until after the deed is done (clients using the files).

Rygir · 05-08-2014, 03:55 AM

Quote:

Originally Posted by jpollard

You may as well know that not all the metadata CAN be the same.

In particular, the inode number will almost certainly be different.

Good point, I'm using the inode to check and make sure that the file isn't the same file. If it is, then deleting it on one side risks deleting it on both sides after all. Since it should never be on the same partition, hard links are not possible and thus the same inode would indicate I accidently misused the script to compare the folder to itself. It shouldn't do anything in this case.

Quote:

Depending on the filesystem load, even the "size" of the file will be different (storage requires various metadata to point to the data), and some depend on the filesystems used for the source and destination - the filesystems themselves will have different metadata used.

While I can imagine rounding differences between types of filesystems... I've never actually seen this... could you elaborate?

Quote:

In the case of the "creation date" (actually inode change date) which date do you want?

The creation of the original file (which is NOT the creation of the copy...) And if you store two dates for that - guess what, you still won't get the same. In the first case, the creation date and inode modification date might be the same... but the copy won't be. If it were, then the second date would be wrong... For nearly everything, "creation date" doesn't mean anything - which is why when you copy a file you get the date the file started being copied... as that is the "creation date" of the copy. In the second case (a creation date and a copy date) what you get depends on how you copy the file... use an editor to copy a file (it happens), and you don't get the original date, you get the date the file was copied... cp isn't the only way to copy a file, you can also use cat, dd, tr, tar, rsync, cpio, ... and then there are the dozens of programs out there that copy data as well..

Well since my intention is to be able to track how long was work was done on a file I want to preserve that "creation date" and avoid getting the date of the copy, that is the point of this whole thread. Because yes, I'm aware you can replicate data in a number of ways. I'm trying to find the "best" way, the one that can never re-interprete data. I guess this isn't the scientifically correct way to say it since inode numbers would be interpreted at the very least but I'm talking about the contents of the file in this case, and preserving dates and as many characteristics that users or systems explicitly set on specific files.

Hence, I've been using cp with the switch to preserve dates (-r? -p? I forgot, but it seems to work under the right circumstances).

pan64 · 05-08-2014, 04:03 AM

It was already mentioned, but probably missed: rsync will do that job for you

jpollard · 05-08-2014, 07:09 AM

Quote:

Originally Posted by Rygir

While I can imagine rounding differences between types of filesystems... I've never actually seen this... could you elaborate?

Consider the differences between ext2/ext3 and ext4. ext4 has storage capability using extents that ext2/ext3 does not. So what can happen is that a file copied from an ext3 (with many pointer blocks) gets coalesced into fewer pointer blocks, with larger extents - thus fewer meta data blocks required to store a given amount of data. This difference gets larger for things like xfs and btrfs. There are also possible differences just in the size of an inode... Originally, an inode included the usual access, owner... meta data, but also had pointers to the first level data... And in the case of very small files, no pointers at all - if the data would fit where the pointer list was, then that was where the data resided, and not pointer blocks. This meant that the only metadata was the inode itself. In other filesystems, no data was in the inode, just a pointer list. And the size of that pointer list would depend on the block size of the device.

In addition to extents, xfs uses inodes based on the address of where the inode is located.

Other metadata existence depends on how the disk is mounted - if acls are enabled then that list can be carried - but if the filesystem either doesn't support (or it is mounted with ACLs disabled) such lists may get dropped. Even then, not all filesystems support the same set of ACLs... those not supported get dropped.

Quote:

Well since my intention is to be able to track how long was work was done on a file I want to preserve that "creation date" and avoid getting the date of the copy, that is the point of this whole thread. Because yes, I'm aware you can replicate data in a number of ways. I'm trying to find the "best" way, the one that can never re-interprete data. I guess this isn't the scientifically correct way to say it since inode numbers would be interpreted at the very least but I'm talking about the contents of the file in this case, and preserving dates and as many characteristics that users or systems explicitly set on specific files.

Well "work was done on a file" is ambiguous - copying a file IS work done on a file.

Users don't set the "creation date"... the system identifies the inode modification date so that a user cannot hide the fact that the file/inode has been modified. Without that restriction anyone would be able to alter a log file... and hid the fact that the log file was altered. Root can in some cases, do that... (using a file system debugger is one - as that allows direct access to the filesystem without going through the system). It is also one reason using dd to make a copy of a filesystem doesn't always work. dd makes a copy all right - but since the filesystem metadata is not modified, some things that NEED to be modified (filesystem labels, UUIDs) don't get changed - thus causing failures on boot as the correct filesystem for a given mount can't necessarily be made.

Quote:

Hence, I've been using cp with the switch to preserve dates (-r? -p? I forgot, but it seems to work under the right circumstances).

I believe that can only keep modification date and access date, not inode modification date.

A tar file (or cpio) is the most reliable way to preserve the file - it does get the inode modification date, even if it can't restore it. It also has ways of storing other metadata (the extended attributes). It will not preserve the metadata required by a filesystem to maintain the storage of the data (that is irrelevant anyway).

Rygir · 05-12-2014, 12:17 PM

Quote:

Originally Posted by pan64

It was already mentioned, but probably missed: rsync will do that job for you

Quote:

Originally Posted by Rygir

Because the copying is taken care of, I'm only interested in validating that it did what I hoped it would do.

That said, I was interested in rsync but I haven't been able to get it to work. Supposedly my nas supports it, but I doubt my source machine supports it. I forgot what went wrong exactly but maybe I was just doing it wrong....

---

Rygir · 05-12-2014, 12:33 PM

Quote:

Originally Posted by jpollard

Consider the differences between ext2/ext3 and ext4. ... In other filesystems, no data was in the inode, just a pointer list. And the size of that pointer list would depend on the block size of the device.
... Even then, not all filesystems support the same set of ACLs... those not supported get dropped.

Fascinating information, thanks for explaining all this to me, it's nice to learn some details like these!

The good news is that these details are not the kind of details that I was aiming to copy. Well maybe the ACL but for my current application it doesn't matter.

Quote:

Well "work was done on a file" is ambiguous - copying a file IS work done on a file.

I meant from a user's perspective, as in when I move my car it's still the same car.

Or perhaps more ambiguously : while moving a car and driving to your location are roughly the same, only driving to a location is considered to really be "making use of your car", moving it to a different parking spot because someone else couldn't get out of their garage is just overhead.

So let me specify that when I mention meta data I'm referring to information belonging to the abstract concept of a file and that has nothing to do with the file system it is contained on. I guess this is still debatable but let's say I just want the file (and not inode) modification date, a creation date if it can be found and any alternate data streams if present, filename and the size of the file should be tested to make sure the copy was a success.

Quote:

Users don't set the "creation date"... the system identifies the inode modification date so that a user cannot hide the fact that the file/inode has been modified. Without that restriction anyone would be able to alter a log file... and hid the fact that the log file was altered.

How can you see this date? It's clearly not the file's modification date since you can just set that...

Quote:

It is also one reason using dd to make a copy of a filesystem doesn't always work. ...thus causing failures on boot as the correct filesystem for a given mount can't necessarily be made.

Again, fascinating things but I've used dd so far only to make identical copies with the idea to be able to replace a hard drive instantly so this behaviour was actually exactly what I wanted

.

Quote:

I believe that can only keep modification date and access date, not inode modification date.

That would be perfect, I don't want to confuse the file system by faking things it relies on.

Quote:

A tar file (or cpio) is the most reliable way to preserve the file - it does get the inode modification date, even if it can't restore it. It also has ways of storing other metadata (the extended attributes). It will not preserve the metadata required by a filesystem to maintain the storage of the data (that is irrelevant anyway).

I didnt' know tar stored all these details...but it doesn't change the problem that when I untar them to the destination I still want to check everything has been deployed correctly and the target file system accepted the "meta data"/file dates.