-   Programming (
-   -   Ordered output from find? (

havard 09-24-2011 11:31 AM

Ordered output from find?
Is there a simple way to get the output from find in order of last modified time?

Background: I would like to get the list the last 30 picture directories modified the last year. Additionally, I do not want to list "container" directories: Given this structure, only the last directory should be included:


So far, I've got this:


find pictures -name "*.jpg" -mtime -365 -printf "%h\n" | uniq
It finds all the jpg files modified in the last year, but prints only the parent directories. As this will result in lots of duplicates, the uniq is necessary, and it's not very efficient. Furthermore, it will not enable me to sort by time in the next step.

I could add the modified time to the printf string, and still run sort/uniq over directory names, however I would then pick the time of an arbitrary file. This might still be acceptable, but maybe there is a better way?


find pictures -name "*.jpg" -mtime -365 -printf "%T+ %h\n" | sort -u -k 2 | sort | tail -30 | cut -f 2 -d ' '
The double call to sort is necessary here, because first it removes duplicate directories (in field 2), but cannot sort on a different field at the same time. The second will sort by the timestamp at the beginning of the line. Finally, only the last 30 lines is returned, and the now redundant timestamp is removed. Maybe I should go with awk instead?

grail 09-24-2011 01:07 PM

Why are you searching for jpg files when you are asking about directories? Use -type d along with your other criteria and this eliminates the uniq.

I believe this may help with your sorting issue as well.

Nominal Animal 09-24-2011 03:07 PM

I'm assuming you want to find directories that contain JPEG images modified in the last year, then output the thirty directories with the newest JPEG images in them. This is not exactly what you stated, but looking at your code I assume this is your goal.

This command will output the date of each JPEG file, together with the name of the parent directory. For simplicity, let us assume you do not have newlines in directory names.

find pictures '-iname '*.jpg' -mtime -366 -printf '%TY%Tm%Td%TH%TM%TS %h/\n'
Note how the time is in a format which is suitable for text sorting.
Pipe the above to an awk script which outputs only the latest one for each directory.

awk '{ if ($2 in dirdate) {
          if ($1 > dirdate[$2])
              dirdate[$2] = $1
      } else
          dirdate[$2] = $1
 END { for (dd in dirdate)
          printf("%s %s\n", dirdate[dd], dd)

Note that the record rule checks if the directory (second field, $2) has already been seen, and only compares the timestamp (first field, $1) if so. Because all timestamps are "positive" (will compare above nothing), it is not strictly necessary here as raw comparisons would work just fine, I recommend the practice. Always doing it this way means you don't ever get problems with e.g. negative numbers in the values. We all copy-paste code..

Unfortunately standard/POSIX awk has no sort function, so you'll need to sort the output. (GNU awk, gawk, does, but I'll talk about that further below.)
In the general case, you can pipe the above to

sort -rg | head -n 30 | cut -d ' ' -f 2-
which will sort the output newest first, grabs the 30 newest directories, and then just outputs their names (cutting off the initial date-and-time field).

An alternate method is to put the directory name first, so you can use sort and uniq instead of the awk scriptlet. Uniq can then skip the first field. However, uniq considers spaces and tabs as field separators, so this will only work if you do not have spaces or tabs in your directory names: find pictures '-iname '*.jpg' -mtime -366 -printf '%h %TY%Tm%Td%TH%TM%TS\n' | sort -rk 2 | uniq -f 1 | head -n 30 | cut -d ' ' -f 1
I for one have all kinds of characters in my directory names, so I don't recommend this alternate method, ever.

If you are using GNU awk, you can incorporate the sort (and of course also outputting only the desired number of directories), but you can also switch to a safe delimiter, '\0'. It is the only character (octet, or byte value) guaranteed to never occur in the middle of a file or directory name or a full path. (In a single directory or file name, '/' is also guaranteed to never occur, but it obviously occurs in paths since it is the separator there.)

The variant below will work with any file names, even those containing newlines or other special characters:

find pictures '-iname '*.jpg' -mtime -366 -printf '%TY%Tm%Td%TH%TM%TS@%h/\0' | gawk '
  BEGIN { RS="\0" ; FS="\0" }
        { timestamp = $0 ; sub(/@.*$/, "", timestamp)
          directory = $0 ; sub(/^[^ ]*@/, "", directory)
          if (directory in last) {
              if (timestamp > last[directory])
                  last[directory] = timestamp
          } else
              last[directory] = timestamp
    END { # Merge timestamps and names back to a single list.
          n = split("", list)
          for (directory in last)
              list[n++] = last[directory] "@" directory

          # Sort the list.
          n = asort(list)

          # Only output max. 30 latest directories.
          max = 30

          while (n-->0 && max-->0) {
              timestamp = list[n] ; sub(/@.*$/, "", timestamp)
              directory = list[n] ; sub(/^[^@]*@/, "", directory)
              printf("%s\n", directory)

Note that the at (@) I used between the time and the directory could be basically anything (that does not occur in a numeric timestamp -- numbers and decimal points are out since some find's print fractional seconds -- but some characters need escaping when used in a regular expression). I normally use space, but this time I wanted it to be easily seen so that you can see where and how I split the time and the name.

For gawk 4.0.0 and later, PROCINFO["sorted_in"]="@val_str_desc" would yield the directories in the desired order simply using for (directory in last) ... , but it does not seem to work in gawk-3.1.7 at least. (I [b]hate[/I] it when documentation does not mention when a feature has been added. GNU awk and GNU bash are the worst cases I know; they keep adding new features, but neglect to mention which version each feature requires to work.)

To avoid that issue, I construct a new array similar to the original input (but with only the latest timestamp for each directory). I sort that, and pick only the 30 last entries, and that's it.

If this was an important utility, or there were millions of directories, I would open-code a version of Quicksort that only sorts (recurses into) the desired entries. It would save both CPU time and memory -- but for normal cases, up to several tens of thousands of directories (the number of files is practically irrelevant), this is quite sufficient. In most cases it should be I/O-bound anyway; it does not make sense to try to make it faster, since it already works as fast as your filesystem can provide data. The stuff in the END rule takes only a fraction of a second, and is lost in the noise.

I hope you find this useful and informative,

grail 09-25-2011 02:48 AM


I hate it when documentation does not mention when a feature has been added.
Agree :)

All times are GMT -5. The time now is 07:12 AM.