LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 09-24-2011, 11:31 AM   #1
havard
LQ Newbie
 
Registered: Sep 2011
Distribution: Fedora
Posts: 5

Rep: Reputation: Disabled
Question Ordered output from find?


Is there a simple way to get the output from find in order of last modified time?

Background: I would like to get the list the last 30 picture directories modified the last year. Additionally, I do not want to list "container" directories: Given this structure, only the last directory should be included:

pictures
pictures/2011
pictures/2011/holiday_day1

So far, I've got this:

Code:
find pictures -name "*.jpg" -mtime -365 -printf "%h\n" | uniq
It finds all the jpg files modified in the last year, but prints only the parent directories. As this will result in lots of duplicates, the uniq is necessary, and it's not very efficient. Furthermore, it will not enable me to sort by time in the next step.

I could add the modified time to the printf string, and still run sort/uniq over directory names, however I would then pick the time of an arbitrary file. This might still be acceptable, but maybe there is a better way?

Code:
find pictures -name "*.jpg" -mtime -365 -printf "%T+ %h\n" | sort -u -k 2 | sort | tail -30 | cut -f 2 -d ' '
The double call to sort is necessary here, because first it removes duplicate directories (in field 2), but cannot sort on a different field at the same time. The second will sort by the timestamp at the beginning of the line. Finally, only the last 30 lines is returned, and the now redundant timestamp is removed. Maybe I should go with awk instead?
 
Old 09-24-2011, 01:07 PM   #2
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,442

Rep: Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880
Why are you searching for jpg files when you are asking about directories? Use -type d along with your other criteria and this eliminates the uniq.

I believe this may help with your sorting issue as well.
 
Old 09-24-2011, 03:07 PM   #3
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
I'm assuming you want to find directories that contain JPEG images modified in the last year, then output the thirty directories with the newest JPEG images in them. This is not exactly what you stated, but looking at your code I assume this is your goal.

This command will output the date of each JPEG file, together with the name of the parent directory. For simplicity, let us assume you do not have newlines in directory names.
Code:
find pictures '-iname '*.jpg' -mtime -366 -printf '%TY%Tm%Td%TH%TM%TS %h/\n'
Note how the time is in a format which is suitable for text sorting.
Pipe the above to an awk script which outputs only the latest one for each directory.
Code:
awk '{ if ($2 in dirdate) {
           if ($1 > dirdate[$2])
               dirdate[$2] = $1
       } else
           dirdate[$2] = $1
     }
 END { for (dd in dirdate)
           printf("%s %s\n", dirdate[dd], dd)
     }'
Note that the record rule checks if the directory (second field, $2) has already been seen, and only compares the timestamp (first field, $1) if so. Because all timestamps are "positive" (will compare above nothing), it is not strictly necessary here as raw comparisons would work just fine, I recommend the practice. Always doing it this way means you don't ever get problems with e.g. negative numbers in the values. We all copy-paste code..

Unfortunately standard/POSIX awk has no sort function, so you'll need to sort the output. (GNU awk, gawk, does, but I'll talk about that further below.)
In the general case, you can pipe the above to
Code:
sort -rg | head -n 30 | cut -d ' ' -f 2-
which will sort the output newest first, grabs the 30 newest directories, and then just outputs their names (cutting off the initial date-and-time field).

An alternate method is to put the directory name first, so you can use sort and uniq instead of the awk scriptlet. Uniq can then skip the first field. However, uniq considers spaces and tabs as field separators, so this will only work if you do not have spaces or tabs in your directory names: find pictures '-iname '*.jpg' -mtime -366 -printf '%h %TY%Tm%Td%TH%TM%TS\n' | sort -rk 2 | uniq -f 1 | head -n 30 | cut -d ' ' -f 1
I for one have all kinds of characters in my directory names, so I don't recommend this alternate method, ever.

If you are using GNU awk, you can incorporate the sort (and of course also outputting only the desired number of directories), but you can also switch to a safe delimiter, '\0'. It is the only character (octet, or byte value) guaranteed to never occur in the middle of a file or directory name or a full path. (In a single directory or file name, '/' is also guaranteed to never occur, but it obviously occurs in paths since it is the separator there.)

The variant below will work with any file names, even those containing newlines or other special characters:
Code:
find pictures '-iname '*.jpg' -mtime -366 -printf '%TY%Tm%Td%TH%TM%TS@%h/\0' | gawk '
  BEGIN { RS="\0" ; FS="\0" }
        { timestamp = $0 ; sub(/@.*$/, "", timestamp)
          directory = $0 ; sub(/^[^ ]*@/, "", directory)
          if (directory in last) {
              if (timestamp > last[directory])
                  last[directory] = timestamp
          } else
              last[directory] = timestamp
        }
    END { # Merge timestamps and names back to a single list.
          n = split("", list)
          for (directory in last)
              list[n++] = last[directory] "@" directory

          # Sort the list.
          n = asort(list)

          # Only output max. 30 latest directories.
          max = 30

          while (n-->0 && max-->0) {
              timestamp = list[n] ; sub(/@.*$/, "", timestamp)
              directory = list[n] ; sub(/^[^@]*@/, "", directory)
              printf("%s\n", directory)
          }
        }'
Note that the at (@) I used between the time and the directory could be basically anything (that does not occur in a numeric timestamp -- numbers and decimal points are out since some find's print fractional seconds -- but some characters need escaping when used in a regular expression). I normally use space, but this time I wanted it to be easily seen so that you can see where and how I split the time and the name.

For gawk 4.0.0 and later, PROCINFO["sorted_in"]="@val_str_desc" would yield the directories in the desired order simply using for (directory in last) ... , but it does not seem to work in gawk-3.1.7 at least. (I [b]hate[/I] it when documentation does not mention when a feature has been added. GNU awk and GNU bash are the worst cases I know; they keep adding new features, but neglect to mention which version each feature requires to work.)

To avoid that issue, I construct a new array similar to the original input (but with only the latest timestamp for each directory). I sort that, and pick only the 30 last entries, and that's it.

If this was an important utility, or there were millions of directories, I would open-code a version of Quicksort that only sorts (recurses into) the desired entries. It would save both CPU time and memory -- but for normal cases, up to several tens of thousands of directories (the number of files is practically irrelevant), this is quite sufficient. In most cases it should be I/O-bound anyway; it does not make sense to try to make it faster, since it already works as fast as your filesystem can provide data. The stuff in the END rule takes only a fraction of a second, and is lost in the noise.

I hope you find this useful and informative,
 
Old 09-25-2011, 02:48 AM   #4
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,442

Rep: Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880
Quote:
I hate it when documentation does not mention when a feature has been added.
Agree
 
  


Reply

Tags
bash scripting, find


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Xargs with output of find thund3rstruck Programming 6 07-17-2011 08:26 AM
formatting find output usr_handle Linux - Newbie 7 07-01-2010 12:55 PM
Find with -exec argument - not giving proper output..how to find... hinetvenkat Linux - Server 4 01-25-2010 06:19 AM
Pipe output of find through rm? CelticBlues Programming 3 07-29-2008 03:09 AM
Just ordered my first server! =) nr5 Linux - Software 14 12-28-2004 08:23 PM


All times are GMT -5. The time now is 05:50 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration