LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Help! Need to find certain files from 47,000 (https://www.linuxquestions.org/questions/linux-newbie-8/help-need-to-find-certain-files-from-47-000-a-489635/)

simba_cubs 10-05-2006 04:54 AM

Help! Need to find certain files from 47,000
 
Hi,

I need to find all files between 08:00 - 12:00 yesterday morning.
The files are emails and the structure of the directories in is date format, so each time a new directory is created and named 20061004 for example.

In the 20061004 directory there are 47,000 files. I need to extract all files that contain a user's name from those 47,000.

I've tried the following...

grep "tom.thumb" *

This returns a "bash: /bin/grep: Argument list too long" error.

I then tried

find * -newer 8amoct4_06 ! 11_59amoct4_06 -print

That returned the following "bash: /usr/bin/find: Argument list too long" error.

Could someone tell me where I'm going wrong please.

Many thanks

budword 10-05-2006 06:06 AM

I might be wrong, but I think the file command just finds files by name, if you are looking for a certain string(a persons name) inside that file find won't help.

Check out the bottom of this page. http://www.computerhope.com/unix/ugrep.htm

Looks like the following command might work
grep -ir tom.thumb .

Let me know if that helps....

Best of luck

David

kstan 10-05-2006 06:20 AM

Have u try something like this?
$grep -n "tom.thumb" .
to get all file name?

Wim Sturkenboom 10-05-2006 06:40 AM

The problem with grep is the the dot is interpreted as a special character.
Code:

grep "tom\.thumb" *

olaola 10-05-2006 07:36 AM

The problem is the number of files you are exploring (Argument list too long).
Using the "*" you are passing to the command (grep or find or something else) a list o files. When this list is too long you get an error.

Try to restrict the list using something like "A*"...

simba_cubs 10-05-2006 07:51 AM

Many thanks to you all for your quick response - much appreciated.

budword 10-05-2006 10:13 AM

The * at the end just tells grep to conduct the regex search at the current directory. It's not a globbing or regex wildcard in that context. I left the . in the tom.thumb regex because I thought it was supposed to be in there, to help find all the instances of tom?thumb. Please correct me if I got anything wrong.

Thanks much....

David

stress_junkie 10-05-2006 11:19 AM

The find command can be used to select the files based on time. This list can then be fed to the grep command to find the files that contain the character string. The problem with the find command as it is written in the initial post is that there is a * following the command. The first term following the find command is the directory to search. The following example expects 8amoct4_06 to be a file that exists, not just a file specification. Try this.
Code:

find 20061004 -newer 20061004/8amoct4_06 -a ! -newer 20061004 /11_59amoct4_06 -exec grep -H tom.thumb {} \;
The part of the command that starts with -exec is where we feed the output of the find command to the grep command. The -H in the grep command tells grep to list the names of the files that contain the expression. If you want this list in a file then you can redirect the output of this using the > operator as in > result.txt.
Code:

find 20061004 -newer 20061004/8amoct4_06 -a ! -newer 20061004 /11_59amoct4_06 -exec grep -H tom.thumb {} \; > result.txt

simba_cubs 10-06-2006 03:40 AM

The filenames appear as "1GUy9T-0006gz-Ne-H"....

Sorry I may not have been clear. I thought I would be able to search based the time stamp on the file
ls -la of the directory...

-rw-rw---- 1 Debian-exim Debian-exim 2979 2006-10-04 05:14 1GUy9T-0006gz-Ne-H

I was under the impression I would be able to search against the 2006-10-04 05:14 ?

Sorry if I was or am unclear I am fairly new to linux :study:

Thanks for you help so far

stress_junkie 10-06-2006 01:13 PM

Let's go back to your first post and see what we can do. Don't get discrouaged. I'm not being critical. I'm just trying to summarize what has been said so far.
Quote:

Originally Posted by simba_cubs
I need to find all files between 08:00 - 12:00 yesterday morning.

I understand this to mean that you want to list all of the files that arrived between 08:00 and noon on October 4, 2006.
Quote:

Originally Posted by simba_cubs
The files are emails and the structure of the directories in is date format, so each time a new directory is created and named 20061004 for example.

I understand this to mean that a new directory is created every day. The name of the directory is the date of that day. The name of the directory for October 4, 2006 is 20061004.
Quote:

Originally Posted by simba_cubs
In the 20061004 directory there are 47,000 files. I need to extract all files that contain a user's name from those 47,000.

I understand this to mean that you want to LIST all of the files that contain the user's name.
Quote:

Originally Posted by simba_cubs
I've tried the following...
grep "tom.thumb" *
This returns a "bash: /bin/grep: Argument list too long" error.

Wim Sturkenboom explained in post #4 that the dot is a special character and you need to put a slash in front of it when you want to include the dot in a regular expression. But, that isn't the reason that you got the error message.

olaola explained in post #5 that using the wildcard character * resulted in too many file names being passed to the grep command. That is the reason that you got the error message.
Quote:

Originally Posted by simba_cubs
I then tried
find * -newer 8amoct4_06 ! 11_59amoct4_06 -print
That returned the following "bash: /usr/bin/find: Argument list too long" error.

In post #8 I explained that the first argument in the find command has to be a directory to search. Putting a * there was a mistake. Then I showed how the find command would accept the name of the directory that you want to search. In this case the directory name is 20061004. So I started the find command as "find 20061004".

Quote:

Originally Posted by simba_cubs
Could someone tell me where I'm going wrong please.
Many thanks

At this point your request is satified. You have been told where you have gone wrong.
=====
Your last post introduced new information.
Quote:

Originally Posted by simba_cubs
The filenames appear as "1GUy9T-0006gz-Ne-H"....

Okay. You could have adapted what you have already been told to accomodate this file name format.
Quote:

Originally Posted by simba_cubs
Sorry I may not have been clear. I thought I would be able to search based the time stamp on the file
ls -la of the directory...
-rw-rw---- 1 Debian-exim Debian-exim 2979 2006-10-04 05:14 1GUy9T-0006gz-Ne-H
I was under the impression I would be able to search against the 2006-10-04 05:14 ?

You can search based on the last access time or the last modification time of a file but not in the form that you see when you list files. The system keeps the dates and times of files in a different format. You cannot search on the date in the form of a text string. Well, not directly.
Quote:

Originally Posted by simba_cubs
Sorry if I was or am unclear I am fairly new to linux

I don't think that you were unclear. I think that once you got the answer to your question your concept of the question changed.
Quote:

Originally Posted by simba_cubs
Thanks for you help so far

Everybody here is very happy to help, especially new Linux users and admins. We all want your experience with Linux to be positive and enjoyable.
=====
One of the problems with the find command is that it doesn't have an argument that just says "after 08:00 and before 12:00". Nevertheless, you need the find command in order to pass file names to the grep command one at a time. If you just try to use the grep command and pass all of the file names to it in one command you will pass too many file names at one time, as you already know. So let's look at how to build a find comand that will do the job.

First we know that we need to use the grep command to search the contents of the email files for the user name tom.thumb. The -H parameter of the grep command tells grep to list the name of the file that contains the search string.
In the following examples I will use question marks to indicate something that we don't know yet. Also, I stopped using quotation marks in regular expressions when I found that the result can be unpredictable.
Code:

grep -H tom\.thumb ?????
Second, we know that we need to use the find command to pass file names one at a time to the grep command.
Code:

find ?????????? -exec grep -H tom\.thumb {} \;
The first parameter to the find comand is the directory to search. In this case it is the 20061004 directory.
Code:

find 20061004 ?????????? -exec grep -H tom\.thumb {} \;
We could take out the question marks and run the find command as it is.
Code:

find 20061004  -exec grep -H tom\.thumb {} \;
That would do what you originally said that you wanted to do using just the grep command. However the output is a bit messy because it will include both the file name and the line that the search string is found in. We can make the output easier to read using the cut command.
Code:

find 20061004  -exec grep -H tom\.thumb {} \; | cut -d ":" -f 1
Now that's a sweet looking output. If you want those file names in a text file you can redirect the output to a file as follows.
Code:

find 20061004  -exec grep -H tom\.thumb {} \; | cut -d ":" -f 1 > tom-thumb-emails.txt
Once you want to select the files that arrived between 08:00 and 12:00 noon we start to find the deficiencies of the find command. The find command does not have very many parameters that test the date and time of files. We have to do a bit of work and find logical conditions that satisfy the time requirement while using the poor selection of parameters available in the find command. The man page of the find commnad tells us that none of the available parameters tests the creation time of the file. Unfortunately Linux and Unix don't keep track of the creation time of files; just the last time that they were accessed and the last time that they were modified. If these files had the time that they arrived as a string inside the email then we can use grep to search for that. If the arrival time of the emails was included in the email file name we could use that to select the proper files. Unfortunately neither of these conditions is true. The emails will have the time that they were sent inside the email, but not the time that they arrived. The file names, as you have shown, do not include the arrival time in any format.

Try the last example of the find command with the pretty output and see if that does what you need it to do. Write back and append more posts to this thread if you want more help. I will be watching this thread for a few days. The other posters might also be watching this thread.


All times are GMT -5. The time now is 08:44 PM.