How to grep the first occurrence of a date string in a file and print the file name and directory
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How to grep the first occurrence of a date string in a file and print the file name and directory
Hello, what would be the grep command that would achieve the following task?
Suppose that I am in the /home/log directory. With hundreds of subfolders. And in one of them, say in /home/log/000020/a00001.txt, we have the text string that looks like
<WEB-DATETIME>20040316142929
And as another example, in /home/log/000020/b00009.txt
we have the text string that looks like
<WEB-DATETIME>20040317112020
I want to use grep from all txt files under /home/log, to output a file as /home/alllog.txt, that , with the field separator ':', that have two columns: A. location of the txt file, and B. the number string, that is always in 14 digits, immediately after the 'first occurrence' of the <WEB-DATETIME> string, so that the alllog.txt would look like:
I'd start with grep before escalating to awk or perl. GNU grep has substantial support for perl style regular expressions. The -o, -r, and -P options will be of use to you here. Maybe -m too.
See
Code:
man grep
man perlre
In the PCRE, you'll want to look at using a look-behind zero-width assertion (?<= ... ) to prevent grep from printing that part of the pattern.
So give a try and show your code if you do get stuck and need tips.
looks a lot like XML to me (i bet it's followed by </WEB-DATETIME>, isn't it?).
it would be a lot easier to use a tool that parses XML.
many choices, but i have experience with xmllint, e.g. like this:
once again, i have to disagree. this:looks a lot like XML to me (i bet it's followed by </WEB-DATETIME>, isn't it?).
I agree with the disgreement
If it is just non-XML text, then grep is easiest.
However, if the files contain XML then it is far more appropriate to search them with a tool that properly parses the XML. Anything that handles XPath is good and xmllint is a good one for that.
Hi, sorry it's not XML.
What I got so far is this:
LANG=C grep -rl ".*<WEB-DATETIME>[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]"
But I have no idea how to proceed from here to achieve the tasks as laid out in the first post. I was thinking of piping this into awk with field seprator '>', but then the task demands that I have the 14 digit string and the filename[and directory] be displayed side by side. So the pipe into awk won't work in my thinking.
Quote:
Originally Posted by ondoho
once again, i have to disagree. this:looks a lot like XML to me (i bet it's followed by </WEB-DATETIME>, isn't it?).
it would be a lot easier to use a tool that parses XML.
many choices, but i have experience with xmllint, e.g. like this:
Hi, sorry it's not XML.
What I got so far is this:
LANG=C grep -rl ".*<WEB-DATETIME>[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]"
As mentioned, you will probably benefit from the -o option to print the matched pattern and the -P option on grep to give you Perl regular expressions so you can use a look-behind zero-width assertion to prevent grep from printing that part of the pattern.
Thanks for the help.
I read man sed, but couldn't understand this part. I won't call myself an expert in vi but I always thought sed does vi commands, and this looks nothing like something in VI. For example, what is /| I know what it is doing is to remove the slash in directory name and only keep the file name but I can't understand Why it works.
sed 's|^.*/||'
This is interesting would you mind please explain the meaning of 's|^.*/||'?
Quote:
Originally Posted by Turbocapitalist
As mentioned, you will probably benefit from the -o option to print the matched pattern and the -P option on grep to give you Perl regular expressions so you can use a look-behind zero-width assertion to prevent grep from printing that part of the pattern.
It's easier to read that way. s#/some/path/## versus s/\/some\/path\///
Most commmonly you see the s command with slashes, as in s/// However, in sed the s command only needs three of the same kind as delimiters for the search pattern and the replacement text. So, at least in sed, the following are functionally the same:
Code:
echo abcdabcdcdbadbccd | sed 's/b/X/g'
echo abcdabcdcdbadbccd | sed 's!b!X!g'
echo abcdabcdcdbadbccd | sed 's#b#X#g'
echo abcdabcdcdbadbccd | sed 's|b|X|g'
echo abcdabcdcdbadbccd | sed 'sabaXag'
Because a path has slashes / the substitution is easier to read if another symbol is used instead. That way the slashes in the pattern don't have to be escaped.
Some other languages will also allow some flexibility with the s commands delimiters.
There are a number of different styles of regex. That complicates things. The lowest common denominator is probably POSIX.
But in all of them, the caret ^ anchors the pattern search to the beginning of the string. You could say that it matches the invisible beginning of the string. It's probably not needed in the exampel above because the searching is done from the begining (left) to the end (right). The $ anchors the pattern to the end of the string. See "man 7 regex"
Code:
echo abcdabcdcdbadbccd | sed 's/b.*/X/g'; # replace from b onwards with a single X
echo abcdabcdcdbadbccd | sed 's/^b.*/X/g'; # same but only if the line starts with a b
echo abcdabcdcdbadbccd | sed 's/.*d$/X/g'; # replace the whole line with an X if it ends with d
echo abcdabcdcdbadbccd | sed 's/.*c$/X/g'; # replace the whole line with an X if it ends with d
echo abcdabcdcdbadbccd | sed 's/^$/X/g'; # replace an empty line with an X
echo | sed 's/^$/X/g'; # replace an empty line with an X
Because * is "greedy", that is it matches as much as possible, the ^ or $ are often not needed.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.