LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-20-2017, 12:16 AM   #1
onthetopo
LQ Newbie
 
Registered: May 2017
Posts: 11

Rep: Reputation: Disabled
How to grep the first occurrence of a date string in a file and print the file name and directory


Hello, what would be the grep command that would achieve the following task?

Suppose that I am in the /home/log directory. With hundreds of subfolders. And in one of them, say in /home/log/000020/a00001.txt, we have the text string that looks like
<WEB-DATETIME>20040316142929

And as another example, in /home/log/000020/b00009.txt
we have the text string that looks like
<WEB-DATETIME>20040317112020

I want to use grep from all txt files under /home/log, to output a file as /home/alllog.txt, that , with the field separator ':', that have two columns: A. location of the txt file, and B. the number string, that is always in 14 digits, immediately after the 'first occurrence' of the <WEB-DATETIME> string, so that the alllog.txt would look like:

000020/a00001: 20040316142929
000020/b00009: 20040317112020
etc

Many thanks!
 
Old 10-20-2017, 12:21 AM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,120

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
You will need something a bit more like a (scripting) language than grep.
Try awk, perl, python, LUA, ...

Whatever you are comfortable with.
 
Old 10-20-2017, 12:30 AM   #3
onthetopo
LQ Newbie
 
Registered: May 2017
Posts: 11

Original Poster
Rep: Reputation: Disabled
Either Python or Awk works with me, please let me know. Thanks! It's been 5 years since I touched them, giving an answer would really be appreciated.
Quote:
Originally Posted by syg00 View Post
You will need something a bit more like a (scripting) language than grep.
Try awk, perl, python, LUA, ...

Whatever you are comfortable with.
 
Old 10-20-2017, 12:59 AM   #4
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,295
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
I'd start with grep before escalating to awk or perl. GNU grep has substantial support for perl style regular expressions. The -o, -r, and -P options will be of use to you here. Maybe -m too.
See

Code:
man grep
man perlre
In the PCRE, you'll want to look at using a look-behind zero-width assertion (?<= ... ) to prevent grep from printing that part of the pattern.

So give a try and show your code if you do get stuck and need tips.
 
Old 10-20-2017, 01:29 AM   #5
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
once again, i have to disagree. this:
Quote:
Originally Posted by onthetopo View Post
<WEB-DATETIME>20040316142929
looks a lot like XML to me (i bet it's followed by </WEB-DATETIME>, isn't it?).
it would be a lot easier to use a tool that parses XML.
many choices, but i have experience with xmllint, e.g. like this:
Code:
xmllint --html --nonet --xpath "//WEB-DATETIME//text()" 2>/dev/null filename
(assuming the file is called filename)
 
Old 10-20-2017, 01:36 AM   #6
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,295
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
Quote:
Originally Posted by ondoho View Post
once again, i have to disagree. this:looks a lot like XML to me (i bet it's followed by </WEB-DATETIME>, isn't it?).
I agree with the disgreement

If it is just non-XML text, then grep is easiest.

However, if the files contain XML then it is far more appropriate to search them with a tool that properly parses the XML. Anything that handles XPath is good and xmllint is a good one for that.
 
Old 10-20-2017, 11:06 AM   #7
onthetopo
LQ Newbie
 
Registered: May 2017
Posts: 11

Original Poster
Rep: Reputation: Disabled
Hi, sorry it's not XML.
What I got so far is this:
LANG=C grep -rl ".*<WEB-DATETIME>[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]"

But I have no idea how to proceed from here to achieve the tasks as laid out in the first post. I was thinking of piping this into awk with field seprator '>', but then the task demands that I have the 14 digit string and the filename[and directory] be displayed side by side. So the pipe into awk won't work in my thinking.

Quote:
Originally Posted by ondoho View Post
once again, i have to disagree. this:looks a lot like XML to me (i bet it's followed by </WEB-DATETIME>, isn't it?).
it would be a lot easier to use a tool that parses XML.
many choices, but i have experience with xmllint, e.g. like this:
Code:
xmllint --html --nonet --xpath "//WEB-DATETIME//text()" 2>/dev/null filename
(assuming the file is called filename)
 
Old 10-20-2017, 11:13 AM   #8
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,295
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
Quote:
Originally Posted by onthetopo View Post
Hi, sorry it's not XML.
What I got so far is this:
LANG=C grep -rl ".*<WEB-DATETIME>[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]"
As mentioned, you will probably benefit from the -o option to print the matched pattern and the -P option on grep to give you Perl regular expressions so you can use a look-behind zero-width assertion to prevent grep from printing that part of the pattern.

For example,

Code:
grep -m 1 -o -P '(?<=<WEB-DATETIME>)([0-9]{14})' ./dir/subdir/*.txt

grep -m 1 -o -P '(?<=<WEB-DATETIME>)([0-9]{14})' ./dir/subdir/*.txt | sed 's|^.*/||'
See "man grep" and "man perlre" and, maybe, "man sed"
 
Old 10-20-2017, 01:59 PM   #9
onthetopo
LQ Newbie
 
Registered: May 2017
Posts: 11

Original Poster
Rep: Reputation: Disabled
Thanks for the help.
I read man sed, but couldn't understand this part. I won't call myself an expert in vi but I always thought sed does vi commands, and this looks nothing like something in VI. For example, what is /| I know what it is doing is to remove the slash in directory name and only keep the file name but I can't understand Why it works.

sed 's|^.*/||'

This is interesting would you mind please explain the meaning of 's|^.*/||'?

Quote:
Originally Posted by Turbocapitalist View Post
As mentioned, you will probably benefit from the -o option to print the matched pattern and the -P option on grep to give you Perl regular expressions so you can use a look-behind zero-width assertion to prevent grep from printing that part of the pattern.

For example,

Code:
grep -m 1 -o -P '(?<=<WEB-DATETIME>)([0-9]{14})' ./dir/subdir/*.txt

grep -m 1 -o -P '(?<=<WEB-DATETIME>)([0-9]{14})' ./dir/subdir/*.txt | sed 's|^.*/||'
See "man grep" and "man perlre" and, maybe, "man sed"

Last edited by onthetopo; 10-20-2017 at 02:21 PM.
 
Old 10-20-2017, 02:42 PM   #10
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,295
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
It's easier to read that way. s#/some/path/## versus s/\/some\/path\///

Most commmonly you see the s command with slashes, as in s/// However, in sed the s command only needs three of the same kind as delimiters for the search pattern and the replacement text. So, at least in sed, the following are functionally the same:

Code:
echo abcdabcdcdbadbccd | sed 's/b/X/g'
echo abcdabcdcdbadbccd | sed 's!b!X!g'
echo abcdabcdcdbadbccd | sed 's#b#X#g'
echo abcdabcdcdbadbccd | sed 's|b|X|g'
echo abcdabcdcdbadbccd | sed 'sabaXag'
Because a path has slashes / the substitution is easier to read if another symbol is used instead. That way the slashes in the pattern don't have to be escaped.

Some other languages will also allow some flexibility with the s commands delimiters.

Code:
echo abcdabcdcdbadbccd | perl -p -e 's/b/X/g'
echo abcdabcdcdbadbccd | perl -p -e 's#b#X#g'
echo abcdabcdcdbadbccd | perl -p -e 's!b!X!g'
echo abcdabcdcdbadbccd | perl -p -e 's|b|X|g'
 
Old 10-20-2017, 10:11 PM   #11
onthetopo
LQ Newbie
 
Registered: May 2017
Posts: 11

Original Poster
Rep: Reputation: Disabled
I see thanks.

Why does ^.*/ correspond to /foo/bar/?
^. means anything not dot.

I tried to use https://regex101.com/ but it's not making sense.
 
Old 10-21-2017, 12:36 AM   #12
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,295
Blog Entries: 3

Rep: Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719Reputation: 3719
There are a number of different styles of regex. That complicates things. The lowest common denominator is probably POSIX.

But in all of them, the caret ^ anchors the pattern search to the beginning of the string. You could say that it matches the invisible beginning of the string. It's probably not needed in the exampel above because the searching is done from the begining (left) to the end (right). The $ anchors the pattern to the end of the string. See "man 7 regex"

Code:
echo abcdabcdcdbadbccd | sed 's/b.*/X/g';   # replace from b onwards with a single X
echo abcdabcdcdbadbccd | sed 's/^b.*/X/g';  # same but only if the line starts with a b
echo abcdabcdcdbadbccd | sed 's/.*d$/X/g';  # replace the whole line with an X if it ends with d
echo abcdabcdcdbadbccd | sed 's/.*c$/X/g';  # replace the whole line with an X if it ends with d
echo abcdabcdcdbadbccd | sed 's/^$/X/g';  # replace an empty line with an X
echo | sed 's/^$/X/g';  # replace an empty line with an X
Because * is "greedy", that is it matches as much as possible, the ^ or $ are often not needed.
 
Old 10-22-2017, 02:08 PM   #13
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,781

Rep: Reputation: 1199Reputation: 1199Reputation: 1199Reputation: 1199Reputation: 1199Reputation: 1199Reputation: 1199Reputation: 1199Reputation: 1199
Quote:
Originally Posted by onthetopo View Post
I see thanks.

Why does ^.*/ correspond to /foo/bar/?
^. means anything not dot.

I tried to use https://regex101.com/ but it's not making sense.
Anything not dot is [^.]
The character set denotes square brackets!
 
  


Reply

Tags
grep



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Replace 2nd occurrence of a string in a file - sed or awk? kushalkoolwal Programming 26 09-26-2021 04:10 PM
ls command to list out the file names,directory(along which file is present), file details(size,date time) at one go amala15vsa Linux - Newbie 1 10-09-2017 11:37 AM
How to show selected string using grep from file and replace it with new input string prasad1990 Linux - Software 2 03-19-2015 08:02 AM
how to find and replace only the 2nd occurrence of similar string in a file hchoonbeng Linux - Newbie 1 10-08-2008 03:44 AM
sed: print section of file from string to end of file samyboy Linux - Newbie 4 02-26-2008 07:23 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 08:53 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration