How to grep the first occurrence of a date string in a file and print the file name and directory

onthetopo · 10-20-2017, 12:16 AM

Hello, what would be the grep command that would achieve the following task?

Suppose that I am in the /home/log directory. With hundreds of subfolders. And in one of them, say in /home/log/000020/a00001.txt, we have the text string that looks like
<WEB-DATETIME>20040316142929

And as another example, in /home/log/000020/b00009.txt
we have the text string that looks like
<WEB-DATETIME>20040317112020

I want to use grep from all txt files under /home/log, to output a file as /home/alllog.txt, that , with the field separator ':', that have two columns: A. location of the txt file, and B. the number string, that is always in 14 digits, immediately after the 'first occurrence' of the <WEB-DATETIME> string, so that the alllog.txt would look like:

000020/a00001: 20040316142929
000020/b00009: 20040317112020
etc

Many thanks!

syg00 · 10-20-2017, 12:21 AM

You will need something a bit more like a (scripting) language than grep.
Try awk, perl, python, LUA, ...

Whatever you are comfortable with.

onthetopo · 10-20-2017, 12:30 AM

Either Python or Awk works with me, please let me know. Thanks! It's been 5 years since I touched them, giving an answer would really be appreciated.

Quote:

Originally Posted by syg00

You will need something a bit more like a (scripting) language than grep.
Try awk, perl, python, LUA, ...

Whatever you are comfortable with.

Turbocapitalist · 10-20-2017, 12:59 AM

I'd start with grep before escalating to awk or perl. GNU grep has substantial support for perl style regular expressions. The -o, -r, and -P options will be of use to you here. Maybe -m too.
See

Code:

man grep
man perlre

In the PCRE, you'll want to look at using a look-behind zero-width assertion (?<= ... ) to prevent grep from printing that part of the pattern.

So give a try and show your code if you do get stuck and need tips.

ondoho · 10-20-2017, 01:29 AM

once again, i have to disagree. this:

Quote:

Originally Posted by onthetopo

<WEB-DATETIME>20040316142929

looks a lot like XML to me (i bet it's followed by </WEB-DATETIME>, isn't it?).
it would be a lot easier to use a tool that parses XML.
many choices, but i have experience with xmllint, e.g. like this:

Code:

xmllint --html --nonet --xpath "//WEB-DATETIME//text()" 2>/dev/null filename

(assuming the file is called filename)

Turbocapitalist · 10-20-2017, 01:36 AM

Quote:

Originally Posted by ondoho

once again, i have to disagree. this:looks a lot like XML to me (i bet it's followed by </WEB-DATETIME>, isn't it?).

I agree with the disgreement

If it is just non-XML text, then grep is easiest.

However, if the files contain XML then it is far more appropriate to search them with a tool that properly parses the XML. Anything that handles XPath is good and xmllint is a good one for that.

onthetopo · 10-20-2017, 11:06 AM

Hi, sorry it's not XML.
What I got so far is this:
LANG=C grep -rl ".*<WEB-DATETIME>[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]"

But I have no idea how to proceed from here to achieve the tasks as laid out in the first post. I was thinking of piping this into awk with field seprator '>', but then the task demands that I have the 14 digit string and the filename[and directory] be displayed side by side. So the pipe into awk won't work in my thinking.

Quote:

Originally Posted by ondoho

once again, i have to disagree. this:looks a lot like XML to me (i bet it's followed by </WEB-DATETIME>, isn't it?).
it would be a lot easier to use a tool that parses XML.
many choices, but i have experience with xmllint, e.g. like this:

Code:

xmllint --html --nonet --xpath "//WEB-DATETIME//text()" 2>/dev/null filename

(assuming the file is called filename)

Turbocapitalist · 10-20-2017, 11:13 AM

Quote:

Originally Posted by onthetopo

Hi, sorry it's not XML.
What I got so far is this:
LANG=C grep -rl ".*<WEB-DATETIME>[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]"

As mentioned, you will probably benefit from the -o option to print the matched pattern and the -P option on grep to give you Perl regular expressions so you can use a look-behind zero-width assertion to prevent grep from printing that part of the pattern.

For example,

Code:

grep -m 1 -o -P '(?<=<WEB-DATETIME>)([0-9]{14})' ./dir/subdir/*.txt

grep -m 1 -o -P '(?<=<WEB-DATETIME>)([0-9]{14})' ./dir/subdir/*.txt | sed 's|^.*/||'

See "man grep" and "man perlre" and, maybe, "man sed"

onthetopo · 10-20-2017, 01:59 PM

Thanks for the help.
I read man sed, but couldn't understand this part. I won't call myself an expert in vi but I always thought sed does vi commands, and this looks nothing like something in VI. For example, what is /| I know what it is doing is to remove the slash in directory name and only keep the file name but I can't understand Why it works.

sed 's|^.*/||'

This is interesting would you mind please explain the meaning of 's|^.*/||'?

Quote:

Originally Posted by Turbocapitalist

As mentioned, you will probably benefit from the -o option to print the matched pattern and the -P option on grep to give you Perl regular expressions so you can use a look-behind zero-width assertion to prevent grep from printing that part of the pattern.

For example,

Code:

grep -m 1 -o -P '(?<=<WEB-DATETIME>)([0-9]{14})' ./dir/subdir/*.txt

grep -m 1 -o -P '(?<=<WEB-DATETIME>)([0-9]{14})' ./dir/subdir/*.txt | sed 's|^.*/||'

See "man grep" and "man perlre" and, maybe, "man sed"

Turbocapitalist · 10-20-2017, 02:42 PM

It's easier to read that way. s#/some/path/## versus s/\/some\/path\///

Most commmonly you see the s command with slashes, as in s/// However, in sed the s command only needs three of the same kind as delimiters for the search pattern and the replacement text. So, at least in sed, the following are functionally the same:

Code:

echo abcdabcdcdbadbccd | sed 's/b/X/g'
echo abcdabcdcdbadbccd | sed 's!b!X!g'
echo abcdabcdcdbadbccd | sed 's#b#X#g'
echo abcdabcdcdbadbccd | sed 's|b|X|g'
echo abcdabcdcdbadbccd | sed 'sabaXag'

Because a path has slashes / the substitution is easier to read if another symbol is used instead. That way the slashes in the pattern don't have to be escaped.

Some other languages will also allow some flexibility with the s commands delimiters.

Code:

echo abcdabcdcdbadbccd | perl -p -e 's/b/X/g'
echo abcdabcdcdbadbccd | perl -p -e 's#b#X#g'
echo abcdabcdcdbadbccd | perl -p -e 's!b!X!g'
echo abcdabcdcdbadbccd | perl -p -e 's|b|X|g'

onthetopo · 10-20-2017, 10:11 PM

I see thanks.

Why does ^.*/ correspond to /foo/bar/?
^. means anything not dot.

I tried to use https://regex101.com/ but it's not making sense.

Turbocapitalist · 10-21-2017, 12:36 AM

There are a number of different styles of regex. That complicates things. The lowest common denominator is probably POSIX.

But in all of them, the caret ^ anchors the pattern search to the beginning of the string. You could say that it matches the invisible beginning of the string. It's probably not needed in the exampel above because the searching is done from the begining (left) to the end (right). The $ anchors the pattern to the end of the string. See "man 7 regex"

Code:

echo abcdabcdcdbadbccd | sed 's/b.*/X/g';   # replace from b onwards with a single X
echo abcdabcdcdbadbccd | sed 's/^b.*/X/g';  # same but only if the line starts with a b
echo abcdabcdcdbadbccd | sed 's/.*d$/X/g';  # replace the whole line with an X if it ends with d
echo abcdabcdcdbadbccd | sed 's/.*c$/X/g';  # replace the whole line with an X if it ends with d
echo abcdabcdcdbadbccd | sed 's/^$/X/g';  # replace an empty line with an X
echo | sed 's/^$/X/g';  # replace an empty line with an X

Because * is "greedy", that is it matches as much as possible, the ^ or $ are often not needed.

MadeInGermany · 10-22-2017, 02:08 PM

Quote:

Originally Posted by onthetopo

I see thanks.

Why does ^.*/ correspond to /foo/bar/?
^. means anything not dot.

I tried to use https://regex101.com/ but it's not making sense.

Anything not dot is [^.]
The character set denotes square brackets!