Why won't grep work on my logs?

Rob00 · 11-04-2009, 01:16 PM

I've been trying to use grep for at least 30 minutes now trying to do something simple and it just doesn't seem to be working . It's most likely something I'm doing wrong, but I can't figure it out.

I have old msn chat logs on my other hard drive. I want to simply use grep to search for a particular string/word. First of all, I am in the correct directory with the logs which MSN stores in html format.

It seems simple enough, grep -i 'string' *.html

But I've tried with futility to get this to work.I've tried using a string 'hello' and other common words, and it NEVER find anything. I thought something weird was going on after a while, so I opened up one of the logs in gedit, and saved an exact duplicate (named Fakelog) in the same folder.

Strangely, enough grep could find matches in this new copy, but not in the original?

I thought this was weird, so I did "file *" and had this output
Code:

Code:

April 2008.html:      XML document text
August 2007.html:     XML document text
August 2008.html:     XML document text
August 2009.html:     XML document text
December 2007.html:   XML document text
December 2008.html:   XML document text
February 2008.html:   XML document text
Fakelog 2005.html: ASCII text

As you can see, it seems to be calling my Fakelog an Ascii test, and every other file an XML text. I wonder if this is why grep won't work on the original log files?

Can someone help me understand the logic behind this seemingly illogical problem? I can't understand it. I can cat the .html logs and they output , but grep just won't work on them.

Edit: I'm still trying to make sense of it, and I tried this:

Code:

$ tail -2 October\ 2007.html  <----- printing the last two lines of a log
</html> <--- it worked
$ tail -2 October\ 2007.html | grep html <--- this didn't work
$ echo '</html>' | grep html <-- this did
</html>
$

It doesn't work when I 'tail', but it does when I echo the same thing? I don't get it. I noticed that the last line of the file is seemingly blank and I can't remove the blank line. I wonder if that's causing problems, really I have no idea.

pixellany · 11-04-2009, 01:30 PM

I'm lost here. Please post a (short) sample of the contents of one of the files.

Rob00 · 11-04-2009, 02:01 PM

I attached a short edited log, and gave it a .txt extension to make it uploadable here.

I cannot get grep to work on that edited log. I did this:

$ grep -i 'slicex' Testlog\ 2009.txt

Doesn't work for me.

I noticed when I cat Testlog\ 2009.txt or any of these MSN logs, there are four suspicious "diamond" characters that appear at the beginning and end of the file, which don't show up in gedit or geany. I wonder if they are preventing grep from working.

pixellany · 11-04-2009, 02:29 PM

Got it!!

Look at the first few lines from "cat filename|hexdump -C":

Code:

00000000  ff fe 3c 00 3f 00 78 00  6d 00 6c 00 20 00 76 00  |..<.?.x.m.l. .v.|
00000010  65 00 72 00 73 00 69 00  6f 00 6e 00 3d 00 22 00  |e.r.s.i.o.n.=.".|
00000020  31 00 2e 00 30 00 22 00  20 00 65 00 6e 00 63 00  |1...0.". .e.n.c.|
00000030  6f 00 64 00 69 00 6e 00  67 00 3d 00 22 00 55 00  |o.d.i.n.g.=.".U.|
00000040  54 00 46 00 2d 00 31 00  36 00 4c 00 45 00 22 00  |T.F.-.1.6.L.E.".|
00000050  3f 00 3e 00 0d 00 0a 00  3c 00 21 00 44 00 4f 00  |?.>.....<.!.D.O.|

Note that the data is padded with a "zero" byte after every character.

Even so, it displays correctly with "cat" and if opened on a web browser.
Why? no clue.

Why is it padded like this? No clue.

Rob00 · 11-04-2009, 02:40 PM

Yeah it seems to be weird with a lot of no clues here as well. The good news is I found what I was looking for without grep. I feel less frustrated knowing I wasn't doing something wrong. I thought I was losing my mind! Thanks. I didn't know about the hexdump command either, neat trick!

pcunix · 11-04-2009, 02:50 PM

Something to keep in mind for the future - with arbitrary partially binary Microsoft stuff (you KNOW I wanted to use another word!), this can help:

Code:

strings file | grep whatever

Rob00 · 11-04-2009, 03:43 PM

'strings' didn't work my on msn log files either. It doesn't output anything.

Code:

$ cat Testlog\ 2009.txt | head -5
��<?xml version="1.0" encoding="UTF-16LE"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- Note: this file is an XHTML 1.0 document constructed to be rendered properly in most web browsers
that do not support XHTML natively. Be careful to respect the XHTML 1.0 syntax if you manually edit this file -->

$ strings Testlog\ 2009.txt 
$

colucix · 11-04-2009, 04:24 PM

Well... I'm going to venture into a little explanation... hoping someone more experienced will correct my inaccuracies!

The file posted in attachment is an XML document encoded in UTF-16. This means that every character is 16-bits long, hence the "zero padding" as noted by pixellany. Actually hexdump (and od) cannot dump 2-byte chars (take in mind they are "ancient" command and maybe characters longer than 1-byte did not exist at the time they were created). This brings to alternate zeros in the output of these commands.

Moreover, note that the first two bytes represent the so-called Byte Order Mark, a sequence often used in UTF-16 encoding to denote the endiannes of the binary data. In the example posted above they are "ff fe", that is the unicode sequence U+FFFE which denotes little endian byte order.

This information is also included in the header of the XML file:

Code:

$ head -1 Testlog\ 2009.txt
��<?xml version="1.0" encoding="UTF-16LE"?

but take in mind that it can be misleading, since the author can write any information about the encoding of the XML document and then use a different encoding type!

Finally, to solve the grep or any other parsing problem, you can try to convert the file from one encoding to another, using the iconv POSIX command. Here is an example that performs conversion in UTF-8:

Code:

$ grep xml Testlog\ 2009.txt
$ iconv -f UTF-16 -t UTF-8 Testlog\ 2009.txt > Testlog\ 2009_UTF-8.txt
$ grep xml Testlog\ 2009_UTF-8.txt
<?xml version="1.0" encoding="UTF-16LE"?>
<html xmlns="http://www.w3.org/1999/xhtml">
$

See man 1p iconv for details (iconv is a binary utility provided by glibc).

Hope this helps. Bye

PTrenholme · 11-04-2009, 04:34 PM

Did you try cat ...html | grep -i <target>? If cat works, that should work.

If you want to know the name of file where the match occurs, look at the find command using xargs. (See info find for details and examples.)

pixellany · 11-04-2009, 04:36 PM

Wow!!! Is there **anything** that guy doesn't know.......

What was wrong with good old ascii? Why do we need Unicode, UTF, and all that stuff? Where's my Apple-II?

Rob00 · 11-04-2009, 05:16 PM

PTrenholme: Yes, I tried that and it doesn't work. On a single file, grep finds no matches. On *.html, it says "Binary file standard output matches"

Impressive and interesting reply Guru colucix and iconv works great too, thanks.

syg00 · 11-04-2009, 06:16 PM

Yes, awesome colucix - and not a mention of awk anywhere ...