LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 11-04-2009, 02:16 PM   #1
Rob00
LQ Newbie
 
Registered: Oct 2009
Posts: 13

Rep: Reputation: 0
Why won't grep work on my logs?


I've been trying to use grep for at least 30 minutes now trying to do something simple and it just doesn't seem to be working . It's most likely something I'm doing wrong, but I can't figure it out.

I have old msn chat logs on my other hard drive. I want to simply use grep to search for a particular string/word. First of all, I am in the correct directory with the logs which MSN stores in html format.

It seems simple enough, grep -i 'string' *.html

But I've tried with futility to get this to work.I've tried using a string 'hello' and other common words, and it NEVER find anything. I thought something weird was going on after a while, so I opened up one of the logs in gedit, and saved an exact duplicate (named Fakelog) in the same folder.

Strangely, enough grep could find matches in this new copy, but not in the original?

I thought this was weird, so I did "file *" and had this output
Code:

Code:
April 2008.html:      XML document text
August 2007.html:     XML document text
August 2008.html:     XML document text
August 2009.html:     XML document text
December 2007.html:   XML document text
December 2008.html:   XML document text
February 2008.html:   XML document text
Fakelog 2005.html: ASCII text
As you can see, it seems to be calling my Fakelog an Ascii test, and every other file an XML text. I wonder if this is why grep won't work on the original log files?

Can someone help me understand the logic behind this seemingly illogical problem? I can't understand it. I can cat the .html logs and they output , but grep just won't work on them.

Edit: I'm still trying to make sense of it, and I tried this:

Code:
$ tail -2 October\ 2007.html  <----- printing the last two lines of a log
</html> <--- it worked
$ tail -2 October\ 2007.html | grep html <--- this didn't work
$ echo '</html>' | grep html <-- this did
</html>
$
It doesn't work when I 'tail', but it does when I echo the same thing? I don't get it. I noticed that the last line of the file is seemingly blank and I can't remove the blank line. I wonder if that's causing problems, really I have no idea.
 
Old 11-04-2009, 02:30 PM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738
I'm lost here. Please post a (short) sample of the contents of one of the files.
 
Old 11-04-2009, 03:01 PM   #3
Rob00
LQ Newbie
 
Registered: Oct 2009
Posts: 13

Original Poster
Rep: Reputation: 0
I attached a short edited log, and gave it a .txt extension to make it uploadable here.

I cannot get grep to work on that edited log. I did this:

$ grep -i 'slicex' Testlog\ 2009.txt

Doesn't work for me.

I noticed when I cat Testlog\ 2009.txt or any of these MSN logs, there are four suspicious "diamond" characters that appear at the beginning and end of the file, which don't show up in gedit or geany. I wonder if they are preventing grep from working.
Attached Files
File Type: txt Testlog 2009.txt (8.7 KB, 11 views)
 
Old 11-04-2009, 03:29 PM   #4
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738
Got it!!

Look at the first few lines from "cat filename|hexdump -C":
Code:
00000000  ff fe 3c 00 3f 00 78 00  6d 00 6c 00 20 00 76 00  |..<.?.x.m.l. .v.|
00000010  65 00 72 00 73 00 69 00  6f 00 6e 00 3d 00 22 00  |e.r.s.i.o.n.=.".|
00000020  31 00 2e 00 30 00 22 00  20 00 65 00 6e 00 63 00  |1...0.". .e.n.c.|
00000030  6f 00 64 00 69 00 6e 00  67 00 3d 00 22 00 55 00  |o.d.i.n.g.=.".U.|
00000040  54 00 46 00 2d 00 31 00  36 00 4c 00 45 00 22 00  |T.F.-.1.6.L.E.".|
00000050  3f 00 3e 00 0d 00 0a 00  3c 00 21 00 44 00 4f 00  |?.>.....<.!.D.O.|
Note that the data is padded with a "zero" byte after every character.

Even so, it displays correctly with "cat" and if opened on a web browser.
Why? no clue.

Why is it padded like this? No clue.
 
Old 11-04-2009, 03:40 PM   #5
Rob00
LQ Newbie
 
Registered: Oct 2009
Posts: 13

Original Poster
Rep: Reputation: 0
Yeah it seems to be weird with a lot of no clues here as well. The good news is I found what I was looking for without grep. I feel less frustrated knowing I wasn't doing something wrong. I thought I was losing my mind! Thanks. I didn't know about the hexdump command either, neat trick!
 
Old 11-04-2009, 03:50 PM   #6
pcunix
Member
 
Registered: Dec 2004
Location: MA
Distribution: Various
Posts: 149

Rep: Reputation: 23
Something to keep in mind for the future - with arbitrary partially binary Microsoft stuff (you KNOW I wanted to use another word!), this can help:
Code:
strings file | grep whatever
 
Old 11-04-2009, 04:43 PM   #7
Rob00
LQ Newbie
 
Registered: Oct 2009
Posts: 13

Original Poster
Rep: Reputation: 0
'strings' didn't work my on msn log files either. It doesn't output anything.
Code:
$ cat Testlog\ 2009.txt | head -5
��<?xml version="1.0" encoding="UTF-16LE"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- Note: this file is an XHTML 1.0 document constructed to be rendered properly in most web browsers
that do not support XHTML natively. Be careful to respect the XHTML 1.0 syntax if you manually edit this file -->

$ strings Testlog\ 2009.txt 
$
 
Old 11-04-2009, 05:24 PM   #8
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978
Well... I'm going to venture into a little explanation... hoping someone more experienced will correct my inaccuracies!

The file posted in attachment is an XML document encoded in UTF-16. This means that every character is 16-bits long, hence the "zero padding" as noted by pixellany. Actually hexdump (and od) cannot dump 2-byte chars (take in mind they are "ancient" command and maybe characters longer than 1-byte did not exist at the time they were created). This brings to alternate zeros in the output of these commands.

Moreover, note that the first two bytes represent the so-called Byte Order Mark, a sequence often used in UTF-16 encoding to denote the endiannes of the binary data. In the example posted above they are "ff fe", that is the unicode sequence U+FFFE which denotes little endian byte order.

This information is also included in the header of the XML file:
Code:
$ head -1 Testlog\ 2009.txt
��<?xml version="1.0" encoding="UTF-16LE"?
but take in mind that it can be misleading, since the author can write any information about the encoding of the XML document and then use a different encoding type!

Finally, to solve the grep or any other parsing problem, you can try to convert the file from one encoding to another, using the iconv POSIX command. Here is an example that performs conversion in UTF-8:
Code:
$ grep xml Testlog\ 2009.txt
$ iconv -f UTF-16 -t UTF-8 Testlog\ 2009.txt > Testlog\ 2009_UTF-8.txt
$ grep xml Testlog\ 2009_UTF-8.txt
<?xml version="1.0" encoding="UTF-16LE"?>
<html xmlns="http://www.w3.org/1999/xhtml">
$
See man 1p iconv for details (iconv is a binary utility provided by glibc).

Hope this helps. Bye

Last edited by colucix; 11-04-2009 at 05:25 PM.
 
Old 11-04-2009, 05:34 PM   #9
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,186

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
Did you try cat ...html | grep -i <target>? If cat works, that should work.

If you want to know the name of file where the match occurs, look at the find command using xargs. (See info find for details and examples.)
 
Old 11-04-2009, 05:36 PM   #10
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738
Wow!!! Is there **anything** that guy doesn't know.......

What was wrong with good old ascii? Why do we need Unicode, UTF, and all that stuff? Where's my Apple-II?
 
Old 11-04-2009, 06:16 PM   #11
Rob00
LQ Newbie
 
Registered: Oct 2009
Posts: 13

Original Poster
Rep: Reputation: 0
PTrenholme: Yes, I tried that and it doesn't work. On a single file, grep finds no matches. On *.html, it says "Binary file standard output matches"

Impressive and interesting reply Guru colucix and iconv works great too, thanks.
 
Old 11-04-2009, 07:16 PM   #12
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 15,988

Rep: Reputation: 2217Reputation: 2217Reputation: 2217Reputation: 2217Reputation: 2217Reputation: 2217Reputation: 2217Reputation: 2217Reputation: 2217Reputation: 2217Reputation: 2217
Yes, awesome colucix - and not a mention of awk anywhere ...
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
find . -type d |grep fuser (does not work, why) thllgo Linux - Server 6 12-28-2011 07:03 AM
awk multilines / grep colums.. or indeed anything that will work baidym Programming 6 12-09-2008 05:09 AM
awk multilines / grep colums.. or indeed anything that will work baidym Programming 3 12-04-2008 07:26 PM
how does recursive grep work? serutan Linux - Newbie 5 07-11-2008 02:00 PM
grep does not work in crontab script blizunt7 Linux - General 5 08-24-2007 03:19 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 07:15 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration