LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-30-2012, 03:27 AM   #1
Vi_rod
Member
 
Registered: Dec 2011
Posts: 42

Rep: Reputation: Disabled
To capture email ids from a doc file


I am trying to capture email ids from all the doc files in a folder., and separate it out to another txt file. For this i do -

cat /opt/data/ca/aa.doc | egrep "Email |email |@"


But the output i get is - Binary file (standard input) matches.Please let me know what is wrong in my approach??
 
Old 10-30-2012, 03:39 AM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
Most Unix/Linux tools, like grep, need a plain text file as input. Windows doc format isn't plain text.

You could try using grep's --binary-files=text option, but this could output binary code to your terminal which in turn can ruin your terminal session:
Code:
egrep --binary-files=text "Email |email |@" /opt/data/ca/aa.doc
Otherwise you need to convert the doc to plain text and use that as input.
 
2 members found this post helpful.
Old 10-30-2012, 06:02 AM   #3
Vi_rod
Member
 
Registered: Dec 2011
Posts: 42

Original Poster
Rep: Reputation: Disabled
Thanks, So i converted the docs to html and tried and got the output as -

cat sur_sep_2012_.html | egrep "@"

@page { size: 8.5in 11in; margin-right: 0.81in; margin-top: 0.88in; margin-bottom: 0.59in }
@page:first { }
</FONT><FONT COLOR="#0000ff"><FONT FACE="Verdana, sans-serif"><FONT COLOR="#000000"><FONT SIZE=1 STYLE="font-size: 8pt"><SPAN STYLE="text-decoration: none">surk2@gmail.com</SPAN></FONT></FONT></FONT></FONT></FONT></P>

How can i get only Email id as output without the tags??
 
Old 10-30-2012, 06:28 AM   #4
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
You can use grep's -o option (Print only the matched (non-empty) parts) and you need a more restrictive regular expression:
Code:
egrep -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+" sur_sep_2012_.html
The regular expression assumes that an email address consists of 1 or more letters (capitals included) and/or numbers and/or a dot on both sides of the @ sign.

Example run on the small snippet of data you posted:
Code:
egrep -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+"  sur_sep_2012_.html
surk2@gmail.com
 
1 members found this post helpful.
Old 10-30-2012, 06:49 AM   #5
Vi_rod
Member
 
Registered: Dec 2011
Posts: 42

Original Poster
Rep: Reputation: Disabled
Wow, exactly the result required! Thanks
 
Old 10-30-2012, 07:33 AM   #6
Vi_rod
Member
 
Registered: Dec 2011
Posts: 42

Original Poster
Rep: Reputation: Disabled
How would i display email id corresponding to a particular file in output?
 
Old 10-30-2012, 07:59 AM   #7
Wim Sturkenboom
Senior Member
 
Registered: Jan 2005
Location: Roodepoort, South Africa
Distribution: Slackware 10.1/10.2/12, Ubuntu 12.04, Crunchbang Statler
Posts: 3,786

Rep: Reputation: 282Reputation: 282Reputation: 282
If I understand you correctly, you want to see the filename in which the email address was found. Check (e)grep's -H option

Code:
egrep -H -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+"  sur_sep_2012_.html
According to the man page, this is the default when grep searches through multiple files. So both commands below should give the same result.
Code:
egrep -H -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+" *
egrep -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+" *
 
Old 11-01-2012, 03:08 AM   #8
Vi_rod
Member
 
Registered: Dec 2011
Posts: 42

Original Poster
Rep: Reputation: Disabled
Yes, I tried what you said. -H was something i was unaware of..
 
Old 11-01-2012, 03:18 AM   #9
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
@Vi_rod: The -H , -o and --binary-files= are just 3 options. Grep has a lot of other useful options. Have a look here: GNU Grep
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to created Email ids for the user in Linux web server ?? shakul hameed Linux - Server 1 09-07-2012 11:51 AM
Fetchmail: same email to multiple local ids gugabaga Linux - Server 0 12-25-2007 01:54 PM
trying to send openoffic doc in email hamguy Red Hat 0 11-17-2003 09:46 PM
separate login IDs/passwords for email and users zthomasz Linux - Security 3 08-01-2003 12:35 PM
separate login IDs/passwords for email and users zthomasz Linux - Newbie 3 07-29-2003 02:19 PM


All times are GMT -5. The time now is 10:43 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration