LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   To capture email ids from a doc file (https://www.linuxquestions.org/questions/linux-newbie-8/to-capture-email-ids-from-a-doc-file-4175434712/)

Vi_rod 10-30-2012 02:27 AM

To capture email ids from a doc file
 
I am trying to capture email ids from all the doc files in a folder., and separate it out to another txt file. For this i do -

cat /opt/data/ca/aa.doc | egrep "Email |email |@"


But the output i get is - Binary file (standard input) matches.Please let me know what is wrong in my approach??

druuna 10-30-2012 02:39 AM

Most Unix/Linux tools, like grep, need a plain text file as input. Windows doc format isn't plain text.

You could try using grep's --binary-files=text option, but this could output binary code to your terminal which in turn can ruin your terminal session:
Code:

egrep --binary-files=text "Email |email |@" /opt/data/ca/aa.doc
Otherwise you need to convert the doc to plain text and use that as input.

Vi_rod 10-30-2012 05:02 AM

Thanks, So i converted the docs to html and tried and got the output as -

cat sur_sep_2012_.html | egrep "@"

@page { size: 8.5in 11in; margin-right: 0.81in; margin-top: 0.88in; margin-bottom: 0.59in }
@page:first { }
</FONT><FONT COLOR="#0000ff"><FONT FACE="Verdana, sans-serif"><FONT COLOR="#000000"><FONT SIZE=1 STYLE="font-size: 8pt"><SPAN STYLE="text-decoration: none">surk2@gmail.com</SPAN></FONT></FONT></FONT></FONT></FONT></P>

How can i get only Email id as output without the tags??

druuna 10-30-2012 05:28 AM

You can use grep's -o option (Print only the matched (non-empty) parts) and you need a more restrictive regular expression:
Code:

egrep -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+" sur_sep_2012_.html
The regular expression assumes that an email address consists of 1 or more letters (capitals included) and/or numbers and/or a dot on both sides of the @ sign.

Example run on the small snippet of data you posted:
Code:

egrep -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+"  sur_sep_2012_.html
surk2@gmail.com


Vi_rod 10-30-2012 05:49 AM

Wow, exactly the result required! Thanks

Vi_rod 10-30-2012 06:33 AM

How would i display email id corresponding to a particular file in output?

Wim Sturkenboom 10-30-2012 06:59 AM

If I understand you correctly, you want to see the filename in which the email address was found. Check (e)grep's -H option

Code:

egrep -H -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+"  sur_sep_2012_.html
According to the man page, this is the default when grep searches through multiple files. So both commands below should give the same result.
Code:

egrep -H -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+" *
egrep -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+" *


Vi_rod 11-01-2012 02:08 AM

Yes, I tried what you said. -H was something i was unaware of..

druuna 11-01-2012 02:18 AM

@Vi_rod: The -H , -o and --binary-files= are just 3 options. Grep has a lot of other useful options. Have a look here: GNU Grep


All times are GMT -5. The time now is 02:48 AM.