To capture email ids from a doc file
I am trying to capture email ids from all the doc files in a folder., and separate it out to another txt file. For this i do -
cat /opt/data/ca/aa.doc | egrep "Email |email |@" But the output i get is - Binary file (standard input) matches.Please let me know what is wrong in my approach?? |
Most Unix/Linux tools, like grep, need a plain text file as input. Windows doc format isn't plain text.
You could try using grep's --binary-files=text option, but this could output binary code to your terminal which in turn can ruin your terminal session: Code:
egrep --binary-files=text "Email |email |@" /opt/data/ca/aa.doc |
Thanks, So i converted the docs to html and tried and got the output as -
cat sur_sep_2012_.html | egrep "@" @page { size: 8.5in 11in; margin-right: 0.81in; margin-top: 0.88in; margin-bottom: 0.59in } @page:first { } </FONT><FONT COLOR="#0000ff"><FONT FACE="Verdana, sans-serif"><FONT COLOR="#000000"><FONT SIZE=1 STYLE="font-size: 8pt"><SPAN STYLE="text-decoration: none">surk2@gmail.com</SPAN></FONT></FONT></FONT></FONT></FONT></P> How can i get only Email id as output without the tags?? |
You can use grep's -o option (Print only the matched (non-empty) parts) and you need a more restrictive regular expression:
Code:
egrep -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+" sur_sep_2012_.html Example run on the small snippet of data you posted: Code:
egrep -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+" sur_sep_2012_.html |
Wow, exactly the result required! Thanks
|
How would i display email id corresponding to a particular file in output?
|
If I understand you correctly, you want to see the filename in which the email address was found. Check (e)grep's -H option
Code:
egrep -H -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+" sur_sep_2012_.html Code:
egrep -H -o "[A-Za-z0-9\.]+@[A-Za-z0-9\.]+" * |
Yes, I tried what you said. -H was something i was unaware of..
|
@Vi_rod: The -H , -o and --binary-files= are just 3 options. Grep has a lot of other useful options. Have a look here: GNU Grep
|
All times are GMT -5. The time now is 02:48 AM. |