LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 12-27-2007, 08:27 PM   #1
xiawinter
LQ Newbie
 
Registered: Aug 2007
Posts: 27

Rep: Reputation: 15
how to grep a "keyword" beside another keyword?


Hi All,
This is my first post in LQ.

My Problem is like this:

I need to grep a keyword from a string/a document, and this keyword should be beside another keyword.
suppose s=I Love LQLQLQLQLQ very much

I want to grep (LQ)+, but LQ should beside Love.
so my solution is egrep "Love (LQ)+"|grep (LQ)+

But I have thousands of such patterns, and it's very troublesome if done like this. Do you have any suggestions on this?

I want to do this task in bash only, and a quick solution is needed (like using grep)

Any ideas on this will be greatly appreciated. thanks in advance.
 
Old 12-27-2007, 08:31 PM   #2
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
For regular grep use "\(<pattern1>\)+<pattern2>". For egrep or grep -E use "(<pattern1>)+<pattern2".
 
Old 12-27-2007, 09:09 PM   #3
xiawinter
LQ Newbie
 
Registered: Aug 2007
Posts: 27

Original Poster
Rep: Reputation: 15
Thanks jschiwal.

I tried this egrep -o "(l)cd" document.txt, but it reported still "lcd", but not cd.

I want only "cd" reported, but this cd should beside "lcd", not like "mcd","bcd".

I can't follow the signal "<>" as well. is it used to represent a pattern?

Thanks any way, J.
 
Old 12-27-2007, 09:39 PM   #4
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
No, the <pattern> is just how I indicated that I didn't mean pattern to be literal.

Code:
"(l)cd"
Your test is for lcd not cd. Your example here is over simplified to the point of being a literal.

To use your first example: "Love (LQ)+ " will match "Love LQ " and "Love LQLQ " and "Love LQLQLQ ". If you have too many matches then your search criteria isn't selective enough. Maybe you wanted "Love (LQ)+ very much" instead. This wouldn't match "Love LQ daily" where the first pattern would.

By the way, there is a "regex" manpage. There is also a gawk info file that includes a regular expression chapter.
 
Old 12-28-2007, 03:49 AM   #5
xiawinter
LQ Newbie
 
Registered: Aug 2007
Posts: 27

Original Poster
Rep: Reputation: 15
Thanks, J.

Actually I understood the pattern "Love (LQ)+" would definitely match both "Love LQ Very much" and "Love LQ daily".

Well my question now is:

How I can extract "Love LQ very much"'s "Love LQ", but not "Love LQ daily's " "Love LQ" ?

The simple way is
Code:
grep "Love LQ very much"| grep "Love LQ"
because each patter has a different condition for the patter, so I hope I can write the patters with their conditions in the same or different file(s) and grep them at once

I hope it can make me understood.

thanks for your quick reply.

Last edited by xiawinter; 12-28-2007 at 08:09 PM.
 
Old 12-28-2007, 08:21 PM   #6
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
If you include a pattern that is too small you will get false positives. Since the following text is a part of the pattern, you should include it in the regular expression. The (LQ)+ part will allow variations. The 'very much' part should still be a part of the regex so that you don't have false positives.

This is somewhat similar to something you may find when using sed. You might want to edit a pattern, but only for a range of lines matching another pattern:
Code:
echo '<script type="text/javascript">
/* Start Of Javascript Generated By WP-Polls 2.12 */
/* <![CDATA[ */
if(site_url != 'http://blogs.zdnet.com/hardware' || ajax_url != '/hardware/index.php') {
var site_url = 'http://blogs.zdnet.com/hardware';
var ajax_url = '/hardware/index.php';
}
/* ]]> */
/* End Of Javascript Generated By WP-Polls 2.12 */
</script>' | sed '/<script type="text\/javascript">/,/<\/script>/s/hardware/software/g'
<script type="text/javascript">
/* Start Of Javascript Generated By WP-Polls 2.12 */
/* <![CDATA[ */
if(site_url != http://blogs.zdnet.com/software || ajax_url != /software/index.php) {
var site_url = http://blogs.zdnet.com/software;
var ajax_url = /software/index.php;
}
/* ]]> */
/* End Of Javascript Generated By WP-Polls 2.12 */
</script>
This example will change every "software" to "hardware" in an html document but only for inside of lines inside a java script program.

Last edited by jschiwal; 12-28-2007 at 08:24 PM.
 
Old 12-28-2007, 08:22 PM   #7
xiawinter
LQ Newbie
 
Registered: Aug 2007
Posts: 27

Original Poster
Rep: Reputation: 15
Hi,

I would like to do more explanations to my problem.

I want to match and count string like "intel", but Intel should be not in another word like "intelligence" (or "cd", not in "lcd").

If I grep use patters like "[^a-zA-Z]?intel[^a-zA-Z]?" might output ",intel", ".intel" or "@cd", "~cd". Because my document is Chinese, so any Chinese chracters beside "intel" or "cd" should fit the pattern.
What I actually want is only "intel" and "cd", but not ",intel" and ".cd", so that I could count number of lines for each patter using sort|uniq -c.

Now I use two steps to do the task:
Code:
egrep -f -o pattern1.txt document.txt >temp
egrep -f -o pattern2.txt temp|sort|uniq -c >keywords.txt
rm temp
where pattern1 is the patters I described above ("[^a-zA-Z]?intel[^a-zA-Z]?"), and the pattern2 is the only core word ("intel").

I hope it can be finished in a pattern, not need to create a temp file and use two pattern files, because for any English words I need to do like this, which should consumes much resource.

Thanks for your ideas on this problem.
 
Old 12-29-2007, 12:50 AM   #8
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
If you want to count the number of occurences of each word in a document, it will probably work out best if you replace spaces with newlines ( maybe using "tr" ). You will need to join hyphenated words first however.

Look at something like this for ideas:
Code:
man -Tascii uniq | tr ' ' '\n' | tr -s '\n' | sort -f | uniq -ic  | less
 
  


Reply

Tags
grep


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
"Unknown keyword in config file" when trying to boot from CD jkh107 Fedora - Installation 7 03-26-2009 12:29 AM
Support for "equalize" keyword for iproute!! vishamr2000 Linux - General 4 05-29-2006 12:34 AM
Which Databases in nsswitch.conf can be used with "dns" keyword saudoi Solaris / OpenSolaris 5 12-29-2005 10:37 AM
java variable scope - use of "this" keyword zeppelin147 Programming 1 11-21-2005 10:04 PM
exportfs: /etc/exports:1 unknown keyword "show" ukrainet Linux - Newbie 2 12-15-2004 07:18 AM


All times are GMT -5. The time now is 08:37 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration