LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 12-03-2010, 09:06 AM   #1
antcore99
LQ Newbie
 
Registered: Dec 2010
Posts: 7

Rep: Reputation: 0
Grep question


I have a long file that is structured like this:

Code:
This is about soccer #soccer_generic #soccer_intro . More information is here in more text.
This is line 2 #another_hash_tag #hastag_2 . And here is even more text.
I' like to obtain a list of hastags used in that text, like this:

Code:
#soccer_generic 
#soccer_intro
#another_hash_tag 
#hastag_2
I've tested every variation I could come with on:
Quote:
egrep -oh '#.*?\S' filename
The problems seems to be with multiple hashtags on a single line. What am I doing wrong? Is AWK a better option?
 
Old 12-03-2010, 09:12 AM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
In awk I would do something like this:
Code:
awk 'BEGIN{RS="[[:space:]]"}/^#/'
Not sure about the problem with grep: what is the output of your command?
 
Old 12-03-2010, 09:32 AM   #3
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
In your sample text it looks (to me) like the "hash tag" block is (always?) terminated by a " . ".

colucix's suggestion assumes
  • That the tags contain no space separators and
  • That there are no [[:space:]]# sequences following the " . "

If this is the case, the suggested solution will work. If it is not the case, please try to describe you problem more completely.
 
Old 12-03-2010, 09:35 AM   #4
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Here is a working grep:
Code:
grep -E -o '#[^ ]+'
Matching any character which is not a blank space, limit the matching string to the single word, whereas the .* pattern includes spaces as well and matches any string up to the end of the line. Hope this helps.
 
Old 12-03-2010, 09:35 AM   #5
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556
Code:
sasha@reactor: grep -o '#\w*' tags
#soccer_generic
#soccer_intro
#another_hash_tag
#hastag_2
sasha@reactor:
Appears to work like this, but is pretty simple and quickly thrown together so there may be a fatal flaw in it.
 
Old 12-03-2010, 09:36 AM   #6
antcore99
LQ Newbie
 
Registered: Dec 2010
Posts: 7

Original Poster
Rep: Reputation: 0
@colucix
Thanks, your solution did the trick.

This is what the grep commnand from the start post gives back for the example text from the start post:
Quote:
#soccer_generic #soccer_intro . More information is here in more text.
#another_hash_tag #hastag_2 . And here is even more text.
@PTrenholme
My objective was to collect the hash tags separately, not as a block. colucix's awk Solution was correct for this purpose. We do not know how to do this in grep yet, though.

Last edited by antcore99; 12-03-2010 at 09:38 AM. Reason: typo, clarification
 
Old 12-03-2010, 09:46 AM   #7
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by GrapefruiTgirl View Post
sasha@reactor: grep -o '#\w*' tags
Indeed it works and it's better than mine, since it excludes any punctuation immediately following the hashed tag. Nice.
 
Old 12-03-2010, 09:48 AM   #8
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by antcore99 View Post
We do not know how to do this in grep yet, though.
You have to follow this thread quickly. We are very fast!
 
Old 12-03-2010, 10:19 AM   #9
antcore99
LQ Newbie
 
Registered: Dec 2010
Posts: 7

Original Poster
Rep: Reputation: 0
Fast you are! Thank you GrapefruiTgirl, your solution is simple but effective
 
Old 12-07-2010, 03:02 AM   #10
antcore99
LQ Newbie
 
Registered: Dec 2010
Posts: 7

Original Poster
Rep: Reputation: 0
Addendum

A quick addition to this question: How would one go about obtaining a count of occurrence after each tag? Like so:

Code:
#soccer_generic (3)
#soccer_intro (1)
#another_hash_tag (8)
#hastag_2 (1)
 
Old 12-07-2010, 03:50 AM   #11
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Using awk you can easily count each tag occurrence, whereas grep can count the matching patterns all together:
Code:
awk '{for ( i = 1; i<=NF; i++ ) if ( $i ~ /^#\w/ ){ sub(/[[:punct:]]+$/,"",$i); _[$i]++ }} END{ for ( i in _ ) printf "%s (%d)\n",i,_[i]}' file

Last edited by colucix; 12-07-2010 at 04:43 AM. Reason: Added code to remove punctuation after each matching field.
 
Old 12-07-2010, 02:08 PM   #12
antcore99
LQ Newbie
 
Registered: Dec 2010
Posts: 7

Original Poster
Rep: Reputation: 0
Thank you.
 
Old 12-08-2010, 10:45 PM   #13
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
And for good measure an alternative approach ;}
Code:
awk 'BEGIN{RS="[[:space:]]";ORS="\n"}/^#/{a[$1]++}END{for (b in a){printf "%s (%s)\n", b,a[b]}}' soccer

Last edited by Tinkster; 12-08-2010 at 10:47 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
grep question nushki Programming 3 01-30-2010 04:44 AM
[SOLVED] grep question aihaike Programming 3 07-28-2009 12:43 PM
using grep (question) graziano1968 Linux - General 5 02-12-2009 03:53 AM
question about grep new_2_unix Linux - Newbie 5 12-07-2007 04:45 PM
grep question vasanthraghavan Programming 3 04-23-2004 12:32 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 06:30 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration