[SOLVED] Grep question

antcore99 · 12-03-2010, 09:06 AM

I have a long file that is structured like this:

Code:

This is about soccer #soccer_generic #soccer_intro . More information is here in more text.
This is line 2 #another_hash_tag #hastag_2 . And here is even more text.

I' like to obtain a list of hastags used in that text, like this:

Code:

#soccer_generic 
#soccer_intro
#another_hash_tag 
#hastag_2

I've tested every variation I could come with on:

Quote:

egrep -oh '#.*?\S' filename

The problems seems to be with multiple hashtags on a single line. What am I doing wrong? Is AWK a better option?

colucix · 12-03-2010, 09:12 AM

In awk I would do something like this:

Code:

awk 'BEGIN{RS="[[:space:]]"}/^#/'

Not sure about the problem with grep: what is the output of your command?

PTrenholme · 12-03-2010, 09:32 AM

In your sample text it looks (to me) like the "hash tag" block is (always?) terminated by a " . ".

colucix's suggestion assumes

That the tags contain no space separators and
That there are no [[:space:]]# sequences following the " . "

If this is the case, the suggested solution will work. If it is not the case, please try to describe you problem more completely.

colucix · 12-03-2010, 09:35 AM

Here is a working grep:

Code:

grep -E -o '#[^ ]+'

Matching any character which is not a blank space, limit the matching string to the single word, whereas the .* pattern includes spaces as well and matches any string up to the end of the line. Hope this helps.

GrapefruiTgirl · 12-03-2010, 09:35 AM

Code:

sasha@reactor: grep -o '#\w*' tags
#soccer_generic
#soccer_intro
#another_hash_tag
#hastag_2
sasha@reactor:

Appears to work like this, but is pretty simple and quickly thrown together so there may be a fatal flaw in it.

antcore99 · 12-03-2010, 09:36 AM

@colucix
Thanks, your solution did the trick.

This is what the grep commnand from the start post gives back for the example text from the start post:

Quote:

#soccer_generic #soccer_intro . More information is here in more text.
#another_hash_tag #hastag_2 . And here is even more text.

@PTrenholme
My objective was to collect the hash tags separately, not as a block. colucix's awk Solution was correct for this purpose. We do not know how to do this in grep yet, though.

colucix · 12-03-2010, 09:46 AM

Quote:

Originally Posted by GrapefruiTgirl

sasha@reactor: grep -o '#\w*' tags

Indeed it works and it's better than mine, since it excludes any punctuation immediately following the hashed tag. Nice.

colucix · 12-03-2010, 09:48 AM

Quote:

Originally Posted by antcore99

We do not know how to do this in grep yet, though.

You have to follow this thread quickly. We are very fast!

antcore99 · 12-03-2010, 10:19 AM

Fast you are! Thank you GrapefruiTgirl, your solution is simple but effective

antcore99 · 12-07-2010, 03:02 AM

A quick addition to this question: How would one go about obtaining a count of occurrence after each tag? Like so:

Code:

#soccer_generic (3)
#soccer_intro (1)
#another_hash_tag (8)
#hastag_2 (1)

colucix · 12-07-2010, 03:50 AM

Using awk you can easily count each tag occurrence, whereas grep can count the matching patterns all together:

Code:

awk '{for ( i = 1; i<=NF; i++ ) if ( $i ~ /^#\w/ ){ sub(/[[:punct:]]+$/,"",$i); _[$i]++ }} END{ for ( i in _ ) printf "%s (%d)\n",i,_[i]}' file

antcore99 · 12-07-2010, 02:08 PM

Thank you.

Tinkster · 12-08-2010, 10:45 PM

And for good measure an alternative approach ;}

Code:

awk 'BEGIN{RS="[[:space:]]";ORS="\n"}/^#/{a[$1]++}END{for (b in a){printf "%s (%s)\n", b,a[b]}}' soccer