[SOLVED] grep for pattern following the nth occurence of a character in a file

cosminel · 09-28-2013, 04:42 PM

Hello everyone,

After days of searching articles, forums etc I still can't get grep to do what I want. I have some files that contain data in the following format and I am interested in the my_string and my_string_2 as shown below:

data;data;data;data;;data;my_string;my_string;data;data;data;data;;;;my_string_2;;;;etc;etc

Things to consider:
- "data" may contain anything and the lenght may vary
- as it is clearly shown the data strings are separated by ; or ;; or ;;;;
- sometimes I want grep to look for the 2nd "my_string", sometimes for "my_string_2" as they will represent user input in a script, something like: "Enter [my_string] or leave blank" and "Enter [my_string_2] or leave blank"

So basicaully I want to grep for the 2nd "my_string" or "my_string_2". The only constant, non-changing markers I have in all this is the ";" character. So what I know for sure is that after the 7th ";" the 2nd "my_string" will always follow and after the 15th ";" "my_string_3" will always follow.

Is it possible to do the above with grep?

Thank you in advance.

Firerat · 09-28-2013, 04:56 PM

I really don't understand what you are trying to do

why do you want the n'th match?

can you post your script so I can get an idea of what you want

I have a feeling you really want awk, but full(er) context will help

Code:

awk -F\; '{printf "%s %s",$8,$16}' InputFile

example on posting code

[code]
awk -F\; '{printf "%s %s",$8,$16}' InputFile
[/code]

allend · 09-28-2013, 08:24 PM

grep is probably not the tool you want to use, as the regular expression matching is 'greedy'.
Another alternative is the 'cut' command.

Code:

bash-4.2$ echo 'data;data;data;data;data;;my_string;data;data;data;data;;;;my_string_2;;;;' | cut -d';' -f7,15
my_string;my_string_2

cosminel · 09-29-2013, 02:54 AM

Thank you for your replies.

I was hoping that grep has the ability to do what I want using a more complicated extended regexp which I can't determine at this point.

Firerat, the position of my_string changes its significance, this is why I want grep to match it at precisely that position. Furthermore, in several cases my_string = my_string_2 and as I said, depending on the user input, the meaning of the value differs.

If what I need grep to do is not possible, I will try the awk instead.

Firerat · 09-29-2013, 06:09 AM

Quote:

Originally Posted by cosminel

Thank you for your replies.

I was hoping that grep has the ability to do what I want using a more complicated extended regexp which I can't determine at this point.

Firerat, the position of my_string changes its significance, this is why I want grep to match it at precisely that position. Furthermore, in several cases my_string = my_string_2 and as I said, depending on the user input, the meaning of the value differs.

If what I need grep to do is not possible, I will try the awk instead.

with awk you can test each field, you can the report which field it is
but
at the moment I still do not understand what you want from your description

show us your code and some input data, multiple lines.
so we have some context

but here I give you an awk ( not certain it fits with what you want/need )

Code:

awk -F\; -v string1="my_string" -v string2="my_string_2"  '{for (i=1;i<=NF;i++)
    {
     {if ( $i == string1) print "String1 found at field "i}
     {if ( $i == string2) print "String2 found at field "i}
    }
}' Input

grail · 09-29-2013, 08:01 AM

I think Firerat is on the mark, my only addition would be to alter the separator to include one or more semicolons:

Code:

awk -F";+" ...

Firerat · 09-29-2013, 08:58 AM

an alternative might be to put your data into an array

e.g.

Code:

MyArray=( $(sed -e 's/^/"/' -e 's/;/" "/g' -e s/$/\"/ Input ))

Edit: Forget the above

Code:

while read -d\; Element;do MyArray+=("$Element");done < Input

Code:

echo "Number of elements in MyArray= ${#MyArray[@]}"
echo -e "Array :-\n${MyArray[@]}"
echo "Note: Arrays start at 0 "
for ((i=0;i<${#MyArray[@]};i++));do
    echo "${i} = ${MyArray[$i]}"
done

echo "remove all \""
for ((i=0;i<${#MyArray[@]};i++));do
    echo "${i} = ${MyArray[$i]//\"}"
done

http://www.tldp.org/LDP/Bash-Beginners-Guide/html/
http://www.tldp.org/LDP/abs/html/
http://mywiki.wooledge.org/BashGuide
http://www.gnu.org/software/bash/manual/bashref.html

specifically
http://www.tldp.org/LDP/abs/html/arrays.html

cosminel · 09-30-2013, 12:14 AM

Thank you for your help. I will try to see which proposed solution returns the desired result.

To tell you the truth I thought it would be easier to write instructions for returning the whole line if the searched string is found at nth semicolon (which is used as a separatror).

Firerat, the information I have in those files is written in such a way that "my_string = received data" and "my_string_2 = sent data", and this can be determined solely on where they are positioned inside the line, having the semicolons as separators for all the data strings.

Also note that my_string and my_string_2 are interchangeable.

All I want is to extend a script that I made in order to contain these prompts:

"Enter received data string or leave blank:"
"Enter sent data string or leave blank:"

As the searched string may be positioned at the "received data" location or at the "sent data" location (which is determined by the nth semicolon), I want the returned results to conform to the user's choices when using grep to search the files based on the above prompts.

I hope this clarifies what I aim to do.

Firerat · 09-30-2013, 12:48 AM

I think either would work

if you are still stuck,

post a sample script, with sample data
along with some user input to test it

cosminel · 10-07-2013, 03:07 PM

I finally found some time to investigate your solutions. I found the command string that I was looking for:

grep string file* | awk -F";+" '$13 ~ "string" {print $0}'

Now, the trick is to pass the string which is a user input variable into the awk command. This is where I'm currently stuck. I looked over Firerat's command, searched the web but for the life of me I cannot figure out how to pass the script variable into awk. I do not understand the syntax. Here is part of my script:

Code:

#!/bin/bash

cd /root

read -p "Enter received data or leave blank: " rcvdata
read -p "Enter sent data or leave blank: " sntdata

if [ -z $sntdata ]; then
	grep $rcvdata testfile* | awk -F";+" '$13 ~ "$rcvdata" {print $0}'
fi

As you can see "rcvdata" and "sntdata" are user generated variables. Now, from what I understand I need to pass the script variable "rcvdata" to awk with -v (and here is the point where I get completely lost)

Firerat · 10-07-2013, 03:43 PM

awk --help

Code:

Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:		GNU long options: (standard)
	-f progfile		--file=progfile
	-F fs			--field-separator=fs
	-v var=val		--assign=var=val
Short options:		GNU long options: (extensions)
	-b			--characters-as-bytes
	-c			--traditional
	-C			--copyright
	-d[file]		--dump-variables[=file]
	-e 'program-text'	--source='program-text'
	-E file			--exec=file
	-g			--gen-pot
	-h			--help
	-L [fatal]		--lint[=fatal]
	-n			--non-decimal-data
	-N			--use-lc-numeric
	-O			--optimize
	-p[file]		--profile[=file]
	-P			--posix
	-r			--re-interval
	-S			--sandbox
	-t			--lint-old
	-V			--version

man awk

Code:

......
       -v var=val
       --assign var=val
              Assign the value val to the variable var, before execution of the program begins.  Such variable values are available to the BEGIN block of an AWK program.
......

Code:

#!/bin/bash

cd /root

read -p "Enter received data or leave blank: " rcvdata
read -p "Enter sent data or leave blank: " sntdata

if [ -z $sntdata ]; then
	grep $rcvdata testfile* | awk -F";+" '$13 ~ "$rcvdata" {print $0}'
        #^^^ You do not need this,            ^^^^^^^^^^^^ that does it
fi

Code:

#!/bin/bash

cd /root

read -p "Enter received data or leave blank: " rcvdata
read -p "Enter sent data or leave blank: " sntdata

if [ -z $sntdata ]; then
	awk -v Foo="$rcvdata" -F";+" '$13 ~ Foo {print $0}' testfile*
     #or
     #  awk -F";+" '$13 ~ "'"$rcvdata"'" {print $0}' testfile*
     # the seaGreen is protected from shell expansion
     # echo awk -F";+" '$13 ~ "$rcvdata" {print $0}' testfile*
     # echo awk -F";+" '$13 ~ "'"$rcvdata"'" {print $0}' testfile*
     # see the difference the '' makes
     # don't think of them as being around "$rcvdata", think "$rcvdata" as being outside the ''
fi

GazL · 10-07-2013, 03:59 PM

Quote:

Originally Posted by cosminel

So basicaully I want to grep for the 2nd "my_string" or "my_string_2". The only constant, non-changing markers I have in all this is the ";" character. So what I know for sure is that after the 7th ";" the 2nd "my_string" will always follow and after the 15th ";" "my_string_3" will always follow.

Is it possible to do the above with grep?

Thank you in advance.

If I'm understanding your requirements correctly then this grep string looks like it does what you're asking.

Code:

gazl@ws1:/tmp$ cat testdata
matchboth;data;data;data;;data;my_string;my_string;data;data;data;data;;;;my_string_2;;;;etc;etc
nomatch;data;data;data;;data;my_string;other_string;data;data;data;data;;;;other_string_2;;;;etc;etc
match8th;data;data;data;;data;my_string;my_string;data;data;data;data;;;;other_string_2;;;;etc;etc
match16th;data;data;data;;data;my_string;other_string;data;data;data;data;;;;my_string_2;;;;etc;etc
gazl@ws1:/tmp$ string1="my_string"
gazl@ws1:/tmp$ string2="my_string_2"
gazl@ws1:/tmp$ grep "\(^\([^;]*;\)\{7\}${string1};.*\)\|\(^\([^;]*;\)\{15\}${string2};.*\)" < testdata
matchboth;data;data;data;;data;my_string;my_string;data;data;data;data;;;;my_string_2;;;;etc;etc
match8th;data;data;data;;data;my_string;my_string;data;data;data;data;;;;other_string_2;;;;etc;etc
match16th;data;data;data;;data;my_string;other_string;data;data;data;data;;;;my_string_2;;;;etc;etc
gazl@ws1:/tmp$

cosminel · 10-07-2013, 04:16 PM

Oh I see now Firerat, I needed to define the variable for awk for the defined variable in the script

Either do this or use the ' ' to separate. The syntax format is killing me since I am a total beginner.

I already knew that I could grab the data without grep but I had this impression that using solely awk would slow down the search considerably. I didn't get the chance to test this in the working environment (a server with loads of data). So I just temporarily thought of letting grep (or I could use fgrep) of grabbing the data and then pass the results to awk.

Thank you for your input GazL. I have to say, awk looks cleaner at this point

I will test grep/fgrep against awk on the production server to see which is the fastest and by what amount.

Thank you guys for your help. After further testing, If I don't get stuck somewhere, I will mark the thread as solved, as I understand it's a good thing to do.

GazL · 10-07-2013, 04:24 PM

Quote:

Originally Posted by cosminel

Thank you for your input GazL. I have to say, awk looks cleaner at this point

it usually does.

Regexes never look pretty.

I'd be interested to see the results of your benchmarking of awk v grep if you'd be kind enough to come back and let us know.

cosminel · 10-07-2013, 04:58 PM

Sure thing! Once things settle down around here I will begin testing and get back to you with my findings.

I am also wondering how much is the speed of awk affected by the complexity of the command that involves it.

But I will begin with a plain string search.