LinuxQuestions.org - [SOLVED] grep for pattern following the nth occurence of a character in a file

Page 1 of 2

Show 50 post(s) from this thread on one page

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - grep for pattern following the nth occurence of a character in a file (https://www.linuxquestions.org/questions/linux-newbie-8/grep-for-pattern-following-the-nth-occurence-of-a-character-in-a-file-4175478909/)

cosminel

09-28-2013 04:42 PM

grep for pattern following the nth occurence of a character in a file

Hello everyone,

After days of searching articles, forums etc I still can't get grep to do what I want. I have some files that contain data in the following format and I am interested in the my_string and my_string_2 as shown below:

data;data;data;data;;data;my_string;my_string;data;data;data;data;;;;my_string_2;;;;etc;etc

Things to consider:
- "data" may contain anything and the lenght may vary
- as it is clearly shown the data strings are separated by ; or ;; or ;;;;
- sometimes I want grep to look for the 2nd "my_string", sometimes for "my_string_2" as they will represent user input in a script, something like: "Enter [my_string] or leave blank" and "Enter [my_string_2] or leave blank"

So basicaully I want to grep for the 2nd "my_string" or "my_string_2". The only constant, non-changing markers I have in all this is the ";" character. So what I know for sure is that after the 7th ";" the 2nd "my_string" will always follow and after the 15th ";" "my_string_3" will always follow.

Is it possible to do the above with grep?

Thank you in advance.

Firerat

09-28-2013 04:56 PM

I really don't understand what you are trying to do

why do you want the n'th match?

can you post your script so I can get an idea of what you want

I have a feeling you really want awk, but full(er) context will help

Code:

awk -F\; '{printf "%s %s",$8,$16}' InputFile

example on posting code ;)

[code]
awk -F\; '{printf "%s %s",$8,$16}' InputFile
[/code]

allend

09-28-2013 08:24 PM

grep is probably not the tool you want to use, as the regular expression matching is 'greedy'.
Another alternative is the 'cut' command.

Code:

bash-4.2$ echo 'data;data;data;data;data;;my_string;data;data;data;data;;;;my_string_2;;;;' | cut -d';' -f7,15

my_string;my_string_2

cosminel

09-29-2013 02:54 AM

Thank you for your replies.

I was hoping that grep has the ability to do what I want using a more complicated extended regexp which I can't determine at this point.

Firerat, the position of my_string changes its significance, this is why I want grep to match it at precisely that position. Furthermore, in several cases my_string = my_string_2 and as I said, depending on the user input, the meaning of the value differs.

If what I need grep to do is not possible, I will try the awk instead.

Firerat

09-29-2013 06:09 AM

Quote:

Originally Posted by cosminel (Post 5036690)

with awk you can test each field, you can the report which field it is
but
at the moment I still do not understand what you want from your description

show us your code and some input data, multiple lines.
so we have some context

but here I give you an awk ( not certain it fits with what you want/need )

Code:

awk -F\; -v string1="my_string" -v string2="my_string_2"  '{for (i=1;i<=NF;i++)

    {

    {if ( $i == string1) print "String1 found at field "i}

    {if ( $i == string2) print "String2 found at field "i}

    }

}' Input

grail

09-29-2013 08:01 AM

I think Firerat is on the mark, my only addition would be to alter the separator to include one or more semicolons:

Code:

awk -F";+" ...

Firerat

09-29-2013 08:58 AM

an alternative might be to put your data into an array

e.g.

Code:

MyArray=( $(sed -e 's/^/"/' -e 's/;/" "/g' -e s/$/\"/ Input ))

Edit: Forget the above

Code:

while read -d\; Element;do MyArray+=("$Element");done < Input

Code:

echo "Number of elements in MyArray= ${#MyArray[@]}"

echo -e "Array :-\n${MyArray[@]}"

echo "Note: Arrays start at 0 "

for ((i=0;i<${#MyArray[@]};i++));do

    echo "${i} = ${MyArray[$i]}"

done



echo "remove all \""

for ((i=0;i<${#MyArray[@]};i++));do

    echo "${i} = ${MyArray[$i]//\"}"

done

http://www.tldp.org/LDP/Bash-Beginners-Guide/html/
http://www.tldp.org/LDP/abs/html/
http://mywiki.wooledge.org/BashGuide
http://www.gnu.org/software/bash/manual/bashref.html

specifically
http://www.tldp.org/LDP/abs/html/arrays.html

cosminel

09-30-2013 12:14 AM

Thank you for your help. I will try to see which proposed solution returns the desired result.

To tell you the truth I thought it would be easier to write instructions for returning the whole line if the searched string is found at nth semicolon (which is used as a separatror).

Firerat, the information I have in those files is written in such a way that "my_string = received data" and "my_string_2 = sent data", and this can be determined solely on where they are positioned inside the line, having the semicolons as separators for all the data strings.

Also note that my_string and my_string_2 are interchangeable.

All I want is to extend a script that I made in order to contain these prompts:

"Enter received data string or leave blank:"
"Enter sent data string or leave blank:"

As the searched string may be positioned at the "received data" location or at the "sent data" location (which is determined by the nth semicolon), I want the returned results to conform to the user's choices when using grep to search the files based on the above prompts.

I hope this clarifies what I aim to do.

Firerat

09-30-2013 12:48 AM

I think either would work

if you are still stuck,

post a sample script, with sample data
along with some user input to test it

cosminel

10-07-2013 03:07 PM

I finally found some time to investigate your solutions. I found the command string that I was looking for:

grep string file* | awk -F";+" '$13 ~ "string" {print $0}'

Now, the trick is to pass the string which is a user input variable into the awk command. This is where I'm currently stuck. I looked over Firerat's command, searched the web but for the life of me I cannot figure out how to pass the script variable into awk. I do not understand the syntax. Here is part of my script:

Code:

#!/bin/bash



cd /root



read -p "Enter received data or leave blank: " rcvdata

read -p "Enter sent data or leave blank: " sntdata



if [ -z $sntdata ]; then

        grep $rcvdata testfile* | awk -F";+" '$13 ~ "$rcvdata" {print $0}'

fi

As you can see "rcvdata" and "sntdata" are user generated variables. Now, from what I understand I need to pass the script variable "rcvdata" to awk with -v (and here is the point where I get completely lost)

Firerat

10-07-2013 03:43 PM

awk --help

Code:

Usage: awk [POSIX or GNU style options] -f progfile [--] file ...

Usage: awk [POSIX or GNU style options] [--] 'program' file ...

POSIX options:                GNU long options: (standard)

        -f progfile                --file=progfile

        -F fs                        --field-separator=fs

        -v var=val                --assign=var=val

Short options:                GNU long options: (extensions)

        -b                        --characters-as-bytes

        -c                        --traditional

        -C                        --copyright

        -d[file]                --dump-variables[=file]

        -e 'program-text'        --source='program-text'

        -E file                        --exec=file

        -g                        --gen-pot

        -h                        --help

        -L [fatal]                --lint[=fatal]

        -n                        --non-decimal-data

        -N                        --use-lc-numeric

        -O                        --optimize

        -p[file]                --profile[=file]

        -P                        --posix

        -r                        --re-interval

        -S                        --sandbox

        -t                        --lint-old

        -V                        --version

man awk

Code:

......

      -v var=val

      --assign var=val

              Assign the value val to the variable var, before execution of the program begins.  Such variable values are available to the BEGIN block of an AWK program.

......

Code:



#!/bin/bash



cd /root



read -p "Enter received data or leave blank: " rcvdata

read -p "Enter sent data or leave blank: " sntdata



if [ -z $sntdata ]; then

        grep $rcvdata testfile* | awk -F";+" '$13 ~ "$rcvdata" {print $0}'

        #^^^ You do not need this,            ^^^^^^^^^^^^ that does it

fi

Code:

#!/bin/bash



cd /root



read -p "Enter received data or leave blank: " rcvdata

read -p "Enter sent data or leave blank: " sntdata



if [ -z $sntdata ]; then

        awk -v Foo="$rcvdata" -F";+" '$13 ~ Foo {print $0}' testfile*

    #or

    #  awk -F";+" '$13 ~ "'"$rcvdata"'" {print $0}' testfile*

    # the seaGreen is protected from shell expansion

    # echo awk -F";+" '$13 ~ "$rcvdata" {print $0}' testfile*

    # echo awk -F";+" '$13 ~ "'"$rcvdata"'" {print $0}' testfile*

    # see the difference the '' makes

    # don't think of them as being around "$rcvdata", think "$rcvdata" as being outside the ''

fi

GazL	10-07-2013 03:59 PM

Quote:

Originally Posted by cosminel (Post 5036541)

So basicaully I want to grep for the 2nd "my_string" or "my_string_2". The only constant, non-changing markers I have in all this is the ";" character. So what I know for sure is that after the 7th ";" the 2nd "my_string" will always follow and after the 15th ";" "my_string_3" will always follow.

Is it possible to do the above with grep?

Thank you in advance.

If I'm understanding your requirements correctly then this grep string looks like it does what you're asking.

Code:

gazl@ws1:/tmp$ cat testdata

matchboth;data;data;data;;data;my_string;my_string;data;data;data;data;;;;my_string_2;;;;etc;etc

nomatch;data;data;data;;data;my_string;other_string;data;data;data;data;;;;other_string_2;;;;etc;etc

match8th;data;data;data;;data;my_string;my_string;data;data;data;data;;;;other_string_2;;;;etc;etc

match16th;data;data;data;;data;my_string;other_string;data;data;data;data;;;;my_string_2;;;;etc;etc

gazl@ws1:/tmp$ string1="my_string"

gazl@ws1:/tmp$ string2="my_string_2"

gazl@ws1:/tmp$ grep "\(^\([^;]*;\)\{7\}${string1};.*\)\|\(^\([^;]*;\)\{15\}${string2};.*\)" < testdata

matchboth;data;data;data;;data;my_string;my_string;data;data;data;data;;;;my_string_2;;;;etc;etc

match8th;data;data;data;;data;my_string;my_string;data;data;data;data;;;;other_string_2;;;;etc;etc

match16th;data;data;data;;data;my_string;other_string;data;data;data;data;;;;my_string_2;;;;etc;etc

gazl@ws1:/tmp$

cosminel

10-07-2013 04:16 PM

Oh I see now Firerat, I needed to define the variable for awk for the defined variable in the script :) Either do this or use the ' ' to separate. The syntax format is killing me since I am a total beginner.

I already knew that I could grab the data without grep but I had this impression that using solely awk would slow down the search considerably. I didn't get the chance to test this in the working environment (a server with loads of data). So I just temporarily thought of letting grep (or I could use fgrep) of grabbing the data and then pass the results to awk.

Thank you for your input GazL. I have to say, awk looks cleaner at this point :)

I will test grep/fgrep against awk on the production server to see which is the fastest and by what amount.

Thank you guys for your help. After further testing, If I don't get stuck somewhere, I will mark the thread as solved, as I understand it's a good thing to do.

GazL	10-07-2013 04:24 PM

Quote:

Originally Posted by cosminel (Post 5041715)

Thank you for your input GazL. I have to say, awk looks cleaner at this point :)

it usually does. :) Regexes never look pretty.

I'd be interested to see the results of your benchmarking of awk v grep if you'd be kind enough to come back and let us know.

cosminel

10-07-2013 04:58 PM

Sure thing! Once things settle down around here I will begin testing and get back to you with my findings.

I am also wondering how much is the speed of awk affected by the complexity of the command that involves it.

But I will begin with a plain string search.

All times are GMT -5. The time now is 08:32 PM.

Page 1 of 2

Show 50 post(s) from this thread on one page