LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   sed challenge..datamining (https://www.linuxquestions.org/questions/programming-9/sed-challenge-datamining-519099/)

fs11 01-14-2007 12:13 AM

sed challenge..datamining
 
Hello All,

I am working on a project and i came up with this problem.I need to extract a certain information from a text file...
for example.
YES_CHICK
agagagagadagatdgatagatfgatagatagag
agagagagatagatagatagtagatatagtatagta
fgafgatatatatatattgtgatgatgatgatgat
YES_HUMAN
sgsgsgsgasgafafafatsfgsgsfsgsfgsfsg
fgsfstfstsgtsgstsgstsgtsgstsgstsgts
gstsstsgstsgtsgstsgstsgtsgsststgstgs
YES_DEMON
fgsdgddghudghgdgshghdghsghgdhgdsgdhghd
fgdshdghdshgdshgdsgdhghdsghdshgdsgsh
gshgdhsgdhgsdgshghdsghdsghgdhgshgdhgsh


and i want to extract the info from it.for example if the user query is YES_HUMAN..then i get all the lines after YES_HUMAN uptil...YES_DEMON(not included.)


I have worked with sed before many times but i am having trouble doing this..i am sure it is possible.

If u think that it is not possible..what other options do i have..like any C++ code would also be of great help..

thanks and Regards to all
FAHAD SAEED

ilikejam 01-14-2007 12:59 AM

Hi.

Have a read of this:
http://enterprise.linux.com/article....33253&from=rss
The 'Searching, browsing, and exporting records' bit should be particularly interesting.

Dave

fs11 01-14-2007 01:46 AM

it still wont work... :(...any other ideas

ghostdog74 01-14-2007 02:10 AM

just one way
Code:

sed -n "/YES_CHICK/,/YES_HUMAN/{/YES_*/!p}" yourfile
sorry, i forgot to say my sed is GNU based.

jlinkels 01-14-2007 02:54 PM

The code by Ghostdog doesn't work with me in bash. I admit, I don't understand the code either otherwise I would have tried to fix it.

In these cases, I think awk is your friend. I really pays off to grab the concept of awk. Once you do it only takes a few minutes to create a script for this kind of processing. Awk was written for this purpose. :) Writing this post took me longer than writing the script.

This is the script:

Code:

BEGIN {
        pflag=0
}

{
        if ($0 ~ /YES_/){
                pflag=0
        }

        if ($0 == flavour) {
                pflag=1
        }


        if (pflag == 1 && $0 !~ flavour ){
                print $0
        }
}

With this input file
Code:

YES_CHICKEN
1. chicken chicken
2. chicken chicken
3. chicken chicken
4. chicken chicken
YES_HUMAN
1. human human
2. human human
3. human human
4. human human
YES_BIRD
1. bird bird
2. bird bird
3. bird bird
4. bird bird

yesfile is the input file containing your data strings. yes.awk is the awk script file.

it gives this output:

donald_pc:/tmp$ cat yesfile | awk -v flavour=YES_ -f yes.awk

donald_pc:/tmp$ cat yesfile | awk -v flavour=YES_m -f yes.awk

donald_pc:/tmp$ cat yesfile | awk -v flavour=YES_BIRD -f yes.awk
1. bird bird
2. bird bird
3. bird bird
4. bird bird

donald_pc:/tmp$ cat yesfile | awk -v flavour=YES_CHICKEN -f yes.awk
1. chicken chicken
2. chicken chicken
3. chicken chicken
4. chicken chicken

donald_pc:/tmp$


If you want the query string to show up before the data lines, change
if (pflag == 1 && $0 !~ flavour ){
in
if (pflag == 1){

On the command line, "-v flavour" passes a command line parameter to the awk script.

I know that there are awk gurus who can do this much more elegantly, and put it all on one line. This script is readable though :D

Let me know if this works for you

jlinkels

fs11 01-14-2007 07:58 PM

Thankyou all :)...that was very nice of u..

The code did work for the input file that i gave...

BUT there is one more hurdle...

the original data file is


Code:

>YES_CHICKEN
1. chicken chicken
2. chicken chicken
3. chicken chicken
4. chicken chicken
>YES_HUMAN
1. human human
2. human human
3. human human
4. human human
>YES_BIRD
1. bird bird
2. bird bird
3. bird bird
4. bird bird


and the code that jlinkels gave did not work with this data file...
so i tried to modify the code and did this

Code:

BEGIN {
        pflag=0
}

{
        if ($0 ~ />YES_/){
                pflag=0
        }

        if ($0 == flavour) {
                pflag=1
        }


        if (pflag == 1 && $0 !~ flavour ){
                print $0
        }
}

and for the output i typed this...

Code:

cat yesfile | awk -v flavour=>YES_BIRD -f yes.awk
BUT it wont work for the new data file...please help!!!!

homey 01-14-2007 08:18 PM

If you are going to put > before the field, then use quotes in the awk statement.

Code:

cat file.txt
>YES_CHICKEN
1. chicken chicken
2. chicken chicken
3. chicken chicken
4. chicken chicken
>YES_HUMAN
1. human human
2. human human
3. human human
4. human human
>YES_BIRD
1. bird bird
2. bird bird
3. bird bird
4. bird bird

Code:

awk -v flavour=">YES_BIRD" -f yes.awk file.txt
1. bird bird
2. bird bird
3. bird bird
4. bird bird

Edit: By the way, the sed command works on my FC6 box
Code:

sed -n '/>YES_HUMAN/,/>YES_BIRD/{/>YES_BIRD/!p}' file.txt
>YES_HUMAN
1. human human
2. human human
3. human human
4. human human


fs11 01-14-2007 08:26 PM

Thankyou so much for allof the help :)

the code did work for the modified data file.


However this did nt work on my RedHat Linux 3.3..

Code:

sed -n "/YES_CHICK/,/YES_HUMAN/{/YES_*/!p}" yourfile


All times are GMT -5. The time now is 05:48 AM.