[SOLVED] sed, grep,gawk :: Finding First Instance Of String Before A Given Line Number

cin_ · 05-12-2014, 01:18 PM

I have two files..

Code:

enwiki.xml    #~4G (awk was complaining about file size until grail suggested gawk)
verbs.fr      #

verbs.fr lists the line number where a string matched on an earlier grep process

what i need is:

taking a line number from verbs.fr, say: 53343242, in enwiki.xml look at earlier lines starting at that line number until i find the first occurrence of the string '<page>', then look at lines after this line number to find the first occurrence of '</page>', and finally print out the chunk between

Code:

???  #    :<page>
?         : blah
?         : blah
53343242# :  MY_STRING
?         : blah
?         : blah
??? #     :</page>

the big problem is the line number is necessary because the string can occur more than once, if the string were unique for each instance i could slap together a regular expression to find the chunk

i can write my own script but it would be slow..

my knowledge of sed, grep, and (g)awk is limited so i always assume there is some quick and easy way to do what i need with them

if writing my own script is the route necessary..
i was planning something like:
stopping at a line n distance from my desired line, alike n=17, then step through the 17 and if </page> or <page> occurs more than once before the specified line number then adjust the value of n until <page> only occurs once before the desired line, then step through until </page> appears, note the line values for <page> and </page> and sed -n 's#,e#p' FILE..
again, but it will be slooow

grail · 05-12-2014, 01:29 PM

As you are using Ubuntu, you may find you are using mawk which is the default. So you could try installing gawk and see if that eliminates the file size issue.

I would however probably use something like Perl or Ruby as the have modules / libraries which can parse xml and hence would probably make the whole process easier.

cin_ · 05-12-2014, 01:41 PM

spot on with the gawk fix..

i'd rather stay away from any parsers, was hoping to get it together in a oneliner, but if the full fledged script is necessary i can reconsider the parser..

ntubski · 05-12-2014, 03:34 PM

Quote:

Originally Posted by cin_

verbs.fr lists the line number where a string matched on an earlier grep process

what i need is:

taking a line number from verbs.fr, say: 53343242, in enwiki.xml look at earlier lines starting at that line number until i find the first occurrence of the string '<page>', then look at lines after this line number to find the first occurrence of '</page>', and finally print out the chunk between

If you can use GNU grep's --byte-offset option when making verbs.fr, you can save a lot of time. Finding a line number requires scanning the whole file up to that line, but given a byte offset you can jump right there (not using awk though).

danielbmartin · 05-12-2014, 04:33 PM

With this InFile ...

Code:

???  #    :<page>
?         : blah
?         : blah
53343232# :  HERMAN'S_STRING
?         : blah
?         : blah
??? #     :</page>
???  #    :<page>
?         : blah
?         : blah
53343242# :  MY_STRING
?         : blah
?         : blah
??? #     :</page>
???  #    :<page>
?         : blah
?         : blah
53343252# :  EDWARD'S_STRING
?         : blah
?         : blah
??? #     :</page>
???  #    :<page>
?         : blah
?         : blah
53343262# :  DAVID'S_STRING
?         : blah
?         : blah
??? #     :</page>

... this awk ...

Code:

awk 'BEGIN{RS="<page>|</page>"}/53343242/' $InFile >$OutFile

... produced this OutFile ...

Code:

?         : blah
?         : blah
53343242# :  MY_STRING
?         : blah
?         : blah
??? #     :

Daniel B. Martin

syg00 · 05-12-2014, 06:05 PM

If one is concerned about reading the entire file, could be modified to stop after reading the required "record"

Code:

awk 'BEGIN{RS="<page>|</page>"}/53343242/ {print ; exit}' $InFile >$OutFile

cin_ · 05-13-2014, 03:50 AM

danielbmartin this is awesome.. exactly what i expected and wanted to see
and syg00's exit was a great optimasation for how i was going to use it

ntubski, --byte-offset is HUGE.. i wanted something like this after i read awk's NR reads from line 1 each time it runs
i first tried loading the file into an array so i could keep count of the last match with an index counter,
but the load made it ridiculously slow and the whole attempt was riddled with other issues.. --byte-offset would have fixed this issue being able to determine which byte to start the next search

so many times i am working on little projects and want a way to 'stream' naively and at my own pace
--byte-offset will help a great deal

i ended up writing a bunch of one liners and just strung them through a bash script creating tons of temporary files

i thought if i could trust the dada wholly then there should be an equal number of opening <page> and closing </page>s
i could..

Code:

grep -n "<page>" enwik > page.lst
grep -o [0-9]* page.lst >linepage.lst
grep -n "</page>" enwik.xml > stop.lst
grep -o [0-9]* stop.lst >linestop.lst

wc -l line*.lst
 173421 linepage.lst
 173421 linestop.lst
 346842 total

paste linepage.lst linestop.lst > chunk.lst

Code:

#!/bin/bash

while read line; do
 IFS=":"
 thisln=(${line})
 lnum=${thisln[0]}
 unset IFS
 while read otln; do
  set IFS=" "
  thisot=(${otln})
  unset IFS
  sm=${thisot[0]}
  lg=${thisot[1]}
  if [[ $lnum -gt $sm ]] && [[ $lnum -lt $lg ]]; then
   sed -n "${sm},${lg}p" enwik.xml >> frverbs.lst
   break
  fi
 done < "chunk.lst"
done < "verbs.fr"

oof.. ugly
.. and, like i knew it would be, it was slooow, but a short walk later and i had what i was looking for

thanks for all of the excellent help

grail · 05-13-2014, 10:45 AM

I am curious. Whilst I had thought of the solutions presented by Daniel and syg00, my understanding was that the following information:

Quote:

taking a line number from verbs.fr, say: 53343242

meant that the file contained line numbers and not the string you are looking for, hence the //{} format would not work.
Obviously you can compare the number with NR, however this now removes the idea of setting RS to the page alternatives as this would also alter the NR count as I am guessing the number supplied would be based on individual lines of data.

So using this information, please confirm how these solutions have aided you?

cin_ · 05-13-2014, 03:29 PM

grail, i've been kind of straddling this post between my home box and my work box.. we have ubuntu set up at work
the project is my own though, and i suppose i have been vague about the dada but mostly unintentionally

i'm using wiktionary to create a foreign language learning aide:
http://dumps.wikimedia.org/enwiktionary/latest/

i hoped if i could get many languages from one source the dada would be homogeneous enough.. naturally i have already found a couple inconsistencies in styling and syntax but far less than if i was sourcing dada from all over for each language.. to make adding new languages to the aide would be painless

so i was looking at the french verb être :
http://en.wiktionary.org/wiki/%C3%AAtre
if you hit edit next to the word verb, this important bit shows itself..

Code:

===Verb===
{{fr-verb|type=auxiliary|sort=etre}}

that is a psuedo`unique string that can be located in the ~4G entire dadabase

i went through the four gig dada with a script that stripped the \n after the string '===Verb===' so i could grep -n '===Verb==={{fr-verb'
now i have a file listing, with an internal line number, all of the word page that contain a french verb

so two seperate files, one containing only dada i want with line numbers and one containing all of the dada without line numbers

due the work and home relationship i have yet to be in the right place to test the (g)awk suggestions and my elation and thanks were based solely on what was implied in the post as the functionality, from the example dada and the output it looked like it would do what i was asking

also, like i said, by the time i received, saw, the responses i already podged together a walking script that got me what i needed so testing the suggestions would have been purely academic

when i get to a place where i can run the one liner suggested against my dada i'll post in what way i used the suggestion what the output looked like

certainly when i go to add russian, fyodor!, and german, franz!, i'll return to this thread and see what i can use, and how, to speed up the process of turning wiktionary's dada schema into my own

grail · 05-14-2014, 04:12 AM

Thanks for the information

Makes a little more sense now. Good luck with your project

syg00 · 05-14-2014, 06:13 AM

Quote:

Originally Posted by cin_

... exit was a great optimasation for how i was going to use it

Be aware that if you are going to be re-reading the same file, there are significant advantages in ordering the reads.
The way page-cache works, if you can make use of data already resident in RAM, you save having to do any of the physical I/O on the subsequent reads. Sounds obvious, but let's say you can hold 40% of the file in page-cache. If you read say 30% of the file and exit, then subsequenty read 10% (from the start), you do no (physical) I/O. Way to go.
However, if you first read 90% of the file, page-cache starts flushing, and when you then read the 10%, it all has to be read again from disk.
Makes a big difference if you're doing it a lot.

cin_ · 05-22-2014, 11:28 PM

that [solved] tag was a bit premature and way over zealous..
when i said my own solution worked but was slow.. that turned out to be incorrect as well,
when i looked at the dada it was incomplete and weird,
reworking the script to perform correctly came with a theoretical runtime of 6 days

as for the awk suggestions..
after running some tests i see why grail was confused on my confirmation for the awk suggestions,
they worked but with a number of false postives:
sometimes the line number was present in an unrelated way, like in a ID number or timestamp or some such,
so i massaged the dada to include a unique string with the line number:

Code:

5553343--:LINE
5553344--:LINE

then re`awk'd, but it was too slow
so it tried reworking,
all efforts failed

i got the runtime down to 26 hours but still too long so i decided to start over,
and looked into ntubski's suggestion of --byte-offset

my goal was speed and i was trying anything,
i was --byte-offset'ing then --byte-offset'ing the --byte-offset file itself,
this shit was turtles all the way down but whatevered i tried sloshed along
then i found the --byte-offset equivalent tag in tail and head: -c; tail -c +INT | head -c INT

so my plan became binary search the --byte-offset of the --byte-offset file to find the split using the tail -c and use the split to tail -c the original to get the extracted bit i desired

with some work here and there i pulled off a ~4+ minute run where all of the dada i wanted was present,
such results seem crazy when compared to what it originally took me

here's the script for posterity:

Code:

#!/bin/bash
##from a file sent as the first argument remove chunks between two line numbers based on a containing line value

theorig="$1"
thislang="$2"
theout="$2.verb.lst"
bytebyte="byte.bytechunk.lst"
bytechunk="bytechunk.lst"

modded="enwik.mod.singline.xml"
theins="byteverb.fr.only.lst"

#modded="tmp.enwik.mod.singline.xml"
#theins="tmp.byteverb.fr.only.lst"
#perl -pe 's/(?<=\=\=\=Verb\=\=\=)\n//' $theorig > $modded
#grep -n "===Verb==={{fr-verb" $modded > tmp.fr.verbose
#sed "/verb-form/d" tmp.fr.verbose > $theins

wc -c $modded

#bytebyte="tmp.byte.bytechunk.lst"
#bytechunk="tmp.bytechunk.lst"
#grep --byte-offset "<page>" $1 > tmp.bytepage.lst
#grep -o [0-9]* tmp.bytepage.lst > tmp.bytepage.only.lst
#grep --byte-offset "</page>" $1 > tmp.bytestop.lst
#grep -o [0-9]* tmp.bytestop.lst > tmp.bytestop.only.lst

#paste tmp.bytepage.only.lst tmp.bytestop.only.lst > $bytechunk
#grep --byte-offset "" "$bytechunk" > "$bytebyte"

count=0
while read line
do
  IFS=":"
  thisln=(${line})
  i=${thisln[0]}
  found=0
  bot=0
  findtop=($(tail -1 $bytebyte))
  top=${findtop[0]}
  unset IFS
  while [[ $found -eq 0 ]]; do
    incount=0
    guess=$((bot + ((top - bot)/2)))
      while read line; do 
        arr=(${line})
        if [[ $top -eq 0 ]]; then
          theset=($line)
          break
        fi
        if [[ $incount -eq 1 ]];then
          theset=(${line})
          break
        fi
        let incount=incount+1
      done < <(tail -c +"$guess" $bytechunk | head -n 2)
    if [ "${theset[0]}" -lt "${i}" ] && [ "${theset[1]}" -gt "${i}" ];then
      echo -en "\r"${theset[0]}..$i..${theset[1]}..$count
      ( tail -c "+${theset[0]}" "$modded" | head -c $(( ${theset[1]}-${theset[0]}+21 )) >> $theout ) &
      found=1
    fi

    if [ "${theset[0]}" -gt "${i}" ] && [ "${theset[0]}" -gt "${i}" ];then
      top=$(( $guess-1 ))
      if [[ $top -lt 0 ]]; then
        top=0
      fi
    fi

    if [ "${theset[0]}" -lt "${i}" ] && [ "${theset[1]}" -lt "${i}" ];then
      bot=$(( $guess+1 ))
    fi
  done
  let count=count+1
done < "$theins" 
#rm ./tmp*       
#./newlang ../orig/enwiktionary-latest-pages-articles.xml fr

thanks for all the help, and goddamn --byte-offset is awesome

grail · 05-23-2014, 10:28 AM

hmmm ... to say your code has confused me is a little of an understatement

I tried to re-write and see if I could understand it better that way but would you be able to tell me what data and in what format information is stored in both byte.bytechunk.lst and bytechunk.lst?
I am guessing from some of the work they are numbers, but with the tests you are showing it has become confusing.

What I will say is that you change between [], [[]], (()) and let and hence have made it rather difficult to follow your logic. If you would provide the detail above I would be happy to show you an alternative using basically the same code.

chrism01 · 05-28-2014, 07:04 AM

Not entirely clear what you're after, but this will grab a range of lines by pattern, within line num range

Code:

sed -n '2,${/^<page>/,/^<\/page>/p;}' t.t

HTH