LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-12-2014, 01:18 PM   #1
cin_
Member
 
Registered: Dec 2010
Posts: 281

Rep: Reputation: 24
sed, grep,gawk :: Finding First Instance Of String Before A Given Line Number


I have two files..

Code:
enwiki.xml    #~4G (awk was complaining about file size until grail suggested gawk)
verbs.fr      #
verbs.fr lists the line number where a string matched on an earlier grep process

what i need is:

taking a line number from verbs.fr, say: 53343242, in enwiki.xml look at earlier lines starting at that line number until i find the first occurrence of the string '<page>', then look at lines after this line number to find the first occurrence of '</page>', and finally print out the chunk between

Code:
???  #    :<page>
?         : blah
?         : blah
53343242# :  MY_STRING
?         : blah
?         : blah
??? #     :</page>
the big problem is the line number is necessary because the string can occur more than once, if the string were unique for each instance i could slap together a regular expression to find the chunk

i can write my own script but it would be slow..

my knowledge of sed, grep, and (g)awk is limited so i always assume there is some quick and easy way to do what i need with them

if writing my own script is the route necessary..
i was planning something like:
stopping at a line n distance from my desired line, alike n=17, then step through the 17 and if </page> or <page> occurs more than once before the specified line number then adjust the value of n until <page> only occurs once before the desired line, then step through until </page> appears, note the line values for <page> and </page> and sed -n 's#,e#p' FILE..
again, but it will be slooow

Last edited by cin_; 05-12-2014 at 02:01 PM. Reason: gramm`err
 
Old 05-12-2014, 01:29 PM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
As you are using Ubuntu, you may find you are using mawk which is the default. So you could try installing gawk and see if that eliminates the file size issue.

I would however probably use something like Perl or Ruby as the have modules / libraries which can parse xml and hence would probably make the whole process easier.
 
1 members found this post helpful.
Old 05-12-2014, 01:41 PM   #3
cin_
Member
 
Registered: Dec 2010
Posts: 281

Original Poster
Rep: Reputation: 24
gawk

spot on with the gawk fix..

i'd rather stay away from any parsers, was hoping to get it together in a oneliner, but if the full fledged script is necessary i can reconsider the parser..
 
Old 05-12-2014, 03:34 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by cin_ View Post
verbs.fr lists the line number where a string matched on an earlier grep process

what i need is:

taking a line number from verbs.fr, say: 53343242, in enwiki.xml look at earlier lines starting at that line number until i find the first occurrence of the string '<page>', then look at lines after this line number to find the first occurrence of '</page>', and finally print out the chunk between
If you can use GNU grep's --byte-offset option when making verbs.fr, you can save a lot of time. Finding a line number requires scanning the whole file up to that line, but given a byte offset you can jump right there (not using awk though).
 
1 members found this post helpful.
Old 05-12-2014, 04:33 PM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
With this InFile ...
Code:
???  #    :<page>
?         : blah
?         : blah
53343232# :  HERMAN'S_STRING
?         : blah
?         : blah
??? #     :</page>
???  #    :<page>
?         : blah
?         : blah
53343242# :  MY_STRING
?         : blah
?         : blah
??? #     :</page>
???  #    :<page>
?         : blah
?         : blah
53343252# :  EDWARD'S_STRING
?         : blah
?         : blah
??? #     :</page>
???  #    :<page>
?         : blah
?         : blah
53343262# :  DAVID'S_STRING
?         : blah
?         : blah
??? #     :</page>
... this awk ...
Code:
awk 'BEGIN{RS="<page>|</page>"}/53343242/' $InFile >$OutFile
... produced this OutFile ...
Code:
?         : blah
?         : blah
53343242# :  MY_STRING
?         : blah
?         : blah
??? #     :
Daniel B. Martin
 
1 members found this post helpful.
Old 05-12-2014, 06:05 PM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,124

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
If one is concerned about reading the entire file, could be modified to stop after reading the required "record"
Code:
awk 'BEGIN{RS="<page>|</page>"}/53343242/ {print ; exit}' $InFile >$OutFile
 
2 members found this post helpful.
Old 05-13-2014, 03:50 AM   #7
cin_
Member
 
Registered: Dec 2010
Posts: 281

Original Poster
Rep: Reputation: 24
danielbmartin this is awesome.. exactly what i expected and wanted to see
and syg00's exit was a great optimasation for how i was going to use it

ntubski, --byte-offset is HUGE.. i wanted something like this after i read awk's NR reads from line 1 each time it runs
i first tried loading the file into an array so i could keep count of the last match with an index counter,
but the load made it ridiculously slow and the whole attempt was riddled with other issues.. --byte-offset would have fixed this issue being able to determine which byte to start the next search

so many times i am working on little projects and want a way to 'stream' naively and at my own pace
--byte-offset will help a great deal

i ended up writing a bunch of one liners and just strung them through a bash script creating tons of temporary files

i thought if i could trust the dada wholly then there should be an equal number of opening <page> and closing </page>s
i could..
Code:
grep -n "<page>" enwik > page.lst
grep -o [0-9]* page.lst >linepage.lst
grep -n "</page>" enwik.xml > stop.lst
grep -o [0-9]* stop.lst >linestop.lst

wc -l line*.lst
 173421 linepage.lst
 173421 linestop.lst
 346842 total

paste linepage.lst linestop.lst > chunk.lst
Code:
#!/bin/bash

while read line; do
 IFS=":"
 thisln=(${line})
 lnum=${thisln[0]}
 unset IFS
 while read otln; do
  set IFS=" "
  thisot=(${otln})
  unset IFS
  sm=${thisot[0]}
  lg=${thisot[1]}
  if [[ $lnum -gt $sm ]] && [[ $lnum -lt $lg ]]; then
   sed -n "${sm},${lg}p" enwik.xml >> frverbs.lst
   break
  fi
 done < "chunk.lst"
done < "verbs.fr"
oof.. ugly
.. and, like i knew it would be, it was slooow, but a short walk later and i had what i was looking for


thanks for all of the excellent help

Last edited by cin_; 05-13-2014 at 04:01 AM. Reason: gramm`err
 
Old 05-13-2014, 10:45 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I am curious. Whilst I had thought of the solutions presented by Daniel and syg00, my understanding was that the following information:
Quote:
taking a line number from verbs.fr, say: 53343242
meant that the file contained line numbers and not the string you are looking for, hence the //{} format would not work.
Obviously you can compare the number with NR, however this now removes the idea of setting RS to the page alternatives as this would also alter the NR count as I am guessing the number supplied would be based on individual lines of data.

So using this information, please confirm how these solutions have aided you?
 
Old 05-13-2014, 03:29 PM   #9
cin_
Member
 
Registered: Dec 2010
Posts: 281

Original Poster
Rep: Reputation: 24
enwik

grail, i've been kind of straddling this post between my home box and my work box.. we have ubuntu set up at work
the project is my own though, and i suppose i have been vague about the dada but mostly unintentionally

i'm using wiktionary to create a foreign language learning aide:
http://dumps.wikimedia.org/enwiktionary/latest/

i hoped if i could get many languages from one source the dada would be homogeneous enough.. naturally i have already found a couple inconsistencies in styling and syntax but far less than if i was sourcing dada from all over for each language.. to make adding new languages to the aide would be painless

so i was looking at the french verb être :
http://en.wiktionary.org/wiki/%C3%AAtre
if you hit edit next to the word verb, this important bit shows itself..
Code:
===Verb===
{{fr-verb|type=auxiliary|sort=etre}}
that is a psuedo`unique string that can be located in the ~4G entire dadabase

i went through the four gig dada with a script that stripped the \n after the string '===Verb===' so i could grep -n '===Verb==={{fr-verb'
now i have a file listing, with an internal line number, all of the word page that contain a french verb

so two seperate files, one containing only dada i want with line numbers and one containing all of the dada without line numbers

due the work and home relationship i have yet to be in the right place to test the (g)awk suggestions and my elation and thanks were based solely on what was implied in the post as the functionality, from the example dada and the output it looked like it would do what i was asking

also, like i said, by the time i received, saw, the responses i already podged together a walking script that got me what i needed so testing the suggestions would have been purely academic

when i get to a place where i can run the one liner suggested against my dada i'll post in what way i used the suggestion what the output looked like

certainly when i go to add russian, fyodor!, and german, franz!, i'll return to this thread and see what i can use, and how, to speed up the process of turning wiktionary's dada schema into my own

Last edited by cin_; 05-13-2014 at 03:31 PM. Reason: gramm`err
 
Old 05-14-2014, 04:12 AM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Thanks for the information Makes a little more sense now. Good luck with your project
 
Old 05-14-2014, 06:13 AM   #11
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,124

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Quote:
Originally Posted by cin_ View Post
... exit was a great optimasation for how i was going to use it
Be aware that if you are going to be re-reading the same file, there are significant advantages in ordering the reads.
The way page-cache works, if you can make use of data already resident in RAM, you save having to do any of the physical I/O on the subsequent reads. Sounds obvious, but let's say you can hold 40% of the file in page-cache. If you read say 30% of the file and exit, then subsequenty read 10% (from the start), you do no (physical) I/O. Way to go.
However, if you first read 90% of the file, page-cache starts flushing, and when you then read the 10%, it all has to be read again from disk.
Makes a big difference if you're doing it a lot.
 
Old 05-22-2014, 11:28 PM   #12
cin_
Member
 
Registered: Dec 2010
Posts: 281

Original Poster
Rep: Reputation: 24
premature zeal

that [solved] tag was a bit premature and way over zealous..
when i said my own solution worked but was slow.. that turned out to be incorrect as well,
when i looked at the dada it was incomplete and weird,
reworking the script to perform correctly came with a theoretical runtime of 6 days

as for the awk suggestions..
after running some tests i see why grail was confused on my confirmation for the awk suggestions,
they worked but with a number of false postives:
sometimes the line number was present in an unrelated way, like in a ID number or timestamp or some such,
so i massaged the dada to include a unique string with the line number:
Code:
5553343--:LINE
5553344--:LINE
then re`awk'd, but it was too slow
so it tried reworking,
all efforts failed

i got the runtime down to 26 hours but still too long so i decided to start over,
and looked into ntubski's suggestion of --byte-offset

my goal was speed and i was trying anything,
i was --byte-offset'ing then --byte-offset'ing the --byte-offset file itself,
this shit was turtles all the way down but whatevered i tried sloshed along
then i found the --byte-offset equivalent tag in tail and head: -c; tail -c +INT | head -c INT

so my plan became binary search the --byte-offset of the --byte-offset file to find the split using the tail -c and use the split to tail -c the original to get the extracted bit i desired

with some work here and there i pulled off a ~4+ minute run where all of the dada i wanted was present,
such results seem crazy when compared to what it originally took me

here's the script for posterity:
Code:
#!/bin/bash
##from a file sent as the first argument remove chunks between two line numbers based on a containing line value

theorig="$1"
thislang="$2"
theout="$2.verb.lst"
bytebyte="byte.bytechunk.lst"
bytechunk="bytechunk.lst"

modded="enwik.mod.singline.xml"
theins="byteverb.fr.only.lst"

#modded="tmp.enwik.mod.singline.xml"
#theins="tmp.byteverb.fr.only.lst"
#perl -pe 's/(?<=\=\=\=Verb\=\=\=)\n//' $theorig > $modded
#grep -n "===Verb==={{fr-verb" $modded > tmp.fr.verbose
#sed "/verb-form/d" tmp.fr.verbose > $theins

wc -c $modded

#bytebyte="tmp.byte.bytechunk.lst"
#bytechunk="tmp.bytechunk.lst"
#grep --byte-offset "<page>" $1 > tmp.bytepage.lst
#grep -o [0-9]* tmp.bytepage.lst > tmp.bytepage.only.lst
#grep --byte-offset "</page>" $1 > tmp.bytestop.lst
#grep -o [0-9]* tmp.bytestop.lst > tmp.bytestop.only.lst

#paste tmp.bytepage.only.lst tmp.bytestop.only.lst > $bytechunk
#grep --byte-offset "" "$bytechunk" > "$bytebyte"

count=0
while read line
do
  IFS=":"
  thisln=(${line})
  i=${thisln[0]}
  found=0
  bot=0
  findtop=($(tail -1 $bytebyte))
  top=${findtop[0]}
  unset IFS
  while [[ $found -eq 0 ]]; do
    incount=0
    guess=$((bot + ((top - bot)/2)))
      while read line; do 
        arr=(${line})
        if [[ $top -eq 0 ]]; then
          theset=($line)
          break
        fi
        if [[ $incount -eq 1 ]];then
          theset=(${line})
          break
        fi
        let incount=incount+1
      done < <(tail -c +"$guess" $bytechunk | head -n 2)
    if [ "${theset[0]}" -lt "${i}" ] && [ "${theset[1]}" -gt "${i}" ];then
      echo -en "\r"${theset[0]}..$i..${theset[1]}..$count
      ( tail -c "+${theset[0]}" "$modded" | head -c $(( ${theset[1]}-${theset[0]}+21 )) >> $theout ) &
      found=1
    fi

    if [ "${theset[0]}" -gt "${i}" ] && [ "${theset[0]}" -gt "${i}" ];then
      top=$(( $guess-1 ))
      if [[ $top -lt 0 ]]; then
        top=0
      fi
    fi

    if [ "${theset[0]}" -lt "${i}" ] && [ "${theset[1]}" -lt "${i}" ];then
      bot=$(( $guess+1 ))
    fi
  done
  let count=count+1
done < "$theins" 
#rm ./tmp*       
#./newlang ../orig/enwiktionary-latest-pages-articles.xml fr
thanks for all the help, and goddamn --byte-offset is awesome

Last edited by cin_; 05-22-2014 at 11:38 PM. Reason: gramm`err
 
Old 05-23-2014, 10:28 AM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
hmmm ... to say your code has confused me is a little of an understatement

I tried to re-write and see if I could understand it better that way but would you be able to tell me what data and in what format information is stored in both byte.bytechunk.lst and bytechunk.lst?
I am guessing from some of the work they are numbers, but with the tests you are showing it has become confusing.

What I will say is that you change between [], [[]], (()) and let and hence have made it rather difficult to follow your logic. If you would provide the detail above I would be happy to show you an alternative using basically the same code.
 
Old 05-28-2014, 07:04 AM   #14
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,358

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Not entirely clear what you're after, but this will grab a range of lines by pattern, within line num range
Code:
sed -n '2,${/^<page>/,/^<\/page>/p;}' t.t
HTH
 
  


Reply

Tags
grep, sed



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Insert line using sed or awk at line using line number as variable sunilsagar Programming 11 02-03-2012 10:48 AM
[SOLVED] BASH (grep / sed / awk): find string within a (line) range dragonetti Linux - Newbie 2 11-24-2011 10:16 PM
[SOLVED] using sed and grep with a text string boumphreyfr Linux - Newbie 4 04-29-2011 04:26 PM
Sed/awk/grep search for number string of variable length in text file Alexr Linux - Newbie 10 01-19-2010 01:34 PM
grep - finding string and replacing with new ckibler Linux - Newbie 6 08-01-2003 06:25 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:09 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration