[SOLVED] sed, grep,gawk :: Finding First Instance Of String Before A Given Line Number
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
sed, grep,gawk :: Finding First Instance Of String Before A Given Line Number
I have two files..
Code:
enwiki.xml #~4G (awk was complaining about file size until grail suggested gawk)
verbs.fr #
verbs.fr lists the line number where a string matched on an earlier grep process
what i need is:
taking a line number from verbs.fr, say: 53343242, in enwiki.xml look at earlier lines starting at that line number until i find the first occurrence of the string '<page>', then look at lines after this line number to find the first occurrence of '</page>', and finally print out the chunk between
the big problem is the line number is necessary because the string can occur more than once, if the string were unique for each instance i could slap together a regular expression to find the chunk
i can write my own script but it would be slow..
my knowledge of sed, grep, and (g)awk is limited so i always assume there is some quick and easy way to do what i need with them
if writing my own script is the route necessary..
i was planning something like:
stopping at a line n distance from my desired line, alike n=17, then step through the 17 and if </page> or <page> occurs more than once before the specified line number then adjust the value of n until <page> only occurs once before the desired line, then step through until </page> appears, note the line values for <page> and </page> and sed -n 's#,e#p' FILE..
again, but it will be slooow
Last edited by cin_; 05-12-2014 at 02:01 PM.
Reason: gramm`err
As you are using Ubuntu, you may find you are using mawk which is the default. So you could try installing gawk and see if that eliminates the file size issue.
I would however probably use something like Perl or Ruby as the have modules / libraries which can parse xml and hence would probably make the whole process easier.
i'd rather stay away from any parsers, was hoping to get it together in a oneliner, but if the full fledged script is necessary i can reconsider the parser..
verbs.fr lists the line number where a string matched on an earlier grep process
what i need is:
taking a line number from verbs.fr, say: 53343242, in enwiki.xml look at earlier lines starting at that line number until i find the first occurrence of the string '<page>', then look at lines after this line number to find the first occurrence of '</page>', and finally print out the chunk between
If you can use GNU grep's --byte-offset option when making verbs.fr, you can save a lot of time. Finding a line number requires scanning the whole file up to that line, but given a byte offset you can jump right there (not using awk though).
danielbmartin this is awesome.. exactly what i expected and wanted to see
and syg00's exit was a great optimasation for how i was going to use it
ntubski, --byte-offset is HUGE.. i wanted something like this after i read awk's NR reads from line 1 each time it runs
i first tried loading the file into an array so i could keep count of the last match with an index counter,
but the load made it ridiculously slow and the whole attempt was riddled with other issues.. --byte-offset would have fixed this issue being able to determine which byte to start the next search
so many times i am working on little projects and want a way to 'stream' naively and at my own pace
--byte-offset will help a great deal
i ended up writing a bunch of one liners and just strung them through a bash script creating tons of temporary files
i thought if i could trust the dada wholly then there should be an equal number of opening <page> and closing </page>s
i could..
I am curious. Whilst I had thought of the solutions presented by Daniel and syg00, my understanding was that the following information:
Quote:
taking a line number from verbs.fr, say: 53343242
meant that the file contained line numbers and not the string you are looking for, hence the //{} format would not work.
Obviously you can compare the number with NR, however this now removes the idea of setting RS to the page alternatives as this would also alter the NR count as I am guessing the number supplied would be based on individual lines of data.
So using this information, please confirm how these solutions have aided you?
grail, i've been kind of straddling this post between my home box and my work box.. we have ubuntu set up at work
the project is my own though, and i suppose i have been vague about the dada but mostly unintentionally
i hoped if i could get many languages from one source the dada would be homogeneous enough.. naturally i have already found a couple inconsistencies in styling and syntax but far less than if i was sourcing dada from all over for each language.. to make adding new languages to the aide would be painless
that is a psuedo`unique string that can be located in the ~4G entire dadabase
i went through the four gig dada with a script that stripped the \n after the string '===Verb===' so i could grep -n '===Verb==={{fr-verb'
now i have a file listing, with an internal line number, all of the word page that contain a french verb
so two seperate files, one containing only dada i want with line numbers and one containing all of the dada without line numbers
due the work and home relationship i have yet to be in the right place to test the (g)awk suggestions and my elation and thanks were based solely on what was implied in the post as the functionality, from the example dada and the output it looked like it would do what i was asking
also, like i said, by the time i received, saw, the responses i already podged together a walking script that got me what i needed so testing the suggestions would have been purely academic
when i get to a place where i can run the one liner suggested against my dada i'll post in what way i used the suggestion what the output looked like
certainly when i go to add russian, fyodor!, and german, franz!, i'll return to this thread and see what i can use, and how, to speed up the process of turning wiktionary's dada schema into my own
Last edited by cin_; 05-13-2014 at 03:31 PM.
Reason: gramm`err
... exit was a great optimasation for how i was going to use it
Be aware that if you are going to be re-reading the same file, there are significant advantages in ordering the reads.
The way page-cache works, if you can make use of data already resident in RAM, you save having to do any of the physical I/O on the subsequent reads. Sounds obvious, but let's say you can hold 40% of the file in page-cache. If you read say 30% of the file and exit, then subsequenty read 10% (from the start), you do no (physical) I/O. Way to go.
However, if you first read 90% of the file, page-cache starts flushing, and when you then read the 10%, it all has to be read again from disk.
Makes a big difference if you're doing it a lot.
that [solved] tag was a bit premature and way over zealous..
when i said my own solution worked but was slow.. that turned out to be incorrect as well,
when i looked at the dada it was incomplete and weird,
reworking the script to perform correctly came with a theoretical runtime of 6 days
as for the awk suggestions..
after running some tests i see why grail was confused on my confirmation for the awk suggestions,
they worked but with a number of false postives:
sometimes the line number was present in an unrelated way, like in a ID number or timestamp or some such,
so i massaged the dada to include a unique string with the line number:
Code:
5553343--:LINE
5553344--:LINE
then re`awk'd, but it was too slow
so it tried reworking,
all efforts failed
i got the runtime down to 26 hours but still too long so i decided to start over,
and looked into ntubski's suggestion of --byte-offset
my goal was speed and i was trying anything,
i was --byte-offset'ing then --byte-offset'ing the --byte-offset file itself,
this shit was turtles all the way down but whatevered i tried sloshed along
then i found the --byte-offset equivalent tag in tail and head: -c; tail -c +INT | head -c INT
so my plan became binary search the --byte-offset of the --byte-offset file to find the split using the tail -c and use the split to tail -c the original to get the extracted bit i desired
with some work here and there i pulled off a ~4+ minute run where all of the dada i wanted was present,
such results seem crazy when compared to what it originally took me
here's the script for posterity:
Code:
#!/bin/bash
##from a file sent as the first argument remove chunks between two line numbers based on a containing line value
theorig="$1"
thislang="$2"
theout="$2.verb.lst"
bytebyte="byte.bytechunk.lst"
bytechunk="bytechunk.lst"
modded="enwik.mod.singline.xml"
theins="byteverb.fr.only.lst"
#modded="tmp.enwik.mod.singline.xml"
#theins="tmp.byteverb.fr.only.lst"
#perl -pe 's/(?<=\=\=\=Verb\=\=\=)\n//' $theorig > $modded
#grep -n "===Verb==={{fr-verb" $modded > tmp.fr.verbose
#sed "/verb-form/d" tmp.fr.verbose > $theins
wc -c $modded
#bytebyte="tmp.byte.bytechunk.lst"
#bytechunk="tmp.bytechunk.lst"
#grep --byte-offset "<page>" $1 > tmp.bytepage.lst
#grep -o [0-9]* tmp.bytepage.lst > tmp.bytepage.only.lst
#grep --byte-offset "</page>" $1 > tmp.bytestop.lst
#grep -o [0-9]* tmp.bytestop.lst > tmp.bytestop.only.lst
#paste tmp.bytepage.only.lst tmp.bytestop.only.lst > $bytechunk
#grep --byte-offset "" "$bytechunk" > "$bytebyte"
count=0
while read line
do
IFS=":"
thisln=(${line})
i=${thisln[0]}
found=0
bot=0
findtop=($(tail -1 $bytebyte))
top=${findtop[0]}
unset IFS
while [[ $found -eq 0 ]]; do
incount=0
guess=$((bot + ((top - bot)/2)))
while read line; do
arr=(${line})
if [[ $top -eq 0 ]]; then
theset=($line)
break
fi
if [[ $incount -eq 1 ]];then
theset=(${line})
break
fi
let incount=incount+1
done < <(tail -c +"$guess" $bytechunk | head -n 2)
if [ "${theset[0]}" -lt "${i}" ] && [ "${theset[1]}" -gt "${i}" ];then
echo -en "\r"${theset[0]}..$i..${theset[1]}..$count
( tail -c "+${theset[0]}" "$modded" | head -c $(( ${theset[1]}-${theset[0]}+21 )) >> $theout ) &
found=1
fi
if [ "${theset[0]}" -gt "${i}" ] && [ "${theset[0]}" -gt "${i}" ];then
top=$(( $guess-1 ))
if [[ $top -lt 0 ]]; then
top=0
fi
fi
if [ "${theset[0]}" -lt "${i}" ] && [ "${theset[1]}" -lt "${i}" ];then
bot=$(( $guess+1 ))
fi
done
let count=count+1
done < "$theins"
#rm ./tmp*
#./newlang ../orig/enwiktionary-latest-pages-articles.xml fr
thanks for all the help, and goddamn --byte-offset is awesome
Last edited by cin_; 05-22-2014 at 11:38 PM.
Reason: gramm`err
hmmm ... to say your code has confused me is a little of an understatement
I tried to re-write and see if I could understand it better that way but would you be able to tell me what data and in what format information is stored in both byte.bytechunk.lst and bytechunk.lst?
I am guessing from some of the work they are numbers, but with the tests you are showing it has become confusing.
What I will say is that you change between [], [[]], (()) and let and hence have made it rather difficult to follow your logic. If you would provide the detail above I would be happy to show you an alternative using basically the same code.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.