LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-10-2009, 10:50 PM   #16
tuxdev
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 2,012

Rep: Reputation: 115Reputation: 115

ah, right. parsing -> word splitting -> evaluation
 
Old 11-11-2009, 06:20 AM   #17
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Original Poster
Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Quote:
Originally Posted by ghostdog74 View Post
ok on small files, but will choke on big files.
I would guess that my solution could have the same issue (the whole file winds up in the SED buffers.)

I have tried to find a SED solution that goes line by line, but no luck so far.
 
Old 11-11-2009, 07:31 AM   #18
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by pixellany View Post
That just might be more obfuscated than my SED solution.....
It's about BUFFER.
 
Old 11-11-2009, 07:43 AM   #19
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Original Poster
Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Quote:
Originally Posted by ghostdog74 View Post
less code doesn't mean its always legible or understandable.
Code:
awk 'NR>1&&$1=="the"{print ""}{ printf "%s ",$0}' words.txt
because printf doesn't insert a newline unless you tell it to, the output you see will be lines concat together, until the key word "the" is found, then print a newline. this is much more simpler to understand than the bunch of sed secret code
Before I can run side-by-side tests, the above needs to be modified to remove extra spaces and to have a (linefeed or EOF?) at the end.

Note the following:
Code:
[mherring@Ath play]$ awk 'NR>1&&$1=="the"{print ""}{ printf "%s ",$0}' words.txt
the house is blue
the cat is  hungry
the sun is bright
the
the  cat gone
the [mherring@Ath play]$ sed -n '${H;x;s/\n/ /g;s/^ *//;s/ \+/ /g;s/ the/\nthe/g;p};H' words.txt
the house is blue
the cat is hungry
the sun is bright
the
the cat gone
the
[mherring@Ath play]$
Note that the awk solution lacks a line feed, EOF, or ? Also, it's hard to see, but it leaves in some extra spaces (where there are blank lines in the original file.)
 
Old 11-11-2009, 07:43 AM   #20
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by pixellany View Post
I would guess that my solution could have the same issue (the whole file winds up in the SED buffers.)

I have tried to find a SED solution that goes line by line, but no luck so far.
On the other hand, you could preprocess using tr to get rid of the newlines, or even sed to put everything on one line and trim whitespace before splitting it up again. Nothing says sed has to load a seemingly-endless line all at once if the pattern doesn't require it.
Kevin Barry
 
Old 11-11-2009, 07:59 AM   #21
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Original Poster
Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Quote:
Originally Posted by ghostdog74 View Post
ok on small files, but will choke on big files.
Quote:
Originally Posted by pixellany View Post
I would guess that my solution could have the same issue (the whole file winds up in the SED buffers.)

I have tried to find a SED solution that goes line by line, but no luck so far.
My SED solution fails with a file size of 15MB, but works at 4MB. The alpha-geek will want to figure out exactly what file size breaks it and why.......

Last edited by pixellany; 11-11-2009 at 08:00 AM.
 
Old 11-11-2009, 08:51 AM   #22
tuxdev
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 2,012

Rep: Reputation: 115Reputation: 115
should be able to handle arbitrarily large files.. not so much for large lines.
Code:
#!/bin/bash
while read -r LINE ; do
   BUFFER="$BUFFER $LINE"
   BUFFER="${BUFFER//  / }"
   while [[ "$BUFFER" =~ "^ the.* the" ]] ; do
      BUFFER="${BUFFER:1}"
      BUFFER="${BUFFER/ the/$'\n' the}"
      echo "${BUFFER%%$'\n'*}"
      BUFFER="${BUFFER#*$'\n'}"
   done
done
BUFFER="${BUFFER:1}"
if [[ "${BUFFER:$((${#BUFFER}-1))}" == " " ]] ; then
   BUFFER="${BUFFER:0:$((${#BUFFER}-1))}"
fi
BUFFER="${BUFFER// the/$'\n'the}"
echo "$BUFFER"
 
Old 11-11-2009, 09:06 AM   #23
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by pixellany View Post
Before I can run side-by-side tests, the above needs to be modified to remove extra spaces and to have a (linefeed or EOF?) at the end.
Code:
$ awk 'NR>1&&$1=="the"{print ""}NF{printf "%s ",$0}END{print ""}' w.txt
 
Old 11-11-2009, 09:09 AM   #24
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by tuxdev View Post
should be able to handle arbitrarily large files.. not so much for large lines.
couldn't handle data like this ( there is "the" in "thesis" )
Code:
the
thesis
is done
 
Old 11-11-2009, 12:09 PM   #25
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Original Poster
Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Code:
[mherring@mystical play]$ time for (( n=0 ; n<1000 ; n++ )) do awk 'NR>1&&$1=="the"{print ""}NF{printf "%s ",$0}END{print ""}';done <words >outfile

real    0m3.566s
user    0m2.703s
sys     0m0.603s
[mherring@mystical play]$ time for (( n=0 ; n<1000 ; n++ )) do sed -n '${H;x;s/\n/ /g;s/^ *//;s/ \+/ /g;s/ the/\nthe/g;p};H';done <words >outfile

real    0m3.472s
user    0m2.693s
sys     0m0.510s
[mherring@mystical play]$
the ratio of the times seems to hold up to a file size ~100X larger.

So, SED wins by a teeny margin---not enough to justify the higher "inscrutable factor".


Any other problems we can chew on?
 
Old 11-11-2009, 12:23 PM   #26
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Pixellany, you know the rules - no homework here on LQ,

On a serious note, I think we need more threads like that!
 
Old 11-11-2009, 12:37 PM   #27
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Original Poster
Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Homework??!!!----at my age? Hmmmmm
 
Old 11-11-2009, 12:47 PM   #28
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Quote:
Originally Posted by pixellany View Post
Homework??!!!----at my age? Hmmmmm
That's what they all say, LOL
 
Old 11-11-2009, 06:17 PM   #29
tuxdev
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 2,012

Rep: Reputation: 115Reputation: 115
Code:
#!/bin/bash
while read -r LINE ; do
   BUFFER=" $BUFFER $LINE "
   BUFFER="${BUFFER//  / }"
   BUFFER="${BUFFER:1}"
   BUFFER="${BUFFER// the /$'\n'the }"
   if [[ "$BUFFER" =~ $'\n' ]] ; then
      echo "${BUFFER%%$'\n'*}"
      BUFFER="${BUFFER#*$'\n'}"
   fi
done
BUFFER="${BUFFER:0:$((${#BUFFER}-1))}"
BUFFER="${BUFFER// the /$'\n'the }"
echo "$BUFFER"
The sed version actually has the the vs. thesis problem too
 
Old 11-11-2009, 06:24 PM   #30
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Original Poster
Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Quote:
Originally Posted by tuxdev View Post
The sed version actually has the the vs. thesis problem too
You mean my original thing?

I assume that the fix would be to replace "the" with "\bthe\b"
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sed or awk ilo Programming 1 08-22-2008 10:38 AM
awk or sed help cmontr Programming 16 05-14-2008 10:59 AM
awk and/or sed linux2man Linux - General 7 01-22-2007 10:02 AM
Sed and Awk Gins Programming 7 04-19-2006 10:32 AM
awk/sed help pantera Programming 1 05-13-2004 11:59 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:38 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration