LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-23-2008, 05:03 PM   #1
lodi
LQ Newbie
 
Registered: Oct 2008
Posts: 2

Rep: Reputation: 0
Sed one-liner to drop data from beginning of file?


Hello, I'm trying to drop all data from the beginning of a file up to the first occurence of a specific opening xml tag. I need this operation to run as fast as possible since it will be used on huge files (several GB) that, for the most part, don't have any newlines in them.

This is the best I can come up with sofar and it doesn't quite work...

sed '1,/<foo / s/^.*<foo /<foo /'

when I run it on this file:

---
asdga sdf
asdf asf a
garbage garbage</foo><foo xmlns=...</foo><foo ...></foo>
---

I get

---
asdga sdf
asdf asf a
<foo ...></foo>
---

So it doesn't remove the garbage in the beginning, and also removes too many 'foo' tags because of the greedy pattern match.

How can I get sed to match *everything* up to a token, not just line-by-line? Or alternatively, is there some other command I can use that would still run as fast?

Thanks.
 
Old 10-23-2008, 06:10 PM   #2
forrestt
Senior Member
 
Registered: Mar 2004
Location: Cary, NC, USA
Distribution: Fedora, Kubuntu, RedHat, CentOS, SuSe
Posts: 1,288

Rep: Reputation: 99
With the limited amount of input data, the following seems to work:

Code:
sed -n -e '/<foo /,/<foo / s/^.*<foo /<foo /' -e '/<foo /,$  p' foo
HTH

Forrest

Last edited by forrestt; 10-23-2008 at 06:36 PM. Reason: left a space out of second -e start parameter
 
Old 10-24-2008, 03:05 AM   #3
burschik
Member
 
Registered: Jul 2008
Posts: 159

Rep: Reputation: 31
Csplit may be faster, however. I haven't checked.
 
Old 10-24-2008, 09:44 AM   #4
lodi
LQ Newbie
 
Registered: Oct 2008
Posts: 2

Original Poster
Rep: Reputation: 0
Thanks forrestt, but that throws away every foo except one. I need to preserve all the foo tags I can (basically I just have to drop some malformed xml from the beginning of the file, and then keep processing from the first opening tag).

Thanks for the help though.
 
Old 10-24-2008, 03:21 PM   #5
jan61
Member
 
Registered: Jun 2008
Posts: 235

Rep: Reputation: 47
Moin,

I don't know, if my "solution" ;-) fits your needs - it's a little bit strange. You don't have a ungreedy qualifier without using Perl compatible regex (that's why you should check if can do the job in Perl). I went another way:
Code:
sed -r '1,/<foo /{s/$/|/;s/(<foo )/|\1/;s/[^|]*\|//;/^\|*$/d;s/\|//}' foo.xml
To explain it: I started with a line selection like you:
Code:
jan@jack:~/tmp> sed -r '1,/<foo /{ # start a code block
s/$/|/ # append a | at the end of lines
s/(<foo )/|\1/ # insert a | before the first "<foo "
s/[^|]*\|// # remove everything up to the first | in every line
/^\|*$/d # now delete empty lines (which don't contain a "<foo ")
s/\|// # remove the remaining | at the end of the "<foo " line
}' foo.xml
hth
Jan
 
Old 10-24-2008, 10:32 PM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Code:
awk '/<\/foo>/{
    start = match($0,"</foo>")
    print substr($0,RSTART)
    f=1
    next
}f' file
 
Old 10-29-2008, 05:22 PM   #7
jan61
Member
 
Registered: Jun 2008
Posts: 235

Rep: Reputation: 47
Moin,

Quote:
Originally Posted by ghostdog74 View Post
Code:
awk '/<\/foo>/{
    start = match($0,"</foo>")
    print substr($0,RSTART)
    f=1
    next
}f' file
this solution doesn't work, if a line - after the garbage at the file's head - does not contain a </foo> - lodi wanted to cleanup a html file containing garbage at the beginning of the file up to the first "<foo>" tag. You must extend the script so, that after the first match all lines are printed (and start with "<foo>", not "</foo>") - untested:
Code:
awk ' BEGIN { found = 0; }
  /<foo>/{
    if (found == 0) {
      start = match($0,"<foo>");
      print substr($0,RSTART);
      found = 1;
      next;
    }
    else
      print $0;
  }
  { if (found == 1) print $0; } ' file
Jan
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sed insert # at the beginning of a line ilo Linux - Newbie 17 12-19-2012 08:21 AM
LXer: Simple Shell One-Liner To Enumerate File Types In Linux and Unix LXer Syndicated Linux News 2 05-30-2008 08:47 AM
How to turn a file to a one liner thru sh script? jillann Linux - Newbie 15 04-18-2007 09:32 AM
Extracting data from file using sed EneWolverine Programming 7 12-29-2006 09:23 AM
How to add data at file beginning in C? Nad0xFF Programming 8 04-17-2005 11:48 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:51 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration