LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Sed one-liner to drop data from beginning of file? (https://www.linuxquestions.org/questions/programming-9/sed-one-liner-to-drop-data-from-beginning-of-file-678647/)

lodi 10-23-2008 05:03 PM

Sed one-liner to drop data from beginning of file?
 
Hello, I'm trying to drop all data from the beginning of a file up to the first occurence of a specific opening xml tag. I need this operation to run as fast as possible since it will be used on huge files (several GB) that, for the most part, don't have any newlines in them.

This is the best I can come up with sofar and it doesn't quite work...

sed '1,/<foo / s/^.*<foo /<foo /'

when I run it on this file:

---
asdga sdf
asdf asf a
garbage garbage</foo><foo xmlns=...</foo><foo ...></foo>
---

I get

---
asdga sdf
asdf asf a
<foo ...></foo>
---

So it doesn't remove the garbage in the beginning, and also removes too many 'foo' tags because of the greedy pattern match.

How can I get sed to match *everything* up to a token, not just line-by-line? Or alternatively, is there some other command I can use that would still run as fast?

Thanks.

forrestt 10-23-2008 06:10 PM

With the limited amount of input data, the following seems to work:

Code:

sed -n -e '/<foo /,/<foo / s/^.*<foo /<foo /' -e '/<foo /,$  p' foo
HTH

Forrest

burschik 10-24-2008 03:05 AM

Csplit may be faster, however. I haven't checked.

lodi 10-24-2008 09:44 AM

Thanks forrestt, but that throws away every foo except one. I need to preserve all the foo tags I can (basically I just have to drop some malformed xml from the beginning of the file, and then keep processing from the first opening tag).

Thanks for the help though.

jan61 10-24-2008 03:21 PM

Moin,

I don't know, if my "solution" ;-) fits your needs - it's a little bit strange. You don't have a ungreedy qualifier without using Perl compatible regex (that's why you should check if can do the job in Perl). I went another way:
Code:

sed -r '1,/<foo /{s/$/|/;s/(<foo )/|\1/;s/[^|]*\|//;/^\|*$/d;s/\|//}' foo.xml
To explain it: I started with a line selection like you:
Code:

jan@jack:~/tmp> sed -r '1,/<foo /{ # start a code block
s/$/|/ # append a | at the end of lines
s/(<foo )/|\1/ # insert a | before the first "<foo "
s/[^|]*\|// # remove everything up to the first | in every line
/^\|*$/d # now delete empty lines (which don't contain a "<foo ")
s/\|// # remove the remaining | at the end of the "<foo " line
}' foo.xml

hth
Jan

ghostdog74 10-24-2008 10:32 PM

Code:

awk '/<\/foo>/{
    start = match($0,"</foo>")
    print substr($0,RSTART)
    f=1
    next
}f' file


jan61 10-29-2008 05:22 PM

Moin,

Quote:

Originally Posted by ghostdog74 (Post 3321360)
Code:

awk '/<\/foo>/{
    start = match($0,"</foo>")
    print substr($0,RSTART)
    f=1
    next
}f' file


this solution doesn't work, if a line - after the garbage at the file's head - does not contain a </foo> - lodi wanted to cleanup a html file containing garbage at the beginning of the file up to the first "<foo>" tag. You must extend the script so, that after the first match all lines are printed (and start with "<foo>", not "</foo>") - untested:
Code:

awk ' BEGIN { found = 0; }
  /<foo>/{
    if (found == 0) {
      start = match($0,"<foo>");
      print substr($0,RSTART);
      found = 1;
      next;
    }
    else
      print $0;
  }
  { if (found == 1) print $0; } ' file

Jan


All times are GMT -5. The time now is 03:15 AM.