Problem with RegEx using sed

citygrid · 03-27-2010, 06:13 AM

I'm trying to isolate a number from a text file using sed. The text file looks like this:

-GARBAGE-GARBAGE-GARBAGE- Number of frames: 183933 frames Codec -GARBAGE-GARBAGE-GARBAGE-

I tried the following:

Code:

sed "s/^.*Number of frames: //g; s/ frames Codec.*$//g" "info.txt" > "frames.txt"

Strangely, it only seems to be stripping off the end, but not the beginning, like so:
-GARBAGE-GARBAGE-GARBAGE- Number of frames: 183933

I'm obviously not using the command correctly, so what am I doing wrong?

If anyone has alternatives using awk or grep, I'd be open to those as well, but for future reference I'm curious to know why my argument above is not working the way I expect it to.

Thanks in advance!

syg00 · 03-27-2010, 06:31 AM

Almost always a bad specification - maybe there are two spaces somewhere, maybe a <tab>, ...
Use as little specific data as possible. If you just want the number, just specify numbers - something like (untested)

Code:

sed -r 's/.*([[:digit:]]*).*/\1/' "info.txt" > "frames.txt"

citygrid · 03-27-2010, 06:56 AM

Quote:

Originally Posted by syg00

Almost always a bad specification - maybe there are two spaces somewhere, maybe a <tab>, ...

Thanks for replying.

I thought of that, of course, and went out in search of \t and double spaces, but there aren't any. The real mystery is that it's finding the expression. If I put in:

Code:

sed "s/Number of frames: /SOMEWORD/g; s/ frames Codec.*$//g" "info.txt" > "frames.txt"

it returns:
-GARBAGE-GARBAGE-GARBAGE- SOMEWORD183933

So it would seem that it has something to do with finding the beginning of the file (which is one line).

Any other ideas?

Quote:

Use as little specific data as possible. If you just want the number, just specify numbers - something like (untested)

Code:

sed -r 's/.*([[:digit:]]*).*/\1/' "info.txt" > "frames.txt"

This doesn't work, unfortunately, because "info.txt" contains a lot of numeric information about a video, such as number of frames, resolution, duration, audio bitrate, etc., so I wouldn't only be getting what I needed (which is the total number of frames).

syg00 · 03-27-2010, 07:05 AM

What happens if you try your original attempt without the anchor (and are you using gnu sed) ?.

citygrid · 03-27-2010, 07:40 AM

Quote:

Originally Posted by syg00

What happens if you try your original attempt without the anchor (and are you using gnu sed) ?.

It returns
-GARBAGE-GARBAGE-GARBAGE-......€183933

I'm not very experienced with Linux, but I imagine I'm using gnu sed. I'm on Ubuntu and am typing the command into the terminal.

citygrid · 03-27-2010, 08:31 AM

It's bizarre. It has something to do with the anchor not finding the beginning of the line/file, and I can't figure it out. Even putting this in directly:

Code:

sed "s/^.*183933//g; s/ frames Codec.*$//g" "info.txt" > "frames.txt"

didn't return an empty file, as I would expect, but still gave me the whole file up to and including the number.

Anyway, knowing that I could at least replace the expression "Number of frames: " let me do this:

Code:

sed "s/Number of frames: /\n/g; s/ frames Codec.*$//g" "info.txt" | head -2 | tail -1 > "frames.txt"

so I've solved my problem, albeit in a convoluted manner, but it still doesn't give me any insight into why the first expression doesn't work. If anyone can explain this, please let me know.

In any case, Syg00, thank you for taking the time to help me out!

syg00 · 03-27-2010, 05:39 PM

That implies your leading search text occurs more than once (per line) in the data - try something like

Code:

sed -r 's/.*Number of frames: ([[:digit:]]+).*/\1/' "info.txt" > "frames.txt"

crts · 03-27-2010, 08:43 PM

Quote:

Originally Posted by citygrid

I'm trying to isolate a number from a text file using sed. The text file looks like this:

-GARBAGE-GARBAGE-GARBAGE- Number of frames: 183933 frames Codec -GARBAGE-GARBAGE-GARBAGE-

I tried the following:

Code:

sed "s/^.*Number of frames: //g; s/ frames Codec.*$//g" "info.txt" > "frames.txt"

Strangely, it only seems to be stripping off the end, but not the beginning, like so:
-GARBAGE-GARBAGE-GARBAGE- Number of frames: 183933

I'm obviously not using the command correctly, so what am I doing wrong?

If anyone has alternatives using awk or grep, I'd be open to those as well, but for future reference I'm curious to know why my argument above is not working the way I expect it to.

Thanks in advance!

Hi,

I copy&pasted your data into a file and executed your command. It worked fine, i.e. I got 183933 as output. I am using sed version 4.1.5, bash version is 3.2.39.
I noticed that you are using "double-quotes" instead of 'single-quotes' so maybe your sed instruction just fell victim to some expansion issues?

citygrid · 03-27-2010, 09:17 PM

Quote:

Originally Posted by crts

Hi,
I copy&pasted your data into a file and executed your command. It worked fine, i.e. I got 183933 as output. I am using sed version 4.1.5, bash version is 3.2.39.
I noticed that you are using "double-quotes" instead of 'single-quotes' so maybe your sed instruction just fell victim to some expansion issues?

Thanks for replying.

I did indeed try the single quote option, but it gave me the same output.

The funny thing is that when I pasted exactly what I wrote above, the regex worked for me, too. This led me to believe that there was something funky going on with the original output file from the video encoding program rather than the regular expression itself.

Anyway, I played around with it a bit, and found that if I resaved the text file as UTF-8 in gedit, then the original argument that I posted worked.

So in the end, it's simply a question of character coding, it seems.

Unfortunately, I don't know enough about the subject to understand why it monkeyed up the regex or how to fix the problem in the future, so if someone could enlighten me, I'd be much obliged.

Thanks again for your responses!