LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 12-26-2006, 12:48 PM   #1
EneWolverine
LQ Newbie
 
Registered: Dec 2006
Posts: 4

Rep: Reputation: 0
Extracting data from file using sed


Hi all,

I've just started to work with shell scripting and sed today. I'm trying to figure out how I can extract data from a file. Basically the file I want to extract data is a streaminfo file from my satellite receiver box (dreambox) and the file structure looks like this:

[html tags]
[html tags]VPID:[html tags]123h[html tags]APID[html tags]:[html tags]678b
[html tags]

That's the gist of it. I've found out how to strip the html tags from the file using sed but actually I don't really even need to do that.

What I want to do is for instance search for the string "VPID" in the file above and then extract the sequence of numbers/letters that follows that. I need to extract it in the form of a variable as I'm running this from a shell script and I want to use that extracted data later on in the shell script.

If I strip the html tags then the file looks like:

VPID: 123h APID: 678b

I've not quite got the hand of regex yet and I was wondering if there's a simple way to go about getting this data out of the file.

Any help would be greatly appreciated.

Thanks in advance,

The Wolverine
 
Old 12-26-2006, 09:30 PM   #2
jlinkels
Senior Member
 
Registered: Oct 2003
Location: Bonaire
Distribution: Debian Lenny/Squeeze/Wheezy/Sid
Posts: 4,053

Rep: Reputation: 484Reputation: 484Reputation: 484Reputation: 484Reputation: 484
Something like this:

line=`grep VPID yourfile | sed s/html_tag//g`
vpid=echo "$line" | awk '{print $2}'
apid=echo "$line" | awk '(print $4}'

You don't have to eliminate the html tags, if they are constant and you can change the $2 or $4 in the awk statement. $n means the nth field in a line.

jlinkels
 
Old 12-27-2006, 05:15 AM   #3
EneWolverine
LQ Newbie
 
Registered: Dec 2006
Posts: 4

Original Poster
Rep: Reputation: 0
Hi,

yea the only problem is that the field position of the data in the line is not constant. This is because earlier in the line there's a field called "Service:". After this the name of the channel comes and this can be one word, two words, or sometimes even three words. So the field position of the data after VPID for instance is variable. So I don't know then how to locate the data.

Is there no way of finding and locating the string "VPID:" for instance and then counting x number of fields after this instead of from the beginning of the line? Perhaps if I add a newline command before "VPID:" it might work as then on the next line the field VPID is the first field. Is there a way of adding a line break?

The Wolverine
 
Old 12-27-2006, 06:53 AM   #4
jlinkels
Senior Member
 
Registered: Oct 2003
Location: Bonaire
Distribution: Debian Lenny/Squeeze/Wheezy/Sid
Posts: 4,053

Rep: Reputation: 484Reputation: 484Reputation: 484Reputation: 484Reputation: 484
If you want to have a line which starts with VPID then do this:

grep VPID yourfile | sed s/html_tag//g | awk '{st=match($0, "VPID"); print substr ($0, st)}'

What awk does here is finding VPID in your string, and prints a substring starting at the position where VPID starts.

This does not solve the problem if there are more and different html tags between VPID and value. I think you have to substitute (sed) out all possible html tags to get a know position for the value counting from VPID.

jlinkels
 
Old 12-27-2006, 09:13 AM   #5
EneWolverine
LQ Newbie
 
Registered: Dec 2006
Posts: 4

Original Poster
Rep: Reputation: 0
yep I've managed to strip out all the html tags now and locate the string I need after VPID. Only problem now is that I want to put this in an expect script. Unfortunately the expect script doesn't like the single quotes in the send command for some reason. Do you know how I can escape these quote marks or if there is another way to send sed/awk commands in an expect script?

Thanks,

The Wolverine
 
Old 12-27-2006, 05:29 PM   #6
jlinkels
Senior Member
 
Registered: Oct 2003
Location: Bonaire
Distribution: Debian Lenny/Squeeze/Wheezy/Sid
Posts: 4,053

Rep: Reputation: 484Reputation: 484Reputation: 484Reputation: 484Reputation: 484
I never used expect, the best knowledge is that I heard about it. I have no idea either what string you are sending to what, so I am afraid I am lost here, sorry.

jlinkels
 
Old 12-27-2006, 06:15 PM   #7
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Quote:
[html tags]VPID:[html tags]123h[html tags]
It might help if you posted some actual lines, plus what you want for the output.
You can do away with the grep part and let sed do the selecting:
[code]
sed '/VPID/s/.*VPID:<html_tag>\(...h\).*/VPID: \1/p' yourfile >input_file_for_expect

The \( \) pair save what is in between, and you can reuse it in the replacement part. Here I assumed that there is only one VPID number per line. Generally, what you do is identify a string or regex expression that acts as an anchor so that you save the part you want. Outside the \( ... \) part of the regex are the anchors.

This looks a bit similar to how I use the saved .K3B file to filter out the filenames backed up and pipe it to xargs to remove them.
Code:
sed -e '/^<url>/!d' -e 's/<url>\(.*\)<\/url>/\1/' maindata.xml | tr '\n' '\000' | xargs -0 rm
Remember to include a backslash before a forward slash if you match '/' in the pattern, as in a closing tag. I left out the sed commands which handle '&amp;' -> '&', '&gt;' -> '>' and '&lt;' -> '<' so the line I posted wouldn't get to long. The '-e' option precedes each sed command, allowing you to process each line more than once per sed command. The first sed command "-e '/^<url>/!d removes lines that don't contain filenames. The "<url>" here is literally what is in the xml file. The "<url>" and "<\/url>" parts in the second command are the placemarkers. In between is the file that was backed up. Similar to your expect issue, a filename may contain whitespace. So I pipe the output through the "tr" command to replace returns with NULLs. The output then is just as if it came from a find command using the "-print0" argument. So I can use "xargs -0".

Last edited by jschiwal; 12-27-2006 at 06:23 PM.
 
Old 12-29-2006, 09:23 AM   #8
EneWolverine
LQ Newbie
 
Registered: Dec 2006
Posts: 4

Original Poster
Rep: Reputation: 0
Hi,

thanks for the input. I've managed to figure out a set of commands to extract the data that I want. Unfortunately I still can't get this to work in my expect script. Take for instance the sed command; the expect script doesn't seem to like the single quotes. Also the script doesn't seem to like the way I'm stripping the html out of the document.

Here is txt doc that I want to process:

Code:
<html><META http-equiv=Content-Type content="text/html; charset=UTF-8">
<head><title>Stream Info</title><link rel="stylesheet" type="text/css" href="/webif.css"></head><body bgcolor=#ffffff><!-- 1:0:1:1d1f:2fa8:13e:820000:0:0:0:-->
<table cellspacing=5 cellpadding=0 border=0><tr><td>Name:</td><td>Sun TV</td></tr><tr><td>Provider:</td><td>Globecast NE</td></tr><tr><td>Service reference:</td><td>1:0:1:1d1f:2fa8:13e:820000:0:0:0:</td></tr><tr><td>VPID:</td><td>1901h (6401d)</td></tr><tr><td>APID:</td><td>190bh (6411d)</td></tr><tr><td>PCRPID:</td><td>1901h (6401d)</td></tr><tr><td>TPID:</td><td>ffffffffh (-1d)</td></tr><tr><td>TSID:</td><td>2fa8h</td></tr><tr><td>ONID:</td><td>013eh</td></tr><tr><td>SID:</td><td>1d1fh</td></tr><tr><td>PMT:</td><td>0106h</td></tr><tr><td>Video Format:<td>720x576 (4:3)</td></tr></table></body></html>
When I login to my server next I'll post up the bash shell script and expect script that I've written.

Any ideas would be appreciated.

Thanks,

The Wolverine
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
help extracting data from csv file willinusf Linux - General 10 10-27-2006 09:10 PM
Extracting MySQL data from raw files cs-cam Linux - Software 1 06-12-2006 11:22 PM
[bash / sed] remove all data between < > Ljohan Programming 4 03-15-2006 05:20 AM
extracting gz file..... b123coder Linux - Newbie 1 11-21-2004 07:55 AM
Extracting data from broken drive darin3200 Linux - Software 1 07-12-2003 01:34 PM


All times are GMT -5. The time now is 12:26 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration