LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 01-31-2012, 11:13 PM   #1
sirdeneb
LQ Newbie
 
Registered: Jan 2012
Posts: 6

Rep: Reputation: Disabled
Get a dynamic string from a file


Hi,

I would like to read a string in a file which is right (or a few character) after an other particular string. The particular string is static but can be anywhere in the document.
I'be tried with the command grep but it only returns the whole line even with the argument -u (most probably I'm too much of a newbie to use it correctly).
I've seen that awk should be good for me but I'm not sure to understand how.
Should I read the file character by character and then return the dynamic string I want?

Thanks
 
Old 01-31-2012, 11:17 PM   #2
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 361Reputation: 361Reputation: 361Reputation: 361
You will need to give an example of the input you have and the output you want.

Likely a tool such as sed or awk will do what you describe... depending on the complexity of the input. There's no way to know with a description that boils down to "I want to find string #1 that comes after string #2, but string #2 can appear anywhere."

Not enough detail.
 
Old 01-31-2012, 11:34 PM   #3
sirdeneb
LQ Newbie
 
Registered: Jan 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
Thank you for answer.
And, alright, I just didn't want to be too confusing...

Here is my precise situation.
I would like to get the time of the sunset and the sunrise from this page for the current day: http://www.timeanddate.com/worldcloc...my.html?n=2416
So, as a first step I download the page thanks to the wget command. Here is a part of the file:
Quote:
...h class="sep" rowspan=2>Sunrise</th><th class="sep" rowspan=2>Sunset</th><th class="sep" rowspan=2>This day</th><th class="sep" rowspan=2>Difference</th><th class="sep" rowspan=2>Time</th><th class="sep" rowspan=2>Altitude</th><th class="sep NULL">Distance</th></tr><tr class="head"><th class="sep smaller">(10<sup>6</sup> km)</th></tr></thead><tbody><tr class=c0><td>Feb 1, 2012</td><td>7:07 AM</td><td>5:12 PM</td><td>10h 04m 40s</td><td>+ 2m 12s</td><td>12:09 PM</td><td>31.8&deg; </td><td>147.400</td></tr><tr class=c1><td>Feb 2, 2012</td><td>7:06 AM</td><td>5:13 PM</td><td>10h 06m 54s</td><td>+ 2m 14s</td><td>12:10 PM</td><td>32.0&deg; </td><td>147.421</td></tr><tr class=c0><td>Feb 3, 2012</td><td>7:05 AM</td><td>5:15 PM</td><td>10h 09m 10s</td><td>+ 2m 16s</td><td>12:10 PM</td><td>32.3&deg; </td><td>147.442</td></tr><tr class=c1><td>Feb 4, 2012</td...
Then, I would like to read the sunset time in the downloaded file. If you look the html code, it's located right after the string "Feb 1, 2012</td><td>" (</td><td> meaning the end and the start of a cell I guess) and it's 7:07 AM here. I just need the hours and the minutes in order to save them in two different variables.

Hope that makes sense now

Thanks again
 
Old 01-31-2012, 11:51 PM   #4
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 361Reputation: 361Reputation: 361Reputation: 361
Yes, sed can handle this, though, if you're not familiar with regular expressions, the command may look like gobbledygook.

For this example, I assume your web page is saved as "webpage.html" -- substitute as appropriate in the command
Code:
sed -n "s@.*Feb 1, 2012</td><td>\([0-9:]*\).*@\1@p" webpage.html
On my machine, given the sample input you provided, that command returns "7:07". It can be modified to return the information in a different way depending on your need.

Also, I'm going to continue experimenting with the command a bit so that it will automatically pull the right time based on the current date.

EDIT:
And here is the command to find tomorrow's sunrise--"tomorrow" as in the upcoming day based on the date that you run the command:
Code:
sed -n "s@.*$(date -d 'tomorrow' '+%b %-e, %Y')</td><td>\([0-9:]*\).*@\1@p" webpage.html
Again, it will return something like 7:07. If you need something different to help you split the result into your two variables, tell me what output format would be the most useful for you.

Last edited by Dark_Helmet; 01-31-2012 at 11:58 PM.
 
Old 01-31-2012, 11:57 PM   #5
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,554
Blog Entries: 3

Rep: Reputation: 816Reputation: 816Reputation: 816Reputation: 816Reputation: 816Reputation: 816Reputation: 816
You'll get better results if you use a scripting language like Python, PHP, Perl, Ruby -- anything, really, that can both read the page directly over the network, parse it for you, and split it into a document object tree.

I've had good results with PHP and Tidy. You only need to find the correct table (you can search based on cell contents!), then the correct row and cell, to extract the data. You don't need to have Apache installed to use PHP, just use the command-line version of PHP, php5-cli (or similar).
 
Old 02-01-2012, 12:05 AM   #6
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 361Reputation: 361Reputation: 361Reputation: 361
I agree with Nominal. Ideally, something to parse the html would be better.

If the sed command suits your needs, then you can use it, but as long as you understand that it's "picky." Slight changes in the format of the html data may cause the command to fail (i.e. not find a match or return the wrong data). An html parser is more flexible in that regard. Though it would require a little more effort to setup.
 
Old 02-01-2012, 12:24 AM   #7
sirdeneb
LQ Newbie
 
Registered: Jan 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
Quote:
You'll get better results if you use a scripting language like Python, PHP, Perl, Ruby -- anything, really, that can both read the page directly over the network, parse it for you, and split it into a document object tree.

I've had good results with PHP and Tidy. You only need to find the correct table (you can search based on cell contents!), then the correct row and cell, to extract the data. You don't need to have Apache installed to use PHP, just use the command-line version of PHP, php5-cli (or similar).
Probably, but I don't knowhow to use any of those languages... I can barely write something in shell...

Quote:
Yes, sed can handle this, though, if you're not familiar with regular expressions, the command may look like gobbledygook.

For this example, I assume your web page is saved as "webpage.html" -- substitute as appropriate in the command
Code:
sed -n "s@.*Feb 1, 2012</td><td>\([0-9:]*\).*@\1@p" webpage.html
On my machine, given the sample input you provided, that command returns "7:07". It can be modified to return the information in a different way depending on your need.

Also, I'm going to continue experimenting with the command a bit so that it will automatically pull the right time based on the current date.

EDIT:
And here is the command to find tomorrow's sunrise--"tomorrow" as in the upcoming day based on the date that you run the command:
Code:
sed -n "s@.*$(date -d 'tomorrow' '+%b %-e, %Y')</td><td>\([0-9:]*\).*@\1@p" webpage.html
Again, it will return something like 7:07. If you need something different to help you split the result into your two variables, tell me what output format would be the most useful for you.
That's awesome!

Now, I'm trying to understand how it works.
s@.* means whatever before the expression and remove it
.*@\1@p means whatever after the expression and remove it
\([0-9:]*\) means to select only characters which are numbers and :

Is that correct?
Actually what I need is the hour of the sunrise and then the minutes of the sunrise. And after that the hour of the sunset and then the minutes of the sunset. The sunset is located in the next cell.
Quote:
Feb 1, 2012</td><td>7:07 AM</td><td>5:12 PM
So for today:
Quote:
H=sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>\([0-9]\).*@\1@p" webpage.html # hour of the sunrise
M=sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>\(:[0-9]\).*@\1@p" webpage.html # minute of the sunrise
sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>$H:$M AM</td><td>\([0-9]\).*@\1@p" webpage.html # hour of the sunset
sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>$H:$M AM</td><td>\(:[0-9]\).*@\1@p" webpage.html # hour of the sunset
I feel that not correct...

Last edited by sirdeneb; 02-01-2012 at 12:25 AM.
 
Old 02-01-2012, 01:00 AM   #8
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 361Reputation: 361Reputation: 361Reputation: 361
Quote:
Originally Posted by sirdeneb
Now, I'm trying to understand how it works.
s@.* means whatever before the expression and remove it
.*@\1@p means whatever after the expression and remove it
\([0-9:]*\) means to select only characters which are numbers and :
Bits and pieces of what you describe are right. Let me go through it and explain from the top.

Ok, so the command (minus the input file):
Code:
sed -n "s@.*Feb 1, 2012</td><td>\([0-9:]*\).*@\1@p"
So, we're invoking sed. The '-n' option tells sed not to print anything by default. That is, we must explicitly tell sed when to print something. The 'p' at the end tells sed to print. In this case, whenever the rest of the command matches, print the result of the manipulation.

Easy part done. The command we give sed is for substitution--the 's@ ... @ ... @' format. In a nutshell, it means "look for a text pattern that matches what is between the first and second '@' symbols and replace what matched with the text between the second and third '@' symbols.

Let me skip ahead and describe the pattern between the second and third '@' symbols first (because it's simple). The '\1' is a backreference toward the text between the first and second '@' symbols. Specifically, it matches the text found between the first set of parentheses. In this case: '\([0-9:]*\)' The parentheses here are "escaped" with a backslash to tell sed that they are not meant as literal parentheses to match--that they should instead be interpreted as grouping matched text for a later backreference. So, just keep in mind that '\1' means whatever matches that specific portion of the text.

For the text between the first and second '@' symbols, a '.' is treated as a wildcard and matches any single character. The asterisk '*' is a modifier that means to match zero or more of the previous pattern. So, '.*' will match as much as possible from the beginning of the line.

The matching is forced to stop when the pattern includes the literal text for the date and subsequent tags: 'Feb 1, 2012</td><td>'. So the first '.*' matches everything up to the literal text. The literal text matches itself, and that brings us back to the parenthetical pattern.

The escaped parentheses were explained earlier, and they have no impact on the text to match. So, ignore them from that perspective. That leaves: '[0-9:]*'

That pattern means to match any numeric digit from 0 to 9 and a colon as a single character. And again, the star modifies that to mean "zero or more of the previous expression". So, collectively, it will match any combination of digits and colons.

Lastly, the trailing '.*' does the same thing that the first one did, and will match everything else remaining on the line.

So, the entire line will be replaced by the sequence of digits and/or colons matched in the parenthetical.

If any of that is not clear, let me know and I'll try to explain it a little better.

The subsequent command I gave uses shell substitution to handle the date. Specifically, the bash shell replaces "$(date -d 'tomorrow' '+%b %-e, %Y')" with the output of the corresponding date command. That is to day, sed never sees the "$( ... )" but only sees the output of the command--which should match the style of date shown in the html file.

Quote:
So for today:
Code:
H=sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>\([0-9]\).*@\1@p" webpage.html # hour of the sunrise
M=sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>\(:[0-9]\).*@\1@p" webpage.html # minute of the sunrise
sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>$H:$M AM</td><td>\([0-9]\).*@\1@p" webpage.html # hour of the sunset
sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>$H:$M AM</td><td>\(:[0-9]\).*@\1@p" webpage.html # hour of the sunset
Note: the '-d' should be removed from the date command. That option is used to specify a date string to override the default "now" that the command uses.

First command: yes. Assuming that you live in a locale where sunrise will never occur after 9:59 AM (i.e. two digits for the hour)
Second command: close... Try:
Code:
M=sed -n "s@.*$(date '+%b %-e, %Y')</td><td>[0-9]*:\([0-9]*\).*@\1@p" webpage.html
Third command: probable--same caveat at the first command. Though I would encourage you to enclose your variable references with curly braces. For instance:
Code:
sed -n "s@.*$(date '+%b %-e, %Y')</td><td>${H}:${M} AM</td><td>\([0-9]\).*@\1@p" webpage.html
Fourth command: same as with the second (and the curly braces):
Code:
sed -n "s@.*$(date '+%b %-e, %Y')</td><td>${H}:${M} AM</td><td>[0-9]*:\([0-9]*\).*@\1@p" webpage.html
I have not run those commands to verify, but I will do so in a moment.

EDIT:
I would be lax if I did not show you this, given what you're obviously trying to accomplish. Try running the following example script. It uses a modified sed command and some shell redirection "magic" to assign all four of your variables at one time:
Code:
#!/bin/bash

read -e sunriseHour sunriseMinute sunsetHour sunsetMinute < <( sed -n "s@.*$(date '+%b %-e, %Y')</td><td>\([0-9]*\):\([0-9]*\) AM</td><td>\([0-9]*\):\([0-9]*\).*@\1 \2 \3 \4@p" webpage.html )

echo "Sunrise hour: ${sunriseHour}"
echo "Sunrise minute: ${sunriseMinute}"
echo "Sunset hour: ${sunsetHour}"
echo "Sunset minute: ${sunsetMinute}"
As a side note, since I'm learning python, I may post an html-parsing python script for you to use. It's more an exercise for me, but maybe you'll find it useful.

Last edited by Dark_Helmet; 02-01-2012 at 01:18 AM.
 
1 members found this post helpful.
Old 02-01-2012, 01:55 AM   #9
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 361Reputation: 361Reputation: 361Reputation: 361
Like I said, I'd post a python script...

This was written using a 2.6.6 interpreter. It will not work on a 3.2 interpreter.

I named mine suntimes.py. Anyway, you run it like so:
Code:
chmod u+x suntimes.py
./suntimes.py "Feb 2, 2012" webpage.html
You only need to execute the "chmod" once.

It will output:
Code:
7 06 5 13
And those are the sunrise hour, sunrise minute, sunset hour, and sunset minute for the date specified.

So, you could use it along with that shell redirection "magic" I mentioned in my previous reply's EDIT section.

The script:
Code:
#!/usr/bin/python

from HTMLParser import HTMLParser
import re
import sys

class myParser( HTMLParser ):
    def __init__( self ):
        self.reset()
        self.checkForDate = 0
        self.dateFound = 0
        self.sunriseFound = 0
        self.sunTimes = []

    def SetParseForDate( self, wantedDate ):
        self.targetDate = wantedDate

    def handle_data( self, data ):
        if( self.sunriseFound < 2 and data == "Sunrise" ):
            self.checkForDate = 1
        elif( self.sunriseFound < 2 and self.checkForDate == 1 ):
            if( data == self.targetDate ):
                self.dateFound = 1
                self.checkForDate = 0
        elif( self.sunriseFound < 2 and self.dateFound == 1 ):
            reMatch = re.search("([0-9]+):([0-9]{2})", data )
            if( reMatch != None ):
                self.sunTimes = self.sunTimes + [ reMatch.group(1), reMatch.group(2) ]
                self.sunriseFound = self.sunriseFound + 1
                if( self.sunriseFound == 2 ):
                    print ' '.join( self.sunTimes )


if( len( sys.argv ) != 3 ):
    print ( "This script requires a date and filename--in that order--to run" )
    sys.exit( 1 )

targetDate = sys.argv[1]
targetFile = sys.argv[2]

try:
    sunriseSunsetHtml = open( targetFile, "r" )
except:
    print ( "Unable to open {0} for reading".format( targetFile ) )
    sys.exit( 2 )

parser = myParser()
parser.SetParseForDate( targetDate )
for dataLine in sunriseSunsetHtml:
    parser.feed( dataLine )
In a nutshell, it scans the individual element data in the html. When it finds "Sunrise" (presumably as the header for the table) the script sets a flag and starts checking for a match on the date given. When the date is found, another flag is set, and the script starts looking for the next two pieces of data that match "[0-9]+:[0-9]+". Those two matching times are printed with a space separating each component.

Last edited by Dark_Helmet; 02-01-2012 at 01:57 AM.
 
Old 02-01-2012, 06:59 PM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 1,421

Rep: Reputation: 360Reputation: 360Reputation: 360Reputation: 360
Alternate solution using xmlstarlet:

Code:
#!/bin/bash
date='Feb 1, 2012'
page=astronomy.html

{ IFS=': ' read sunrise_hour sunrise_minute AM; IFS=': ' read sunset_hour sunset_minute PM; } < \
    <(xml fo --html "$page" | xml sel -T -t -m \
    "//th[. = 'Sunrise']/ancestor::table[1]//tr[td = '$date']/td[contains(., ':')][position() <= 2]" \
    -v . -n )

echo "sunrise = hour ${sunrise_hour} minute ${sunrise_minute}"
echo "sunset = hour ${sunset_hour} minute ${sunset_minute}"
 
Old 02-05-2012, 05:46 PM   #11
sirdeneb
LQ Newbie
 
Registered: Jan 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
Thank you so much Dark_Helmet for these explanations!
However, I must admit there are some stuffs which go straight over my head...
And I don't talk about the python script...

I know using bash is not the best for my application and it's tricky but I'm using your one line command and that's freaking awesome!

Thanks a lot for your help
 
Old 02-05-2012, 07:20 PM   #12
sirdeneb
LQ Newbie
 
Registered: Jan 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
Quote:
First command: yes. Assuming that you live in a locale where sunrise will never occur after 9:59 AM (i.e. two digits for the hour)
Where I live, the sunrise never occurs after 10am, nevertheless the sunset occurs after 9pm during the summer...
But I'm not even sure to understand why you mean. Indeed, the problem doesn't matter for the minutes which are obviously always with two digits... And the pattern is the same as for the hours.
 
Old 02-05-2012, 07:39 PM   #13
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 241Reputation: 241Reputation: 241
There are also web services more parsing friendly:
http://www.earthtools.org/webservices.htm
(see New York example)

Or use perl, with Astro::Sunrise module, and don't download anything
http://search.cpan.org/~rkhill/Astro....91/Sunrise.pm
 
Old 02-05-2012, 07:44 PM   #14
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 361Reputation: 361Reputation: 361Reputation: 361
Quote:
... nevertheless the sunset occurs after 9pm during the summer...
And the pattern is the same as for the hours
Not quite. The pattern is different for the hours versus the minutes. If you look closely at the commands for the hours (commands #1 and #3 in the response you're referring to) and the commands for the minutes (commands #2 and #4) there is a small, but significant difference.

For this example, I'll only focus on commands #1 and #2...

From the revised commands provided in my response:
Code:
H=sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>\([0-9]\).*@\1@p" webpage.html # hour of the sunrise
M=sed -n "s@.*$(date '+%b %-e, %Y')</td><td>[0-9]*:\([0-9]*\).*@\1@p" webpage.html
Notice that there is an asterisk ('*') inside the parentheses for the minutes command. Also notice there is no corresponding asterisk inside the parentheses for the hours command.

The asterisk means match zero or more of the previous pattern. In this case, the previous pattern is any digit (0 through 9). So, the minutes command would match zero digits, one digit, two consecutive digits, three consecutive digits, etc.

Without the asterisk the pattern will only match one digit. Therefore, the hour command, because it does not use an asterisk will only match one digit immediately after the '<td>' tag for the hour. Therefore, if you have an input of '10:02' then the minutes will be correct: two consecutive digits '02' However, the hour would be wrong, because it would match only one digit after the '<td>' -- in this case '1'.

Add the asterisk inside the parentheses for any hour pattern where you anticipate needing more than one digit to represent the hour.

Last edited by Dark_Helmet; 02-05-2012 at 07:47 PM.
 
Old 02-18-2012, 03:01 PM   #15
sirdeneb
LQ Newbie
 
Registered: Jan 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
Thanks all for your replies, especially Dark_Helmet; I learnt a lot!

Though I'll go with the simplest solution for me which is not the most efficient, but at least I understand a little bit what I'm doing...
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Need to remove dynamic string across multiple lines and files. vandigroup Linux - Server 2 05-14-2011 11:50 PM
C/C++ dynamic string array? Thesniperofdeath Programming 9 03-09-2011 03:06 AM
how do i replace a text string in a file with a random string? (with sed etc) steve51184 Linux - Software 16 09-02-2010 11:05 AM
read string after specific string from a text file using C++ programing language badwl24 Programming 5 10-08-2009 05:41 AM
Problem to acess to a dynamic string os2 Programming 1 03-25-2005 10:19 AM


All times are GMT -5. The time now is 04:38 AM.

Main Menu
 
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration