ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I would like to read a string in a file which is right (or a few character) after an other particular string. The particular string is static but can be anywhere in the document.
I'be tried with the command grep but it only returns the whole line even with the argument -u (most probably I'm too much of a newbie to use it correctly).
I've seen that awk should be good for me but I'm not sure to understand how.
Should I read the file character by character and then return the dynamic string I want?
You will need to give an example of the input you have and the output you want.
Likely a tool such as sed or awk will do what you describe... depending on the complexity of the input. There's no way to know with a description that boils down to "I want to find string #1 that comes after string #2, but string #2 can appear anywhere."
Thank you for answer.
And, alright, I just didn't want to be too confusing...
Here is my precise situation.
I would like to get the time of the sunset and the sunrise from this page for the current day: http://www.timeanddate.com/worldcloc...my.html?n=2416
So, as a first step I download the page thanks to the wget command. Here is a part of the file:
Then, I would like to read the sunset time in the downloaded file. If you look the html code, it's located right after the string "Feb 1, 2012</td><td>" (</td><td> meaning the end and the start of a cell I guess) and it's 7:07 AM here. I just need the hours and the minutes in order to save them in two different variables.
Yes, sed can handle this, though, if you're not familiar with regular expressions, the command may look like gobbledygook.
For this example, I assume your web page is saved as "webpage.html" -- substitute as appropriate in the command
Code:
sed -n "s@.*Feb 1, 2012</td><td>\([0-9:]*\).*@\1@p" webpage.html
On my machine, given the sample input you provided, that command returns "7:07". It can be modified to return the information in a different way depending on your need.
Also, I'm going to continue experimenting with the command a bit so that it will automatically pull the right time based on the current date.
EDIT:
And here is the command to find tomorrow's sunrise--"tomorrow" as in the upcoming day based on the date that you run the command:
Code:
sed -n "s@.*$(date -d 'tomorrow' '+%b %-e, %Y')</td><td>\([0-9:]*\).*@\1@p" webpage.html
Again, it will return something like 7:07. If you need something different to help you split the result into your two variables, tell me what output format would be the most useful for you.
Last edited by Dark_Helmet; 01-31-2012 at 11:58 PM.
You'll get better results if you use a scripting language like Python, PHP, Perl, Ruby -- anything, really, that can both read the page directly over the network, parse it for you, and split it into a document object tree.
I've had good results with PHP and Tidy. You only need to find the correct table (you can search based on cell contents!), then the correct row and cell, to extract the data. You don't need to have Apache installed to use PHP, just use the command-line version of PHP, php5-cli (or similar).
I agree with Nominal. Ideally, something to parse the html would be better.
If the sed command suits your needs, then you can use it, but as long as you understand that it's "picky." Slight changes in the format of the html data may cause the command to fail (i.e. not find a match or return the wrong data). An html parser is more flexible in that regard. Though it would require a little more effort to setup.
You'll get better results if you use a scripting language like Python, PHP, Perl, Ruby -- anything, really, that can both read the page directly over the network, parse it for you, and split it into a document object tree.
I've had good results with PHP and Tidy. You only need to find the correct table (you can search based on cell contents!), then the correct row and cell, to extract the data. You don't need to have Apache installed to use PHP, just use the command-line version of PHP, php5-cli (or similar).
Probably, but I don't knowhow to use any of those languages... I can barely write something in shell...
Quote:
Yes, sed can handle this, though, if you're not familiar with regular expressions, the command may look like gobbledygook.
For this example, I assume your web page is saved as "webpage.html" -- substitute as appropriate in the command
Code:
sed -n "s@.*Feb 1, 2012</td><td>\([0-9:]*\).*@\1@p" webpage.html
On my machine, given the sample input you provided, that command returns "7:07". It can be modified to return the information in a different way depending on your need.
Also, I'm going to continue experimenting with the command a bit so that it will automatically pull the right time based on the current date.
EDIT:
And here is the command to find tomorrow's sunrise--"tomorrow" as in the upcoming day based on the date that you run the command:
Code:
sed -n "s@.*$(date -d 'tomorrow' '+%b %-e, %Y')</td><td>\([0-9:]*\).*@\1@p" webpage.html
Again, it will return something like 7:07. If you need something different to help you split the result into your two variables, tell me what output format would be the most useful for you.
That's awesome!
Now, I'm trying to understand how it works.
s@.* means whatever before the expression and remove it
.*@\1@p means whatever after the expression and remove it
\([0-9:]*\) means to select only characters which are numbers and :
Is that correct?
Actually what I need is the hour of the sunrise and then the minutes of the sunrise. And after that the hour of the sunset and then the minutes of the sunset. The sunset is located in the next cell.
Quote:
Feb 1, 2012</td><td>7:07 AM</td><td>5:12 PM
So for today:
Quote:
H=sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>\([0-9]\).*@\1@p" webpage.html # hour of the sunrise
M=sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>\(:[0-9]\).*@\1@p" webpage.html # minute of the sunrise
sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>$H:$M AM</td><td>\([0-9]\).*@\1@p" webpage.html # hour of the sunset
sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>$H:$M AM</td><td>\(:[0-9]\).*@\1@p" webpage.html # hour of the sunset
Now, I'm trying to understand how it works.
s@.* means whatever before the expression and remove it
.*@\1@p means whatever after the expression and remove it
\([0-9:]*\) means to select only characters which are numbers and :
Bits and pieces of what you describe are right. Let me go through it and explain from the top.
Ok, so the command (minus the input file):
Code:
sed -n "s@.*Feb 1, 2012</td><td>\([0-9:]*\).*@\1@p"
So, we're invoking sed. The '-n' option tells sed not to print anything by default. That is, we must explicitly tell sed when to print something. The 'p' at the end tells sed to print. In this case, whenever the rest of the command matches, print the result of the manipulation.
Easy part done. The command we give sed is for substitution--the 's@ ... @ ... @' format. In a nutshell, it means "look for a text pattern that matches what is between the first and second '@' symbols and replace what matched with the text between the second and third '@' symbols.
Let me skip ahead and describe the pattern between the second and third '@' symbols first (because it's simple). The '\1' is a backreference toward the text between the first and second '@' symbols. Specifically, it matches the text found between the first set of parentheses. In this case: '\([0-9:]*\)' The parentheses here are "escaped" with a backslash to tell sed that they are not meant as literal parentheses to match--that they should instead be interpreted as grouping matched text for a later backreference. So, just keep in mind that '\1' means whatever matches that specific portion of the text.
For the text between the first and second '@' symbols, a '.' is treated as a wildcard and matches any single character. The asterisk '*' is a modifier that means to match zero or more of the previous pattern. So, '.*' will match as much as possible from the beginning of the line.
The matching is forced to stop when the pattern includes the literal text for the date and subsequent tags: 'Feb 1, 2012</td><td>'. So the first '.*' matches everything up to the literal text. The literal text matches itself, and that brings us back to the parenthetical pattern.
The escaped parentheses were explained earlier, and they have no impact on the text to match. So, ignore them from that perspective. That leaves: '[0-9:]*'
That pattern means to match any numeric digit from 0 to 9 and a colon as a single character. And again, the star modifies that to mean "zero or more of the previous expression". So, collectively, it will match any combination of digits and colons.
Lastly, the trailing '.*' does the same thing that the first one did, and will match everything else remaining on the line.
So, the entire line will be replaced by the sequence of digits and/or colons matched in the parenthetical.
If any of that is not clear, let me know and I'll try to explain it a little better.
The subsequent command I gave uses shell substitution to handle the date. Specifically, the bash shell replaces "$(date -d 'tomorrow' '+%b %-e, %Y')" with the output of the corresponding date command. That is to day, sed never sees the "$( ... )" but only sees the output of the command--which should match the style of date shown in the html file.
Quote:
So for today:
Code:
H=sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>\([0-9]\).*@\1@p" webpage.html # hour of the sunrise
M=sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>\(:[0-9]\).*@\1@p" webpage.html # minute of the sunrise
sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>$H:$M AM</td><td>\([0-9]\).*@\1@p" webpage.html # hour of the sunset
sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>$H:$M AM</td><td>\(:[0-9]\).*@\1@p" webpage.html # hour of the sunset
Note: the '-d' should be removed from the date command. That option is used to specify a date string to override the default "now" that the command uses.
First command: yes. Assuming that you live in a locale where sunrise will never occur after 9:59 AM (i.e. two digits for the hour)
Second command: close... Try:
Third command: probable--same caveat at the first command. Though I would encourage you to enclose your variable references with curly braces. For instance:
Code:
sed -n "s@.*$(date '+%b %-e, %Y')</td><td>${H}:${M} AM</td><td>\([0-9]\).*@\1@p" webpage.html
Fourth command: same as with the second (and the curly braces):
Code:
sed -n "s@.*$(date '+%b %-e, %Y')</td><td>${H}:${M} AM</td><td>[0-9]*:\([0-9]*\).*@\1@p" webpage.html
I have not run those commands to verify, but I will do so in a moment.
EDIT:
I would be lax if I did not show you this, given what you're obviously trying to accomplish. Try running the following example script. It uses a modified sed command and some shell redirection "magic" to assign all four of your variables at one time:
As a side note, since I'm learning python, I may post an html-parsing python script for you to use. It's more an exercise for me, but maybe you'll find it useful.
Last edited by Dark_Helmet; 02-01-2012 at 01:18 AM.
And those are the sunrise hour, sunrise minute, sunset hour, and sunset minute for the date specified.
So, you could use it along with that shell redirection "magic" I mentioned in my previous reply's EDIT section.
The script:
Code:
#!/usr/bin/python
from HTMLParser import HTMLParser
import re
import sys
class myParser( HTMLParser ):
def __init__( self ):
self.reset()
self.checkForDate = 0
self.dateFound = 0
self.sunriseFound = 0
self.sunTimes = []
def SetParseForDate( self, wantedDate ):
self.targetDate = wantedDate
def handle_data( self, data ):
if( self.sunriseFound < 2 and data == "Sunrise" ):
self.checkForDate = 1
elif( self.sunriseFound < 2 and self.checkForDate == 1 ):
if( data == self.targetDate ):
self.dateFound = 1
self.checkForDate = 0
elif( self.sunriseFound < 2 and self.dateFound == 1 ):
reMatch = re.search("([0-9]+):([0-9]{2})", data )
if( reMatch != None ):
self.sunTimes = self.sunTimes + [ reMatch.group(1), reMatch.group(2) ]
self.sunriseFound = self.sunriseFound + 1
if( self.sunriseFound == 2 ):
print ' '.join( self.sunTimes )
if( len( sys.argv ) != 3 ):
print ( "This script requires a date and filename--in that order--to run" )
sys.exit( 1 )
targetDate = sys.argv[1]
targetFile = sys.argv[2]
try:
sunriseSunsetHtml = open( targetFile, "r" )
except:
print ( "Unable to open {0} for reading".format( targetFile ) )
sys.exit( 2 )
parser = myParser()
parser.SetParseForDate( targetDate )
for dataLine in sunriseSunsetHtml:
parser.feed( dataLine )
In a nutshell, it scans the individual element data in the html. When it finds "Sunrise" (presumably as the header for the table) the script sets a flag and starts checking for a match on the date given. When the date is found, another flag is set, and the script starts looking for the next two pieces of data that match "[0-9]+:[0-9]+". Those two matching times are printed with a space separating each component.
Last edited by Dark_Helmet; 02-01-2012 at 01:57 AM.
Thank you so much Dark_Helmet for these explanations!
However, I must admit there are some stuffs which go straight over my head...
And I don't talk about the python script...
I know using bash is not the best for my application and it's tricky but I'm using your one line command and that's freaking awesome!
First command: yes. Assuming that you live in a locale where sunrise will never occur after 9:59 AM (i.e. two digits for the hour)
Where I live, the sunrise never occurs after 10am, nevertheless the sunset occurs after 9pm during the summer...
But I'm not even sure to understand why you mean. Indeed, the problem doesn't matter for the minutes which are obviously always with two digits... And the pattern is the same as for the hours.
... nevertheless the sunset occurs after 9pm during the summer...
And the pattern is the same as for the hours
Not quite. The pattern is different for the hours versus the minutes. If you look closely at the commands for the hours (commands #1 and #3 in the response you're referring to) and the commands for the minutes (commands #2 and #4) there is a small, but significant difference.
For this example, I'll only focus on commands #1 and #2...
From the revised commands provided in my response:
Code:
H=sed -n "s@.*$(date -d'+%b %-e, %Y')</td><td>\([0-9]\).*@\1@p" webpage.html # hour of the sunrise
M=sed -n "s@.*$(date '+%b %-e, %Y')</td><td>[0-9]*:\([0-9]*\).*@\1@p" webpage.html
Notice that there is an asterisk ('*') inside the parentheses for the minutes command. Also notice there is no corresponding asterisk inside the parentheses for the hours command.
The asterisk means match zero or more of the previous pattern. In this case, the previous pattern is any digit (0 through 9). So, the minutes command would match zero digits, one digit, two consecutive digits, three consecutive digits, etc.
Without the asterisk the pattern will only match one digit. Therefore, the hour command, because it does not use an asterisk will only match one digit immediately after the '<td>' tag for the hour. Therefore, if you have an input of '10:02' then the minutes will be correct: two consecutive digits '02' However, the hour would be wrong, because it would match only one digit after the '<td>' -- in this case '1'.
Add the asterisk inside the parentheses for any hour pattern where you anticipate needing more than one digit to represent the hour.
Last edited by Dark_Helmet; 02-05-2012 at 07:47 PM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.