LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 12-03-2009, 05:28 AM   #1
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Rep: Reputation: 15
BASH scripting: problem with file formatting.


I am trying to do a script which depends on

Code:
cat file | while read i;
do
....
.... 
done

I do some scripting on the html output. An example source file can be found here (which would be $s in the line of code below)
http://www.dodgybloke.co.uk/11191S2E

Code:
cat $s | egrep 'RB|RT' |  sed '1,2d' |  sed -e :a -e 's/<[^>]*>//g;/</N;//ba' |  sed 's/^[ \t]*//' |  tr ',' '\n' >> $s.trackingfound
The file produced looks like this (when a simple command such as
Code:
 cat *.trackingfound >> broken
is performed to get all the data into one file

Code:
george@linux-z40o:~> cat broken

RB116413492HK

RB116413492HK
RT040029841HK
RT040029461HK
RT040029841HK
RT040029461HK
However closer examination reveals this is how read is seeing it.
Code:
george@linux-z40o:~/> cat broken | while read i; do  echo "*S*" $i "*E*"; done
 *E*
 *E*RB116413492HK
 *E*
 *E*RB116413492HK
*S* RT040029841HK *E*
 *E*RT040029461HK
*S* RT040029841HK *E*
 *E*RT040029461HK
You can see the file is, as the name suggests, completely broken and only the last and third from last line is read in correctly into my script. I would like to know how to fix it or get each "word" into a variable using another method.

SED and awk commands to remove blank lines have been fruitless.
Perhaps I need to put everything back onto a single line, then re-separate at the point of RB or RT or after every 13th character. In which case, some of the commands describe above can probably be changed.

As you can see, I am trying to parse the "tracking numbers" from the html.

Last edited by suse_nerd; 12-03-2009 at 05:42 AM. Reason: updated to include source html
 
Old 12-03-2009, 08:11 AM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950
Let me see if I understand correctly. You want to extract only the tracking numbers from the source code of the html documents like the example page you gave, right? And the numbers you want always start with RB or RT?

I tried the extraction string you posted on the page you gave, and it gave me the following output:
Code:
testpage=$(wget -O- http://www.dodgybloke.co.uk/11191S2E)

echo "$testpage" | egrep 'RB|RT' |  sed '1,2d' |  sed -e :a -e 's/<[^>]*>//g;/</N;//ba' |  sed 's/^[ \t]*//' |  tr ',' '\n'

size="2">RB116413492HK
href="http://app3.hongkongpost.com/CGI/mt/genresult.jsp?tracknbr=RB116413492HK" target=_blank>
RB116413492HK
Something tells me you don't want all that extra garbage. Besides, I think you're making it much more complicated than it needs to be. I can get the tracking number with just the following command:
Code:
$$ sed -rn '0,/tracknbr/ s/^.*=((RB|RT)[^"]+).*/\1/p' <<<$testpage

RB116413492HK
"0,/tracknbr/" says to only search the file up to the first line that has "tracknbr" in it, then it uses the s/// expression to extract the actual number. You may have to modify it a little if the input can vary significantly.

Finally, it's better to avoid using pipes and external commands like cat whenever possible for efficiency purposes. Pipes also run subsequent commands in subshells that can cause confusing behavior with variables. So your while loop can be written better this way:
Code:
while read i;
do
....
.... 
done <file

Last edited by David the H.; 12-03-2009 at 08:14 AM. Reason: fixed formatting error
 
1 members found this post helpful.
Old 12-07-2009, 07:32 AM   #3
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Original Poster
Rep: Reputation: 15
Many thanks for the reply. It has fixed it. I thought I would upload my entire script, it all works fine, but I expect there are better ways of doing it. I tried changing the commands to what you suggested, that didnt work though, but could have been because of other problems.

http://www.dodgybloke.co.uk/trackingscript.sh
 
Old 12-07-2009, 08:08 AM   #4
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
no need complicated regex
Code:
# wget -q -O- http://www.dodgybloke.co.uk/11191S2E | awk -F"tracknbr=" '/tracknbr=/{sub(/\".*/,"",$2);print $2}'
RB116413492HK
RB116413492HK
 
Old 12-07-2009, 05:31 PM   #5
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by ghostdog74 View Post
no need complicated regex

Hi ghostdog, which line are you saying I could replace. It is not as simple as just getting the file from the above site, it was provided as an example only. The lynx script logs in to dealextreme.com and gets the file.

The complicated regex gets rid of the duplicate lines like the above, as the next part of the script checks the tracking number against the hong kong post website and would do the same tracking number twice otherwise.

Code:
 sed '$!N; /^\(.*\)\n\1$/!P; D'
Deletes duplicate non-consecutive lines

Code:
 sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Deletes duplicate consecutive lines
 
  


Reply

Tags
awk, bash, formatting, line, lines, newline, read, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Reading a bash variable in bash scripting problem freeindy Programming 3 11-27-2008 03:29 AM
bash scripting...is a file also a string? Daravon Linux - Newbie 3 09-03-2008 04:38 AM
file formatting via shell scripting athreyavc Programming 2 10-09-2007 04:55 AM
bash scripting testing for file exvor Programming 4 08-08-2007 04:42 PM
bash scripting read from file cadj Programming 2 02-29-2004 11:42 PM


All times are GMT -5. The time now is 08:50 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration