LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 09-27-2011, 10:39 PM   #1
flackend
LQ Newbie
 
Registered: Jun 2009
Location: Ohio
Distribution: Ubuntu 9.04
Posts: 10

Rep: Reputation: 0
Question Return text, regex, grep, awk...


Hi, I have a text file that contains records. Here's an example:

Code:
"1000","0932010","Google Search Engine  http://www.google.com/";"Yahoo! Search Engine http://www.yahoo.com/"
"1001","9189000","Linux Questions Forum  http://www.linuxquestions.org/questions/"
I'd like to isolate and curl each URL with a bash script. I could employ a slightly brute-force awk command employing head and tail to iterate, but I'd like to learn to use regular expressions.

I looked into sed, but it seems that it's used for replacing or modifying the returned string. And my understanding of grep is that it only returns complete lines.

What command do I need to use to isolate the URLS?

Thanks for any help!

Last edited by flackend; 09-28-2011 at 10:33 PM.
 
Old 09-27-2011, 11:52 PM   #2
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
You will need to provide more information as the first line has 2 urls, do you want both?

awk, sed or grep can all return just the portion you are looking for, as can bash for that matter.
Maybe along with an example of what information you would want from the above example you could show
where you are getting stuck using any one of these and someone will help.
 
1 members found this post helpful.
Old 09-28-2011, 07:52 PM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
Here's a quick summary of your options:

grep applies a regex pattern to each line, and if it matches, it prints out the whole line by default. However it can optionally print out just the part that matches, or the lines that don't match, or that line plus the ones just before or after it. It's main limitation is that it can't match a string, then print out only part of the matching section (i.e. you can't search for "abc=xyz", and print only "xyz").

sed is a more flexible editing tool. It applies regex patterns to each line of input, and can print, delete, insert, alter, or extract text from them. It can do the substring extraction that grep cannot. While it does have some multi-line editing capability as well, it's not well designed for it.

awk is not just a program, but a full scripting language, and comes in several variations based on the interpreter used. Linux generally uses gawk. awk divides the input up into records, and then subdivides those records into fields. These are user defined, but by default are one line per record, and one word per field. You can then manipulate these fields with an impressive collection of functions.

The text processing capabilities of bash and other shells is string-based and fairly powerful. If you can break the text down into clearly defined, reasonably-sized chunks (a single line from a file, for example), then you can store them in variables and operate on them with various internal and external tools. Bash is one of the more advanced shells, and has full regex ability built into its [[..]] extended test pattern, in addition to other string manipulation features.



Now, since you used quote tags above, which don't preserve formatting, rather than [code][/code] tags, which do, I'm not yet sure about the exact layout of the text you posted. Does each line contain a single, unbroken url, or can they contain more than one, or can the urls can span multiple lines? If the latter, then it's going to require some additional work to extract them.

The basic procedure will probably be very simple. Just extract the urls using one of the above tools, store them in variables, and run your curl commands on them.

If the file contains unbroken urls, then a simple bash loop iterating through the lines would do the job. But if there can be multi-line urls, or if the file is very large, then it would probably be better to use sed or awk to extract them first, then run the loop on the results.
 
1 members found this post helpful.
Old 09-29-2011, 01:12 AM   #4
flackend
LQ Newbie
 
Registered: Jun 2009
Location: Ohio
Distribution: Ubuntu 9.04
Posts: 10

Original Poster
Rep: Reputation: 0
Here is my script (it's also attached along with "testrecords.txt"):
Code:
#!/bin/bash

clear

# DEFINE # ITERATE # through records in text file
while read currentLine
do
	# DEFINE # current line resrcCount
	resrcCount=`echo $currentLine | grep -o http:// | wc -l`

	# Does current line contain external resources?
	if [ $resrcCount -lt 1 ]

	# NO #
	then

		# print record to HTML (containing only no-resource records)
		echo No resources.

	# YES #
	else
		
		# ITERATE # through record's resources
		for (( i = 1; i <= $resrcCount; i++ ))
		do
			# DEFINE # current resource URL
			resrcURL="http://"`echo $currentLine | awk -F, '{print $3}' | awk -F\" '{print $'$(($i * 2))'}' | awk 'BEGIN {FS="http://"};{print $2}'`
			
			echo $resrcURL

		done

	fi
	
	# Add a blank line between records
	printf "\n"

done < testrecords.txt
I didn't realize grep, awk, and sed could return what I need. Thanks! I'll see if I can find the right sytax online somewhere.

If anyone knows what I need without putting time into it, I'd prefer to not have to spend any more time on Google, haha.

I also noticed that the bottom line of my testrecords.txt is never processed so I put an extra line "END", but I'd prefer to know why it's skipping the last line. Any idea?

Thanks, @David the H. and @grail!
Attached Files
File Type: txt testrecords.txt (466 Bytes, 6 views)
File Type: txt ProcessRecords.txt (1.1 KB, 3 views)

Last edited by flackend; 09-29-2011 at 01:13 AM.
 
Old 09-29-2011, 03:36 AM   #5
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
hmm .. not sure why it had to be complicated:
Code:
#!/bin/bash

urls=($(egrep 'http://[^"]+' testrecords.txt))

for url in ${urls[*]}
do
    curl "$url"
done
Of course I am not sure what sort of curl operation you want but I am sure you get the idea.

The only issue to arise with this will relate to whether or not a url has quotes in it.
 
1 members found this post helpful.
Old 09-29-2011, 09:25 PM   #6
flackend
LQ Newbie
 
Registered: Jun 2009
Location: Ohio
Distribution: Ubuntu 9.04
Posts: 10

Original Poster
Rep: Reputation: 0
When I run this command:

Code:
echo some text http://google.com/ | egrep 'http://[^"]+'
I get this:

Code:
some text http://google.com/
I need to get only the URL:

Code:
http://google.com/


The code I posted before (below) takes "0123","4567","Webpage Title http://www.example.com/" and returns only http://www.example.com/. But like you said it's sloppy and overly complicated.

Code:
resrcURL="http://"`echo $currentLine | awk -F, '{print $3}' | awk -F\" '{print $'$(($i * 2))'}' | awk 'BEGIN {FS="http://"};{print $2}'`
If I figure out the regex, is their a switch or some way to tell egrep to return only the URL, not the entire line? I haven't found any evidence that their is.

Thanks for your help!
 
Old 09-29-2011, 09:50 PM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
Well, I had to get some sleep so grail beat me to the solution, but my rewrite is very similar. I tried to keep it closer to the basic structure of the original, however:

Code:
#!/bin/bash

infile=$1
IFS=$'\n'

# DEFINE # ITERATE # through records in text file
while read currentLine ; do

     resrcURL=( $( grep -Eo 'http[^"]+' <<<"$currentLine" ) )

     if (( ${#resrcURL[@]} )); then

          # ITERATE # through record's resources
          echo "${resrcURL[*]}"

     else

          # print record to HTML (containing only no-resource records)
          echo "No resources."

     fi

     # Add a blank line between records
     echo

done <"$infile"
Points of comment:

1) The regex in this case can be very simple. Since every url starts with http and ends in a ", all you have to do is grab everything after "http" that's not a quotemark. And fortunately grep can cleanly match the whole url and print it. (I think grail forgot to add the -o option to his grep command, by the way. )

2) $(..) is highly recommended over `..`

3) Like grail, I simply loaded every url on the line into an array first. Then I could simply test the array variable to see if there was anything there, and print the appropriate string.

4) grail used a loop to print out the urls, but I used a slightly different technique. The pattern "${array[*]}" prints all existing array elements, separated by the first character in IFS. Since I set IFS to newline, that means it will print one per line. You may want to change it back to a loop if you want to perform other actions on the urls.

5) echo prints a newline by default; easier than using printf.

6) It's usually more convenient in the long run to use variables to specify input files. It makes it easier to modify them later, without having to go through the whole script.

Edit after seeing the previous post:

egrep is just an alias for grep -E as the linked manpage states, so all grep options are still valid. As I mentioned before, grail just forgot to add the -o option.

Last edited by David the H.; 09-29-2011 at 09:55 PM. Reason: as stated
 
1 members found this post helpful.
Old 09-29-2011, 10:16 PM   #8
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
So firstly, David is on the money with missing -o as I didn't copy and paste

I would also point out that the example you provided won't work for either script or even your own as there are no quotes
in it.
 
Old 09-30-2011, 12:58 AM   #9
flackend
LQ Newbie
 
Registered: Jun 2009
Location: Ohio
Distribution: Ubuntu 9.04
Posts: 10

Original Poster
Rep: Reputation: 0
Thank you!

@grail you're right about my example..I didn't realize what you meant. Now that I've made sense in my head of how the regex pattern works I understand what you meant about the quotes.

@David the H. you went above and beyond! Thanks!



Last line of text file not being read issue solved...and explained.

Last edited by flackend; 09-30-2011 at 01:55 AM.
 
  


Reply

Tags
awk, bash, gawk, grep, regular expressions


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Using grep or sed to return a regex match davee Linux - General 7 08-02-2011 02:48 AM
[SOLVED] Any grep, sed or awk gurus with regex familiarity? I need some help. bcrawl Linux - Newbie 19 01-19-2011 07:52 PM
[SOLVED] Using grep / awk to search for coloured text s7upify Linux - General 5 09-21-2010 11:00 AM
awk/sed to grep the text ahpin Linux - Software 3 10-17-2007 12:34 AM
Extracting text with grep or awk? UrbanDruid Linux - Newbie 5 04-07-2005 02:57 PM


All times are GMT -5. The time now is 01:34 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration