LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Return text, regex, grep, awk... (http://www.linuxquestions.org/questions/programming-9/return-text-regex-grep-awk-905333/)

flackend 09-27-2011 10:39 PM

Return text, regex, grep, awk...
 
Hi, I have a text file that contains records. Here's an example:

Code:

"1000","0932010","Google Search Engine  http://www.google.com/";"Yahoo! Search Engine http://www.yahoo.com/"
"1001","9189000","Linux Questions Forum  http://www.linuxquestions.org/questions/"

I'd like to isolate and curl each URL with a bash script. I could employ a slightly brute-force awk command employing head and tail to iterate, but I'd like to learn to use regular expressions.

I looked into sed, but it seems that it's used for replacing or modifying the returned string. And my understanding of grep is that it only returns complete lines.

What command do I need to use to isolate the URLS?

Thanks for any help!

grail 09-27-2011 11:52 PM

You will need to provide more information as the first line has 2 urls, do you want both?

awk, sed or grep can all return just the portion you are looking for, as can bash for that matter.
Maybe along with an example of what information you would want from the above example you could show
where you are getting stuck using any one of these and someone will help.

David the H. 09-28-2011 07:52 PM

Here's a quick summary of your options:

grep applies a regex pattern to each line, and if it matches, it prints out the whole line by default. However it can optionally print out just the part that matches, or the lines that don't match, or that line plus the ones just before or after it. It's main limitation is that it can't match a string, then print out only part of the matching section (i.e. you can't search for "abc=xyz", and print only "xyz").

sed is a more flexible editing tool. It applies regex patterns to each line of input, and can print, delete, insert, alter, or extract text from them. It can do the substring extraction that grep cannot. While it does have some multi-line editing capability as well, it's not well designed for it.

awk is not just a program, but a full scripting language, and comes in several variations based on the interpreter used. Linux generally uses gawk. awk divides the input up into records, and then subdivides those records into fields. These are user defined, but by default are one line per record, and one word per field. You can then manipulate these fields with an impressive collection of functions.

The text processing capabilities of bash and other shells is string-based and fairly powerful. If you can break the text down into clearly defined, reasonably-sized chunks (a single line from a file, for example), then you can store them in variables and operate on them with various internal and external tools. Bash is one of the more advanced shells, and has full regex ability built into its [[..]] extended test pattern, in addition to other string manipulation features.



Now, since you used quote tags above, which don't preserve formatting, rather than [code][/code] tags, which do, I'm not yet sure about the exact layout of the text you posted. Does each line contain a single, unbroken url, or can they contain more than one, or can the urls can span multiple lines? If the latter, then it's going to require some additional work to extract them.

The basic procedure will probably be very simple. Just extract the urls using one of the above tools, store them in variables, and run your curl commands on them.

If the file contains unbroken urls, then a simple bash loop iterating through the lines would do the job. But if there can be multi-line urls, or if the file is very large, then it would probably be better to use sed or awk to extract them first, then run the loop on the results.

flackend 09-29-2011 01:12 AM

2 Attachment(s)
Here is my script (it's also attached along with "testrecords.txt"):
Code:

#!/bin/bash

clear

# DEFINE # ITERATE # through records in text file
while read currentLine
do
        # DEFINE # current line resrcCount
        resrcCount=`echo $currentLine | grep -o http:// | wc -l`

        # Does current line contain external resources?
        if [ $resrcCount -lt 1 ]

        # NO #
        then

                # print record to HTML (containing only no-resource records)
                echo No resources.

        # YES #
        else
               
                # ITERATE # through record's resources
                for (( i = 1; i <= $resrcCount; i++ ))
                do
                        # DEFINE # current resource URL
                        resrcURL="http://"`echo $currentLine | awk -F, '{print $3}' | awk -F\" '{print $'$(($i * 2))'}' | awk 'BEGIN {FS="http://"};{print $2}'`
                       
                        echo $resrcURL

                done

        fi
       
        # Add a blank line between records
        printf "\n"

done < testrecords.txt

I didn't realize grep, awk, and sed could return what I need. Thanks! I'll see if I can find the right sytax online somewhere.

If anyone knows what I need without putting time into it, I'd prefer to not have to spend any more time on Google, haha.

I also noticed that the bottom line of my testrecords.txt is never processed so I put an extra line "END", but I'd prefer to know why it's skipping the last line. Any idea?

Thanks, @David the H. and @grail!

grail 09-29-2011 03:36 AM

hmm .. not sure why it had to be complicated:
Code:

#!/bin/bash

urls=($(egrep 'http://[^"]+' testrecords.txt))

for url in ${urls[*]}
do
    curl "$url"
done

Of course I am not sure what sort of curl operation you want but I am sure you get the idea.

The only issue to arise with this will relate to whether or not a url has quotes in it.

flackend 09-29-2011 09:25 PM

When I run this command:

Code:

echo some text http://google.com/ | egrep 'http://[^"]+'
I get this:

Code:

some text http://google.com/
I need to get only the URL:

Code:

http://google.com/


The code I posted before (below) takes "0123","4567","Webpage Title http://www.example.com/" and returns only http://www.example.com/. But like you said it's sloppy and overly complicated.

Code:

resrcURL="http://"`echo $currentLine | awk -F, '{print $3}' | awk -F\" '{print $'$(($i * 2))'}' | awk 'BEGIN {FS="http://"};{print $2}'`
If I figure out the regex, is their a switch or some way to tell egrep to return only the URL, not the entire line? I haven't found any evidence that their is.

Thanks for your help!

David the H. 09-29-2011 09:50 PM

Well, I had to get some sleep so grail beat me to the solution, but my rewrite is very similar. I tried to keep it closer to the basic structure of the original, however:

Code:

#!/bin/bash

infile=$1
IFS=$'\n'

# DEFINE # ITERATE # through records in text file
while read currentLine ; do

    resrcURL=( $( grep -Eo 'http[^"]+' <<<"$currentLine" ) )

    if (( ${#resrcURL[@]} )); then

          # ITERATE # through record's resources
          echo "${resrcURL[*]}"

    else

          # print record to HTML (containing only no-resource records)
          echo "No resources."

    fi

    # Add a blank line between records
    echo

done <"$infile"

Points of comment:

1) The regex in this case can be very simple. Since every url starts with http and ends in a ", all you have to do is grab everything after "http" that's not a quotemark. And fortunately grep can cleanly match the whole url and print it. (I think grail forgot to add the -o option to his grep command, by the way. ;))

2) $(..) is highly recommended over `..`

3) Like grail, I simply loaded every url on the line into an array first. Then I could simply test the array variable to see if there was anything there, and print the appropriate string.

4) grail used a loop to print out the urls, but I used a slightly different technique. The pattern "${array[*]}" prints all existing array elements, separated by the first character in IFS. Since I set IFS to newline, that means it will print one per line. You may want to change it back to a loop if you want to perform other actions on the urls.

5) echo prints a newline by default; easier than using printf.

6) It's usually more convenient in the long run to use variables to specify input files. It makes it easier to modify them later, without having to go through the whole script.

Edit after seeing the previous post:

egrep is just an alias for grep -E as the linked manpage states, so all grep options are still valid. As I mentioned before, grail just forgot to add the -o option.

grail 09-29-2011 10:16 PM

So firstly, David is on the money with missing -o as I didn't copy and paste :doh:

I would also point out that the example you provided won't work for either script or even your own as there are no quotes
in it.

flackend 09-30-2011 12:58 AM

Thank you!

@grail you're right about my example..I didn't realize what you meant. Now that I've made sense in my head of how the regex pattern works I understand what you meant about the quotes.

@David the H. you went above and beyond! Thanks!



Last line of text file not being read issue solved...and explained.


All times are GMT -5. The time now is 02:10 AM.