Help with Sed/Grep/Awk for file parsing

StupidNewbie · 03-15-2012, 02:56 PM

Hi everyone,

I need some help with a program I'm working on. Say I want to read in a file that looks like:

Code:

site:nameofsite
username:nameofuser
ipaddress:ipofsite
password:somehashedvalue
site:nameofsite2
username:nameofuser2
ipaddress:ipofsite2
password:somehashedvalue2
site:nameofsite3
username:nameofuser3
ipaddress:ipofsite3
password:somehashedvalue3

I can run grep and do something like:

Code:

cat file | grep 'site' | cut -d ':' -f2

and get "nameofsite" for each site. However, if I then put:

Code:

grep 'username' | cut -d ':' -f2

I get nothing. In fact, the program hangs and I have to ctrl+c to get out of it. The output I am trying to get is:

Code:

nameofsite
nameofuser
ipofsite
somehashedvalue
nameofsite2
nameofuser2
ipofsite2
somehashedvalue2
nameofsite3
nameofuser3
ipofsite3
somehashedvalue3

so that I can assign those items to variables. I have tried using sed to replace anything before the ':' with nothing (i.e. sed 's/.*://') but unfortunately the file I'm parsing is a bit more complicated than the one above. I am using this as an example for simplicity as I feel there must be a way to make grep go back and search again from the top of the file for a new string.

Does anyone have any idea how to make that happen?

Thanks in advance!!!

colucix · 03-15-2012, 03:08 PM

Quote:

Originally Posted by StupidNewbie

Code:

grep 'username' | cut -d ':' -f2

I get nothing.

It simply misses the file name. In this case grep expects input from the keyboard (standard input) and you have to terminate it using Ctrl-D, whereas Ctrl-C interrupts the whole process. Anyway, why don't you show us a piece of the real input? Maybe we can give some more help.

David the H. · 03-15-2012, 03:14 PM

Code:

grep 'username' | cut -d ':' -f2

There is no filename or stdin input given here for grep to read, so it sits there waiting for you to give it some.

Code:

grep 'username' inputfile | cut -d ':' -f2

Notice that this also avoids the Useless Use Of Cat that your first grep command is guilty of.

StupidNewbie · 03-16-2012, 12:34 PM

Thanks for the replies. David, I'm not actually using cat. I am using tail -500 logfile.log (it's a log file) | grep 'stuff' | cut -d blah blah blah

Unfortunately I can't get a sample of the exact output because it is on a private system, but it basically takes this format:

[mm/dd/yyyy hh:mm:ss] Creating a connection config for: SITE
[mm/dd/yyyy hh:mm:ss] Set parameter: PARAM
[mm/dd/yyyy hh:mm:ss] Set parameter: PARAM
[mm/dd/yyyy hh:mm:ss] Set parameter: PARAM
[mm/dd/yyyy hh:mm:ss] Set url = URL
[mm/dd/yyyy hh:mm:ss] Creating a connection config for: SITE
[mm/dd/yyyy hh:mm:ss] Set parameter: PARAM
[mm/dd/yyyy hh:mm:ss] Set parameter: PARAM
[mm/dd/yyyy hh:mm:ss] Set parameter: PARAM
[mm/dd/yyyy hh:mm:ss] Set url = URL
[mm/dd/yyyy hh:mm:ss] Creating a connection config for: SITE
[mm/dd/yyyy hh:mm:ss] Set parameter: PARAM
[mm/dd/yyyy hh:mm:ss] Set parameter: PARAM
[mm/dd/yyyy hh:mm:ss] Set parameter: PARAM
[mm/dd/yyyy hh:mm:ss] Set url = URL

I need PARAM, PARAM, PARAM and URL for each site. The desired output would be

SITE
PARAM
PARAM
PARAM
URL

SITE
PARAM
PARAM
PARAM
URL

SITE
PARAM
PARAM
PARAM
URL

And actually, it doesn't even need to be output, I just need those things separated out so that I can manipulate them. It seems I can get all of the sites, all of one param or another, or all the URLs using grep and cut/sed, but I can't get them ordered in the way I want because once the file is "grepped" once, grep doesn't continue from the top again. I hope this isn't too vague. I would love to be able to post the actual log file but I just can't do it

Reuti · 03-16-2012, 12:39 PM

If it’s always the last column:

Code:

$ awk '{ print $NF }' file

danielbmartin · 03-16-2012, 12:54 PM

Try this...

Code:

|tr = : |sed 's/.*://'

Daniel B. Martin

StupidNewbie · 03-16-2012, 02:14 PM

Thanks guys! Both of these look to have potential but neither of them worked quite like I'd expected. The reason is that some of the PARAMS have special characters in them (for example one of them is a DN string like cn=username,ou=a,ou=b,ou=c,dc=a,dc=b,dc=c)

With awk, I was able to get everything except one of the PARAMs which happens to have spaces in it (I assume because awk is using the last field and using a space as the delimiter?)

With translate I was able to get SITE and only one of the PARAMs, I'm guessing because some of the PARAMs have colons in them. I've come up with some stuff I can post without giving away too much info. This is the exact format the log file follows (punctuation and everything):

Code:

[mm/dd/yyyy hh:mm:ss] Creating a connection config for: SITE1
[mm/dd/yyyy hh:mm:ss] Set parameter: some.stuff.i.dont.need
[mm/dd/yyyy hh:mm:ss] Set parameter: java.naming.security.principal=CN=user,OU=a,OU=b,DC=a,DC=b,DC=c,DC=d,DC=e,DC=f
[mm/dd/yyyy hh:mm:ss] Set parameter: java.naming.security.credentials=somehashedvalue
[mm/dd/yyyy hh:mm:ss] Set parameter: some.more.stuff.i.dont.need
[mm/dd/yyyy hh:mm:ss] Set java.naming.provider.url = http://www.example.com/
Creating a connection config for: SITE2
[mm/dd/yyyy hh:mm:ss] Set parameter: some.stuff.i.dont.need
[mm/dd/yyyy hh:mm:ss] Set parameter: java.naming.security.principal=CN=user,OU=a,OU=b,OU=c,DC=a,DC=b,DC=c,DC=d,DC=e
[mm/dd/yyyy hh:mm:ss] Set parameter: java.naming.security.credentials=somehashedvalue
[mm/dd/yyyy hh:mm:ss] Set parameter: some.more.stuff.i.dont.need
[mm/dd/yyyy hh:mm:ss] Set java.naming.provider.url = http://www.example2.com/
Creating a connection config for: SITE3
[mm/dd/yyyy hh:mm:ss] Set parameter: some.stuff.i.dont.need
[mm/dd/yyyy hh:mm:ss] Set parameter: java.naming.security.principal=CN=user,OU=a,OU=b,OU=c,OU=d,DC=a,DC=b,DC=c,DC=d
[mm/dd/yyyy hh:mm:ss] Set parameter: java.naming.security.credentials=somehashedvalue
[mm/dd/yyyy hh:mm:ss] Set parameter: some.more.stuff.i.dont.need
[mm/dd/yyyy hh:mm:ss] Set java.naming.provider.url = http://www.example3.com/

What I need is the following:

SITE1
CN=user,OU=a,OU=b,DC=a,DC=b,DC=c,DC=d,DC=e,DC=f
somehashedvalue
http://www.example.com/

SITE2
CN=user,OU=a,OU=b,OU=c,DC=a,DC=b,DC=c,DC=d,DC=e
somehashedvalue
http://www.example2.com/

SITE3
CN=user,OU=a,OU=b,OU=c,OU=d,DC=a,DC=b,DC=c,DC=d
somehashedvalue
http://www.example3.com/

Note that the OU structures are different and will vary depending on the site, so I do not have a specific number of fields for that line unfortunately. Likewise notice that there are a couple random lines in the middle of each block which I don't need, although it might be ok because I can probably grep them out if I can get everything else right. Unfortunately the format's not uniform but if I can get close I might be able to figure the rest out on my own. I'm still playing with awk and tr to see if I can get this to work, but in the mean time if you guys are able to get the output above from the code above that, I should be in business!

Thanks again for all the help.

StupidNewbie · 03-16-2012, 03:21 PM

I got it guys! I ended up just piping a bunch of sed commands together after using awk to print out the last field using the field separator of ":"! I used a little bit of each of your replies combined with a bit of my own tweaking!

Here is my final command (I tested it on the real log file and with a little tweaking I got it to work like I planned). This assumes that "testfile" has the format given above:

Code:

cat testfile | awk -F ": " '{ print $NF }' | grep -v 'need' | sed 's/.*java.naming.security//' | sed 's/.*principal=//' | sed 's/.*credentials=//' | sed 's/.*provider.url = //'

THANKS!!!

David the H. · 03-17-2012, 11:17 AM

There's generally no need to mix and match grep, sed, and awk. sed can do everything grep can do and more, and awk is a full text-processing scripting language that can completely replace the other two, and then some.

grep and sed can also be handed multiple expressions at once, using the "-e" option.

Also, don't forget that "." is a regex operator, meaning "match any character", so you have to escape it or use a bracket expression if you want to match a literal period.

Code:

sed -e '/need/d' -e 's/.*for: //' -e 's/.*java\.naming\.security//' -e 's/.*principal=//' -e 's/.*credentials=//' -e 's/.*provider\.url = //' infile.txt

It's possible to compact the command even more if you use extended regular expressions (the -r option in sed). Then you can use parentheses to group a list of alternate values to match (separated by "|").

Code:

sed -r -e '/need/d' -e 's/.*(for: |java\.naming\.security|principal=|credentials=|provider\.url = )//' infile.txt

Here are a few useful sed references.
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/grabbag/
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt

Here are a few useful awk references:
http://www.grymoire.com/Unix/Awk.html
http://www.gnu.org/software/gawk/man...ode/index.html
http://www.pement.org/awk/awk1line.txt
http://www.catonmat.net/blog/awk-one...ined-part-one/

A couple of regular expressions tutorials:
http://mywiki.wooledge.org/RegularExpression
http://www.grymoire.com/Unix/Regular.html

StupidNewbie · 03-18-2012, 01:42 PM

Thanks David. Even though I got this to work, I will give that a shot too. I tried using Sed before (by itself) and it just became so cluttered and cryptic I couldn't keep track of what I was replacing. Also, there were some quirks like Sed not properly interpreting brackets {} in order to make the pattern repeat a specific number of times, which became an issue with the OU string since DC= repeats multiple times, as well as OU=, and it's an unknown number of repetitions each time. Anyway, I will give your code a shot and see if it looks cleaner and works the same way. Thanks!

Reuti · 03-18-2012, 01:52 PM

NB: Your first post showed Ubuntu and the latter Mac OS X from where you are posting. On a Mac the delivered sed is the BSD version and has no -r option. In case you are using it thereon you can compile the GNU sed though, like I did for exactly that purpose.