How do I - not print

ZimMonkey · 08-20-2009, 09:10 AM

I'm trying to clean up a fairly messy script and need a pointer in the right direction. Here's what i'm working with...

file
gibberish 5 Monkey $gibberish $gibberish
gibberish 8 Santa Claus $gibberish $gibberish
gibberish 2 Evil Robot Army $gibberish $gibberish
gibberish 7 Global Thermal Nuclear War $gibberish $gibberish

I want to get rid of the gibberish, (and $gibberish). Here's what I did (and it's messy)

for filename in *; do
awk '$4 ~ /\$/' $filename | awk '$5 ~ /\$/ {print $3,$2}' > a
awk '$5 ~ /\$/' $filename | awk '$6 ~ /\$/ {print $3,$4,$2}' > b
awk '$6 ~ /\$/' $filename | awk '$7 ~ /\$/ {print $3,$4,$5,$2}' > c
awk '$7 ~ /\$/' $filename | awk '$8 ~ /\$/ {print $3,$4,$5,$6,$2}' > d
cat a b c d > $filename
done

In my case the awk list is actually much longer because the gibberish extends out to 15 fields. My only saving grace is that the pattern is the same, where I don't want the first, or last 2 fields. Is there a cleaner way to use awk so it prints everything but the first and last 2 fields? Or am I stuck with the ugly mess?

Thanks

Zim

centosboy · 08-20-2009, 09:22 AM

Quote:

Originally Posted by ZimMonkey

I thought I posted this yesterday, but it's been about 18 hours and the post hasn't shown up, so I guess i didn't hit send. If I'm on a delay for being a noob, then please delete my first question.

I'm trying to clean up a fairly messy script and need a pointer in the right direction. Here's what i'm working with...

file
gibberish 5 Monkey $gibberish $gibberish
gibberish 8 Santa Claus $gibberish $gibberish
gibberish 2 Evil Robot Army $gibberish $gibberish
gibberish 7 Global Thermal Nuclear War $gibberish $gibberish

I want to get rid of the gibberish, (and $gibberish). Here's what I did (and it's messy)

for filename in *; do
awk '$4 ~ /\$/' $filename | awk '$5 ~ /\$/ {print $3,$2}' > a
awk '$5 ~ /\$/' $filename | awk '$6 ~ /\$/ {print $3,$4,$2}' > b
awk '$6 ~ /\$/' $filename | awk '$7 ~ /\$/ {print $3,$4,$5,$2}' > c
awk '$7 ~ /\$/' $filename | awk '$8 ~ /\$/ {print $3,$4,$5,$6,$2}' > d
cat a b c d > $filename
done

In my case the awk list is actually much longer because the gibberish extends out to 15 fields. My only saving grace is that the pattern is the same, where I don't want the first, or last 2 fields. Is there a cleaner way to use awk so it prints everything but the first and last 2 fields? Or am I stuck with the ugly mess?

Thanks

Zim

this might help - perl in line edit.
shorter, quicker, cleaner

Code:

cat filename | perl -ne's/(gibberish |\$gibberish)//g;print'

or

in line if you confident... with .bak backs up orig.

Code:

perl -pi.bak 's/(gibberish |\$gibberish)//g' filename

ilikejam · 08-20-2009, 09:27 AM

Hi.

How about something like:
$ awk '{ for (i=3;i<(NF-2);i++) { printf "%s ", $i }; if (i == (NF-2)) print $i }' /path/to/input/file

Dave

ZimMonkey · 08-20-2009, 06:00 PM

Thanks for the replies.

ilikejam, i tried your code and it removed the first and last fields leaving the second to last field of gibberish still there. I'll try to do some tweaking.

centosboy, I just don't kow enough about perl to go down that road just yet. I'm still trying to get a handle on awk, so it will be a little while before I make that jump.

Thank you both.

ilikejam · 08-20-2009, 06:37 PM

Uh, that's odd. Using the 'file' you gave in your original post, I get back:

Code:

Monkey
Santa Claus
Evil Robot Army
Global Thermal Nuclear War

from the awk line I posted.

Dave

ZimMonkey · 08-20-2009, 07:36 PM

ikilejam, thanks again for your response. I seem to be having a few issues with this. The "file" that I gave was obviously a generalization of the problem that I'm having. To be more accurate, the files that I'm trying to make neater look like this...

2097772 81264 BOOT 1983603 4/30/2007 1 $2.30 $2.30
2612268 023031COUPLING COUPLING, SPLINED HYDRAULIC MOTOR BRIDGE 2032363 6/25/2007 1 $4.60 $4.60
266586 60583203 BULB, PANEL LIGHT 2008627 5/29/2007 1 $0.50 $0.50
1995423 SP16F COLLAR, SPLIT 2 PIECE 1935593 3/9/2007 2 $3.80 $7.60

Where the outcome needs to be effectively description, then part #

BOOT 81264
COUPLING, SPLINED HYDRAULIC MOTOR BRIDGE 023031COUPLING
BULB, PANEL LIGHT 60583203
COUPLING, SPLIT 2 PIECE SP16F

When I use the script you wrote, (for the first one) I get

BOOT 1983603 4/30/2007 1 $2.30

Field 1, 2, and the final field are removed. When I did the copy paste to get this on here, I noticed there are spaces after the final number. I don't know if that has anything to do with anything. I do know that when I use my lengthy code it does work. I wouldn't think that the length of the lines needed to be in the correct order - or do they? Do the lines have to go sequentially in length for this to work - 4 fields, 5, 6, 7 as they were in my post?

Sorry for not making things more clear from the start, I was hoping to be able to learn from your example, and tweak it to suit my needs. Apparently my vagueness caused confusion. I'll keep on trying.

Thanks again,

Zim

centosboy · 08-21-2009, 05:12 AM

Quote:

Originally Posted by ZimMonkey

Thanks for the replies.

ilikejam, i tried your code and it removed the first and last fields leaving the second to last field of gibberish still there. I'll try to do some tweaking.

centosboy, I just don't kow enough about perl to go down that road just yet. I'm still trying to get a handle on awk, so it will be a little while before I make that jump.

Thank you both.

]#

fair enough, but in my example, you dont need to know too much about perl, but about regexp, and just knowing what the extra perl flags mean, which perl -h tells anyway.

ilikejam · 08-21-2009, 05:33 AM

Ah. OK.

Try:

Code:

awk '{ for (i=3;i<(NF-4);i++) { printf "%s ",$i }; print $2 }'

theNbomr · 08-21-2009, 09:33 AM

Quote:

Originally Posted by ZimMonkey

The "file" that I gave was obviously a generalization of the problem that I'm having. To be more accurate, the files that I'm trying to make neater look like this...

2097772 81264 BOOT 1983603 4/30/2007 1 $2.30 $2.30
2612268 023031COUPLING COUPLING, SPLINED HYDRAULIC MOTOR BRIDGE 2032363 6/25/2007 1 $4.60 $4.60
266586 60583203 BULB, PANEL LIGHT 2008627 5/29/2007 1 $0.50 $0.50
1995423 SP16F COLLAR, SPLIT 2 PIECE 1935593 3/9/2007 2 $3.80 $7.60

Where the outcome needs to be effectively description, then part #

BOOT 81264
COUPLING, SPLINED HYDRAULIC MOTOR BRIDGE 023031COUPLING
BULB, PANEL LIGHT 60583203
COUPLING, SPLIT 2 PIECE SP16F

When I use the script you wrote, (for the first one) I get

BOOT 1983603 4/30/2007 1 $2.30

Your edited sample may have been obvious to you, but to everyone else, it wasn't.

When trying to solve these kinds of problems, it is very useful to express the requirements in terms of how you might perform the modifications if you were doing it manually. Using language that describes the input in terms of fields is a good start, and describing how one might identify specific fields of interest is also useful.
So, for example, you might describe the input as "whitespace delimited fields". In this case, that is actually mostly inaccurate, since one field evidently has embedded whitespace. Now, then, the challenge is to unambiguously describe how to break down the elements of the input. It looks like we can still use the concept of whitespace-delimited fields if we measure the location of the fields in different ways, and perhaps in terms of what we do not want. It looks like we do not want the first field. It looks like we do not want the last 5 fields. And, finally we want to print the result in a different order from the input data.

If this is an accurate description of the problem, then the regular expressions and field-indexing gymnastics have practically written themselves. You have said that you aren't up to the task of seeing this as a Perl problem, but I can tell you that if the description of the method for solving the problem is correct, then Perl has some constructs and elements that are particularly well suited to this problem. So, is the description I presented accurate? Should we proceed on to the solution?

--- rod.

ZimMonkey · 08-21-2009, 01:27 PM

ilikejam, thanks for your help.

theNbomr, I will fully admit that I'm learning as I go here. I was actually hoping for - as I said - a pointer, not someone to spoonfeed me the code (I can't deny that it saved me a lot of time). Even in your own post you gave me the pointers of field indexing, and regular expressions.

So to make the question more clear...

I have this in a file...

2097772 81264 BOOT 1983603 4/30/2007 1 $2.30 $2.30
2612268 023031COUPLING COUPLING, SPLINED HYDRAULIC MOTOR BRIDGE 2032363 6/25/2007 1 $4.60 $4.60
266586 60583203 BULB, PANEL LIGHT 2008627 5/29/2007 1 $0.50 $0.50
1995423 SP16F COLLAR, SPLIT 2 PIECE 1935593 3/9/2007 2 $3.80 $7.60

I need to get to this...

BOOT 81264
COUPLING, SPLINED HYDRAULIC MOTOR BRIDGE 023031COUPLING
BULB, PANEL LIGHT 60583203
COUPLING, SPLIT 2 PIECE SP16F

This is the code that I used which is quite messy, and I would like a pointer or suggestion as to how to make my code less messy. I do not want someone to write the code for me, i would like a pointer in the right direction.

for filename in *; do
awk '$4 ~ /\$/' $filename | awk '$5 ~ /\$/ {print $3,$2}' > a
awk '$5 ~ /\$/' $filename | awk '$6 ~ /\$/ {print $3,$4,$2}' > b
awk '$6 ~ /\$/' $filename | awk '$7 ~ /\$/ {print $3,$4,$5,$2}' > c
awk '$7 ~ /\$/' $filename | awk '$8 ~ /\$/ {print $3,$4,$5,$6,$2}' > d
cat a b c d > $filename
done

It appears that there are some extra spaces at the end of each line. I don't know if that makes any difference, but it might be helpful. So as best as I can describe, I want to remove the first, and last 5 fields, then place the second field at the end of the line. To be more accurate still, the file has a "field range" from 5 to 20. So my awk list is very bulky. As I already stated, my code works, I would like some pointers on how to clean it up. How should I proceed?

I'm rather new to linux so right now perl is not on the table, but when the time comes, I will be asking about it too.

Thanks for your help,

Zim

theNbomr · 08-21-2009, 07:11 PM

Quote:

So as best as I can describe, I want to remove the first, and last 5 fields, then place the second field at the end of the line. To be more accurate still, the file has a "field range" from 5 to 20.

Perfect. You have described in unambiguous terms what you wish to do. Now, you simply have to translate those terms to code (being unambiguous really helps with that part). The tricky part of your problem is that there are a variable number of fields, and you want to reference the fields numbering backward from the last field. You can index your fields using awk's builtin 'NF' variable. It will index the last field. Indexing the second last, third last, etc would involve indexing with 'NF-1', 'NF-2', etc. So, your fields are named like

Code:

$NF
 or
$(NF-0)
 or
$(NF-5)

Replacing the '0' with a variable 'i', you can print something like

Code:

print $(NF-i)

Since i is a variable, you can modify it, such as using it as a loop counter:

Code:

for( i = 0; i < NF; i++ ){

}

I could put it all together for you, but there is enough there to point you in the right direction, and still leave plenty of room for learning.

While your code is somewhat 'messy', if it works, there's nothing wrong with it. It is good to try to improve upon your work, and learn new things.

--- rod.