LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Script to remove dynamic array from file - sed or grep (http://www.linuxquestions.org/questions/programming-9/script-to-remove-dynamic-array-from-file-sed-or-grep-4175435532/)

marky9074 11-04-2012 01:30 AM

Script to remove dynamic array from file - sed or grep
 
Hi guys I am trying to do a search and return a list of items in a file and then delete all but one of them. All of these lines are 80 characters (padded by spaces), so need to retain the padding....

Code:

H0019 1  1 06466745.12N 00513800.45E                                         
H0019 1  2 06467464.46N 00512968.25E                                         
H0019 1  3 06467783.25N 00512599.43E                                         
H0019 1  4 06467963.08N 00512391.38E                                         
H0019 1  5 06468682.42N 00511559.18E                                         
H0019 1  6 06469001.21N 00511190.36E                                         
H0019 1  7 06469189.22N 00510972.86E                                         
H0019 1  8 06469499.84N 00510613.50E                                         
H0019 1  9 06470186.48N 00509819.12E                                         
H0019 1  10 06470505.27N 00509450.30E                                         
H0019 1  11 06470693.28N 00509232.80E                                         
H0019 1  12 06471012.07N 00508863.98E                                         
H0019 1  13 06471715.06N 00508050.69E                                         
H0019 1  14 06472033.85N 00507681.87E                                         
H0019 1  15 06472221.86N 00507464.37E                                         
H0019 1  16 06472924.85N 00506651.08E                                         
H0019 1  17 06473243.64N 00506282.26E                                         
H0019 1  18 06473423.47N 00506074.21E                                         
H0019 1  19 06474142.81N 00505242.01E                                         
H0019 1  20 06474461.60N 00504873.19E                                         
H0019 1  21 06474649.61N 00504655.69E                                         
H0019 1  22 06474960.23N 00504296.33E                                         
H0019 1  23 06475646.87N 00503501.95E                                         
H0019 1  24 06476333.24N 00502707.38E

The above would be the result if I used grep to look for H00019 in the file. I could use

Code:

grep -n "H0019" | cut -f1 -d:
or

Code:

sed -n '/H0019/='
to return a list of row numbers. But how can I get this to loop around and delete (presumably using sed) all rows bar the last one? In addition I want to rename the last one in this example.

Code:

H0019 1  24
to
Code:

H0019 1  1
Any help would be much appreciated. I guess also I should handle multiple files, and not just assume I going to use this one file at a time.

Thanks,

Mark

pixellany 11-04-2012 04:37 AM

To return only the last line (in a file or data stream):
Code:

sed -n '$p' filename
To also modify the last line:
Code:

sed -n '$s/old/new/p' filename

marky9074 11-04-2012 06:41 AM

OK, so I cobbled together a few things to get something similar to what I require:

Code:

grep H0019 header.p2 | sed -n '$s/H0019/E0019/p' > temp
grep -v H0019 header.p2 > temp2
sed -i 's/E0019/H0019/g' temp
sed -i '/H0018/ r temp' temp2

But it still doesnt get around in my example that I will return

Code:

H0019 1  24
Instead of

Code:

H0019 1  1
And plus I have now ended up with another file rather than doing it in the stream :/

grail 11-04-2012 06:50 AM

Can we go back a step ... does the file in question only contain those 5 lines? Or are we likely to find the pattern amongst other lines and so need to preserve other data?

marky9074 11-04-2012 08:19 AM

No, the files have lots of other lines in them, this is just the header of the file, but the record is unique, so we can work with just searching for H0019. It would be impossible for H0019 to be in the data part of the file, so there is no need to think about preservation etc..

Cheers,

Mark

grail 11-04-2012 09:38 AM

Sorry to harp on, but does this mean the position in the file of the last entry is important or the fact that you end up with 'H0019 1' in the file, ie you could print the first and
delete the rest?

marky9074 11-04-2012 10:24 AM

I can't print the first and delete the rest, as the only line that is correct is the last one... but as the array length changes on every file it is difficult to pin down. For example say on one file H0019 is present in rows 10-20, I wan't to delete 10-19, keep 20, but rename it as H0019 1 rather than H0019 10 (everytime there is a H0019 it increments by one, so given that I only want one record it will always be H0019 1). The next file the rows could be 10-30, and keep row 30 etc.

Hope that makes sense...

grail 11-04-2012 11:12 AM

Well I was originally trying to come up with a solution that would deliver your value on one read of the file, but as we do not know if another entry will be found until we hit it we may
not be able to replace the line where it is in the file (is this a problem?). My idea would be to place the entry at the end of the file, ie last line ... is this any good?

Other wise the multi-pass idea you are presently using would have to do.
Here is another idea on the multi-pass:
Code:

l=$(awk /H0019/{x=NR}END{print x}' file)

sed -i "1,$((l-1)){/H0019/d};$l,${/H0019/s/.$/1/}" file


marky9074 11-04-2012 12:20 PM

As awk returns the line we want to keep, and the starting row is always static, coould we then get it in a single pass, by deleting the unwanted rows prior to sed for substituting the text?

David the H. 11-04-2012 12:46 PM

Since sed's workflow is one-way, probably the easiest thing to do is start working from the end of the file.

Code:

tac file | sed '0,/H0019/! { /H0019/d } ; /H0019/ s/[0-9]\+$/1/' | tac
tac prints the lines of the file in reverse order. The first sed expression then ignores (!) everything from the start of the (reversed) file to the first desired entry, and deletes any found in the rest of it. The second expression modifies the one line remaining. Finally just re-reverse the file with another tac command.


Edit: Here's another option I came up with that uses ed. It's a bit clunky, but at least only a single command is required. There are probably other, better ways to do it.

Code:

printf '%s\n' '?H0019? s/[0-9]\+$/1#/' 'g/H0019.*[^#]$/ d' '/H0019.*#$/ s/#$//' '%p' | ed -s file
'?..?' is like '/../', except that it searches backwards through the file. Since ed starts with the last line as the working line, it means that it will match the last entry in the file. We then modify it to end with '1#'.

Next, we globally delete all lines that match the pattern, except the one with the '#' at the end.

Finally we remove the '#' from the remaining line and output the result. '%p' prints the entire file to stdout. Change it to 'w' to write the changes back to the original file.


How to use ed:
http://wiki.bash-hackers.org/howto/edit-ed
http://snap.nlc.dcccd.edu/learn/nlc/ed.html
(also read the info page)

marky9074 11-04-2012 02:16 PM

Interesting, the tac option just seemed to vape all H0019 lines in my example....

Edit: If I change it to:

Code:

tac file | sed '1,/H0019/! { /H0019/d } ; /H0019/ s/[0-9]\+$/1/' | tac
Adding the '1' after sed, it keeps the line I want, but doesnt rename it..

Ahh, I see my example was wrong initially.. I've updated it in the original post.

I'm playing with the ed one now, but it is complaining at the end (I am using busybox/mobaxterm) about file not found.. what is the -s switch for?

grail 11-05-2012 11:04 AM

hmmm ... so I am confused again, mainly by the example data (which has now changed).

Your example seems to imply that all lines that contain H0019 will be consecutive. Is this correct?

If we assume only 5 lines of your new example, could it perhaps look like the following:
Code:

blah blah
foo bar
H0019 1  1 06466745.12N 00513800.45E                                         
H0019 1  2 06467464.46N 00512968.25E                                         
H0019 1  3 06467783.25N 00512599.43E                                         
H0019 1  4 06467963.08N 00512391.38E                                         
H0019 1  5 06468682.42N 00511559.18E
more stuff here
and here

If above is likely then the following awk would create a new file with relevant data:
Code:

awk '/^H0019/{x=1;$3=1;l=$0}x && !/^H0019/{print l;x=0}!x' old_file > new_file

marky9074 11-05-2012 12:07 PM

Hi there,

Yes sorry about that, I didn't realise I had messed up my example, so reposted it with a little bit more detail, but your correct, stuff above and below.

Will try awk now!

Edit: Ok that works, but the whole line is 80 characters (padded by spaces to the end), and the substitution just has the data part (and has shuffled up a couple of characters at the start). That said, the part after the E is always the same number of characters, so should be easy to pad out?

Thanks,

Mark

grail 11-06-2012 09:29 AM

Yes awk has a printf statement so you can have the output as you prefer.

David the H. 11-06-2012 09:58 AM

Quote:

Originally Posted by marky9074 (Post 4822107)
Adding the '1' after sed, it keeps the line I want, but doesnt rename it..

This is where is becomes important to state the environment you're using, if it's non-standard in some way. The '0' address is a gnu addition to sed (it allows an address range to work even if the 2nd pattern appears on line one), and is likely not available in the busybox implementation, which generally strips its commands down to only the most basic features.

Quote:

I'm playing with the ed one now, but it is complaining at the end (I am using busybox/mobaxterm) about file not found.. what is the -s switch for?
-s is the "silent" option. It simply allows you to feed scripted commands into it without getting unnecessary feedback.

Again though, you'll need to check the busybox documentation to see what features its versions of the commands support.

After seeing your revised input data, here's an update to my ed command too.

Code:

printf '%s\n' '?H0019? s/\(.\{8\}\).../#@\1  1/' 'g/^H0019/d' '/^#@/ s///' '%p' | ed -s filename
Since the column positions now appear to be fixed, it becomes easier. I changed the first command so that it simply matches the first 11 columns of the line. Then it keeps the first 8 and replaces the last three with ' 1'. It also adds the unique string #@ to the beginning of the line this time.

The second expression can now simply globally delete all lines that match the pattern, except for the one with #@ on it, naturally. The third command again follows up by removing the extra string, only now it too can be greatly simplified.


All times are GMT -5. The time now is 01:00 PM.