deleting multiple consecutine lines in files

tabbygirl1990 · 12-09-2013, 11:42 AM

hi guys,

i'm running a matlab code that can take a couple days to run so to speed things up i beak up the input file into 6 pieces and then run the code on 6 differnt machines. when the results are done, i "stich" the six output files together. well sometimes i overlap the stiching and end up with one extra line, and that messes up some other stuff that I do to make pretty plots. so i need a script that will check my "stiching" and remove a line that has a consecutive value in the previous line but only in the first column.

so far I've tried

Code:

awk 'a !~ $1; {a=$1}'

but it's not exactly right and I think it checks the whole line not just the first column so to filter just on the first column i think i want to combine it with something like

Code:

 awk -F ' ' '$1== I DON'T KNOW, BUT IT CHECKS THE LINE ABOVE IT {count++}END{print count}'

this awk line also checks the whole line also and because of randome processing the whole line won't necessarily be the same as it's preceeding line, but the first colum will alwaqys be the same

here's some example data

Code:

0 2 4 75.87 3.33 0 1 5.23
1 2 7 76.01 5.11 0 1.76 7.11
1 2 4 77.11 4.92 0 1.76 7.12
2 2 6 71.99 4.07 5 1 4.99

so in this case the 2nd and 3rd lines are not unique in the 3rd, 4th, 5th, and 8th columns, so the bash command uniq won't work either. but in the end I need to eliminate either line 2 or line 3 it doesn't really matter which

or maybe ther is a better way just by "cat" the two files so the stiching comes out right idk ???

thanks soooo much, tabby

druuna · 12-09-2013, 11:49 AM

Have a look at this:

Code:

awk '!($1 in _) { _[$1]; print }' input

Example run with your small data set:

Code:

$ awk '!($1 in _) { _[$1]; print }' input 
0 2 4 75.87 3.33 0 1 5.23
1 2 7 76.01 5.11 0 1.76 7.11
2 2 6 71.99 4.07 5 1 4.99

colucix · 12-09-2013, 11:53 AM

I think the correct awk one-liner is a good solution:

Code:

awk '!_[$1]++' file

here the $1-th element of the array _ is increased by one every time $1 is encountered. The NOT operator (the exclamation mark) causes awk to print out the record only the first time $1 is encountered (0 that is FALSE becomes TRUE, whereas any other value greater than 0 that is TRUE becomes FALSE). Hope this helps.

sag47 · 12-09-2013, 12:27 PM

You can use the Matlab Parallel Computing Toolbox to speed up your computations. Since your job is capable of data parallelism threading it should be relatively easy.

What you're currently attempting is a poor man's map reduce (Which essentially splits the data, processes in parallel, and assembles the results). You should look up map reduce in Matlab to accomplish a similar effect. Matlab can handle this easily without you having to run it on a bunch of machines independently. If you truly need to use more than one machine then run Matlab in Cluster Configuration.

This is one of those types of questions where you ask about a tree but what you really want is the forest. i.e. you want your calculations to run faster using parallelism but you're trying to re-invent a solution.

I can see using this poor man's map reduce on a single threaded system where single threading is your only option. An example of a single thread application which processes data and stats would be R. In this case it can be automated. See the split command in Linux to automatically split data. One can launch independent processes so you can launch multiple processes on the same machine and get a similar result. As long as there is multiple cores on your system I recommend launching 1 process per core for processing your data. e.g. if you have an 8 core machine then you should split your data into 8 parts and launch 8 processes. However, in the case of matlab I recommend using its built-in parallelism functions.

SAM

tabbygirl1990 · 12-09-2013, 01:24 PM

Quote:

Originally Posted by sag47

You can use the Matlab Parallel Computing Toolbox to speed up your computations. Since your job is capable of data parallelism threading it should be relatively easy.

What you're currently attempting is a poor man's map reduce (Which essentially splits the data, processes in parallel, and assembles the results). You should look up map reduce in Matlab to accomplish a similar effect. Matlab can handle this easily without you having to run it on a bunch of machines independently. If you truly need to use more than one machine then run Matlab in Cluster Configuration.

This is one of those types of questions where you ask about a tree but what you really want is the forest. i.e. you want your calculations to run faster using parallelism but you're trying to re-invent a solution.

I can see using this poor man's map reduce on a single threaded system where single threading is your only option. An example of a single thread application which processes data and stats would be R. In this case it can be automated. See the split command in Linux to automatically split data. One can launch independent processes so you can launch multiple processes on the same machine and get a similar result. As long as there is multiple cores on your system I recommend launching 1 process per core for processing your data. e.g. if you have an 8 core machine then you should split your data into 8 parts and launch 8 processes. However, in the case of matlab I recommend using its built-in parallelism functions.

SAM

hi SAM,

my computer has 6 cpu so i'll look into your matlab parallel processing stuff on my machine. i run several sets of 6 across a network on other 6 cpu machines so i'll ak our IT guy if i can do this across the netwrok, thanks!

---------- Post added 12-09-13 at 01:25 PM ----------

Quote:

Originally Posted by druuna

Have a look at this:

Code:

awk '!($1 in _) { _[$1]; print }' input

Example run with your small data set:

Code:

$ awk '!($1 in _) { _[$1]; print }' input 
0 2 4 75.87 3.33 0 1 5.23
1 2 7 76.01 5.11 0 1.76 7.11
2 2 6 71.99 4.07 5 1 4.99

that looks fairly straight fwrd, thanks!

tabbygirl1990 · 12-09-2013, 01:27 PM

Quote:

Originally Posted by colucix

I think the correct awk one-liner is a good solution:

Code:

awk '!_[$1]++' file

here the $1-th element of the array _ is increased by one every time $1 is encountered. The NOT operator (the exclamation mark) causes awk to print out the record only the first time $1 is encountered (0 that is FALSE becomes TRUE, whereas any other value greater than 0 that is TRUE becomes FALSE). Hope this helps.

as alway colucix comes up with the shortest easiest answer

though it did take me a few minutes to wrap my head around the logic.

thanks guys!

tabby

schneidz · 12-09-2013, 01:41 PM

just for shiggles:

Code:

[schneidz@mom ~]$ cat tabbygirl.txt 
0 2 4 75.87 3.33 0 1 5.23
1 2 7 76.01 5.11 0 1.76 7.11
1 2 4 77.11 4.92 0 1.76 7.12
2 2 6 71.99 4.07 5 1 4.99
[schneidz@mom ~]$ cat tabbygirl.ksh 
#!/bin/bash

col1=schneidz; cat $1 | while read line
do
 col1old=$col1
 col1=`echo $line | awk '{print $1}'`
 if [ $col1 != $col1old ]
 then
  echo $line
 fi
done
[schneidz@mom ~]$ ./tabbygirl.ksh tabbygirl.txt 
0 2 4 75.87 3.33 0 1 5.23
1 2 7 76.01 5.11 0 1.76 7.11
2 2 6 71.99 4.07 5 1 4.99

sag47 · 12-09-2013, 01:45 PM

Quote:

Originally Posted by tabbygirl1990

my computer has 6 cpu so i'll look into your matlab parallel processing stuff on my machine. i run several sets of 6 across a network on other 6 cpu machines so i'll ak our IT guy if i can do this across the netwrok, thanks!

I'd also like to mention that if you take advantage of parallel computing within matlab your processing times will be much faster than your existing method. There is a lot of overhead when you launch all of those matlab processes (setup and breakdown of a whole matlab instance). You will not encounter as much overhead with threading. Your speed-up will be significantly more than what you're experiencing with your current (likely manual) split and run. Even if your split, run, and assemble is automated using threading will be significantly faster because of less overhead.

I hypothesize that you running 6 threads in parallel on your single machine will likely be equivalent or better than your current attempt. Things will only get better by adding in clustering where you can split the data into even more parts (e.g. running 36 threads in parallel across 6x 6-core machines).

tabbygirl1990 · 12-09-2013, 02:06 PM

Quote:

Originally Posted by sag47

I'd also like to mention that if you take advantage of parallel computing within matlab your processing times will be much faster than your existing method. There is a lot of overhead when you launch all of those matlab processes (setup and breakdown of a whole matlab instance). You will not encounter as much overhead with threading. Your speed-up will be significantly more than what you're experiencing with your current (likely manual) split and run. Even if your split, run, and assemble is automated using threading will be significantly faster because of less overhead.

I hypothesize that you running 6 threads in parallel on your single machine will likely be equivalent or better than your current attempt. Things will only get better by adding in clustering where you can split the data into even more parts (e.g. running 36 threads in parallel across 6x 6-core machines).

thanks SAM! tabby

tabbygirl1990 · 12-10-2013, 09:28 AM

Quote:

Originally Posted by sag47

I'd also like to mention that if you take advantage of parallel computing within matlab

good morning, i snooped around a bit last night and the parallel computing thing looks mighty scary and waaaaay past my ability. i called a friend of mine who knows alot of matlab and he said he'd never messed with it. so the way i've been doing it is sloooow and brute force, but it works

sag47 · 12-10-2013, 05:57 PM

There's many examples of parallel computing. Using matlab help parfor you can see the format of the parfor function as one example. Here's a small example usage of parfor. If you wanted that to run with 6 workers over the loop then it would look something like this...

Code:

parfor i=1:lots, 6
   out(:,i)=do(something);
end

Where the red part designates how many workers to run over the for loop. This will automatically split up the data and assemble it similar to OpenMPI.

If you're new to Parallel programming or computing in general then I recommend you pick up a copy of "An Introduction to Parallel Programming" by Peter S. Pacheco ISBN: 978-0-12-374260-5. You're doing it the painful way with your current method.