Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Location: a warm beach, cool ocean breeze, nice waves, and a Margaritta
Distribution: RHEL 5.5 Tikanga
Posts: 63
Rep:
deleting multiple consecutine lines in files
hi guys,
i'm running a matlab code that can take a couple days to run so to speed things up i beak up the input file into 6 pieces and then run the code on 6 differnt machines. when the results are done, i "stich" the six output files together. well sometimes i overlap the stiching and end up with one extra line, and that messes up some other stuff that I do to make pretty plots. so i need a script that will check my "stiching" and remove a line that has a consecutive value in the previous line but only in the first column.
so far I've tried
Code:
awk 'a !~ $1; {a=$1}'
but it's not exactly right and I think it checks the whole line not just the first column so to filter just on the first column i think i want to combine it with something like
Code:
awk -F ' ' '$1== I DON'T KNOW, BUT IT CHECKS THE LINE ABOVE IT {count++}END{print count}'
this awk line also checks the whole line also and because of randome processing the whole line won't necessarily be the same as it's preceeding line, but the first colum will alwaqys be the same
so in this case the 2nd and 3rd lines are not unique in the 3rd, 4th, 5th, and 8th columns, so the bash command uniq won't work either. but in the end I need to eliminate either line 2 or line 3 it doesn't really matter which
or maybe ther is a better way just by "cat" the two files so the stiching comes out right idk ???
thanks soooo much, tabby
Last edited by tabbygirl1990; 12-09-2013 at 11:45 AM.
I think the correct awk one-liner is a good solution:
Code:
awk '!_[$1]++' file
here the $1-th element of the array _ is increased by one every time $1 is encountered. The NOT operator (the exclamation mark) causes awk to print out the record only the first time $1 is encountered (0 that is FALSE becomes TRUE, whereas any other value greater than 0 that is TRUE becomes FALSE). Hope this helps.
You can use the Matlab Parallel Computing Toolbox to speed up your computations. Since your job is capable of data parallelism threading it should be relatively easy.
What you're currently attempting is a poor man's map reduce (Which essentially splits the data, processes in parallel, and assembles the results). You should look up map reduce in Matlab to accomplish a similar effect. Matlab can handle this easily without you having to run it on a bunch of machines independently. If you truly need to use more than one machine then run Matlab in Cluster Configuration.
This is one of those types of questions where you ask about a tree but what you really want is the forest. i.e. you want your calculations to run faster using parallelism but you're trying to re-invent a solution.
I can see using this poor man's map reduce on a single threaded system where single threading is your only option. An example of a single thread application which processes data and stats would be R. In this case it can be automated. See the split command in Linux to automatically split data. One can launch independent processes so you can launch multiple processes on the same machine and get a similar result. As long as there is multiple cores on your system I recommend launching 1 process per core for processing your data. e.g. if you have an 8 core machine then you should split your data into 8 parts and launch 8 processes. However, in the case of matlab I recommend using its built-in parallelism functions.
Location: a warm beach, cool ocean breeze, nice waves, and a Margaritta
Distribution: RHEL 5.5 Tikanga
Posts: 63
Original Poster
Rep:
Quote:
Originally Posted by sag47
You can use the Matlab Parallel Computing Toolbox to speed up your computations. Since your job is capable of data parallelism threading it should be relatively easy.
What you're currently attempting is a poor man's map reduce (Which essentially splits the data, processes in parallel, and assembles the results). You should look up map reduce in Matlab to accomplish a similar effect. Matlab can handle this easily without you having to run it on a bunch of machines independently. If you truly need to use more than one machine then run Matlab in Cluster Configuration.
This is one of those types of questions where you ask about a tree but what you really want is the forest. i.e. you want your calculations to run faster using parallelism but you're trying to re-invent a solution.
I can see using this poor man's map reduce on a single threaded system where single threading is your only option. An example of a single thread application which processes data and stats would be R. In this case it can be automated. See the split command in Linux to automatically split data. One can launch independent processes so you can launch multiple processes on the same machine and get a similar result. As long as there is multiple cores on your system I recommend launching 1 process per core for processing your data. e.g. if you have an 8 core machine then you should split your data into 8 parts and launch 8 processes. However, in the case of matlab I recommend using its built-in parallelism functions.
SAM
hi SAM,
my computer has 6 cpu so i'll look into your matlab parallel processing stuff on my machine. i run several sets of 6 across a network on other 6 cpu machines so i'll ak our IT guy if i can do this across the netwrok, thanks!
---------- Post added 12-09-13 at 01:25 PM ----------
Location: a warm beach, cool ocean breeze, nice waves, and a Margaritta
Distribution: RHEL 5.5 Tikanga
Posts: 63
Original Poster
Rep:
Quote:
Originally Posted by colucix
I think the correct awk one-liner is a good solution:
Code:
awk '!_[$1]++' file
here the $1-th element of the array _ is increased by one every time $1 is encountered. The NOT operator (the exclamation mark) causes awk to print out the record only the first time $1 is encountered (0 that is FALSE becomes TRUE, whereas any other value greater than 0 that is TRUE becomes FALSE). Hope this helps.
as alway colucix comes up with the shortest easiest answer though it did take me a few minutes to wrap my head around the logic.
my computer has 6 cpu so i'll look into your matlab parallel processing stuff on my machine. i run several sets of 6 across a network on other 6 cpu machines so i'll ak our IT guy if i can do this across the netwrok, thanks!
I'd also like to mention that if you take advantage of parallel computing within matlab your processing times will be much faster than your existing method. There is a lot of overhead when you launch all of those matlab processes (setup and breakdown of a whole matlab instance). You will not encounter as much overhead with threading. Your speed-up will be significantly more than what you're experiencing with your current (likely manual) split and run. Even if your split, run, and assemble is automated using threading will be significantly faster because of less overhead.
I hypothesize that you running 6 threads in parallel on your single machine will likely be equivalent or better than your current attempt. Things will only get better by adding in clustering where you can split the data into even more parts (e.g. running 36 threads in parallel across 6x 6-core machines).
Location: a warm beach, cool ocean breeze, nice waves, and a Margaritta
Distribution: RHEL 5.5 Tikanga
Posts: 63
Original Poster
Rep:
Quote:
Originally Posted by sag47
I'd also like to mention that if you take advantage of parallel computing within matlab your processing times will be much faster than your existing method. There is a lot of overhead when you launch all of those matlab processes (setup and breakdown of a whole matlab instance). You will not encounter as much overhead with threading. Your speed-up will be significantly more than what you're experiencing with your current (likely manual) split and run. Even if your split, run, and assemble is automated using threading will be significantly faster because of less overhead.
I hypothesize that you running 6 threads in parallel on your single machine will likely be equivalent or better than your current attempt. Things will only get better by adding in clustering where you can split the data into even more parts (e.g. running 36 threads in parallel across 6x 6-core machines).
Location: a warm beach, cool ocean breeze, nice waves, and a Margaritta
Distribution: RHEL 5.5 Tikanga
Posts: 63
Original Poster
Rep:
Quote:
Originally Posted by sag47
I'd also like to mention that if you take advantage of parallel computing within matlab
good morning, i snooped around a bit last night and the parallel computing thing looks mighty scary and waaaaay past my ability. i called a friend of mine who knows alot of matlab and he said he'd never messed with it. so the way i've been doing it is sloooow and brute force, but it works
There's many examples of parallel computing. Using matlab help parfor you can see the format of the parfor function as one example. Here's a small example usage of parfor. If you wanted that to run with 6 workers over the loop then it would look something like this...
Code:
parfor i=1:lots, 6
out(:,i)=do(something);
end
Where the red part designates how many workers to run over the for loop. This will automatically split up the data and assemble it similar to OpenMPI.
If you're new to Parallel programming or computing in general then I recommend you pick up a copy of "An Introduction to Parallel Programming" by Peter S. Pacheco ISBN: 978-0-12-374260-5. You're doing it the painful way with your current method.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.