LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-09-2013, 12:42 PM   #1
tabbygirl1990
Member
 
Registered: Jul 2013
Location: a warm beach, cool ocean breeze, nice waves, and a Margaritta
Distribution: RHEL 5.5 Tikanga
Posts: 63

Rep: Reputation: 1
deleting multiple consecutine lines in files


hi guys,

i'm running a matlab code that can take a couple days to run so to speed things up i beak up the input file into 6 pieces and then run the code on 6 differnt machines. when the results are done, i "stich" the six output files together. well sometimes i overlap the stiching and end up with one extra line, and that messes up some other stuff that I do to make pretty plots. so i need a script that will check my "stiching" and remove a line that has a consecutive value in the previous line but only in the first column.

so far I've tried

Code:
awk 'a !~ $1; {a=$1}'
but it's not exactly right and I think it checks the whole line not just the first column so to filter just on the first column i think i want to combine it with something like

Code:
 awk -F ' ' '$1== I DON'T KNOW, BUT IT CHECKS THE LINE ABOVE IT {count++}END{print count}'
this awk line also checks the whole line also and because of randome processing the whole line won't necessarily be the same as it's preceeding line, but the first colum will alwaqys be the same

here's some example data

Code:
0 2 4 75.87 3.33 0 1 5.23
1 2 7 76.01 5.11 0 1.76 7.11
1 2 4 77.11 4.92 0 1.76 7.12
2 2 6 71.99 4.07 5 1 4.99
so in this case the 2nd and 3rd lines are not unique in the 3rd, 4th, 5th, and 8th columns, so the bash command uniq won't work either. but in the end I need to eliminate either line 2 or line 3 it doesn't really matter which

or maybe ther is a better way just by "cat" the two files so the stiching comes out right idk ???

thanks soooo much, tabby

Last edited by tabbygirl1990; 12-09-2013 at 12:45 PM.
 
Old 12-09-2013, 12:49 PM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
Have a look at this:
Code:
awk '!($1 in _) { _[$1]; print }' input
Example run with your small data set:
Code:
$ awk '!($1 in _) { _[$1]; print }' input 
0 2 4 75.87 3.33 0 1 5.23
1 2 7 76.01 5.11 0 1.76 7.11
2 2 6 71.99 4.07 5 1 4.99
 
1 members found this post helpful.
Old 12-09-2013, 12:53 PM   #3
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
I think the correct awk one-liner is a good solution:
Code:
awk '!_[$1]++' file
here the $1-th element of the array _ is increased by one every time $1 is encountered. The NOT operator (the exclamation mark) causes awk to print out the record only the first time $1 is encountered (0 that is FALSE becomes TRUE, whereas any other value greater than 0 that is TRUE becomes FALSE). Hope this helps.
 
2 members found this post helpful.
Old 12-09-2013, 01:27 PM   #4
sag47
Senior Member
 
Registered: Sep 2009
Location: Orange County, CA
Distribution: Kubuntu x64, Raspbian, CentOS
Posts: 1,831
Blog Entries: 36

Rep: Reputation: 451Reputation: 451Reputation: 451Reputation: 451Reputation: 451
You can use the Matlab Parallel Computing Toolbox to speed up your computations. Since your job is capable of data parallelism threading it should be relatively easy.

What you're currently attempting is a poor man's map reduce (Which essentially splits the data, processes in parallel, and assembles the results). You should look up map reduce in Matlab to accomplish a similar effect. Matlab can handle this easily without you having to run it on a bunch of machines independently. If you truly need to use more than one machine then run Matlab in Cluster Configuration.

This is one of those types of questions where you ask about a tree but what you really want is the forest. i.e. you want your calculations to run faster using parallelism but you're trying to re-invent a solution.

I can see using this poor man's map reduce on a single threaded system where single threading is your only option. An example of a single thread application which processes data and stats would be R. In this case it can be automated. See the split command in Linux to automatically split data. One can launch independent processes so you can launch multiple processes on the same machine and get a similar result. As long as there is multiple cores on your system I recommend launching 1 process per core for processing your data. e.g. if you have an 8 core machine then you should split your data into 8 parts and launch 8 processes. However, in the case of matlab I recommend using its built-in parallelism functions.

SAM

Last edited by sag47; 12-09-2013 at 01:33 PM.
 
1 members found this post helpful.
Old 12-09-2013, 02:24 PM   #5
tabbygirl1990
Member
 
Registered: Jul 2013
Location: a warm beach, cool ocean breeze, nice waves, and a Margaritta
Distribution: RHEL 5.5 Tikanga
Posts: 63

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by sag47 View Post
You can use the Matlab Parallel Computing Toolbox to speed up your computations. Since your job is capable of data parallelism threading it should be relatively easy.

What you're currently attempting is a poor man's map reduce (Which essentially splits the data, processes in parallel, and assembles the results). You should look up map reduce in Matlab to accomplish a similar effect. Matlab can handle this easily without you having to run it on a bunch of machines independently. If you truly need to use more than one machine then run Matlab in Cluster Configuration.

This is one of those types of questions where you ask about a tree but what you really want is the forest. i.e. you want your calculations to run faster using parallelism but you're trying to re-invent a solution.

I can see using this poor man's map reduce on a single threaded system where single threading is your only option. An example of a single thread application which processes data and stats would be R. In this case it can be automated. See the split command in Linux to automatically split data. One can launch independent processes so you can launch multiple processes on the same machine and get a similar result. As long as there is multiple cores on your system I recommend launching 1 process per core for processing your data. e.g. if you have an 8 core machine then you should split your data into 8 parts and launch 8 processes. However, in the case of matlab I recommend using its built-in parallelism functions.

SAM
hi SAM,

my computer has 6 cpu so i'll look into your matlab parallel processing stuff on my machine. i run several sets of 6 across a network on other 6 cpu machines so i'll ak our IT guy if i can do this across the netwrok, thanks!

---------- Post added 12-09-13 at 01:25 PM ----------

Quote:
Originally Posted by druuna View Post
Have a look at this:
Code:
awk '!($1 in _) { _[$1]; print }' input
Example run with your small data set:
Code:
$ awk '!($1 in _) { _[$1]; print }' input 
0 2 4 75.87 3.33 0 1 5.23
1 2 7 76.01 5.11 0 1.76 7.11
2 2 6 71.99 4.07 5 1 4.99
that looks fairly straight fwrd, thanks!
 
Old 12-09-2013, 02:27 PM   #6
tabbygirl1990
Member
 
Registered: Jul 2013
Location: a warm beach, cool ocean breeze, nice waves, and a Margaritta
Distribution: RHEL 5.5 Tikanga
Posts: 63

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by colucix View Post
I think the correct awk one-liner is a good solution:
Code:
awk '!_[$1]++' file
here the $1-th element of the array _ is increased by one every time $1 is encountered. The NOT operator (the exclamation mark) causes awk to print out the record only the first time $1 is encountered (0 that is FALSE becomes TRUE, whereas any other value greater than 0 that is TRUE becomes FALSE). Hope this helps.
as alway colucix comes up with the shortest easiest answer though it did take me a few minutes to wrap my head around the logic.

thanks guys!

tabby
 
Old 12-09-2013, 02:41 PM   #7
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 5,027

Rep: Reputation: 845Reputation: 845Reputation: 845Reputation: 845Reputation: 845Reputation: 845Reputation: 845
just for shiggles:
Code:
[schneidz@mom ~]$ cat tabbygirl.txt 
0 2 4 75.87 3.33 0 1 5.23
1 2 7 76.01 5.11 0 1.76 7.11
1 2 4 77.11 4.92 0 1.76 7.12
2 2 6 71.99 4.07 5 1 4.99
[schneidz@mom ~]$ cat tabbygirl.ksh 
#!/bin/bash

col1=schneidz; cat $1 | while read line
do
 col1old=$col1
 col1=`echo $line | awk '{print $1}'`
 if [ $col1 != $col1old ]
 then
  echo $line
 fi
done
[schneidz@mom ~]$ ./tabbygirl.ksh tabbygirl.txt 
0 2 4 75.87 3.33 0 1 5.23
1 2 7 76.01 5.11 0 1.76 7.11
2 2 6 71.99 4.07 5 1 4.99
 
Old 12-09-2013, 02:45 PM   #8
sag47
Senior Member
 
Registered: Sep 2009
Location: Orange County, CA
Distribution: Kubuntu x64, Raspbian, CentOS
Posts: 1,831
Blog Entries: 36

Rep: Reputation: 451Reputation: 451Reputation: 451Reputation: 451Reputation: 451
Quote:
Originally Posted by tabbygirl1990 View Post
my computer has 6 cpu so i'll look into your matlab parallel processing stuff on my machine. i run several sets of 6 across a network on other 6 cpu machines so i'll ak our IT guy if i can do this across the netwrok, thanks!
I'd also like to mention that if you take advantage of parallel computing within matlab your processing times will be much faster than your existing method. There is a lot of overhead when you launch all of those matlab processes (setup and breakdown of a whole matlab instance). You will not encounter as much overhead with threading. Your speed-up will be significantly more than what you're experiencing with your current (likely manual) split and run. Even if your split, run, and assemble is automated using threading will be significantly faster because of less overhead.

I hypothesize that you running 6 threads in parallel on your single machine will likely be equivalent or better than your current attempt. Things will only get better by adding in clustering where you can split the data into even more parts (e.g. running 36 threads in parallel across 6x 6-core machines).

Last edited by sag47; 12-09-2013 at 03:04 PM.
 
1 members found this post helpful.
Old 12-09-2013, 03:06 PM   #9
tabbygirl1990
Member
 
Registered: Jul 2013
Location: a warm beach, cool ocean breeze, nice waves, and a Margaritta
Distribution: RHEL 5.5 Tikanga
Posts: 63

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by sag47 View Post
I'd also like to mention that if you take advantage of parallel computing within matlab your processing times will be much faster than your existing method. There is a lot of overhead when you launch all of those matlab processes (setup and breakdown of a whole matlab instance). You will not encounter as much overhead with threading. Your speed-up will be significantly more than what you're experiencing with your current (likely manual) split and run. Even if your split, run, and assemble is automated using threading will be significantly faster because of less overhead.

I hypothesize that you running 6 threads in parallel on your single machine will likely be equivalent or better than your current attempt. Things will only get better by adding in clustering where you can split the data into even more parts (e.g. running 36 threads in parallel across 6x 6-core machines).
thanks SAM! tabby
 
Old 12-10-2013, 10:28 AM   #10
tabbygirl1990
Member
 
Registered: Jul 2013
Location: a warm beach, cool ocean breeze, nice waves, and a Margaritta
Distribution: RHEL 5.5 Tikanga
Posts: 63

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by sag47 View Post
I'd also like to mention that if you take advantage of parallel computing within matlab
good morning, i snooped around a bit last night and the parallel computing thing looks mighty scary and waaaaay past my ability. i called a friend of mine who knows alot of matlab and he said he'd never messed with it. so the way i've been doing it is sloooow and brute force, but it works
 
Old 12-10-2013, 06:57 PM   #11
sag47
Senior Member
 
Registered: Sep 2009
Location: Orange County, CA
Distribution: Kubuntu x64, Raspbian, CentOS
Posts: 1,831
Blog Entries: 36

Rep: Reputation: 451Reputation: 451Reputation: 451Reputation: 451Reputation: 451
There's many examples of parallel computing. Using matlab help parfor you can see the format of the parfor function as one example. Here's a small example usage of parfor. If you wanted that to run with 6 workers over the loop then it would look something like this...

Code:
parfor i=1:lots, 6
   out(:,i)=do(something);
end
Where the red part designates how many workers to run over the for loop. This will automatically split up the data and assemble it similar to OpenMPI.

If you're new to Parallel programming or computing in general then I recommend you pick up a copy of "An Introduction to Parallel Programming" by Peter S. Pacheco ISBN: 978-0-12-374260-5. You're doing it the painful way with your current method.

Last edited by sag47; 12-10-2013 at 06:59 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
deleting duplicate lines without deleting first instance of the duplicated line jkeertir Linux - Newbie 2 02-07-2011 07:55 AM
Deleting multiple lines above and below an expression in a file Ransak Programming 7 05-20-2010 09:08 PM
deleting lines from text files caponewgp Linux - Newbie 10 09-17-2009 10:47 PM
deleting multiple files with # in name crazydrve Linux - Software 5 03-01-2006 03:41 AM
An easy way of deleting lines from multipe files? delawhere Linux - General 2 04-02-2004 12:58 PM


All times are GMT -5. The time now is 10:52 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration