LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 05-03-2017, 05:25 PM   #1
ballsystemlord
Member
 
Registered: Aug 2014
Distribution: Devuan
Posts: 214

Rep: Reputation: Disabled
Lightbulb MPI the correct solution?


Hello,
I have a copy of the osm planet data from the open street map project. I need to update it with osmosis, a java app.
It is 700GiB decompressed which means that I must decompress the data to a pipe update from the pipe and then pipe the updated data to a recompressor. The osm file is compressed using bzip2.
I have tested bzip2, pbzip2 and lbzip, lzip is the fastest on my multicore machine by a large margin.
My problem is that the jobs would run for at least a whole day and night on an uncomfortably loud processor (actually, the fan is loud .
So, I wanted to run the processes across my whole collection of linux computers totalling 15 cores.
I've tried to read about clustering and looked at corosync, heartbeat, pacmaker, then relized that these were not what I wanted and looked into mpi, but it seems that at least the C lbzip2 applicaton must be specifically programmed to take advantage of such a power. I don't know if the java tool osmosis needs any extra code.
So, how would I go about this best?

Thanks
 
Old 05-05-2017, 08:55 AM   #2
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,668
Blog Entries: 4

Rep: Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945
Which of the three steps is taking the time? Or is it the transmission of data between the parties?

Is the nature of the update such that it can be performed against only "a slice" of a file? Or do you envision having multiple CPUs doing self-contained pieces of the total job in parallel?

I believe that gzip and other such algorithms are "fast enough" to be "fast enough" not to present a bottleneck. And, there are advantages to transmitting compressed data. So, maybe, the input file might be "bursted" into a bunch of individually-(re-)compressed jobs representing discrete units of work. The various "worker bees" remove them from a shared queue, decompress them, update them, recompress the updates, and send the updated files downstream through another shared queue. (The exact nature of this queue "to be determined," but it's undoubtedly available "off the shelf.") A final-assembly process retrieves the updated pieces and reassembles them into a final deliverable.

Basically, a "batch job monitor" or "cluster manager" software tool should be available off-the-shelf (open source ...), to provide the necessary workload-management and plumbing pieces needed to do this job, which is actually a very common one.

I can't speak to whether "MPI" (whatever that is ...) is "the correct solution." You'll need to do more research based on your expert knowledge of the intended task.

(Also: has anyone out there built and shared an "open street-maps updater engine?" Never assume that what you want to do hasn't already been done.)

Last edited by sundialsvcs; 05-05-2017 at 08:56 AM.
 
Old 05-08-2017, 04:14 PM   #3
ballsystemlord
Member
 
Registered: Aug 2014
Distribution: Devuan
Posts: 214

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by sundialsvcs View Post
Which of the three steps is taking the time? Or is it the transmission of data between the parties?
The decompression and recompression stages of the main archive and the decompression and parsing of the updates. Transmission time is negligible.
Quote:
Originally Posted by sundialsvcs View Post
Is the nature of the update such that it can be performed against only "a slice" of a file? Or do you envision having multiple CPUs doing self-contained pieces of the total job in parallel?
I envision multiple CPUs each in the decompression and recompression stage, I'm not certain if the updating stage is single threaded or not, I think I read yes, but I'm not 100% certain.

Quote:
Originally Posted by sundialsvcs View Post
I believe that gzip and other such algorithms are "fast enough" to be "fast enough" not to present a bottleneck. And, there are advantages to transmitting compressed data. So, maybe, the input file might be "bursted" into a bunch of individually-(re-)compressed jobs representing discrete units of work. The various "worker bees" remove them from a shared queue, decompress them, update them, recompress the updates, and send the updated files downstream through another shared queue. (The exact nature of this queue "to be determined," but it's undoubtedly available "off the shelf.") A final-assembly process retrieves the updated pieces and reassembles them into a final deliverable.
Actually gzip *is* the bottle neck because there are so many updates. 10 updates, 10 threads, 100 updates 100 threads. Plus bzip2 which, when using 6 threads takes about 14 hours to decompress and recompress (simultaneously). I benchmarked.

Quote:
Originally Posted by sundialsvcs View Post
Basically, a "batch job monitor" or "cluster manager" software tool should be available off-the-shelf (open source ...), to provide the necessary workload-management and plumbing pieces needed to do this job, which is actually a very common one.

I can't speak to whether "MPI" (whatever that is ...) is "the correct solution." You'll need to do more research based on your expert knowledge of the intended task.
I know that, that's why I'm asking here, I can't determine what tool to use/is applicable. I just lack the experience.

Quote:
Originally Posted by sundialsvcs View Post
(Also: has anyone out there built and shared an "open street-maps updater engine?" Never assume that what you want to do hasn't already been done.)
I mentioned in my original thread that osmosis is the "open street-maps updater engine".
 
Old 05-08-2017, 06:00 PM   #4
perfectsecurity
LQ Newbie
 
Registered: May 2017
Posts: 21

Rep: Reputation: Disabled
Just throwing stuff out there: Hadoop is an open source java-based programming framework that supports the processing and storage of large data sets in a distributed computing environment . It is a parallel processing framework that is used to map/reduce jobs. Hadoop is a general purpose framework that supports multiple models which can be used instead of it's default map/reduce model. One such model is Apache Spark that replaces the batch map/reduce model for real-time stream data processing and faster interactive queries.. Ganglia is for distributed high performance monitoring of clusters and grids..
 
Old 05-09-2017, 12:01 PM   #5
Reuti
Senior Member
 
Registered: Dec 2004
Location: Marburg, Germany
Distribution: openSUSE 15.2
Posts: 1,339

Rep: Reputation: 260Reputation: 260Reputation: 260
I forgot to mention one option: there is ScaleMP to create a large SMP machine by several nodes and they offer a Foundation Free edition. But it’s quite unclear what they mean by the 4 processors limit in several editions; if they split all cores in the 8 possible machines, it would mean to distribute them to some several VMs only, as even with 8 nodes you get a minimum of 8 cores of course.

Nevertheless: of how many cores are we speaking here: having mutiple cores in two machines, you could pipe the output of the first machine to the second machine with a plain ssh and do the compression on the other side with all cores in each machine.

-- Reuti
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
MPI programs freeze when accessing MPI shared file on Scientific Linux 6 Pizzicato Linux - Software 0 08-29-2012 07:41 AM
MPI vs LAM/MPI vs OpenMPI vs MPICH2 manojg Linux - General 0 11-28-2010 12:37 PM
/var/log/auth.log doens't have correct date and hostname (Solution) alfmarius Linux - Newbie 0 10-07-2008 06:09 AM
Mpi mkrems Programming 0 06-09-2008 07:18 PM
"please give a correct solution" sarathforlinux Linux - Newbie 2 08-31-2006 10:06 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 02:23 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration