MPI the correct solution?

LinuxQuestions.org

Share your knowledge at the LQ Wiki.

		LinuxQuestions.org > Forums > Linux Forums > Linux - General
MPI the correct solution?

Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Welcome to LinuxQuestions.org, a friendly and active Linux Community.

You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!

Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.

Are you new to LinuxQuestions.org? Visit the following links:
Site Howto | Site FAQ | Sitemap | Register Now

If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.

Having a problem logging in? Please visit this page to clear all LQ-related cookies.

Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.

Exclusive for LQ members, get up to 45% off per month. Click here for more info.

Reply

Search this Thread

05-03-2017, 05:25 PM	#1
ballsystemlord Member Registered: Aug 2014 Distribution: Devuan Posts: 214 Rep:	MPI the correct solution? Hello, I have a copy of the osm planet data from the open street map project. I need to update it with osmosis, a java app. It is 700GiB decompressed which means that I must decompress the data to a pipe update from the pipe and then pipe the updated data to a recompressor. The osm file is compressed using bzip2. I have tested bzip2, pbzip2 and lbzip, lzip is the fastest on my multicore machine by a large margin. My problem is that the jobs would run for at least a whole day and night on an uncomfortably loud processor (actually, the fan is loud . So, I wanted to run the processes across my whole collection of linux computers totalling 15 cores. I've tried to read about clustering and looked at corosync, heartbeat, pacmaker, then relized that these were not what I wanted and looked into mpi, but it seems that at least the C lbzip2 applicaton must be specifically programmed to take advantage of such a power. I don't know if the java tool osmosis needs any extra code. So, how would I go about this best? Thanks

05-05-2017, 08:55 AM	#2
sundialsvcs LQ Guru Registered: Feb 2004 Location: SE Tennessee, USA Distribution: Gentoo, LFS Posts: 10,668 Blog Entries: 4 Rep:	Which of the three steps is taking the time? Or is it the transmission of data between the parties? Is the nature of the update such that it can be performed against only "a slice" of a file? Or do you envision having multiple CPUs doing self-contained pieces of the total job in parallel? I believe that gzip and other such algorithms are "fast enough" to be "fast enough" not to present a bottleneck. And, there are advantages to transmitting compressed data. So, maybe, the input file might be "bursted" into a bunch of individually-(re-)compressed jobs representing discrete units of work. The various "worker bees" remove them from a shared queue, decompress them, update them, recompress the updates, and send the updated files downstream through another shared queue. (The exact nature of this queue "to be determined," but it's undoubtedly available "off the shelf.") A final-assembly process retrieves the updated pieces and reassembles them into a final deliverable. Basically, a "batch job monitor" or "cluster manager" software tool should be available off-the-shelf (open source ...), to provide the necessary workload-management and plumbing pieces needed to do this job, which is actually a very common one. I can't speak to whether "MPI" (whatever that is ...) is "the correct solution." You'll need to do more research based on your expert knowledge of the intended task. (Also: has anyone out there built and shared an "open street-maps updater engine?" Never assume that what you want to do hasn't already been done.) Last edited by sundialsvcs; 05-05-2017 at 08:56 AM.

Old

05-08-2017, 04:14 PM

#3

ballsystemlord

Member

Registered: Aug 2014

Distribution: Devuan

Posts: 214

Original Poster

Rep:

Reputation: Disabled

Quote:

Originally Posted by sundialsvcs

View Post

Which of the three steps is taking the time? Or is it the transmission of data between the parties?

The decompression and recompression stages of the main archive and the decompression and parsing of the updates. Transmission time is negligible.

Quote:

Originally Posted by sundialsvcs

View Post

Is the nature of the update such that it can be performed against only "a slice" of a file? Or do you envision having multiple CPUs doing self-contained pieces of the total job in parallel?

I envision multiple CPUs each in the decompression and recompression stage, I'm not certain if the updating stage is single threaded or not, I think I read yes, but I'm not 100% certain.

Quote:

Originally Posted by sundialsvcs

View Post

I believe that gzip and other such algorithms are "fast enough" to be "fast enough" not to present a bottleneck. And, there are advantages to transmitting compressed data. So, maybe, the input file might be "bursted" into a bunch of individually-(re-)compressed jobs representing discrete units of work. The various "worker bees" remove them from a shared queue, decompress them, update them, recompress the updates, and send the updated files downstream through another shared queue. (The exact nature of this queue "to be determined," but it's undoubtedly available "off the shelf.") A final-assembly process retrieves the updated pieces and reassembles them into a final deliverable.

Actually gzip *is* the bottle neck because there are so many updates. 10 updates, 10 threads, 100 updates 100 threads. Plus bzip2 which, when using 6 threads takes about 14 hours to decompress and recompress (simultaneously). I benchmarked.

Quote:

Originally Posted by sundialsvcs

View Post

Basically, a "batch job monitor" or "cluster manager" software tool should be available off-the-shelf (open source ...), to provide the necessary workload-management and plumbing pieces needed to do this job, which is actually a very common one.

I can't speak to whether "MPI" (whatever that is ...) is "the correct solution." You'll need to do more research based on your expert knowledge of the intended task.

I know that, that's why I'm asking here, I can't determine what tool to use/is applicable. I just lack the experience.

Quote:

Originally Posted by sundialsvcs

View Post

(Also: has anyone out there built and shared an "open street-maps updater engine?" Never assume that what you want to do hasn't already been done.)

I mentioned in my original thread that osmosis is the "open street-maps updater engine".

05-08-2017, 06:00 PM	#4
perfectsecurity LQ Newbie Registered: May 2017 Posts: 21 Rep:	Just throwing stuff out there: Hadoop is an open source java-based programming framework that supports the processing and storage of large data sets in a distributed computing environment . It is a parallel processing framework that is used to map/reduce jobs. Hadoop is a general purpose framework that supports multiple models which can be used instead of it's default map/reduce model. One such model is Apache Spark that replaces the batch map/reduce model for real-time stream data processing and faster interactive queries.. Ganglia is for distributed high performance monitoring of clusters and grids..

05-09-2017, 12:01 PM	#5
Reuti Senior Member Registered: Dec 2004 Location: Marburg, Germany Distribution: openSUSE 15.2 Posts: 1,339 Rep:	I forgot to mention one option: there is ScaleMP to create a large SMP machine by several nodes and they offer a Foundation Free edition. But it’s quite unclear what they mean by the 4 processors limit in several editions; if they split all cores in the 8 possible machines, it would mean to distribute them to some several VMs only, as even with 8 nodes you get a minimum of 8 cores of course. Nevertheless: of how many cores are we speaking here: having mutiple cores in two machines, you could pipe the output of the first machine to the second machine with a plain ssh and do the compression on the other side with all cores in each machine. -- Reuti

Reply

Posting Rules
You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
MPI programs freeze when accessing MPI shared file on Scientific Linux 6	Pizzicato	Linux - Software	0	08-29-2012 07:41 AM
MPI vs LAM/MPI vs OpenMPI vs MPICH2	manojg	Linux - General	0	11-28-2010 12:37 PM
/var/log/auth.log doens't have correct date and hostname (Solution)	alfmarius	Linux - Newbie	0	10-07-2008 06:09 AM
Mpi	mkrems	Programming	0	06-09-2008 07:18 PM
"please give a correct solution"	sarathforlinux	Linux - Newbie	2	08-31-2006 10:06 AM

All times are GMT -5. The time now is 02:23 AM.

Main Menu

Advertisement

My LQ

Write for LQ

LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.

Main Menu

Syndicate

Latest Threads

LQ News

Twitter: @linuxquestions

Open Source Consulting | Domain Registration