Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hello,
I have a copy of the osm planet data from the open street map project. I need to update it with osmosis, a java app.
It is 700GiB decompressed which means that I must decompress the data to a pipe update from the pipe and then pipe the updated data to a recompressor. The osm file is compressed using bzip2.
I have tested bzip2, pbzip2 and lbzip, lzip is the fastest on my multicore machine by a large margin.
My problem is that the jobs would run for at least a whole day and night on an uncomfortably loud processor (actually, the fan is loud .
So, I wanted to run the processes across my whole collection of linux computers totalling 15 cores.
I've tried to read about clustering and looked at corosync, heartbeat, pacmaker, then relized that these were not what I wanted and looked into mpi, but it seems that at least the C lbzip2 applicaton must be specifically programmed to take advantage of such a power. I don't know if the java tool osmosis needs any extra code.
So, how would I go about this best?
Which of the three steps is taking the time? Or is it the transmission of data between the parties?
Is the nature of the update such that it can be performed against only "a slice" of a file? Or do you envision having multiple CPUs doing self-contained pieces of the total job in parallel?
I believe that gzip and other such algorithms are "fast enough" to be "fast enough" not to present a bottleneck. And, there are advantages to transmitting compressed data. So, maybe, the input file might be "bursted" into a bunch of individually-(re-)compressed jobs representing discrete units of work. The various "worker bees" remove them from a shared queue, decompress them, update them, recompress the updates, and send the updated files downstream through another shared queue. (The exact nature of this queue "to be determined," but it's undoubtedly available "off the shelf.") A final-assembly process retrieves the updated pieces and reassembles them into a final deliverable.
Basically, a "batch job monitor" or "cluster manager" software tool should be available off-the-shelf (open source ...), to provide the necessary workload-management and plumbing pieces needed to do this job, which is actually a very common one.
I can't speak to whether "MPI" (whatever that is ...) is "the correct solution." You'll need to do more research based on your expert knowledge of the intended task.
(Also: has anyone out there built and shared an "open street-maps updater engine?" Never assume that what you want to do hasn't already been done.)
Last edited by sundialsvcs; 05-05-2017 at 08:56 AM.
Which of the three steps is taking the time? Or is it the transmission of data between the parties?
The decompression and recompression stages of the main archive and the decompression and parsing of the updates. Transmission time is negligible.
Quote:
Originally Posted by sundialsvcs
Is the nature of the update such that it can be performed against only "a slice" of a file? Or do you envision having multiple CPUs doing self-contained pieces of the total job in parallel?
I envision multiple CPUs each in the decompression and recompression stage, I'm not certain if the updating stage is single threaded or not, I think I read yes, but I'm not 100% certain.
Quote:
Originally Posted by sundialsvcs
I believe that gzip and other such algorithms are "fast enough" to be "fast enough" not to present a bottleneck. And, there are advantages to transmitting compressed data. So, maybe, the input file might be "bursted" into a bunch of individually-(re-)compressed jobs representing discrete units of work. The various "worker bees" remove them from a shared queue, decompress them, update them, recompress the updates, and send the updated files downstream through another shared queue. (The exact nature of this queue "to be determined," but it's undoubtedly available "off the shelf.") A final-assembly process retrieves the updated pieces and reassembles them into a final deliverable.
Actually gzip *is* the bottle neck because there are so many updates. 10 updates, 10 threads, 100 updates 100 threads. Plus bzip2 which, when using 6 threads takes about 14 hours to decompress and recompress (simultaneously). I benchmarked.
Quote:
Originally Posted by sundialsvcs
Basically, a "batch job monitor" or "cluster manager" software tool should be available off-the-shelf (open source ...), to provide the necessary workload-management and plumbing pieces needed to do this job, which is actually a very common one.
I can't speak to whether "MPI" (whatever that is ...) is "the correct solution." You'll need to do more research based on your expert knowledge of the intended task.
I know that, that's why I'm asking here, I can't determine what tool to use/is applicable. I just lack the experience.
Quote:
Originally Posted by sundialsvcs
(Also: has anyone out there built and shared an "open street-maps updater engine?" Never assume that what you want to do hasn't already been done.)
I mentioned in my original thread that osmosis is the "open street-maps updater engine".
Just throwing stuff out there: Hadoop is an open source java-based programming framework that supports the processing and storage of large data sets in a distributed computing environment . It is a parallel processing framework that is used to map/reduce jobs. Hadoop is a general purpose framework that supports multiple models which can be used instead of it's default map/reduce model. One such model is Apache Spark that replaces the batch map/reduce model for real-time stream data processing and faster interactive queries.. Ganglia is for distributed high performance monitoring of clusters and grids..
I forgot to mention one option: there is ScaleMP to create a large SMP machine by several nodes and they offer a Foundation Free edition. But it’s quite unclear what they mean by the 4 processors limit in several editions; if they split all cores in the 8 possible machines, it would mean to distribute them to some several VMs only, as even with 8 nodes you get a minimum of 8 cores of course.
Nevertheless: of how many cores are we speaking here: having mutiple cores in two machines, you could pipe the output of the first machine to the second machine with a plain ssh and do the compression on the other side with all cores in each machine.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.