LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Do chained Bash commands block each other? (https://www.linuxquestions.org/questions/programming-9/do-chained-bash-commands-block-each-other-4175715802/)

halfpower 08-17-2022 06:05 PM

Do chained Bash commands block each other?
 
Below is an example that extracts links from a web page.
Code:

bunzip2 really_big_file.bz2 --stdout\
| jq .text\
| cut -c2- | rev | cut -c2- | rev\
| sed -E 's:(\\n): :g;'\
| grep -P "[a-zA-Z]"

It's clear to me that each command will work before the others have completed, but will they block each other too? Will curl, grep, and sed run simultaneously on different logical cores?

rtmistler 08-17-2022 06:18 PM

The pipes take care of it.

The output of the first one is the input to the next one. And so on.

Yes they will run at the same time but the pipes feed the following two where they depend on the earlier process.

What does it matter how a CPU chooses to run processes?

dugan 08-17-2022 07:15 PM

They run simultaneously, and each one blocks the one to its right.

Sed blocks until it gets a line from grep, and grep blocks until it gets a line from curl.

halfpower 08-17-2022 08:31 PM

Quote:

Originally Posted by rtmistler (Post 6374398)
What does it matter how a CPU chooses to run processes?

When the primary factor limiting execution time is CPU power, then, it can affect total runtime. Theoretically, it should become a big issue when there's a large amount of data, a lot of commands chained together, and many unutilized CPU cores.

NevemTeve 08-17-2022 11:50 PM

Right here the network speed will be the limiting factor.

Turbocapitalist 08-18-2022 12:35 AM

And it goes without saying that processing HTML without a proper parser can be quite brittle. An unexpected space or line break, although valid html, with choke the sed script.

Code:

curl https://example.com \
| tidy -numeric -asxml \
| xmlstarlet sel -N xhtml="http://www.w3.org/1999/xhtml" \
        -t -m '//xhtml:a[@href]'  -v 'concat(@href," ",.)' -n

The xmlstarlet utility is just one option. The are parsers for Perl, Python3, and other scripting languages.

PS. clownflare tries to block me from posting the 'concat' part above. >:(

pan64 08-18-2022 01:29 AM

Quote:

Originally Posted by halfpower (Post 6374420)
When the primary factor limiting execution time is CPU power, then, it can affect total runtime. Theoretically, it should become a big issue when there's a large amount of data, a lot of commands chained together, and many unutilized CPU cores.

No. Obviously the computing capabilities will limit the speed of execution, but linux (the kernel) will be able to utilize all the cores (not only one) and therefore there will be no unutilized cores (don't forget, all the system is running, containing at least several hundred processes).

boughtonp 08-18-2022 06:52 AM

Quote:

Originally Posted by halfpower (Post 6374397)
Below is an example that extracts links from a web page.

Which is ugly, flawed, and unnecessarily slow.

Asking about CPU core allocation is a micro optimization; if you have a performance issue, you have bigger factors to address first.

Is there a larger problem that prompted this line of thinking, or is it just a "what if" thought?


teckk 08-18-2022 08:03 AM

Depends on the web page, grep will do that by itself. What the OP is trying to do anyway. Although a web browser or html/xml parser will do it better.

Code:

agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0"

url1="https://www.linuxquestions.org/questions/programming-9/do-chained-bash-commands-block-each-other-4175715802/"

lynx -useragent="$agent" -dump -source "$url1" | grep -Eo "(http|https).*" > file1.txt

lynx -useragent="$agent" -dump -listonly "$url1" > file2.txt


url2="https://en.m.wikipedia.org/wiki/Carbon"

curl -LA "$agent" "$url2" | grep -oE 'href="([^"#]+)"' > file3.txt

The more pipes the more subshells.

dugan 08-18-2022 10:52 AM

Are you actually having a performance issue that you’re trying to solve here, or was this meant to be a model?

Keep in mind that IPC isn’t fast. The typical performance-centric approach is to to break up the data, distribute each chunk to a program that does not know about the others and which processes its part of the data to a central location, and then wait for all of the individual programs to finish.

BTW, you do understand that the Linux kernel puts processes to "sleep" (in quotes because it's a technical word) when they're waiting for input, right?

halfpower 08-18-2022 04:16 PM

Quote:

Originally Posted by NevemTeve (Post 6374450)
Right here the network speed will be the limiting factor.

I was trying demonstrate the issue. This might be a better example:
Code:

bunzip2 really_big_file.bz2 --stdout\
| jq .text\
| sed -E 's:(\\n): :g;'


dugan 08-18-2022 04:37 PM

Neither example would benefit from multiple cores. The reasons have been said many times, but jq in particular needs to wait for the entire curl or bunzip command to finish before it even starts.

Also, you can use bzcat instead of bunzip2 --stdout.

Quote:

Originally Posted by halfpower (Post 6374420)
When the primary factor limiting execution time is CPU power

It's not. End of story.

Or have you actually seen top/iostat/sar output showing that?

halfpower 08-18-2022 04:57 PM

Quote:

Originally Posted by dugan (Post 6374575)
Are you actually having a performance issue that you’re trying to solve here, or was this meant to be a model?

The question is more theoretical in nature. The code (which has been edited) is only intended to illustrate the issue.

Quote:

Keep in mind that IPC isn’t fast. The typical performance-centric approach is to to break up the data, distribute each chunk to a program that does not know about the others and which processes its part of the data to a central location, and then wait for all of the individual programs to finish.
Some data is stored in a monolithic format. At the present time, I have no method or on-the-fly splitting.

Quote:

Originally Posted by dugan (Post 6374575)
BTW, you do understand that the Linux kernel puts processes to "sleep" (in quotes because it's a technical word) when they're waiting for input, right?

My question was whether or not the processes would block the execution of the other processes. In other words: If my command launches 15 processes, are 14 of them sleeping at any given time? If they are, it is very sub-optimal.

boughtonp 08-18-2022 05:00 PM

Quote:

Originally Posted by halfpower (Post 6374629)
I was trying demonstrate the issue.

What issue?

Put another way: You don't have a performance issue until you can demonstrate a measurable issue.

Whether the answer to your question is yes or no, what difference is it going to make?

If it runs quickly enough, nobody cares what core it executes on.

If it doesn't run quickly enough, switching to forced parallel execution is going to make the code less maintainable, and very likely having less of an impact than optimising whatever algorithm(s) might be involved and/or using a lower-level language for the task.


Quote:

This might be a better example:
Code:

bunzip2 really_big_file.bz2 --stdout\
| jq .text\
| sed -E 's:(\\n): :g;'


It's really not.

Even if you add the missing -z argument to sed, the backslash shouldn't be escaped and the group is unnecessary, but it's far simpler to use tr to replace newlines.

But if we pretend you did use tr, you still have to consider that jq (unlike grep/sed) will wait for stdin to complete before parsing the object, so it doesn't demonstrate any meaningful simultaneous execution.


dugan 08-18-2022 05:22 PM

Quote:

Originally Posted by halfpower (Post 6374637)
My question was whether or not the processes would block the execution of the other processes. In other words: If my command launches 15 processes, are 14 of them sleeping at any given time? If they are, it is very sub-optimal.

Yes. The short answer is yes, and it's been explained to you exactly how that works. How many times do you need to hear "yes" before it gets through to you?

I'm starting to think you're deliberately ignoring the actual answers.


All times are GMT -5. The time now is 04:32 AM.