Do chained Bash commands block each other?

halfpower · 08-17-2022, 06:05 PM

Below is an example that extracts links from a web page.

Code:

bunzip2 really_big_file.bz2 --stdout\
| jq .text\ 
| cut -c2- | rev | cut -c2- | rev\
| sed -E 's:(\\n): :g;'\
| grep -P "[a-zA-Z]"

It's clear to me that each command will work before the others have completed, but will they block each other too? Will curl, grep, and sed run simultaneously on different logical cores?

rtmistler · 08-17-2022, 06:18 PM

The pipes take care of it.

The output of the first one is the input to the next one. And so on.

Yes they will run at the same time but the pipes feed the following two where they depend on the earlier process.

What does it matter how a CPU chooses to run processes?

dugan · 08-17-2022, 07:15 PM

They run simultaneously, and each one blocks the one to its right.

Sed blocks until it gets a line from grep, and grep blocks until it gets a line from curl.

halfpower · 08-17-2022, 08:31 PM

Quote:

Originally Posted by rtmistler

What does it matter how a CPU chooses to run processes?

When the primary factor limiting execution time is CPU power, then, it can affect total runtime. Theoretically, it should become a big issue when there's a large amount of data, a lot of commands chained together, and many unutilized CPU cores.

NevemTeve · 08-17-2022, 11:50 PM

Right here the network speed will be the limiting factor.

Turbocapitalist · 08-18-2022, 12:35 AM

And it goes without saying that processing HTML without a proper parser can be quite brittle. An unexpected space or line break, although valid html, with choke the sed script.

Code:

curl https://example.com \
| tidy -numeric -asxml \
| xmlstarlet sel -N xhtml="http://www.w3.org/1999/xhtml" \
        -t -m '//xhtml:a[@href]'  -v 'concat(@href," ",.)' -n

The xmlstarlet utility is just one option. The are parsers for Perl, Python3, and other scripting languages.

PS. clownflare tries to block me from posting the 'concat' part above. >:(

pan64 · 08-18-2022, 01:29 AM

Quote:

Originally Posted by halfpower

When the primary factor limiting execution time is CPU power, then, it can affect total runtime. Theoretically, it should become a big issue when there's a large amount of data, a lot of commands chained together, and many unutilized CPU cores.

No. Obviously the computing capabilities will limit the speed of execution, but linux (the kernel) will be able to utilize all the cores (not only one) and therefore there will be no unutilized cores (don't forget, all the system is running, containing at least several hundred processes).

boughtonp · 08-18-2022, 06:52 AM

Quote:

Originally Posted by halfpower

Below is an example that extracts links from a web page.

Which is ugly, flawed, and unnecessarily slow.

Asking about CPU core allocation is a micro optimization; if you have a performance issue, you have bigger factors to address first.

Is there a larger problem that prompted this line of thinking, or is it just a "what if" thought?

teckk · 08-18-2022, 08:03 AM

Depends on the web page, grep will do that by itself. What the OP is trying to do anyway. Although a web browser or html/xml parser will do it better.

Code:

agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0"

url1="https://www.linuxquestions.org/questions/programming-9/do-chained-bash-commands-block-each-other-4175715802/"

lynx -useragent="$agent" -dump -source "$url1" | grep -Eo "(http|https).*" > file1.txt

lynx -useragent="$agent" -dump -listonly "$url1" > file2.txt


url2="https://en.m.wikipedia.org/wiki/Carbon"

curl -LA "$agent" "$url2" | grep -oE 'href="([^"#]+)"' > file3.txt

The more pipes the more subshells.

dugan · 08-18-2022, 10:52 AM

Are you actually having a performance issue that you’re trying to solve here, or was this meant to be a model?

Keep in mind that IPC isn’t fast. The typical performance-centric approach is to to break up the data, distribute each chunk to a program that does not know about the others and which processes its part of the data to a central location, and then wait for all of the individual programs to finish.

BTW, you do understand that the Linux kernel puts processes to "sleep" (in quotes because it's a technical word) when they're waiting for input, right?

halfpower · 08-18-2022, 04:16 PM

Quote:

Originally Posted by NevemTeve

Right here the network speed will be the limiting factor.

I was trying demonstrate the issue. This might be a better example:

Code:

bunzip2 really_big_file.bz2 --stdout\
| jq .text\ 
| sed -E 's:(\\n): :g;'

dugan · 08-18-2022, 04:37 PM

Neither example would benefit from multiple cores. The reasons have been said many times, but jq in particular needs to wait for the entire curl or bunzip command to finish before it even starts.

Also, you can use bzcat instead of bunzip2 --stdout.

Quote:

Originally Posted by halfpower

When the primary factor limiting execution time is CPU power

It's not. End of story.

Or have you actually seen top/iostat/sar output showing that?

halfpower · 08-18-2022, 04:57 PM

Quote:

Originally Posted by dugan

Are you actually having a performance issue that you’re trying to solve here, or was this meant to be a model?

The question is more theoretical in nature. The code (which has been edited) is only intended to illustrate the issue.

Quote:

Keep in mind that IPC isn’t fast. The typical performance-centric approach is to to break up the data, distribute each chunk to a program that does not know about the others and which processes its part of the data to a central location, and then wait for all of the individual programs to finish.

Some data is stored in a monolithic format. At the present time, I have no method or on-the-fly splitting.

Quote:

Originally Posted by dugan

BTW, you do understand that the Linux kernel puts processes to "sleep" (in quotes because it's a technical word) when they're waiting for input, right?

My question was whether or not the processes would block the execution of the other processes. In other words: If my command launches 15 processes, are 14 of them sleeping at any given time? If they are, it is very sub-optimal.

boughtonp · 08-18-2022, 05:00 PM

Quote:

Originally Posted by halfpower

I was trying demonstrate the issue.

What issue?

Put another way: You don't have a performance issue until you can demonstrate a measurable issue.

Whether the answer to your question is yes or no, what difference is it going to make?

If it runs quickly enough, nobody cares what core it executes on.

If it doesn't run quickly enough, switching to forced parallel execution is going to make the code less maintainable, and very likely having less of an impact than optimising whatever algorithm(s) might be involved and/or using a lower-level language for the task.

Quote:

This might be a better example:

Code:

bunzip2 really_big_file.bz2 --stdout\
| jq .text\ 
| sed -E 's:(\\n): :g;'

It's really not.

Even if you add the missing -z argument to sed, the backslash shouldn't be escaped and the group is unnecessary, but it's far simpler to use tr to replace newlines.

But if we pretend you did use tr, you still have to consider that jq (unlike grep/sed) will wait for stdin to complete before parsing the object, so it doesn't demonstrate any meaningful simultaneous execution.

dugan · 08-18-2022, 05:22 PM

Quote:

Originally Posted by halfpower

My question was whether or not the processes would block the execution of the other processes. In other words: If my command launches 15 processes, are 14 of them sleeping at any given time? If they are, it is very sub-optimal.

Yes. The short answer is yes, and it's been explained to you exactly how that works. How many times do you need to hear "yes" before it gets through to you?

I'm starting to think you're deliberately ignoring the actual answers.