[SOLVED] Select lines where the first two words are identical

danielbmartin · 10-13-2023, 08:56 AM

Have: a file in which each line contains two or more blank-delimited words.

Want: an OutFile which contains lines from the InFile where the first two words are identical.

This is a learning exercise, nothing more.

With this InFile ...

Code:

Daniel George
Henry  Frank
Linda Carol Mary Debbie Michelle
Samuel   Samuel  (my Uncle Sam)
Irving Simon Simon
 Harold Harold
Edward  Edward
Robert Richard Robert
 David David
Davi David
 David Davi
avid David
David avid

... the desired OutFile is ...

Code:

Samuel   Samuel  (my Uncle Sam)
 Harold Harold
Edward  Edward
 David David

Note that the irregular spacing is preserved.

This awk works.

Code:

awk '{if ($1==$2) print}' $InFile >$OutFile

This concise awk also works.

Code:

awk '$1==$2' $InFile >$OutFile

This sed works.

Code:

sed -rn '/^(.+) *\b\1\b/p' $InFile >$OutFile

This grep works.

Code:

egrep '^(.+) *\b\1\b' $InFile >$OutFile

This bash almost works...

Code:

while read InLine   # Read one line from the InFile.
  do
    arr=($InLine)
# first two words
    W1="${arr[@]:0:1}"
    W2="${arr[@]:1:1}"
    if [ "$W1" == "$W2" ]
      then
      echo $InLine
    fi
  done <$InFile  # End of bash loop

... but the irregular spacing is lost.

1) Corrections and suggested improvements are welcomed.

2) Please show how the bash solution could be changed
to preserve the irregular spacing.

Thank you.

Daniel B. Martin

.

smallpond · 10-13-2023, 09:30 AM

You want

Code:

echo "$InLine"

to preserve $InLine as a single item.

pan64 · 10-13-2023, 10:03 AM

Code:

while read -r InLine   # Read one line from the InFile.
  do
    read -r -a arr <<< "$InLine"
# first two words

    if [ "${arr[1]}" == "${arr[2]}" ]
      then
      echo "$InLine"
    fi
  done <"$InFile"  # End of bash loop

use shellcheck to fix problems in bash

danielbmartin · 10-13-2023, 11:48 AM

Quote:

Originally Posted by smallpond

You want

Code:

echo "$InLine"

to preserve $InLine as a single item.

Thank you, smallpond, for this correction. It results in a partial improvement. Instead of this...

Code:

Samuel Samuel (my Uncle Sam)
Harold Harold
Edward Edward
David David

... it produced this ...

Code:

Samuel   Samuel  (my Uncle Sam)
Harold Harold
Edward  Edward
David David

... but we really want this ...

Code:

Samuel   Samuel  (my Uncle Sam)
 Harold Harold
Edward  Edward
 David David

Daniel B. Martin

.

danielbmartin · 10-13-2023, 12:16 PM

Quote:

Originally Posted by pan64

Code:

while read -r InLine   # Read one line from the InFile.
  do
    read -r -a arr <<< "$InLine"
# first two words

    if [ "${arr[1]}" == "${arr[2]}" ]
      then
      echo "$InLine"
    fi
  done <"$InFile"  # End of bash loop

Did you test this?

Daniel B. Martin

.

pan64 · 10-13-2023, 12:35 PM

Quote:

Originally Posted by danielbmartin

Did you test this?

Daniel B. Martin

.

No, I didn't. You can do that. I have a bad habit of posting untested and/or almost working solutions. Better to take it as an idea, not a solution.

NevemTeve · 10-14-2023, 12:35 AM

Perhaps this:

Code:

 while IFS= read -r InLine; do
...

MadeInGermany · 10-14-2023, 07:54 AM

Your egrep would match a line

Code:

Daniel George Daniel George

I think the egrep should be

Code:

egrep '^ *([^ ]+) +\1\b' $InFile

to exactly match the first (non-blank!) field. Still the \b would match

Code:

Carol Carol-Anne

So 100% precise is

Code:

egrep '^ *([^ ]+) +\1( |$)' $InFile

The read, if the (default-)IFS has space-like characters, strips leading ones.
The fix is

Code:

while IFS= read -r InLine

This sets an empty IFS for the following read command.
For reading whitespace-separated fields you want the default IFS.
If you always want to capture 2 fields then you can use 3 variables:

Code:

while read -r f1 f2 rest

The 3rd variable consumes the remainder (including any further embedded separators).

pan64 · 10-14-2023, 09:28 AM

Quote:

Originally Posted by danielbmartin

Did you test this?

Daniel B. Martin

.

Code:

#!/bin/bash

InFile="$1"

while IFS='' read -r InLine   # Read one line from the InFile.
do
    read -r -a arr <<< "$InLine"
# first two words

    if [ "${arr[0]}" == "${arr[1]}" ]
	then
		echo "$InLine"
	fi
done <"$InFile"  # End of bash loop

Here is a tested version

danielbmartin · 10-14-2023, 01:34 PM

Thank you NevemTeve and MadeInGermany for specific suggestions. Special thanks to pan64 for a complete and concise solution.

This was a useful exercise, one which illustrates the superiority of awk, at least in this specific case.

SOLVED!

Daniel B. Martin

.

NevemTeve · 10-14-2023, 01:57 PM

You can do it without arrays as well:

Code:

#!/bin/sh
# https://www.linuxquestions.org/questions/programming-9/select-lines-where-the-first-two-words-are-identical-4175729833/

while IFS= read -r InLine
do
    set -- $InLine

    if [ "$1" = "$2" ]
    then
        echo "$InLine"
    fi
done <<"DONE"
Daniel George
Henry  Frank
Linda Carol Mary Debbie Michelle
Samuel   Samuel  (my Uncle Sam)
Irving Simon Simon
 Harold Harold
Edward  Edward
Robert Richard Robert
 David David
Davi David
 David Davi
avid David
David avid
DONE

MadeInGermany · 10-15-2023, 03:18 AM

The $InLine is subject to filename generation, unless you turn it off:

Code:

set -f # disable filename generation (globbing)
while IFS= read -r InLine
do
    set -- $InLine # only word splitting
    ...
done
...
set +f # enable filename generation

danielbmartin · 10-15-2023, 09:29 PM

[QUOTE=NevemTeve;6458799]You can do it without arrays as well:

Code:

while IFS= read -r InLine
do
    set -- $InLine
    if [ "$1" = "$2" ]
    then
        echo "$InLine"
    fi
done <<"DONE"

Lovely! Thank you!!

Daniel B. Martin

.

NevemTeve · 10-16-2023, 01:38 AM

Quote:

Originally Posted by MadeInGermany

The $InLine is subject to filename generation, unless you turn it off:

Thank you for pointing this out.
Note: in real-life usages, when changing a global setting I'd use a subshell not to interfere with other parts, e.g.

Code:

do (
    set -f
    set -- $InLine

    if [ "$1" = "$2" ]
    then
        echo "$InLine"
    fi
) done

pan64 · 10-16-2023, 02:16 AM

in bash do and done are not required, solution in post #12 (to use set) is more efficient. Additionally you might need to save the initial set of options and restore them at the end, which is not required if you use a subshell.
So probably still better to use an array.