[SOLVED] Select lines where the first two words are identical
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Select lines where the first two words are identical
Have: a file in which each line contains two or more blank-delimited words.
Want: an OutFile which contains lines from the InFile where the first two words are identical.
This is a learning exercise, nothing more.
With this InFile ...
Code:
Daniel George
Henry Frank
Linda Carol Mary Debbie Michelle
Samuel Samuel (my Uncle Sam)
Irving Simon Simon
Harold Harold
Edward Edward
Robert Richard Robert
David David
Davi David
David Davi
avid David
David avid
... the desired OutFile is ...
Code:
Samuel Samuel (my Uncle Sam)
Harold Harold
Edward Edward
David David
Note that the irregular spacing is preserved.
This awk works.
Code:
awk '{if ($1==$2) print}' $InFile >$OutFile
This concise awk also works.
Code:
awk '$1==$2' $InFile >$OutFile
This sed works.
Code:
sed -rn '/^(.+) *\b\1\b/p' $InFile >$OutFile
This grep works.
Code:
egrep '^(.+) *\b\1\b' $InFile >$OutFile
This bash almost works...
Code:
while read InLine # Read one line from the InFile.
do
arr=($InLine)
# first two words
W1="${arr[@]:0:1}"
W2="${arr[@]:1:1}"
if [ "$W1" == "$W2" ]
then
echo $InLine
fi
done <$InFile # End of bash loop
... but the irregular spacing is lost.
1) Corrections and suggested improvements are welcomed.
2) Please show how the bash solution could be changed
to preserve the irregular spacing.
Thank you.
Daniel B. Martin
.
Last edited by danielbmartin; 10-13-2023 at 08:58 AM.
Reason: Cosmetic improvement.
while read -r InLine # Read one line from the InFile.
do
read -r -a arr <<< "$InLine"
# first two words
if [ "${arr[1]}" == "${arr[2]}" ]
then
echo "$InLine"
fi
done <"$InFile" # End of bash loop
while read -r InLine # Read one line from the InFile.
do
read -r -a arr <<< "$InLine"
# first two words
if [ "${arr[1]}" == "${arr[2]}" ]
then
echo "$InLine"
fi
done <"$InFile" # End of bash loop
to exactly match the first (non-blank!) field. Still the \b would match
Code:
Carol Carol-Anne
So 100% precise is
Code:
egrep '^ *([^ ]+) +\1( |$)' $InFile
The read, if the (default-)IFS has space-like characters, strips leading ones.
The fix is
Code:
while IFS= read -r InLine
This sets an empty IFS for the following read command.
For reading whitespace-separated fields you want the default IFS.
If you always want to capture 2 fields then you can use 3 variables:
Code:
while read -r f1 f2 rest
The 3rd variable consumes the remainder (including any further embedded separators).
#!/bin/bash
InFile="$1"
while IFS='' read -r InLine # Read one line from the InFile.
do
read -r -a arr <<< "$InLine"
# first two words
if [ "${arr[0]}" == "${arr[1]}" ]
then
echo "$InLine"
fi
done <"$InFile" # End of bash loop
#!/bin/sh
# https://www.linuxquestions.org/questions/programming-9/select-lines-where-the-first-two-words-are-identical-4175729833/
while IFS= read -r InLine
do
set -- $InLine
if [ "$1" = "$2" ]
then
echo "$InLine"
fi
done <<"DONE"
Daniel George
Henry Frank
Linda Carol Mary Debbie Michelle
Samuel Samuel (my Uncle Sam)
Irving Simon Simon
Harold Harold
Edward Edward
Robert Richard Robert
David David
Davi David
David Davi
avid David
David avid
DONE
The $InLine is subject to filename generation, unless you turn it off:
Code:
set -f # disable filename generation (globbing)
while IFS= read -r InLine
do
set -- $InLine # only word splitting
...
done
...
set +f # enable filename generation
in bash do and done are not required, solution in post #12 (to use set) is more efficient. Additionally you might need to save the initial set of options and restore them at the end, which is not required if you use a subshell.
So probably still better to use an array.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.