LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-13-2023, 08:56 AM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Select lines where the first two words are identical


Have: a file in which each line contains two or more blank-delimited words.

Want: an OutFile which contains lines from the InFile where the first two words are identical.

This is a learning exercise, nothing more.

With this InFile ...
Code:
Daniel George
Henry  Frank
Linda Carol Mary Debbie Michelle
Samuel   Samuel  (my Uncle Sam)
Irving Simon Simon
 Harold Harold
Edward  Edward
Robert Richard Robert
 David David
Davi David
 David Davi
avid David
David avid
... the desired OutFile is ...
Code:
Samuel   Samuel  (my Uncle Sam)
 Harold Harold
Edward  Edward
 David David
Note that the irregular spacing is preserved.

This awk works.
Code:
awk '{if ($1==$2) print}' $InFile >$OutFile
This concise awk also works.
Code:
awk '$1==$2' $InFile >$OutFile
This sed works.
Code:
sed -rn '/^(.+) *\b\1\b/p' $InFile >$OutFile
This grep works.
Code:
egrep '^(.+) *\b\1\b' $InFile >$OutFile
This bash almost works...
Code:
while read InLine   # Read one line from the InFile.
  do
    arr=($InLine)
# first two words
    W1="${arr[@]:0:1}"
    W2="${arr[@]:1:1}"
    if [ "$W1" == "$W2" ]
      then
      echo $InLine
    fi
  done <$InFile  # End of bash loop
... but the irregular spacing is lost.

1) Corrections and suggested improvements are welcomed.

2) Please show how the bash solution could be changed
to preserve the irregular spacing.

Thank you.

Daniel B. Martin

.

Last edited by danielbmartin; 10-13-2023 at 08:58 AM. Reason: Cosmetic improvement.
 
Old 10-13-2023, 09:30 AM   #2
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora
Posts: 4,146

Rep: Reputation: 1264Reputation: 1264Reputation: 1264Reputation: 1264Reputation: 1264Reputation: 1264Reputation: 1264Reputation: 1264Reputation: 1264
You want
Code:
echo "$InLine"
to preserve $InLine as a single item.
 
2 members found this post helpful.
Old 10-13-2023, 10:03 AM   #3
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,899

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
Code:
while read -r InLine   # Read one line from the InFile.
  do
    read -r -a arr <<< "$InLine"
# first two words

    if [ "${arr[1]}" == "${arr[2]}" ]
      then
      echo "$InLine"
    fi
  done <"$InFile"  # End of bash loop
use shellcheck to fix problems in bash
 
Old 10-13-2023, 11:48 AM   #4
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by smallpond View Post
You want
Code:
echo "$InLine"
to preserve $InLine as a single item.
Thank you, smallpond, for this correction. It results in a partial improvement. Instead of this...
Code:
Samuel Samuel (my Uncle Sam)
Harold Harold
Edward Edward
David David
... it produced this ...
Code:
Samuel   Samuel  (my Uncle Sam)
Harold Harold
Edward  Edward
David David
... but we really want this ...
Code:
Samuel   Samuel  (my Uncle Sam)
 Harold Harold
Edward  Edward
 David David
Daniel B. Martin

.
 
Old 10-13-2023, 12:16 PM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by pan64 View Post
Code:
while read -r InLine   # Read one line from the InFile.
  do
    read -r -a arr <<< "$InLine"
# first two words

    if [ "${arr[1]}" == "${arr[2]}" ]
      then
      echo "$InLine"
    fi
  done <"$InFile"  # End of bash loop
Did you test this?

Daniel B. Martin

.
 
Old 10-13-2023, 12:35 PM   #6
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,899

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
Quote:
Originally Posted by danielbmartin View Post
Did you test this?

Daniel B. Martin

.
No, I didn't. You can do that. I have a bad habit of posting untested and/or almost working solutions. Better to take it as an idea, not a solution.
 
Old 10-14-2023, 12:35 AM   #7
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,868
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
Perhaps this:
Code:
 while IFS= read -r InLine; do
...
 
1 members found this post helpful.
Old 10-14-2023, 07:54 AM   #8
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,804

Rep: Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203
Your egrep would match a line
Code:
Daniel George Daniel George
I think the egrep should be
Code:
egrep '^ *([^ ]+) +\1\b' $InFile
to exactly match the first (non-blank!) field. Still the \b would match
Code:
Carol Carol-Anne
So 100% precise is
Code:
egrep '^ *([^ ]+) +\1( |$)' $InFile
The read, if the (default-)IFS has space-like characters, strips leading ones.
The fix is
Code:
while IFS= read -r InLine
This sets an empty IFS for the following read command.
For reading whitespace-separated fields you want the default IFS.
If you always want to capture 2 fields then you can use 3 variables:
Code:
while read -r f1 f2 rest
The 3rd variable consumes the remainder (including any further embedded separators).
 
1 members found this post helpful.
Old 10-14-2023, 09:28 AM   #9
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,899

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
Quote:
Originally Posted by danielbmartin View Post
Did you test this?

Daniel B. Martin

.
Code:
#!/bin/bash

InFile="$1"

while IFS='' read -r InLine   # Read one line from the InFile.
do
    read -r -a arr <<< "$InLine"
# first two words

    if [ "${arr[0]}" == "${arr[1]}" ]
	then
		echo "$InLine"
	fi
done <"$InFile"  # End of bash loop
Here is a tested version
 
1 members found this post helpful.
Old 10-14-2023, 01:34 PM   #10
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Thank you NevemTeve and MadeInGermany for specific suggestions. Special thanks to pan64 for a complete and concise solution.

This was a useful exercise, one which illustrates the superiority of awk, at least in this specific case.

SOLVED!

Daniel B. Martin

.
 
Old 10-14-2023, 01:57 PM   #11
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,868
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
You can do it without arrays as well:
Code:
#!/bin/sh
# https://www.linuxquestions.org/questions/programming-9/select-lines-where-the-first-two-words-are-identical-4175729833/

while IFS= read -r InLine
do
    set -- $InLine

    if [ "$1" = "$2" ]
    then
        echo "$InLine"
    fi
done <<"DONE"
Daniel George
Henry  Frank
Linda Carol Mary Debbie Michelle
Samuel   Samuel  (my Uncle Sam)
Irving Simon Simon
 Harold Harold
Edward  Edward
Robert Richard Robert
 David David
Davi David
 David Davi
avid David
David avid
DONE

Last edited by NevemTeve; 10-14-2023 at 01:58 PM.
 
1 members found this post helpful.
Old 10-15-2023, 03:18 AM   #12
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,804

Rep: Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203Reputation: 1203
The $InLine is subject to filename generation, unless you turn it off:
Code:
set -f # disable filename generation (globbing)
while IFS= read -r InLine
do
    set -- $InLine # only word splitting
    ...
done
...
set +f # enable filename generation
 
2 members found this post helpful.
Old 10-15-2023, 09:29 PM   #13
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Original Poster
Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
[QUOTE=NevemTeve;6458799]You can do it without arrays as well:
Code:
while IFS= read -r InLine
do
    set -- $InLine
    if [ "$1" = "$2" ]
    then
        echo "$InLine"
    fi
done <<"DONE"
Lovely! Thank you!!

Daniel B. Martin

.
 
Old 10-16-2023, 01:38 AM   #14
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,868
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
Quote:
Originally Posted by MadeInGermany View Post
The $InLine is subject to filename generation, unless you turn it off:
Thank you for pointing this out.
Note: in real-life usages, when changing a global setting I'd use a subshell not to interfere with other parts, e.g.
Code:
do (
    set -f
    set -- $InLine

    if [ "$1" = "$2" ]
    then
        echo "$InLine"
    fi
) done
 
Old 10-16-2023, 02:16 AM   #15
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,899

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
in bash do and done are not required, solution in post #12 (to use set) is more efficient. Additionally you might need to save the initial set of options and restore them at the end, which is not required if you use a subshell.
So probably still better to use an array.
 
  


Reply

Tags
awk, bash, grep, sed, text processing



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Words, Words, Words--Introducing OpenSearchServer LXer Syndicated Linux News 0 08-07-2019 02:13 PM
[SOLVED] which lines are identical in the two files torito Linux - Newbie 5 06-02-2016 10:08 AM
autofs local map files identical but not identical to automount jwaldram Linux - Server 2 10-26-2012 10:35 AM
[SOLVED] Split a file into two - the first being the first two lines and the second the rest jasonws Linux - General 2 11-02-2010 04:32 AM
Identical disks that are not identical staphanes Linux - Hardware 8 03-11-2006 11:50 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:12 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration