Old 12-08-2011, 10:20 AM   #1
Linux command(s) to eliminate redundant words in a line

The input file is text of this form:
meat beef
flavor vanilla flavor chocolate
color blue color brown color green color red color yellow
music classical music jazz
wrench socket
vegetable potato vegetable broccoli vegetable carrot
automobile mercedes benz automobile toyota automobile rolls royce

The objective is to eliminate the redundant words (if any).
The key word is always the first word in the record.
However the redundant words are not always in positions 3, 5, 7, etc.

Desired output file:
meat beef
flavor vanilla chocolate
color blue brown green red yellow
music classical jazz
wrench socket
vegetable potato broccoli carrot
automobile mercedes benz toyota rolls royce

Intuition points to sed but the syntax baffles me.

Please advise.
Old 12-08-2011, 10:40 AM   #2
Something like
while read -r line; do
    echo $line | sed 's/$(echo $line | grep '[a-zA-Z]* ' -o)//g'
done < text_file
Completely untested, I'm afraid, I'm at work on Windows I can check it later this evening.

Last edited by Snark1994; 12-08-2011 at 10:41 AM.
Old 12-08-2011, 10:48 AM   #3
A Perl onliner:
perl -pae '{%s=();$_=join " ",(grep{!$s{$_}++}@F)."\n"}' input_file
Surelly there is simpler way
Old 12-08-2011, 11:01 AM   #4
Nominal Animal
Cedrik's Perl one-liner is certainly more compact, but I think an awk oneliner would be easier to grok:
awk '{ printf("%s", $1) ; for (i = 2; i <= NF; i++) if ($i != $1) printf(" %s", $i); printf("\n") }' input-file
Print the first field. Then, print each following field (preceded by the field separator) if it does not match the first field. (You can also use if (tolower($i) != tolower($1)) if you want a case-insensitive comparison.) End the record with a newline.
1 members found this post helpful.
Old 12-08-2011, 02:07 PM   #5
while read first rest; do
    echo $first ${rest//$first/}
done< <(sed -r 's/^([^ ]+) (.*)$/\1 \2/')
use as a filter, with stdin:

me@localhost:~$ script < text.txt

Last edited by Juako; 12-08-2011 at 07:17 PM. Reason: sed expr
Old 12-08-2011, 04:45 PM   #6
Just for fun, I composed this without looking at any other responses. Surely someone else already did it better, but let's see.

test$ awk '{keyword = $1; record = keyword; position = 1; while (position++ < NF) {if ($position != keyword) {record = record FS $position}} print record}' input_file.txt
meat beef
flavor vanilla chocolate
color blue brown green red yellow
music classical jazz
wrench socket
vegetable potato broccoli carrot
automobile mercedes benz toyota rolls royce
Looks like Nominal Animal's solution #4 resembles my own, though preferring for over while and printing the new record one field at a time. The bit about tolower(...) applies to my solution as well, of course

Last edited by Telengard; 12-08-2011 at 04:55 PM.
Old 12-08-2011, 09:01 PM   #7
This sed also seems to work:

sed -r ':a;s/^([^ ]+) (.*) \1(.*)$/\1 \2\3/g;ta' txt
It will loop on the line until all ocurrences of the first word except that first occurrence are removed from it, then continue to the next line.

edit: the above will fail if you have lines composed only by two or more equal words. To cope with that situation use:

sed -r ':a;s/^([^ ]+) (.*)\1(.*)$/\1 \2\3/g;ta;s/[ ]+/ /g' txt
What this does is "not assume that a duplicate of the first word will be preceded by non-duplicate content plus a space", it just groups everything that may exist before the duplicate (including a possible extra space). The eventual extra spaces in the replaced line are removed in the second 's' expression.

Last edited by Juako; 12-08-2011 at 09:18 PM.


