LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Linux command(s) to eliminate redundant words in a line (http://www.linuxquestions.org/questions/programming-9/linux-command-s-to-eliminate-redundant-words-in-a-line-917754/)

danielbmartin 12-08-2011 10:20 AM

Linux command(s) to eliminate redundant words in a line
 
The input file is text of this form:
meat beef
flavor vanilla flavor chocolate
color blue color brown color green color red color yellow
music classical music jazz
wrench socket
vegetable potato vegetable broccoli vegetable carrot
automobile mercedes benz automobile toyota automobile rolls royce

The objective is to eliminate the redundant words (if any).
The key word is always the first word in the record.
However the redundant words are not always in positions 3, 5, 7, etc.

Desired output file:
meat beef
flavor vanilla chocolate
color blue brown green red yellow
music classical jazz
wrench socket
vegetable potato broccoli carrot
automobile mercedes benz toyota rolls royce

Intuition points to sed but the syntax baffles me.

Please advise.

Snark1994 12-08-2011 10:40 AM

Something like
Code:

while read -r line; do
    echo $line | sed 's/$(echo $line | grep '[a-zA-Z]* ' -o)//g'
done < text_file

Completely untested, I'm afraid, I'm at work on Windows :) I can check it later this evening.

Cedrik 12-08-2011 10:48 AM

A Perl onliner:
Code:

perl -pae '{%s=();$_=join " ",(grep{!$s{$_}++}@F)."\n"}' input_file
Surelly there is simpler way ;)

Nominal Animal 12-08-2011 11:01 AM

Cedrik's Perl one-liner is certainly more compact, but I think an awk oneliner would be easier to grok:
Code:

awk '{ printf("%s", $1) ; for (i = 2; i <= NF; i++) if ($i != $1) printf(" %s", $i); printf("\n") }' input-file
Print the first field. Then, print each following field (preceded by the field separator) if it does not match the first field. (You can also use if (tolower($i) != tolower($1)) if you want a case-insensitive comparison.) End the record with a newline.

Juako 12-08-2011 02:07 PM

Code:

#!/bin/bash
while read first rest; do
    echo $first ${rest//$first/}
done< <(sed -r 's/^([^ ]+) (.*)$/\1 \2/')

use as a filter, with stdin:

me@localhost:~$ script < text.txt

Telengard 12-08-2011 04:45 PM

Just for fun, I composed this without looking at any other responses. Surely someone else already did it better, but let's see.
:)

Code:

test$ awk '{keyword = $1; record = keyword; position = 1; while (position++ < NF) {if ($position != keyword) {record = record FS $position}} print record}' input_file.txt
meat beef
flavor vanilla chocolate
color blue brown green red yellow
music classical jazz
wrench socket
vegetable potato broccoli carrot
automobile mercedes benz toyota rolls royce
test$

EDIT
Looks like Nominal Animal's solution #4 resembles my own, though preferring for over while and printing the new record one field at a time. The bit about tolower(...) applies to my solution as well, of course ;)

Juako 12-08-2011 09:01 PM

This sed also seems to work:

Code:

sed -r ':a;s/^([^ ]+) (.*) \1(.*)$/\1 \2\3/g;ta' txt
It will loop on the line until all ocurrences of the first word except that first occurrence are removed from it, then continue to the next line.

edit: the above will fail if you have lines composed only by two or more equal words. To cope with that situation use:

Code:

sed -r ':a;s/^([^ ]+) (.*)\1(.*)$/\1 \2\3/g;ta;s/[ ]+/ /g' txt
What this does is "not assume that a duplicate of the first word will be preceded by non-duplicate content plus a space", it just groups everything that may exist before the duplicate (including a possible extra space). The eventual extra spaces in the replaced line are removed in the second 's' expression.


All times are GMT -5. The time now is 01:38 PM.