Linux command(s) to eliminate redundant words in a line
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Linux command(s) to eliminate redundant words in a line
The input file is text of this form:
meat beef
flavor vanilla flavor chocolate
color blue color brown color green color red color yellow
music classical music jazz
wrench socket
vegetable potato vegetable broccoli vegetable carrot
automobile mercedes benz automobile toyota automobile rolls royce
The objective is to eliminate the redundant words (if any).
The key word is always the first word in the record.
However the redundant words are not always in positions 3, 5, 7, etc.
Desired output file:
meat beef
flavor vanilla chocolate
color blue brown green red yellow
music classical jazz
wrench socket
vegetable potato broccoli carrot
automobile mercedes benz toyota rolls royce
Intuition points to sed but the syntax baffles me.
Cedrik's Perl one-liner is certainly more compact, but I think an awk oneliner would be easier to grok:
Code:
awk '{ printf("%s", $1) ; for (i = 2; i <= NF; i++) if ($i != $1) printf(" %s", $i); printf("\n") }' input-file
Print the first field. Then, print each following field (preceded by the field separator) if it does not match the first field. (You can also use if (tolower($i) != tolower($1)) if you want a case-insensitive comparison.) End the record with a newline.
Just for fun, I composed this without looking at any other responses. Surely someone else already did it better, but let's see.
Code:
test$ awk '{keyword = $1; record = keyword; position = 1; while (position++ < NF) {if ($position != keyword) {record = record FS $position}} print record}' input_file.txt
meat beef
flavor vanilla chocolate
color blue brown green red yellow
music classical jazz
wrench socket
vegetable potato broccoli carrot
automobile mercedes benz toyota rolls royce
test$
EDIT
Looks like Nominal Animal's solution #4 resembles my own, though preferring for over while and printing the new record one field at a time. The bit about tolower(...) applies to my solution as well, of course
sed -r ':a;s/^([^ ]+) (.*) \1(.*)$/\1 \2\3/g;ta' txt
It will loop on the line until all ocurrences of the first word except that first occurrence are removed from it, then continue to the next line.
edit: the above will fail if you have lines composed only by two or more equal words. To cope with that situation use:
Code:
sed -r ':a;s/^([^ ]+) (.*)\1(.*)$/\1 \2\3/g;ta;s/[ ]+/ /g' txt
What this does is "not assume that a duplicate of the first word will be preceded by non-duplicate content plus a space", it just groups everything that may exist before the duplicate (including a possible extra space). The eventual extra spaces in the replaced line are removed in the second 's' expression.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.