extract common words in the first column of multiple text files
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
extract common words in the first column of multiple text files
Hi,
I am new to linux programming. I m attempting to come up with a simple script to extract words that appear in the first column of all the text files in a given folder.
I have multiple text files, each contain 3 columns. I want to extract only those words that are present in all the files.
For e.g.
file1.txt
rs12 band 0.003
rs43 band 0.044
rs67 band 0.011
etc.
Welcome to the forum. The programming you mention is common to all modern systems: in other words you'll find them on all systems except Windows.
In order to accomplish the task you describe, you could look at either perl or awk for your task. With awk, there are some built-in variables associated with the input files. Specifically look at FNR and FILENAME there. With perl, you have a lot of options but you might start with the -p or -n option and do a one-liner. In perl the equivalent of awk's FILENAME is $ARGV, if you are using the -p or -n loop for a one-liner.
Code:
man awk
man perlrun
Let us know which language you have chosen and your approach. Show us the code and we can help you over the hard parts.
Last edited by Turbocapitalist; 08-04-2017 at 05:32 AM.
Reason: $ARGV
so first you need to extract the first word on each line, for each file.
easy with awk:
Code:
awk '{print $1}' file
then you need to compare your findings and find the one (or several) that occur(s) in every file.
not so easy; i'd use bash, because i'm familiar with it, but i guess it could be done with awk, too. or perl. or...
As an afterthought, one way to deal with possible duplicates within a single file is to use a second array and have it count the current file and clear that array whenever a new file is started:
Code:
awk '{
if ( f != FILENAME ) { delete c; f=FILENAME };
if ( ! c[$1]++) { a[$1]++ }
}
END { for (r in a ) if ( a[r] == ARGC - 1) print r }' file?.txt
Python has a object called a set. Set has a method (or function) called intersection.
What this does is take multiple sets and find the values that "intersect". That is, {1, 2, 3} {3, 4, 5} {3, 6, 12} will produce 3.
Sets can only have unique values. {1, 2, 3} is ok, but {1, 2, 2, 3} will result in {1, 2, 3}
Does this sound like your problem? I wrote a example program that does what you're looking for. Most of it is collecting each files first column into their own set. Then finally it runs set.intersection(other_sets) to produce the unique strings (or words).
Code:
$ ./similar_words.py file*.txt
rs12
Code:
#!/usr/bin/env python3
import csv
import sys
list_unique_words = list()
for _file in sys.argv[1:]:
with open(_file) as f:
unique_words = {row[0] for row in csv.reader(f, delimiter=' ')}
list_unique_words.append(unique_words)
source_set = list_unique_words[0]
other_sets = list_unique_words[1:]
for word in source_set.intersection(*other_sets):
print(word)
EDIT:
This seemed useful to me, so I made it a little more generic.
Code:
./similar_words.py -h
usage: Parse files for similar columns [-h] [-c COLUMN] [-d DELIMITER]
[-f FILE]
optional arguments:
-h, --help show this help message and exit
-c COLUMN, --column COLUMN
-d DELIMITER, --delimiter DELIMITER
-f FILE, --file FILE
This does the default behavior:
Code:
./similar_words.py -d ' ' -c 1 file*.txt
rs12
... But you can now use --delimiter (-d) to change the delimiter and --column (-c) to change which column. You can also use --file (-f) to specify a file, but it's useless unless you use it more then once (eg -f file1.txt -f file2.txt)
Code:
#!/usr/bin/env python3
import csv
import sys
import argparse
parser = argparse.ArgumentParser('Parse files for similar columns')
parser.add_argument('-c', '--column',
type=int,
default=1,
)
parser.add_argument('-d', '--delimiter',
type=str,
default=' ',
)
parser.add_argument('-f', '--file',
type=str,
action='append',
)
args, other_args = parser.parse_known_args()
files = args.file if args.file else other_args
list_unique_words = list()
for _file in files:
with open(_file) as f:
unique_words = {row[args.column - 1]
for row in csv.reader(f, delimiter=args.delimiter)}
list_unique_words.append(unique_words)
source_set = list_unique_words[0]
other_sets = list_unique_words[1:]
for word in source_set.intersection(*other_sets):
print(word)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.