Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
08-04-2017, 04:53 AM
|
#1
|
LQ Newbie
Registered: Aug 2017
Posts: 3
Rep: 
|
extract common words in the first column of multiple text files
Hi,
I am new to linux programming. I m attempting to come up with a simple script to extract words that appear in the first column of all the text files in a given folder.
I have multiple text files, each contain 3 columns. I want to extract only those words that are present in all the files.
For e.g.
file1.txt
rs12 band 0.003
rs43 band 0.044
rs67 band 0.011
etc.
file2.txt
rs33 naps 0.0045
rs98 naps 0.0004
rs12 naps 0.01
etc.
file3.txt
rs67 alle 0.003
rs12 alle 0.002
rs98 alle 0.00003
etc
I have many files like this and I want an output that contains common words in the first column of all files.
output.txt
rs12
Any help/suggestions are highly appreciated.
thank you
|
|
|
08-04-2017, 05:27 AM
|
#2
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,756
|
Welcome to the forum. The programming you mention is common to all modern systems: in other words you'll find them on all systems except Windows.
In order to accomplish the task you describe, you could look at either perl or awk for your task. With awk, there are some built-in variables associated with the input files. Specifically look at FNR and FILENAME there. With perl, you have a lot of options but you might start with the -p or -n option and do a one-liner. In perl the equivalent of awk's FILENAME is $ARGV, if you are using the -p or -n loop for a one-liner.
Code:
man awk
man perlrun
Let us know which language you have chosen and your approach. Show us the code and we can help you over the hard parts.
Last edited by kakistocrat; 08-04-2017 at 05:32 AM.
Reason: $ARGV
|
|
1 members found this post helpful.
|
08-04-2017, 06:07 AM
|
#3
|
LQ Addict
Registered: Dec 2013
Posts: 19,872
|
so first you need to extract the first word on each line, for each file.
easy with awk:
Code:
awk '{print $1}' file
then you need to compare your findings and find the one (or several) that occur(s) in every file.
not so easy; i'd use bash, because i'm familiar with it, but i guess it could be done with awk, too. or perl. or...
|
|
1 members found this post helpful.
|
08-04-2017, 06:43 AM
|
#4
|
Member
Registered: May 2016
Location: Belgrade, Serbia
Distribution: Debian
Posts: 229
Rep: 
|
Code:
awk '{print $1}' file1 file2 file3 | sort | uniq -d
You would learn more that you did your best to find answer than to pick it up like this. Just saying, with no ill intention at all.
|
|
2 members found this post helpful.
|
08-04-2017, 07:02 AM
|
#5
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,448
|
How does that ensure the output is in all files, not just multiple files ?.
|
|
1 members found this post helpful.
|
08-04-2017, 07:16 AM
|
#6
|
LQ Newbie
Registered: Aug 2017
Posts: 3
Original Poster
Rep: 
|
Thank you all for the suggestions. It seems to work fine.
I firstly made separate files by extracting first columns using
awk '{print $1}' file
then i used
awk 'END { for (r in _) if (_[r] == ARGC - 1) print r }
{ _[$0]++ }' filename1 filename2 filename3.... > output_common.txt
this is working perfectly fine. However, i sort of doing it not so efficiently as i am using 2 steps.
|
|
|
08-04-2017, 07:30 AM
|
#7
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,756
|
You can do it in one step if you use only the first column as the key instead of the whole line.
Code:
awk '{ a[$1]++ } END { for (r in a ) if ( a[r] == ARGC - 1) print r }' file?.txt
However, if the key turns up more than once in the same file, it will throw the results off.
|
|
3 members found this post helpful.
|
08-04-2017, 07:40 AM
|
#8
|
LQ Newbie
Registered: Aug 2017
Posts: 3
Original Poster
Rep: 
|
[solved]
that was really cool!
Thanks.
|
|
1 members found this post helpful.
|
08-04-2017, 08:00 AM
|
#9
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,448
|
nm ...
Last edited by syg00; 08-05-2017 at 11:20 PM.
|
|
|
08-04-2017, 11:46 PM
|
#10
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,756
|
As an afterthought, one way to deal with possible duplicates within a single file is to use a second array and have it count the current file and clear that array whenever a new file is started:
Code:
awk '{
if ( f != FILENAME ) { delete c; f=FILENAME };
if ( ! c[$1]++) { a[$1]++ }
}
END { for (r in a ) if ( a[r] == ARGC - 1) print r }' file?.txt
|
|
|
08-05-2017, 07:52 PM
|
#11
|
LQ Guru
Registered: Jan 2005
Location: USA and Italy
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,524
|
Deleted
Last edited by AwesomeMachine; 08-05-2017 at 07:54 PM.
|
|
|
08-06-2017, 09:43 PM
|
#12
|
Member
Registered: Mar 2015
Distribution: Linux Mint
Posts: 634
|
Python3 Program
Python has a object called a set. Set has a method (or function) called intersection.
What this does is take multiple sets and find the values that "intersect". That is, {1, 2, 3} { 3, 4, 5} { 3, 6, 12} will produce 3.
Sets can only have unique values. {1, 2, 3} is ok, but {1, 2, 2, 3} will result in {1, 2, 3}
Does this sound like your problem? I wrote a example program that does what you're looking for. Most of it is collecting each files first column into their own set. Then finally it runs set.intersection(other_sets) to produce the unique strings (or words).
Code:
$ ./similar_words.py file*.txt
rs12
Code:
#!/usr/bin/env python3
import csv
import sys
list_unique_words = list()
for _file in sys.argv[1:]:
with open(_file) as f:
unique_words = {row[0] for row in csv.reader(f, delimiter=' ')}
list_unique_words.append(unique_words)
source_set = list_unique_words[0]
other_sets = list_unique_words[1:]
for word in source_set.intersection(*other_sets):
print(word)
EDIT:
This seemed useful to me, so I made it a little more generic.
Code:
./similar_words.py -h
usage: Parse files for similar columns [-h] [-c COLUMN] [-d DELIMITER]
[-f FILE]
optional arguments:
-h, --help show this help message and exit
-c COLUMN, --column COLUMN
-d DELIMITER, --delimiter DELIMITER
-f FILE, --file FILE
This does the default behavior:
Code:
./similar_words.py -d ' ' -c 1 file*.txt
rs12
... But you can now use --delimiter (-d) to change the delimiter and --column (-c) to change which column. You can also use --file (-f) to specify a file, but it's useless unless you use it more then once (eg -f file1.txt -f file2.txt)
Code:
#!/usr/bin/env python3
import csv
import sys
import argparse
parser = argparse.ArgumentParser('Parse files for similar columns')
parser.add_argument('-c', '--column',
type=int,
default=1,
)
parser.add_argument('-d', '--delimiter',
type=str,
default=' ',
)
parser.add_argument('-f', '--file',
type=str,
action='append',
)
args, other_args = parser.parse_known_args()
files = args.file if args.file else other_args
list_unique_words = list()
for _file in files:
with open(_file) as f:
unique_words = {row[args.column - 1]
for row in csv.reader(f, delimiter=args.delimiter)}
list_unique_words.append(unique_words)
source_set = list_unique_words[0]
other_sets = list_unique_words[1:]
for word in source_set.intersection(*other_sets):
print(word)
Last edited by Sefyir; 08-09-2017 at 10:06 AM.
|
|
|
All times are GMT -5. The time now is 10:43 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|