LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 08-04-2017, 04:53 AM   #1
vinsweet
LQ Newbie
 
Registered: Aug 2017
Posts: 3

Rep: Reputation: Disabled
Smile extract common words in the first column of multiple text files


Hi,
I am new to linux programming. I m attempting to come up with a simple script to extract words that appear in the first column of all the text files in a given folder.

I have multiple text files, each contain 3 columns. I want to extract only those words that are present in all the files.

For e.g.

file1.txt
rs12 band 0.003
rs43 band 0.044
rs67 band 0.011
etc.

file2.txt
rs33 naps 0.0045
rs98 naps 0.0004
rs12 naps 0.01
etc.

file3.txt
rs67 alle 0.003
rs12 alle 0.002
rs98 alle 0.00003
etc

I have many files like this and I want an output that contains common words in the first column of all files.

output.txt
rs12

Any help/suggestions are highly appreciated.
thank you
 
Old 08-04-2017, 05:27 AM   #2
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 5,674
Blog Entries: 3

Rep: Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906
Welcome to the forum. The programming you mention is common to all modern systems: in other words you'll find them on all systems except Windows.

In order to accomplish the task you describe, you could look at either perl or awk for your task. With awk, there are some built-in variables associated with the input files. Specifically look at FNR and FILENAME there. With perl, you have a lot of options but you might start with the -p or -n option and do a one-liner. In perl the equivalent of awk's FILENAME is $ARGV, if you are using the -p or -n loop for a one-liner.

Code:
man awk
man perlrun
Let us know which language you have chosen and your approach. Show us the code and we can help you over the hard parts.

Last edited by Turbocapitalist; 08-04-2017 at 05:32 AM. Reason: $ARGV
 
1 members found this post helpful.
Old 08-04-2017, 06:07 AM   #3
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 17,420
Blog Entries: 10

Rep: Reputation: 5237Reputation: 5237Reputation: 5237Reputation: 5237Reputation: 5237Reputation: 5237Reputation: 5237Reputation: 5237Reputation: 5237Reputation: 5237Reputation: 5237
so first you need to extract the first word on each line, for each file.
easy with awk:
Code:
awk '{print $1}' file
then you need to compare your findings and find the one (or several) that occur(s) in every file.
not so easy; i'd use bash, because i'm familiar with it, but i guess it could be done with awk, too. or perl. or...
 
1 members found this post helpful.
Old 08-04-2017, 06:43 AM   #4
dejank
Member
 
Registered: May 2016
Location: Belgrade, Serbia
Distribution: Debian
Posts: 229

Rep: Reputation: Disabled
Code:
awk '{print $1}' file1 file2 file3 | sort | uniq -d
You would learn more that you did your best to find answer than to pick it up like this. Just saying, with no ill intention at all.
 
2 members found this post helpful.
Old 08-04-2017, 07:02 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 19,780

Rep: Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573
How does that ensure the output is in all files, not just multiple files ?.
 
1 members found this post helpful.
Old 08-04-2017, 07:16 AM   #6
vinsweet
LQ Newbie
 
Registered: Aug 2017
Posts: 3

Original Poster
Rep: Reputation: Disabled
Thank you all for the suggestions. It seems to work fine.

I firstly made separate files by extracting first columns using
awk '{print $1}' file

then i used
awk 'END { for (r in _) if (_[r] == ARGC - 1) print r }
{ _[$0]++ }' filename1 filename2 filename3.... > output_common.txt

this is working perfectly fine. However, i sort of doing it not so efficiently as i am using 2 steps.
 
Old 08-04-2017, 07:30 AM   #7
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 5,674
Blog Entries: 3

Rep: Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906
You can do it in one step if you use only the first column as the key instead of the whole line.

Code:
awk '{ a[$1]++ } END { for (r in a ) if ( a[r] == ARGC - 1) print r }' file?.txt
However, if the key turns up more than once in the same file, it will throw the results off.
 
3 members found this post helpful.
Old 08-04-2017, 07:40 AM   #8
vinsweet
LQ Newbie
 
Registered: Aug 2017
Posts: 3

Original Poster
Rep: Reputation: Disabled
Thumbs up [solved]

that was really cool!

Thanks.
 
1 members found this post helpful.
Old 08-04-2017, 08:00 AM   #9
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 19,780

Rep: Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573Reputation: 3573
nm ...

Last edited by syg00; 08-05-2017 at 11:20 PM.
 
Old 08-04-2017, 11:46 PM   #10
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 5,674
Blog Entries: 3

Rep: Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906Reputation: 2906
As an afterthought, one way to deal with possible duplicates within a single file is to use a second array and have it count the current file and clear that array whenever a new file is started:

Code:
awk '{ 
       if ( f != FILENAME ) { delete c; f=FILENAME }; 
       if ( ! c[$1]++) { a[$1]++ } 
     } 
     END { for (r in a ) if ( a[r] == ARGC - 1) print r }' file?.txt
 
Old 08-05-2017, 07:52 PM   #11
AwesomeMachine
LQ Guru
 
Registered: Jan 2005
Location: USA and Italy
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,513

Rep: Reputation: 1009Reputation: 1009Reputation: 1009Reputation: 1009Reputation: 1009Reputation: 1009Reputation: 1009Reputation: 1009
Deleted

Last edited by AwesomeMachine; 08-05-2017 at 07:54 PM.
 
Old 08-06-2017, 09:43 PM   #12
Sefyir
Member
 
Registered: Mar 2015
Distribution: Linux Mint
Posts: 633

Rep: Reputation: 316Reputation: 316Reputation: 316Reputation: 316
Python3 Program

Python has a object called a set. Set has a method (or function) called intersection.

What this does is take multiple sets and find the values that "intersect". That is, {1, 2, 3} {3, 4, 5} {3, 6, 12} will produce 3.
Sets can only have unique values. {1, 2, 3} is ok, but {1, 2, 2, 3} will result in {1, 2, 3}
Does this sound like your problem? I wrote a example program that does what you're looking for. Most of it is collecting each files first column into their own set. Then finally it runs set.intersection(other_sets) to produce the unique strings (or words).

Code:
$ ./similar_words.py file*.txt
rs12
Code:
#!/usr/bin/env python3                                                          
                                                                                
import csv                                                                      
import sys                                                                      
                                                                                
list_unique_words = list()                                                      
for _file in sys.argv[1:]:                                                      
    with open(_file) as f:                                                      
        unique_words = {row[0] for row in csv.reader(f, delimiter=' ')}         
        list_unique_words.append(unique_words)                                  
                                                                                
source_set = list_unique_words[0]                                               
other_sets = list_unique_words[1:]                                              
for word in source_set.intersection(*other_sets):                               
    print(word)
EDIT:
This seemed useful to me, so I made it a little more generic.

Code:
./similar_words.py -h
usage: Parse files for similar columns [-h] [-c COLUMN] [-d DELIMITER]
                                       [-f FILE]

optional arguments:
  -h, --help            show this help message and exit
  -c COLUMN, --column COLUMN
  -d DELIMITER, --delimiter DELIMITER
  -f FILE, --file FILE
This does the default behavior:
Code:
./similar_words.py -d ' ' -c 1 file*.txt
rs12
... But you can now use --delimiter (-d) to change the delimiter and --column (-c) to change which column. You can also use --file (-f) to specify a file, but it's useless unless you use it more then once (eg -f file1.txt -f file2.txt)

Code:
#!/usr/bin/env python3

import csv
import sys
import argparse

parser = argparse.ArgumentParser('Parse files for similar columns')
parser.add_argument('-c', '--column',
        type=int,
        default=1,
        )
parser.add_argument('-d', '--delimiter',
        type=str,
        default=' ',
        )
parser.add_argument('-f', '--file',
        type=str,
        action='append',
        )
args, other_args = parser.parse_known_args()
files = args.file if args.file else other_args

list_unique_words = list()
for _file in files:
    with open(_file) as f:
        unique_words = {row[args.column - 1] 
                for row in csv.reader(f, delimiter=args.delimiter)}
        list_unique_words.append(unique_words)

source_set = list_unique_words[0]
other_sets = list_unique_words[1:]
for word in source_set.intersection(*other_sets):
    print(word)

Last edited by Sefyir; 08-09-2017 at 10:06 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bash: split single column of text into multiple columns robcar Linux - Software 8 07-23-2015 08:46 AM
Searching & counting occurrences of words in multiple text files fantabulous Linux - Newbie 7 07-09-2014 05:17 PM
[SOLVED] Extract multiple lines of data from a text file. shawnamiller Programming 8 04-30-2010 11:46 AM
Batch Text Extract Multiple Files lixbie Programming 10 07-02-2008 09:56 AM
Combine multiple one column text file into one text file with multiple colum khairilthegreat Linux - Newbie 7 11-23-2007 01:31 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 05:47 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration