LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-27-2017, 08:32 PM   #1
bioinfo17
LQ Newbie
 
Registered: Apr 2017
Posts: 10

Rep: Reputation: Disabled
Question combine mutiple files using a particular column but files can have different rows


Hi,

I've multiple files in the format below (shown 3 as an example, have ~90 files):

file1.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
bat 7 5 -6 1 0 1

file2.txt
name a b c d e f
apple 1 2 -3 4 5 4
ant 4 -46 5 2 6 2
bat 7 5 -6 1 0 1

file3.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
ant 6 4 -2 5 8 6

would like to merge files based on column c, but files can have different rows, hence print 0 if the same row is not present, to make it clear, the ouput should be:

results.txt file1 file2 file3
apple 2 2 2
cat 6 0 6
ant 0 -46 -2
bat 5 5 0

preferably with awk command would be great!! thanks
 
Old 04-27-2017, 09:17 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 19,777

Rep: Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572
Quote:
Originally Posted by bioinfo17 View Post
to make it clear, the ouput should be:
Not very polite for a first post.

You make some effort, show us what you did, explain why it didn't suit, and maybe we can point you in the right direcction.
 
Old 04-27-2017, 09:27 PM   #3
bioinfo17
LQ Newbie
 
Registered: Apr 2017
Posts: 10

Original Poster
Rep: Reputation: Disabled
Tried to be short and discreet, didn't meant to be rude or impolite. I personally don't describe things very well, hence made the comment "to make it clear, the output should be" for others to understand what I was trying to ask.

Have tried using the following command awk '
BEGIN {
f=0
}
{
if ($1 == "name") {
f++; getline;
}
i = $1
row[i] = row[i]"\t"(($3 == "")? "0":$3)
}
END {
for (i in row) print i,row[i]
} ' file1.txt file2.txt file3.txt

This worked to some extent but did not print "0" in the columns that didn't match to the corresponding rows. Maybe not the right code to use. Any help from the experts out there would be great.
 
Old 04-27-2017, 09:42 PM   #4
Laserbeak
Member
 
Registered: Jan 2017
Location: Manhattan, NYC NY
Distribution: Mac OS X, iOS, Solaris
Posts: 508

Rep: Reputation: 143Reputation: 143
I'm not sure if it'd be helpful here, but you might want to look into some of the database libraries out there... I don't really have much experience with them but there are some that even allow you to use SQL to query text files like these. Just a thought....

Otherwise I'd probably use perl to load the data into hash(es) then process them.
 
Old 04-27-2017, 09:57 PM   #5
bioinfo17
LQ Newbie
 
Registered: Apr 2017
Posts: 10

Original Poster
Rep: Reputation: Disabled
Newbie in programming field, hence not sure whether I would be able to follow SQL. Mind sharing the perl script please? thanks
 
Old 04-27-2017, 11:00 PM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 19,777

Rep: Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572
Quote:
Originally Posted by bioinfo17 View Post
This worked to some extent but did not print "0" in the columns that didn't match to the corresponding rows.
The idea is ok, but you are testing the wrong thing - $3 only exists for records that are read in. You need to test records you didn't read in. Which of course is non-sense.

Off the top of my head you would need an index for each file - even the filename would do. You need to accommodate fileds that appear (only) in later file(s). Have to think about it after I do the yard. There are others here that will have better contributions.
 
Old 04-27-2017, 11:16 PM   #7
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_12{.0|.1}
Posts: 5,650
Blog Entries: 11

Rep: Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685
Quote:
Originally Posted by syg00 View Post
The idea is ok, but you are testing the wrong thing - $3 only exists for records that are read in. You need to test records you didn't read in. Which of course is non-sense.

Off the top of my head you would need an index for each file - even the filename would do. You need to accommodate fileds that appear (only) in later file(s). Have to think about it after I do the yard. There are others here that will have better contributions.
This is the essential thing that must be handled - you must test for things that do not exist in all files, and generate a value (zero is used in the OP example) for the non-existent cases.

Two ways to handle this come to mind:

* A multi-pass process which first collects all names then scans for the values
* An array based function which collects all the names into a single array, and the values for each file in separate arrays indexed to the name array - in a single pass

Multi-pass would be messy, but the array approach could be fairly easily done in most scripting languages.

I have a bash script in mind which I will try to work out if no one beats me to it...
 
Old 04-28-2017, 12:34 AM   #8
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 7.7 (?), Centos 8.1
Posts: 17,873

Rep: Reputation: 2600Reputation: 2600Reputation: 2600Reputation: 2600Reputation: 2600Reputation: 2600Reputation: 2600Reputation: 2600Reputation: 2600Reputation: 2600Reputation: 2600
Also, @bioinfo17, please use CODE tags as per https://www.linuxquestions.org/quest...do=bbcode#code to post code or data.
It maintains formatting and makes things much clearer.
thank you.
 
Old 04-28-2017, 02:50 PM   #9
bioinfo17
LQ Newbie
 
Registered: Apr 2017
Posts: 10

Original Poster
Rep: Reputation: Disabled
@chrism01 - sure will keep this mind for next time, thank you

@syg00, @astrogeek - any solutions yet, please? very keen to see how the bash script would look like, tried my best but couldn't find a solution. Your help is much appreciated!!!! thanks
 
Old 04-28-2017, 03:38 PM   #10
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 5,661
Blog Entries: 3

Rep: Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902
It's doable several ways in awk. Can you show what you have been trying?
 
Old 04-28-2017, 10:45 PM   #11
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 19,777

Rep: Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572Reputation: 3572
Quote:
Originally Posted by bioinfo17 View Post
any solutions yet, please?
Be very careful not to appear not to be demanding an answer immediately. People will respond when/if they feel like it. Not necessarily when you want.
People are giving of their time and knowledge - accept what you get gracefully.

As you have chosen to simply concatenate the values into the row array, to handle this in one pass I would add code to your test for new file. If you split() each row[] member it will return a count of elements it created - this can be compared to "f" and if not equal just concatenate the "0". Needs to be done for every row[] at each change of file, so the processing load will increase as number of files/fields grows; no big deal I imagine.
Note that similar code will be required in the END block to handle the last file.
 
Old 04-28-2017, 11:09 PM   #12
bioinfo17
LQ Newbie
 
Registered: Apr 2017
Posts: 10

Original Poster
Rep: Reputation: Disabled
I do appreciate people using their valuable time to help me. @syg00 You might be an expert but please don't just discourage newbies like me from asking questions in this forum. Well, thanks for the explanation! I posted here thinking of getting a quick solution to my question and of course I have tried and searched similar forums to get a solution. PLEASE STOP JUDGING!!!! I have seen many other posts in this forum where newbies like me have got their answers quick enough without being judged. I have been trying to get a solution and hopefully I will get one soon.
 
Old 04-29-2017, 12:11 AM   #13
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 5,661
Blog Entries: 3

Rep: Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902Reputation: 2902
Quote:
Originally Posted by syg00 View Post
As you have chosen to simply concatenate the values into the row array, to handle this in one pass I would add code to your test for new file...
bioinfo17, I'll add that the built-in variable to for the file name of the currently open file is FILENAME. You'll need your own variable to carry the name of the previous file. When the two differ, then you've started a new file. This is still true even first time you do the comparison of your variable with the first file as your variable will be NULL until you fill it.

I'm rather sure that any solution here will involve arrays and maybe in some cases a two dimensional array that you build in the main loop and then print in the END { } clause.

Code:
man awk

Last edited by Turbocapitalist; 04-29-2017 at 12:12 AM.
 
Old 04-29-2017, 01:02 AM   #14
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_12{.0|.1}
Posts: 5,650
Blog Entries: 11

Rep: Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685Reputation: 3685
Sorry that I have not had the opportunity to think about this today, and I am nearing the end of my day.

What I have in mind is somthing like this... (WARNING! Incomplete, not tested and not completely thought through! )

Code:
COL=0
declare -A namearray;
for file in $*; do
colarray="colarray${COL}"
declare -A ${colarray}
while read -ra line; do
        if [ /* name not in array */ ]; then
                namearray[${line[0]}]="${line[0]}"
        fi
        colarray[${line[0]}]="${line[3]}"
done<"$file"
((COL++))
done

# After reading all files, iterate namearray (outerloop) and colarray${ 0-COL }[name] (inner loop)
# print value:space with no newline, insert zero for missing column values...   
#
The COL variable is essentially the same as the per-file index mentioned in proposed awk solutions, which becomes the column index for the output values.

namearray simply guarantees that we iterate all names for every file (column) when we print.

Now that I have typed it out this far, I see there may be array scoping problems, and I find I use eval quite a bit when referencing complex bash associative arrays...

So, if anyone cares to extend this idea, feel free and call it you own!

Otherwise, I suggest you stay with awk unless you hit some unexpected roadblock.

Good luck!

Last edited by astrogeek; 04-29-2017 at 01:02 AM. Reason: typos
 
Old 04-29-2017, 01:22 AM   #15
bioinfo17
LQ Newbie
 
Registered: Apr 2017
Posts: 10

Original Poster
Rep: Reputation: Disabled
Thanks @Turbocapitalist and @Astrogeek. Much appreciated
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Rename mutiple image files through terminal fortran General 2 02-08-2012 09:29 AM
Change mutiple xml files Exise Programming 1 04-21-2010 11:38 AM
GZIP mutiple files at any one time Azzath General 2 10-18-2007 05:20 AM
copy a file to mutiple files Radical-Rick Programming 6 07-11-2006 03:47 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 07:24 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration