combine mutiple files using a particular column but files can have different rows
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
A more efficient version that removes the repeated calls to sort in the second loop.
I'd still recommend awk, and would like to see bioinfo17's latest awk script.
But with the shell script, two changes would be to use tempfile to generate the names of the temp files for $tmp1, $tmp2, and $tmp3. Also, tail takes a -n option which will help with portability.
Thanks for the critique! There is always room for improvement.
Using 'tempfile' is best for a long-lived solution, but during development a defined name is easier to track.
Adding -n to 'tail' is good practice, but I have no qualms about posting code with syntax applicable to GNU coreutils in a forum titled 'LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie'.
My more immediate concern would be making it so the user can select the target column.
BEGIN { FS = "\t" }
FNR==1 { ++file }
{
a[$1,file] = $2 FS $3
++seen[$1]
}
END {
for (j in seen) {
split(j, b, SUBSEP)
s = b[1] FS b[2]
for (i=1; i<=file; ++i) {
s = s FS (j SUBSEP i in a ? a[j,i] : "NA" FS "NA")
}
print s
}
}
Interesting example. Here's how I would have approached it, though I'm not sure if either way is more efficient:
Code:
#!/usr/bin/awk -f
BEGIN {
# set Output Field Separator to a tab
OFS="\t";
}
# count each file as they are started
f != FILENAME {
f = FILENAME;
c++;
}
# save third column for each element for current file in 2D array
$1 && FNR > 1 {
a[$1][c]=$3;
}
# skip printing the record
{
next;
}
# print out saved data one row and column at a time
END {
for (k in a) {
printf("%s%s", k, OFS);
for (b=1; b <= c; b++) {
printf("%d%s", a[k][b], OFS);
}
printf "\n";
}
}
PS. Remember [code] [/code] tags!
Last edited by Turbocapitalist; 05-02-2017 at 01:31 AM.
The solution is not *simple*.
I went for the "one go". That means: collect all information that is needed, and print in an END section.
First, I need a two-dimensional array S[field1,filename] to store all the field3.
A one-dimensional array holding a string is not possible because you want the "holes" become a zero.
Then, because the filename does not need to be printed, I decided to go for a file number fn instead.
The array becomes S[field1,fn] this 1. saves some memory and 2. some awk versions (not GNU awk) would print a for(i in Array) in random order, while a for (i=1; i<=fn; i++) keeps the order.
Code:
awk '
# translate each FILENAME to a filenumber: increase fn when FILENAME changes
FILENAME!=pFN { pFN=FILENAME; fn++ }
# in the example there was a header line, work on all other lines
$1!="name" {
# store the field3 values in S
S[$1,fn]=$3
# the helper array F1 remembers all $1 that are met, this will allow to detect the missing values
F1[$1]
}
# all input files done
END {
# for all $1 that were met
for (i in F1) {
# preset output string
out=i
# go through all fn (filename numbers)
for (j=1; j<=fn; j++) {
# get the correct field3 from S if present else 0 and append to the output string
out=out OFS (((i,j) in S) ? S[i,j] : 0)
}
# print the output string
print out
}
}' file?.txt
I suggest you study this a dozen times. I put some comments.
Just seeing this is very similar to post #23.
Last edited by MadeInGermany; 05-02-2017 at 03:51 PM.
I've multiple files in the format below (shown 3 as an example, have ~90 files):
file1.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
bat 7 5 -6 1 0 1
file2.txt
name a b c d e f
apple 1 2 -3 4 5 4
ant 4 -46 5 2 6 2
bat 7 5 -6 1 0 1
file3.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
ant 6 4 -2 5 8 6
would like to merge files based on column c, but files can have different rows, hence print 0 if the same row is not present, to make it clear, the ouput should be:
results.txt file1 file2 file3
apple 2 2 2
cat 6 0 6
ant 0 -46 -2
bat 5 5 0
preferably with awk command would be great!! thanks
Clear as mud. Your "output should be" seems to be column B, not c. And omits the "name" item, unless that was informative and not actually in the file.
This seems like something better suited to a multi-pass process. If only for the first pass to identify every possible item. Possibly a 2nd pass to normalize the files so they have ALL items (sorted?). Then it would make sense to use awk, IMO. Anything less would be a debug and validation nightmare. Unless there's some sort of array / table / database / ??? that is in play and not yet mentioned. Implied that the output if not directly interacting with such a thing outputs a CSV to be imported after the fact. As in your specs are vague at best. And your example is wrong according to specs.
If you're just learning then this one may not be as clear as it is useful:
Code:
out=out OFS (((i,j) in S) ? S[i,j] : 0)
It's basically concatenating three variables. Because the variables are next to each other without commas, there will be no OFS in between. Since the first one is itself, it is in practice adding the last two variables to the first one. The OFS is a built-in variable containing the Output Field Separator. That is what goes between fields and, if you print with commas, variables on the way out. The last part with all the parenthesis is really an if-then statement written in a common shorthand. If x, then y, else z.
Code:
x ? y : z
As a whole, the end of the line produces either the contents of S[i,j] or a 0. But the ((i,j) in S) does something a bit different than I thought, so I'll leave that to others. Though it looks like a check to see if S[i,j] is defined. The loops provide the numbers for i and j.
Last edited by Turbocapitalist; 05-03-2017 at 12:08 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.