combine mutiple files using a particular column but files can have different rows
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
combine mutiple files using a particular column but files can have different rows
Hi,
I've multiple files in the format below (shown 3 as an example, have ~90 files):
file1.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
bat 7 5 -6 1 0 1
file2.txt
name a b c d e f
apple 1 2 -3 4 5 4
ant 4 -46 5 2 6 2
bat 7 5 -6 1 0 1
file3.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
ant 6 4 -2 5 8 6
would like to merge files based on column c, but files can have different rows, hence print 0 if the same row is not present, to make it clear, the ouput should be:
results.txt file1 file2 file3
apple 2 2 2
cat 6 0 6
ant 0 -46 -2
bat 5 5 0
preferably with awk command would be great!! thanks
Tried to be short and discreet, didn't meant to be rude or impolite. I personally don't describe things very well, hence made the comment "to make it clear, the output should be" for others to understand what I was trying to ask.
Have tried using the following command awk '
BEGIN {
f=0
}
{
if ($1 == "name") {
f++; getline;
}
i = $1
row[i] = row[i]"\t"(($3 == "")? "0":$3)
}
END {
for (i in row) print i,row[i]
} ' file1.txt file2.txt file3.txt
This worked to some extent but did not print "0" in the columns that didn't match to the corresponding rows. Maybe not the right code to use. Any help from the experts out there would be great.
I'm not sure if it'd be helpful here, but you might want to look into some of the database libraries out there... I don't really have much experience with them but there are some that even allow you to use SQL to query text files like these. Just a thought....
Otherwise I'd probably use perl to load the data into hash(es) then process them.
This worked to some extent but did not print "0" in the columns that didn't match to the corresponding rows.
The idea is ok, but you are testing the wrong thing - $3 only exists for records that are read in. You need to test records you didn't read in. Which of course is non-sense.
Off the top of my head you would need an index for each file - even the filename would do. You need to accommodate fileds that appear (only) in later file(s). Have to think about it after I do the yard. There are others here that will have better contributions.
The idea is ok, but you are testing the wrong thing - $3 only exists for records that are read in. You need to test records you didn't read in. Which of course is non-sense.
Off the top of my head you would need an index for each file - even the filename would do. You need to accommodate fileds that appear (only) in later file(s). Have to think about it after I do the yard. There are others here that will have better contributions.
This is the essential thing that must be handled - you must test for things that do not exist in all files, and generate a value (zero is used in the OP example) for the non-existent cases.
Two ways to handle this come to mind:
* A multi-pass process which first collects all names then scans for the values
* An array based function which collects all the names into a single array, and the values for each file in separate arrays indexed to the name array - in a single pass
Multi-pass would be messy, but the array approach could be fairly easily done in most scripting languages.
I have a bash script in mind which I will try to work out if no one beats me to it...
@chrism01 - sure will keep this mind for next time, thank you
@syg00, @astrogeek - any solutions yet, please? very keen to see how the bash script would look like, tried my best but couldn't find a solution. Your help is much appreciated!!!! thanks
Be very careful not to appear not to be demanding an answer immediately. People will respond when/if they feel like it. Not necessarily when you want.
People are giving of their time and knowledge - accept what you get gracefully.
As you have chosen to simply concatenate the values into the row array, to handle this in one pass I would add code to your test for new file. If you split() each row[] member it will return a count of elements it created - this can be compared to "f" and if not equal just concatenate the "0". Needs to be done for every row[] at each change of file, so the processing load will increase as number of files/fields grows; no big deal I imagine.
Note that similar code will be required in the END block to handle the last file.
I do appreciate people using their valuable time to help me. @syg00 You might be an expert but please don't just discourage newbies like me from asking questions in this forum. Well, thanks for the explanation! I posted here thinking of getting a quick solution to my question and of course I have tried and searched similar forums to get a solution. PLEASE STOP JUDGING!!!! I have seen many other posts in this forum where newbies like me have got their answers quick enough without being judged. I have been trying to get a solution and hopefully I will get one soon.
As you have chosen to simply concatenate the values into the row array, to handle this in one pass I would add code to your test for new file...
bioinfo17, I'll add that the built-in variable to for the file name of the currently open file is FILENAME. You'll need your own variable to carry the name of the previous file. When the two differ, then you've started a new file. This is still true even first time you do the comparison of your variable with the first file as your variable will be NULL until you fill it.
I'm rather sure that any solution here will involve arrays and maybe in some cases a two dimensional array that you build in the main loop and then print in the END { } clause.
Code:
man awk
Last edited by Turbocapitalist; 04-29-2017 at 12:12 AM.
Sorry that I have not had the opportunity to think about this today, and I am nearing the end of my day.
What I have in mind is somthing like this... (WARNING! Incomplete, not tested and not completely thought through! )
Code:
COL=0
declare -A namearray;
for file in $*; do
colarray="colarray${COL}"
declare -A ${colarray}
while read -ra line; do
if [ /* name not in array */ ]; then
namearray[${line[0]}]="${line[0]}"
fi
colarray[${line[0]}]="${line[3]}"
done<"$file"
((COL++))
done
# After reading all files, iterate namearray (outerloop) and colarray${ 0-COL }[name] (inner loop)
# print value:space with no newline, insert zero for missing column values...
#
The COL variable is essentially the same as the per-file index mentioned in proposed awk solutions, which becomes the column index for the output values.
namearray simply guarantees that we iterate all names for every file (column) when we print.
Now that I have typed it out this far, I see there may be array scoping problems, and I find I use eval quite a bit when referencing complex bash associative arrays...
So, if anyone cares to extend this idea, feel free and call it you own!
Otherwise, I suggest you stay with awk unless you hit some unexpected roadblock.
Good luck!
Last edited by astrogeek; 04-29-2017 at 01:02 AM.
Reason: typos
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.