Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
|
04-29-2017, 05:19 AM
|
#16
|
LQ 5k Club
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,449
|
Just for fun to see if this could be done with 'join'.
Code:
#!/bin/sh
tmp1="tempfile1"
tmp2="tempfile2"
tmp3="tempfile3"
list="templist"
output="myoutput"
for f in *.txt; do
tail +2 "$f" | sort > "$tmp1"
if [ -f "$list" ]; then
sort "$list" | join -v2 -o 2.1 - "$tmp1" >> "$list"
else
join -o 1.1 "$tmp1" "$tmp1" > "$list"
fi
done
for f in *.txt; do
tail +2 "$f" | sort > "$tmp1"
if [ -f "$output" ]; then
sort "$list" | join -a1 -e0 -o 2.3 - "$tmp1" > "$tmp2"
paste -d " " "$output" "$tmp2" > "$tmp3"
mv "$tmp3" "$output"
else
sort "$list" | join -a1 -e0 -o "0 2.3" - "$tmp1" > "$output"
fi
done
rm "$tmp1" "$tmp2" "$list"
|
|
1 members found this post helpful.
|
04-29-2017, 05:33 AM
|
#17
|
LQ 5k Club
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,449
|
PS - I know it is awful and wasteful and far better for the OP to continue in awk.
|
|
|
04-30-2017, 02:26 PM
|
#18
|
LQ Newbie
Registered: Apr 2017
Posts: 10
Original Poster
Rep:
|
will give this a try, thanks allend
|
|
|
04-30-2017, 08:03 PM
|
#19
|
LQ 5k Club
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,449
|
A more efficient version that removes the repeated calls to sort in the second loop. Also adds the file names in a header.
Code:
#!/bin/sh
tmp1="tempfile1"
tmp2="tempfile2"
tmp3="tempfile3"
list="templist"
output="myoutput"
# Get a list of all rows apart from header row
for f in *.txt; do
tail +2 "$f" | sort > "$tmp1"
if [ -f "$list" ]; then
sort "$list" | join -v2 -o 2.1 - "$tmp1" >> "$list"
else
join -o 1.1 "$tmp1" "$tmp1" > "$list"
fi
done
# Save the sorted list
sort "$list" > "$tmp1"
mv "$tmp1" "$list"
# Start a header for the output file
header="Row"
# Extract column 3 from the files into output file
for f in *.txt; do
header="$header $f"
tail +2 "$f" | sort > "$tmp1"
if [ -f "$output" ]; then
join -a1 -e0 -o 2.3 "$list" "$tmp1" > "$tmp2"
paste -d " " "$output" "$tmp2" > "$tmp3"
mv "$tmp3" "$output"
else
join -a1 -e0 -o 0 2.3 "$list" "$tmp1" > "$output"
fi
done
# Add the header to the output file
echo "$header" > "$tmp1"
cat "$output" >> "$tmp1"
mv "$tmp1" "$output"
# Cleanup
rm "$tmp2" "$list"
|
|
1 members found this post helpful.
|
05-01-2017, 12:09 AM
|
#20
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,519
|
Quote:
Originally Posted by allend
A more efficient version that removes the repeated calls to sort in the second loop.
|
I'd still recommend awk, and would like to see bioinfo17's latest awk script.
But with the shell script, two changes would be to use tempfile to generate the names of the temp files for $tmp1, $tmp2, and $tmp3. Also, tail takes a -n option which will help with portability.
|
|
1 members found this post helpful.
|
05-01-2017, 08:03 AM
|
#21
|
LQ 5k Club
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,449
|
Thanks for the critique! There is always room for improvement.
Using 'tempfile' is best for a long-lived solution, but during development a defined name is easier to track.
Adding -n to 'tail' is good practice, but I have no qualms about posting code with syntax applicable to GNU coreutils in a forum titled 'LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie'.
My more immediate concern would be making it so the user can select the target column.
Snap on awk.
Last edited by allend; 05-01-2017 at 08:04 AM.
|
|
|
05-01-2017, 02:25 PM
|
#22
|
LQ Newbie
Registered: Apr 2017
Posts: 10
Original Poster
Rep:
|
Another similar code using awk (not mine):
BEGIN { FS = "\t" }
FNR==1 { ++file }
{
a[$1,file] = $2 FS $3
++seen[$1]
}
END {
for (j in seen) {
split(j, b, SUBSEP)
s = b[1] FS b[2]
for (i=1; i<=file; ++i) {
s = s FS (j SUBSEP i in a ? a[j,i] : "NA" FS "NA")
}
print s
}
}
Thanks everyone for your help.
|
|
|
05-02-2017, 12:29 AM
|
#23
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,519
|
Interesting example. Here's how I would have approached it, though I'm not sure if either way is more efficient:
Code:
#!/usr/bin/awk -f
BEGIN {
# set Output Field Separator to a tab
OFS="\t";
}
# count each file as they are started
f != FILENAME {
f = FILENAME;
c++;
}
# save third column for each element for current file in 2D array
$1 && FNR > 1 {
a[$1][c]=$3;
}
# skip printing the record
{
next;
}
# print out saved data one row and column at a time
END {
for (k in a) {
printf("%s%s", k, OFS);
for (b=1; b <= c; b++) {
printf("%d%s", a[k][b], OFS);
}
printf "\n";
}
}
PS. Remember [code] [/code] tags!
Last edited by Turbocapitalist; 05-02-2017 at 12:31 AM.
|
|
3 members found this post helpful.
|
05-02-2017, 09:13 AM
|
#24
|
LQ 5k Club
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,449
|
Very sweet!
I suggest small changes to avoid the addition of unnecessary OFS at the end of a line.
Code:
printf("%s", k);
for (b=1; b <= c; b++) {
printf("%s%d", OFS, a[k][b]);
|
|
3 members found this post helpful.
|
05-02-2017, 02:08 PM
|
#25
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,519
|
Quote:
Originally Posted by allend
I suggest small changes to avoid the addition of unnecessary OFS at the end of a line.
|
Good catch! Thanks. I also think that the other awk script has a more concise way of counting files:
Code:
# count each file as they are started
FNR==1 { ++c; }
|
|
|
05-02-2017, 02:44 PM
|
#26
|
Senior Member
Registered: Dec 2011
Location: Simplicity
Distribution: Mint/MATE
Posts: 2,928
|
The solution is not *simple*.
I went for the "one go". That means: collect all information that is needed, and print in an END section.
First, I need a two-dimensional array S[field1,filename] to store all the field3.
A one-dimensional array holding a string is not possible because you want the "holes" become a zero.
Then, because the filename does not need to be printed, I decided to go for a file number fn instead.
The array becomes S[field1,fn] this 1. saves some memory and 2. some awk versions (not GNU awk) would print a for(i in Array) in random order, while a for (i=1; i<=fn; i++) keeps the order.
Code:
awk '
# translate each FILENAME to a filenumber: increase fn when FILENAME changes
FILENAME!=pFN { pFN=FILENAME; fn++ }
# in the example there was a header line, work on all other lines
$1!="name" {
# store the field3 values in S
S[$1,fn]=$3
# the helper array F1 remembers all $1 that are met, this will allow to detect the missing values
F1[$1]
}
# all input files done
END {
# for all $1 that were met
for (i in F1) {
# preset output string
out=i
# go through all fn (filename numbers)
for (j=1; j<=fn; j++) {
# get the correct field3 from S if present else 0 and append to the output string
out=out OFS (((i,j) in S) ? S[i,j] : 0)
}
# print the output string
print out
}
}' file?.txt
I suggest you study this a dozen times. I put some comments.
Just seeing this is very similar to post #23.
Last edited by MadeInGermany; 05-02-2017 at 02:51 PM.
|
|
1 members found this post helpful.
|
05-02-2017, 02:54 PM
|
#27
|
LQ Newbie
Registered: Apr 2017
Posts: 10
Original Poster
Rep:
|
Many thanks Turbocapitalist and allend for the codes. The codes were very helpful
|
|
|
05-02-2017, 03:28 PM
|
#28
|
Senior Member
Registered: Feb 2003
Distribution: debian
Posts: 4,137
|
Quote:
Originally Posted by bioinfo17
Hi,
I've multiple files in the format below (shown 3 as an example, have ~90 files):
file1.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
bat 7 5 -6 1 0 1
file2.txt
name a b c d e f
apple 1 2 -3 4 5 4
ant 4 -46 5 2 6 2
bat 7 5 -6 1 0 1
file3.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
ant 6 4 -2 5 8 6
would like to merge files based on column c, but files can have different rows, hence print 0 if the same row is not present, to make it clear, the ouput should be:
results.txt file1 file2 file3
apple 2 2 2
cat 6 0 6
ant 0 -46 -2
bat 5 5 0
preferably with awk command would be great!! thanks
|
Clear as mud. Your "output should be" seems to be column B, not c. And omits the "name" item, unless that was informative and not actually in the file.
This seems like something better suited to a multi-pass process. If only for the first pass to identify every possible item. Possibly a 2nd pass to normalize the files so they have ALL items (sorted?). Then it would make sense to use awk, IMO. Anything less would be a debug and validation nightmare. Unless there's some sort of array / table / database / ??? that is in play and not yet mentioned. Implied that the output if not directly interacting with such a thing outputs a CSV to be imported after the fact. As in your specs are vague at best. And your example is wrong according to specs.
|
|
1 members found this post helpful.
|
05-02-2017, 04:19 PM
|
#29
|
LQ Newbie
Registered: Apr 2017
Posts: 10
Original Poster
Rep:
|
code by MadeInGermany worked wonders - thanks heaps. I need to sit down and learn each step ("steep" learning curve for me).
|
|
|
05-02-2017, 11:03 PM
|
#30
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,519
|
If you're just learning then this one may not be as clear as it is useful:
Code:
out=out OFS (((i,j) in S) ? S[i,j] : 0)
It's basically concatenating three variables. Because the variables are next to each other without commas, there will be no OFS in between. Since the first one is itself, it is in practice adding the last two variables to the first one. The OFS is a built-in variable containing the Output Field Separator. That is what goes between fields and, if you print with commas, variables on the way out. The last part with all the parenthesis is really an if-then statement written in a common shorthand. If x, then y, else z.
As a whole, the end of the line produces either the contents of S[i,j] or a 0. But the ((i,j) in S) does something a bit different than I thought, so I'll leave that to others. Though it looks like a check to see if S[i,j] is defined. The loops provide the numbers for i and j.
Last edited by Turbocapitalist; 05-02-2017 at 11:08 PM.
|
|
1 members found this post helpful.
|
All times are GMT -5. The time now is 12:40 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|