combine mutiple files using a particular column but files can have different rows

allend · 04-29-2017, 05:19 AM

Just for fun to see if this could be done with 'join'.

Code:

#!/bin/sh

tmp1="tempfile1"
tmp2="tempfile2"
tmp3="tempfile3"
list="templist"
output="myoutput"

for f in *.txt; do
  tail +2 "$f" | sort > "$tmp1"
  if [ -f "$list" ]; then
    sort "$list" | join -v2 -o 2.1 - "$tmp1" >> "$list"
  else
    join -o 1.1 "$tmp1" "$tmp1" > "$list"
  fi
done

for f in *.txt; do
  tail +2 "$f" | sort > "$tmp1"
  if [ -f "$output" ]; then
    sort "$list" | join -a1 -e0 -o 2.3 - "$tmp1" > "$tmp2"
    paste -d " " "$output" "$tmp2" > "$tmp3"
    mv "$tmp3" "$output"
  else
    sort "$list" | join -a1 -e0 -o "0 2.3" - "$tmp1" > "$output"
  fi
done

rm "$tmp1" "$tmp2" "$list"

allend · 04-29-2017, 05:33 AM

PS - I know it is awful and wasteful and far better for the OP to continue in awk.

bioinfo17 · 04-30-2017, 02:26 PM

will give this a try, thanks allend

allend · 04-30-2017, 08:03 PM

A more efficient version that removes the repeated calls to sort in the second loop. Also adds the file names in a header.

Code:

#!/bin/sh

tmp1="tempfile1"
tmp2="tempfile2"
tmp3="tempfile3"
list="templist"
output="myoutput"

# Get a list of all rows apart from header row
for f in *.txt; do
  tail +2 "$f" | sort > "$tmp1"
  if [ -f "$list" ]; then
    sort "$list" | join -v2 -o 2.1 - "$tmp1" >> "$list"
  else
    join -o 1.1 "$tmp1" "$tmp1" > "$list"
  fi
done

# Save the sorted list
sort "$list" > "$tmp1"
mv "$tmp1" "$list"

# Start a header for the output file
header="Row"

# Extract column 3 from the files into output file
for f in *.txt; do
  header="$header $f"
  tail +2 "$f" | sort > "$tmp1"
  if [ -f "$output" ]; then
    join -a1 -e0 -o 2.3 "$list" "$tmp1" > "$tmp2"
    paste -d " " "$output" "$tmp2" > "$tmp3"
    mv "$tmp3" "$output"
  else
    join -a1 -e0 -o 0 2.3 "$list" "$tmp1" > "$output"
  fi
done

# Add the header to the output file
echo "$header" > "$tmp1"
cat "$output" >> "$tmp1"
mv "$tmp1" "$output"

# Cleanup
rm "$tmp2" "$list"

Turbocapitalist · 05-01-2017, 12:09 AM

Quote:

Originally Posted by allend

A more efficient version that removes the repeated calls to sort in the second loop.

I'd still recommend awk, and would like to see bioinfo17's latest awk script.

But with the shell script, two changes would be to use tempfile to generate the names of the temp files for $tmp1, $tmp2, and $tmp3. Also, tail takes a -n option which will help with portability.

allend · 05-01-2017, 08:03 AM

Thanks for the critique! There is always room for improvement.
Using 'tempfile' is best for a long-lived solution, but during development a defined name is easier to track.
Adding -n to 'tail' is good practice, but I have no qualms about posting code with syntax applicable to GNU coreutils in a forum titled 'LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie'.

My more immediate concern would be making it so the user can select the target column.

Snap on awk.

bioinfo17 · 05-01-2017, 02:25 PM

Another similar code using awk (not mine):

BEGIN { FS = "\t" }
FNR==1 { ++file }
{
a[$1,file] = $2 FS $3
++seen[$1]
}
END {
for (j in seen) {
split(j, b, SUBSEP)
s = b[1] FS b[2]
for (i=1; i<=file; ++i) {
s = s FS (j SUBSEP i in a ? a[j,i] : "NA" FS "NA")
}
print s
}
}

Thanks everyone for your help.

Turbocapitalist · 05-02-2017, 12:29 AM

Interesting example. Here's how I would have approached it, though I'm not sure if either way is more efficient:

Code:

#!/usr/bin/awk -f

BEGIN {
    # set Output Field Separator to a tab
    OFS="\t";
}

# count each file as they are started
f != FILENAME {
    f = FILENAME;
    c++;
}

# save third column for each element for current file in 2D array
$1 && FNR > 1 {
    a[$1][c]=$3;
}

# skip printing the record
{
    next;
}

# print out saved data one row and column at a time
END {
    for (k in a) {
        printf("%s%s", k, OFS);
        for (b=1; b <= c; b++) {
            printf("%d%s", a[k][b], OFS);
        }
        printf "\n";
    }
}

PS. Remember [code] [/code] tags!

allend · 05-02-2017, 09:13 AM

Very sweet!
I suggest small changes to avoid the addition of unnecessary OFS at the end of a line.

Code:

        printf("%s", k);
        for (b=1; b <= c; b++) {
            printf("%s%d", OFS, a[k][b]);

Turbocapitalist · 05-02-2017, 02:08 PM

Quote:

Originally Posted by allend

I suggest small changes to avoid the addition of unnecessary OFS at the end of a line.

Good catch! Thanks. I also think that the other awk script has a more concise way of counting files:

Code:

# count each file as they are started
FNR==1 { ++c; }

MadeInGermany · 05-02-2017, 02:44 PM

The solution is not *simple*.
I went for the "one go". That means: collect all information that is needed, and print in an END section.
First, I need a two-dimensional array S[field1,filename] to store all the field3.
A one-dimensional array holding a string is not possible because you want the "holes" become a zero.
Then, because the filename does not need to be printed, I decided to go for a file number fn instead.
The array becomes S[field1,fn] this 1. saves some memory and 2. some awk versions (not GNU awk) would print a for(i in Array) in random order, while a for (i=1; i<=fn; i++) keeps the order.

Code:

awk '
# translate each FILENAME to a filenumber: increase fn when FILENAME changes
FILENAME!=pFN { pFN=FILENAME; fn++ }
# in the example there was a header line, work on all other lines
$1!="name" {
# store the field3 values in S
  S[$1,fn]=$3
# the helper array F1 remembers all $1 that are met, this will allow to detect the missing values
  F1[$1]
}
# all input files done
END {
# for all $1 that were met
  for (i in F1) {
# preset output string
    out=i
# go through all fn (filename numbers)
    for (j=1; j<=fn; j++) {
# get the correct field3 from S if present else 0 and append to the output string
      out=out OFS (((i,j) in S) ? S[i,j] : 0)
    }
# print the output string
    print out
  }
}' file?.txt

I suggest you study this a dozen times. I put some comments.
Just seeing this is very similar to post #23.

bioinfo17 · 05-02-2017, 02:54 PM

Many thanks Turbocapitalist and allend for the codes. The codes were very helpful

Shadow_7 · 05-02-2017, 03:28 PM

Quote:

Originally Posted by bioinfo17

Hi,

I've multiple files in the format below (shown 3 as an example, have ~90 files):

file1.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
bat 7 5 -6 1 0 1

file2.txt
name a b c d e f
apple 1 2 -3 4 5 4
ant 4 -46 5 2 6 2
bat 7 5 -6 1 0 1

file3.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
ant 6 4 -2 5 8 6

would like to merge files based on column c, but files can have different rows, hence print 0 if the same row is not present, to make it clear, the ouput should be:

results.txt file1 file2 file3
apple 2 2 2
cat 6 0 6
ant 0 -46 -2
bat 5 5 0

preferably with awk command would be great!! thanks

Clear as mud. Your "output should be" seems to be column B, not c. And omits the "name" item, unless that was informative and not actually in the file.

This seems like something better suited to a multi-pass process. If only for the first pass to identify every possible item. Possibly a 2nd pass to normalize the files so they have ALL items (sorted?). Then it would make sense to use awk, IMO. Anything less would be a debug and validation nightmare. Unless there's some sort of array / table / database / ??? that is in play and not yet mentioned. Implied that the output if not directly interacting with such a thing outputs a CSV to be imported after the fact. As in your specs are vague at best. And your example is wrong according to specs.

bioinfo17 · 05-02-2017, 04:19 PM

code by MadeInGermany worked wonders - thanks heaps. I need to sit down and learn each step ("steep" learning curve for me).

Turbocapitalist · 05-02-2017, 11:03 PM

If you're just learning then this one may not be as clear as it is useful:

Code:

out=out OFS (((i,j) in S) ? S[i,j] : 0)

It's basically concatenating three variables. Because the variables are next to each other without commas, there will be no OFS in between. Since the first one is itself, it is in practice adding the last two variables to the first one. The OFS is a built-in variable containing the Output Field Separator. That is what goes between fields and, if you print with commas, variables on the way out. The last part with all the parenthesis is really an if-then statement written in a common shorthand. If x, then y, else z.

Code:

x ? y : z

As a whole, the end of the line produces either the contents of S[i,j] or a 0. But the ((i,j) in S) does something a bit different than I thought, so I'll leave that to others. Though it looks like a check to see if S[i,j] is defined. The loops provide the numbers for i and j.