LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-29-2017, 05:19 AM   #16
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,367

Rep: Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748

Just for fun to see if this could be done with 'join'.
Code:
#!/bin/sh

tmp1="tempfile1"
tmp2="tempfile2"
tmp3="tempfile3"
list="templist"
output="myoutput"

for f in *.txt; do
  tail +2 "$f" | sort > "$tmp1"
  if [ -f "$list" ]; then
    sort "$list" | join -v2 -o 2.1 - "$tmp1" >> "$list"
  else
    join -o 1.1 "$tmp1" "$tmp1" > "$list"
  fi
done

for f in *.txt; do
  tail +2 "$f" | sort > "$tmp1"
  if [ -f "$output" ]; then
    sort "$list" | join -a1 -e0 -o 2.3 - "$tmp1" > "$tmp2"
    paste -d " " "$output" "$tmp2" > "$tmp3"
    mv "$tmp3" "$output"
  else
    sort "$list" | join -a1 -e0 -o "0 2.3" - "$tmp1" > "$output"
  fi
done

rm "$tmp1" "$tmp2" "$list"
 
1 members found this post helpful.
Old 04-29-2017, 05:33 AM   #17
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,367

Rep: Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748
PS - I know it is awful and wasteful and far better for the OP to continue in awk.
 
Old 04-30-2017, 02:26 PM   #18
bioinfo17
LQ Newbie
 
Registered: Apr 2017
Posts: 10

Original Poster
Rep: Reputation: Disabled
will give this a try, thanks allend
 
Old 04-30-2017, 08:03 PM   #19
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,367

Rep: Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748
A more efficient version that removes the repeated calls to sort in the second loop. Also adds the file names in a header.
Code:
#!/bin/sh

tmp1="tempfile1"
tmp2="tempfile2"
tmp3="tempfile3"
list="templist"
output="myoutput"

# Get a list of all rows apart from header row
for f in *.txt; do
  tail +2 "$f" | sort > "$tmp1"
  if [ -f "$list" ]; then
    sort "$list" | join -v2 -o 2.1 - "$tmp1" >> "$list"
  else
    join -o 1.1 "$tmp1" "$tmp1" > "$list"
  fi
done

# Save the sorted list
sort "$list" > "$tmp1"
mv "$tmp1" "$list"

# Start a header for the output file
header="Row"

# Extract column 3 from the files into output file
for f in *.txt; do
  header="$header $f"
  tail +2 "$f" | sort > "$tmp1"
  if [ -f "$output" ]; then
    join -a1 -e0 -o 2.3 "$list" "$tmp1" > "$tmp2"
    paste -d " " "$output" "$tmp2" > "$tmp3"
    mv "$tmp3" "$output"
  else
    join -a1 -e0 -o 0 2.3 "$list" "$tmp1" > "$output"
  fi
done

# Add the header to the output file
echo "$header" > "$tmp1"
cat "$output" >> "$tmp1"
mv "$tmp1" "$output"

# Cleanup
rm "$tmp2" "$list"
 
1 members found this post helpful.
Old 05-01-2017, 12:09 AM   #20
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,292
Blog Entries: 3

Rep: Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718
Quote:
Originally Posted by allend View Post
A more efficient version that removes the repeated calls to sort in the second loop.
I'd still recommend awk, and would like to see bioinfo17's latest awk script.

But with the shell script, two changes would be to use tempfile to generate the names of the temp files for $tmp1, $tmp2, and $tmp3. Also, tail takes a -n option which will help with portability.
 
1 members found this post helpful.
Old 05-01-2017, 08:03 AM   #21
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,367

Rep: Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748
Thanks for the critique! There is always room for improvement.
Using 'tempfile' is best for a long-lived solution, but during development a defined name is easier to track.
Adding -n to 'tail' is good practice, but I have no qualms about posting code with syntax applicable to GNU coreutils in a forum titled 'LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie'.

My more immediate concern would be making it so the user can select the target column.

Snap on awk.

Last edited by allend; 05-01-2017 at 08:04 AM.
 
Old 05-01-2017, 02:25 PM   #22
bioinfo17
LQ Newbie
 
Registered: Apr 2017
Posts: 10

Original Poster
Rep: Reputation: Disabled
Another similar code using awk (not mine):

BEGIN { FS = "\t" }
FNR==1 { ++file }
{
a[$1,file] = $2 FS $3
++seen[$1]
}
END {
for (j in seen) {
split(j, b, SUBSEP)
s = b[1] FS b[2]
for (i=1; i<=file; ++i) {
s = s FS (j SUBSEP i in a ? a[j,i] : "NA" FS "NA")
}
print s
}
}

Thanks everyone for your help.
 
Old 05-02-2017, 12:29 AM   #23
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,292
Blog Entries: 3

Rep: Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718
Interesting example. Here's how I would have approached it, though I'm not sure if either way is more efficient:

Code:
#!/usr/bin/awk -f

BEGIN {
    # set Output Field Separator to a tab
    OFS="\t";
}

# count each file as they are started
f != FILENAME {
    f = FILENAME;
    c++;
}

# save third column for each element for current file in 2D array
$1 && FNR > 1 {
    a[$1][c]=$3;
}

# skip printing the record
{
    next;
}

# print out saved data one row and column at a time
END {
    for (k in a) {
        printf("%s%s", k, OFS);
        for (b=1; b <= c; b++) {
            printf("%d%s", a[k][b], OFS);
        }
        printf "\n";
    }
}
PS. Remember [code] [/code] tags!

Last edited by Turbocapitalist; 05-02-2017 at 12:31 AM.
 
3 members found this post helpful.
Old 05-02-2017, 09:13 AM   #24
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,367

Rep: Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748
Very sweet!
I suggest small changes to avoid the addition of unnecessary OFS at the end of a line.
Code:
        printf("%s", k);
        for (b=1; b <= c; b++) {
            printf("%s%d", OFS, a[k][b]);
 
3 members found this post helpful.
Old 05-02-2017, 02:08 PM   #25
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,292
Blog Entries: 3

Rep: Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718
Quote:
Originally Posted by allend View Post
I suggest small changes to avoid the addition of unnecessary OFS at the end of a line.
Good catch! Thanks. I also think that the other awk script has a more concise way of counting files:

Code:
# count each file as they are started
FNR==1 { ++c; }
 
Old 05-02-2017, 02:44 PM   #26
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,780

Rep: Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198
The solution is not *simple*.
I went for the "one go". That means: collect all information that is needed, and print in an END section.
First, I need a two-dimensional array S[field1,filename] to store all the field3.
A one-dimensional array holding a string is not possible because you want the "holes" become a zero.
Then, because the filename does not need to be printed, I decided to go for a file number fn instead.
The array becomes S[field1,fn] this 1. saves some memory and 2. some awk versions (not GNU awk) would print a for(i in Array) in random order, while a for (i=1; i<=fn; i++) keeps the order.
Code:
awk '
# translate each FILENAME to a filenumber: increase fn when FILENAME changes
FILENAME!=pFN { pFN=FILENAME; fn++ }
# in the example there was a header line, work on all other lines
$1!="name" {
# store the field3 values in S
  S[$1,fn]=$3
# the helper array F1 remembers all $1 that are met, this will allow to detect the missing values
  F1[$1]
}
# all input files done
END {
# for all $1 that were met
  for (i in F1) {
# preset output string
    out=i
# go through all fn (filename numbers)
    for (j=1; j<=fn; j++) {
# get the correct field3 from S if present else 0 and append to the output string
      out=out OFS (((i,j) in S) ? S[i,j] : 0)
    }
# print the output string
    print out
  }
}' file?.txt
I suggest you study this a dozen times. I put some comments.
Just seeing this is very similar to post #23.

Last edited by MadeInGermany; 05-02-2017 at 02:51 PM.
 
1 members found this post helpful.
Old 05-02-2017, 02:54 PM   #27
bioinfo17
LQ Newbie
 
Registered: Apr 2017
Posts: 10

Original Poster
Rep: Reputation: Disabled
Many thanks Turbocapitalist and allend for the codes. The codes were very helpful
 
Old 05-02-2017, 03:28 PM   #28
Shadow_7
Senior Member
 
Registered: Feb 2003
Distribution: debian
Posts: 4,137
Blog Entries: 1

Rep: Reputation: 874Reputation: 874Reputation: 874Reputation: 874Reputation: 874Reputation: 874Reputation: 874
Quote:
Originally Posted by bioinfo17 View Post
Hi,

I've multiple files in the format below (shown 3 as an example, have ~90 files):

file1.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
bat 7 5 -6 1 0 1

file2.txt
name a b c d e f
apple 1 2 -3 4 5 4
ant 4 -46 5 2 6 2
bat 7 5 -6 1 0 1

file3.txt
name a b c d e f
apple 1 2 -3 4 5 4
cat 4 6 5 2 6 2
ant 6 4 -2 5 8 6

would like to merge files based on column c, but files can have different rows, hence print 0 if the same row is not present, to make it clear, the ouput should be:

results.txt file1 file2 file3
apple 2 2 2
cat 6 0 6
ant 0 -46 -2
bat 5 5 0

preferably with awk command would be great!! thanks
Clear as mud. Your "output should be" seems to be column B, not c. And omits the "name" item, unless that was informative and not actually in the file.

This seems like something better suited to a multi-pass process. If only for the first pass to identify every possible item. Possibly a 2nd pass to normalize the files so they have ALL items (sorted?). Then it would make sense to use awk, IMO. Anything less would be a debug and validation nightmare. Unless there's some sort of array / table / database / ??? that is in play and not yet mentioned. Implied that the output if not directly interacting with such a thing outputs a CSV to be imported after the fact. As in your specs are vague at best. And your example is wrong according to specs.
 
1 members found this post helpful.
Old 05-02-2017, 04:19 PM   #29
bioinfo17
LQ Newbie
 
Registered: Apr 2017
Posts: 10

Original Poster
Rep: Reputation: Disabled
code by MadeInGermany worked wonders - thanks heaps. I need to sit down and learn each step ("steep" learning curve for me).
 
Old 05-02-2017, 11:03 PM   #30
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,292
Blog Entries: 3

Rep: Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718Reputation: 3718
If you're just learning then this one may not be as clear as it is useful:

Code:
out=out OFS (((i,j) in S) ? S[i,j] : 0)
It's basically concatenating three variables. Because the variables are next to each other without commas, there will be no OFS in between. Since the first one is itself, it is in practice adding the last two variables to the first one. The OFS is a built-in variable containing the Output Field Separator. That is what goes between fields and, if you print with commas, variables on the way out. The last part with all the parenthesis is really an if-then statement written in a common shorthand. If x, then y, else z.

Code:
x ? y : z
As a whole, the end of the line produces either the contents of S[i,j] or a 0. But the ((i,j) in S) does something a bit different than I thought, so I'll leave that to others. Though it looks like a check to see if S[i,j] is defined. The loops provide the numbers for i and j.

Last edited by Turbocapitalist; 05-02-2017 at 11:08 PM.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Rename mutiple image files through terminal fortran General 2 02-08-2012 09:29 AM
Change mutiple xml files Exise Programming 1 04-21-2010 11:38 AM
GZIP mutiple files at any one time Azzath General 2 10-18-2007 05:20 AM
copy a file to mutiple files Radical-Rick Programming 6 07-11-2006 03:47 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 11:38 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration