LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-21-2012, 02:23 AM   #1
anon016
LQ Newbie
 
Registered: Jun 2012
Posts: 8

Rep: Reputation: Disabled
Bash common line different file


Hi all,

I would like to share with you something a little bit tricky which I was not able to solve yet.
I know that this could be really a challenge.

Here is my problem:
I have 14 file, each of them with 17 columns. The number of line in each file are different.
What I want to achieve is the following:
-Checking in the files the line which have the same value in the 5 columns;
-If a line with a certain 5 value is not present in one of the files, then this line should be added in each file where is not present with value in the ALL remaining column equal to zero.

Thanks for your time and support.

Here there is an example:

INPUT

file1
a b c d e 4 . .
f g h i l 5 . .
m n o p q 6 . .

file2
a b c d e 4 . .
f g h i k 3 . .

file3
a b c d e 2 . .
f g h i l 1 . .
m n o p q 3 . .
r s t u v 5 . .

OUTPUT

file1
a b c d e 4 . .
f g h i l 5 . .
f g h i k 0 . .
m n o p q 6 . .
r s t u v 0 . .

file2
a b c d e 4 . .
f g h i k 3 . .
f g h i l 0 . .
m n o p q 0 . .
r s t u v 0 . .

file3
a b c d e 2 . .
f g h i l 1 . .
m n o p q 3 . .
r s t u v 5 . .
f g h i k 0 . .
 
Old 06-21-2012, 02:57 AM   #2
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
So what have you done so far? Do post your (pseudo) script lines.
 
Old 06-21-2012, 03:12 AM   #3
anon016
LQ Newbie
 
Registered: Jun 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
I was thing about something like a process substitution and the awk command to format the data

<(sort -b $file1 | awk '...')

Then somehow use joint command when equality does not occur.
This should be the way.
 
Old 06-21-2012, 03:38 AM   #4
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Quote:
What I want to achieve is the following:
-Checking in the files the line which have the same value in the 5 columns;
-If a line with a certain 5 value is not present in one of the files, then this line should be added in each file where is not present with value in the ALL remaining column equal to zero.
Let me re-state this to be sure I understand:
1. Check all files and find the lines with the same value in the 5th column.
2. If a file does not contain a line with a stated value in column 5, then add a line with that value in column 5, and zeros in all other columns.

Assuming that #2 is correct, then what do we do with the results of #1? (You don't need #1 to do #2)

Does the new line go at the end of the file?
 
Old 06-21-2012, 04:12 AM   #5
anon016
LQ Newbie
 
Registered: Jun 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Ok, I will try to reformulate and explain this as best as I can.

1. Check all files and check the lines with the same value in ALL FIRST FIVE columns (e.g. a b c d e = a b c d e)
2. If for example file1 contain a b c d e, but file2 doesn't, a b c d e should be added to file2 but with value equal zero in the remaining column.
3. The aim is to have the same number of line in each file using this criteria.
4. It doesn't matter if the new line goes at the end of the file. I can sorted it later.

The posted example is still valid and I think it could help a lot.
For example:

file3 contain (r s t u v) but (r s t u v) in not present in file1 and file2: so it should be added but with value equal to zero in all the remaining 12 column: (r s t u v 0 0 0 . .) in both file1 and file2. And of course it will remain (r s t u v 5 . .) in file3

Thanks for help.
 
Old 06-21-2012, 04:24 AM   #6
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
First, I would write the pseudo-code for the whole thing---I imagine that you'll need some nested loops====eg the outer loop goes thru the files one by one. For each file, it grabs a line, gets the contents of the first 5 columns, and then loops thru all the other files checking for that pattern.

So---it looks like 3 nested loops. You can build something like this from the inside out or the outside in----in this case, I'd first write the loop that reads each file. But, set up the overall structure in pseudo-code first.
 
Old 06-21-2012, 04:30 AM   #7
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,786

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
2 step process, collect all the lines (sorted) in one new file and process all your files based on this common collection.






_____________________________________
If someone helps you, or you approve of what's posted, click the "Add to Reputation" button, on the left of the post.
Happy with solution ... mark as SOLVED
(located in the "thread tools")
 
Old 06-21-2012, 04:33 AM   #8
anon016
LQ Newbie
 
Registered: Jun 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Thanks I'll try do it with two steps.
Let's see.
 
Old 06-21-2012, 04:37 AM   #9
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Quote:
Originally Posted by pan64 View Post
2 step process, collect all the lines (sorted) in one new file and process all your files based on this common collection.
pan;
I suggest putting your signature stuff into an actual LQ signature (in your profile)---as it is, when I quote one of your posts, I get the signature too.....

Back to the topic:
Your approach and mine both wind up checking every file more that once for a given pattern (I don't know if there's a way around that)---Do you have an opinion as to which is more efficient?
 
Old 06-21-2012, 04:46 AM   #10
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,786

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
Quote:
Originally Posted by pixellany View Post
pan;
I suggest putting your signature stuff into an actual LQ signature (in your profile)---as it is, when I quote one of your posts, I get the signature too.....
done, thanks

Efficiency depends on the size of the files and somehow also on the time required to implement it. So maybe a dirty solution is good enough.
 
Old 06-21-2012, 05:40 AM   #11
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
sometimes, a "dirty" solution is also easier to understand......
 
Old 06-21-2012, 05:59 AM   #12
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
If all the files are sorted you can read the files into a array and do a binary search compare. That would reduce the strings you have to compare in half.but I agree it might be best to find first a dirty solution and if that works optimize it.
 
Old 06-21-2012, 08:02 AM   #13
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Here is my suggestion:
Code:
#!/usr/bin/awk -f
BEGIN {
    # Each line (any newline convention) is a separate record. Remove leading and trailing whitespace.
    RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"

    # Fields are separated by whitespace.
    FS = "[\t\v\f ]+"

    # Added records use newline as record separator,
    ORS = "\n"

    # and spaces as field separator.
    OFS = " "

    # After the five-field key, append this. Note the ORS at end.
    fill = OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" ORS

    # File counter.
    files = 0
}

# First record of a new file?
(FNR == 1) {
    filename[++files] = FILENAME
}

# At least five fields in this record?
(NF >= 5) {
    # Construct a key of the five first fields, separated by output field separator.
    key = $1 OFS $2 OFS $3 OFS $4 OFS $5

    # Add the key (without a value) to the global "seen" array.
    seen[key]

    # Add the per-file key to the "seenin" array, prefixed with the file number (shorter than name).
    seenin[files OFS key]
}

# After all input files have been processed:
END {
    # Loop over each input file:
    for (file = 1; file <= files; file++) {
        name = filename[file]

        # Check if each key is listed for this file; append if not.
        for (key in seen)
            if (!((file OFS key) in seenin))
                printf("%s%s", key, fill) >> name

        # Close the appended to file.
        close(name)
    }
}
Each combination of the first five fields forms a key. The key is saved into the seen array, across all input files. The input file (number) and key is saved as a key into another array, seenin . Note that all arrays in awk are associative; here, the indexes are strings.

The idea is that after all input records have been processed, the awk scriptlet loops over each input file, and each unique key (seen across all input files). Combining the file (number) and the key, we can check if that key was listed for the current file or not. If not, we append it to the file.

Note that unlike most awk scripts, this one does modify its input files directly. (The missing records get appended at the end of the file.)

I/O-wise this script is quite efficient. It does use a lot of RAM, relatively speaking, though: there is one global array containing all the unique keys (first five fields) as strings, and another basically duplicating it for each file. If the input files are large (a large fraction of available RAM in total size), then that may pose a real problem.

One could significantly reduce the amount of memory used by the script by using the values in the seen array, to note which files did specify that key. There would then be no need for the seenin array, cutting down the RAM requirements to basically the set of unique five-field combinations (plus a character per file per combination). Currently the values are completely unused, not even set: just referencing the element in an array is enough in awk to define it, it just does not set any corresponding value.

For example, you could append a character in seen corresponding to the current file, under the specific key, to note that that key was listed in the current file: seen[key] = seen[key] sprintf("%c", 32 + files)
Then, index(seen[key], sprintf("%c", 32 + file) is zero if and only if file file did not specify key key. This should be safe portability-wise in all current Linux systems, for up to about 94 input files.

(You could go up to about 255 input files using that method, if you make sure the locale is POSIX/C by setting LANG=C and LC_ALL=C in the environment before running the script, and using sprintf("%c", files) -- remember, files starts at one. For higher counts, I'd save the file number, separated by spaces, and looking for the space-separated number instead of a character, in just about exactly the same way.)

Questions? Comments?
 
Old 06-21-2012, 08:52 AM   #14
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Nominal*;

Not sure, but this may have been a homework problem---I hope you get a prize.....
 
Old 06-21-2012, 03:31 PM   #15
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by pixellany View Post
Not sure, but this may have been a homework problem
I seriously hope not.

However, the OP posed the problem as a Bash script, and mine is completely awk. From experience, I'd say any instructor should be able to immediately notice if a fledgeling script programmer suddenly submits something like that to a homework question. One or two sharp questions should immediately reveal if the programmer just copy-pasted it.

As a programming technique, I thought it was useful enough to show here. It is also a bit "unorthodox" approach, in that it first reads the input files, then appends to them in the END rule; I was hoping maybe it would give others ideas on how to use a similar technique to solve other problems using awk.

Personally, I've used the "string of space separated index number" variant I mentioned in the last part of my post, in a few awk scripts to very good results. The key there is to make sure you start with space number space, and then append each new number as number space; then you can always safely check for any space number space in the string (via e.g. index(string, sprintf(" %d ", number)) in awk). It is not nearly as fast as associated arrays in awk, though; but when the indexes are potentially very long (like here, five input fields concatenated), it should save a lot of memory compared to multidimensional array implementation in awk. (Multidimensional arrays have string indexes, with all indices concatenated with a separator to form the actual array index. For long indices, that wastes a lot of memory, by repeating most of the same string over and over in memory.)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
bash : read every line from text file starting at given line number quadmore Programming 4 02-20-2009 12:29 PM
I would like need a suggestion on bash shell : Read a file line by line and do stuff madi3d8 Linux - Newbie 1 01-15-2009 09:30 AM
Using diff to compare file with common lines, but at different line numbers jimieee Linux - Newbie 3 05-10-2004 07:26 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:59 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration