Bash common line different file

anon016 · 06-21-2012, 02:23 AM

Hi all,

I would like to share with you something a little bit tricky which I was not able to solve yet.
I know that this could be really a challenge.

Here is my problem:
I have 14 file, each of them with 17 columns. The number of line in each file are different.
What I want to achieve is the following:
-Checking in the files the line which have the same value in the 5 columns;
-If a line with a certain 5 value is not present in one of the files, then this line should be added in each file where is not present with value in the ALL remaining column equal to zero.

Thanks for your time and support.

Here there is an example:

INPUT

file1
a b c d e 4 . .
f g h i l 5 . .
m n o p q 6 . .

file2
a b c d e 4 . .
f g h i k 3 . .

file3
a b c d e 2 . .
f g h i l 1 . .
m n o p q 3 . .
r s t u v 5 . .

OUTPUT

file1
a b c d e 4 . .
f g h i l 5 . .
f g h i k 0 . .
m n o p q 6 . .
r s t u v 0 . .

file2
a b c d e 4 . .
f g h i k 3 . .
f g h i l 0 . .
m n o p q 0 . .
r s t u v 0 . .

file3
a b c d e 2 . .
f g h i l 1 . .
m n o p q 3 . .
r s t u v 5 . .
f g h i k 0 . .

unSpawn · 06-21-2012, 02:57 AM

So what have you done so far? Do post your (pseudo) script lines.

anon016 · 06-21-2012, 03:12 AM

I was thing about something like a process substitution and the awk command to format the data

<(sort -b $file1 | awk '...')

Then somehow use joint command when equality does not occur.
This should be the way.

pixellany · 06-21-2012, 03:38 AM

Quote:

What I want to achieve is the following:
-Checking in the files the line which have the same value in the 5 columns;
-If a line with a certain 5 value is not present in one of the files, then this line should be added in each file where is not present with value in the ALL remaining column equal to zero.

Let me re-state this to be sure I understand:
1. Check all files and find the lines with the same value in the 5th column.
2. If a file does not contain a line with a stated value in column 5, then add a line with that value in column 5, and zeros in all other columns.

Assuming that #2 is correct, then what do we do with the results of #1? (You don't need #1 to do #2)

Does the new line go at the end of the file?

anon016 · 06-21-2012, 04:12 AM

Ok, I will try to reformulate and explain this as best as I can.

1. Check all files and check the lines with the same value in ALL FIRST FIVE columns (e.g. a b c d e = a b c d e)
2. If for example file1 contain a b c d e, but file2 doesn't, a b c d e should be added to file2 but with value equal zero in the remaining column.
3. The aim is to have the same number of line in each file using this criteria.
4. It doesn't matter if the new line goes at the end of the file. I can sorted it later.

The posted example is still valid and I think it could help a lot.
For example:

file3 contain (r s t u v) but (r s t u v) in not present in file1 and file2: so it should be added but with value equal to zero in all the remaining 12 column: (r s t u v 0 0 0 . .) in both file1 and file2. And of course it will remain (r s t u v 5 . .) in file3

Thanks for help.

pixellany · 06-21-2012, 04:24 AM

First, I would write the pseudo-code for the whole thing---I imagine that you'll need some nested loops====eg the outer loop goes thru the files one by one. For each file, it grabs a line, gets the contents of the first 5 columns, and then loops thru all the other files checking for that pattern.

So---it looks like 3 nested loops. You can build something like this from the inside out or the outside in----in this case, I'd first write the loop that reads each file. But, set up the overall structure in pseudo-code first.

pan64 · 06-21-2012, 04:30 AM

2 step process, collect all the lines (sorted) in one new file and process all your files based on this common collection.

_____________________________________
If someone helps you, or you approve of what's posted, click the "Add to Reputation" button, on the left of the post.
Happy with solution ... mark as SOLVED
(located in the "thread tools")

anon016 · 06-21-2012, 04:33 AM

Thanks I'll try do it with two steps.
Let's see.

pixellany · 06-21-2012, 04:37 AM

Quote:

Originally Posted by pan64

2 step process, collect all the lines (sorted) in one new file and process all your files based on this common collection.

pan;
I suggest putting your signature stuff into an actual LQ signature (in your profile)---as it is, when I quote one of your posts, I get the signature too.....

Back to the topic:
Your approach and mine both wind up checking every file more that once for a given pattern (I don't know if there's a way around that)---Do you have an opinion as to which is more efficient?

pan64 · 06-21-2012, 04:46 AM

Quote:

Originally Posted by pixellany

pan;
I suggest putting your signature stuff into an actual LQ signature (in your profile)---as it is, when I quote one of your posts, I get the signature too.....

done, thanks

Efficiency depends on the size of the files and somehow also on the time required to implement it. So maybe a dirty solution is good enough.

pixellany · 06-21-2012, 05:40 AM

sometimes, a "dirty" solution is also easier to understand......

whizje · 06-21-2012, 05:59 AM

If all the files are sorted you can read the files into a array and do a binary search compare. That would reduce the strings you have to compare in half.but I agree it might be best to find first a dirty solution and if that works optimize it.

Nominal Animal · 06-21-2012, 08:02 AM

Here is my suggestion:

Code:

#!/usr/bin/awk -f
BEGIN {
    # Each line (any newline convention) is a separate record. Remove leading and trailing whitespace.
    RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"

    # Fields are separated by whitespace.
    FS = "[\t\v\f ]+"

    # Added records use newline as record separator,
    ORS = "\n"

    # and spaces as field separator.
    OFS = " "

    # After the five-field key, append this. Note the ORS at end.
    fill = OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" OFS "0" ORS

    # File counter.
    files = 0
}

# First record of a new file?
(FNR == 1) {
    filename[++files] = FILENAME
}

# At least five fields in this record?
(NF >= 5) {
    # Construct a key of the five first fields, separated by output field separator.
    key = $1 OFS $2 OFS $3 OFS $4 OFS $5

    # Add the key (without a value) to the global "seen" array.
    seen[key]

    # Add the per-file key to the "seenin" array, prefixed with the file number (shorter than name).
    seenin[files OFS key]
}

# After all input files have been processed:
END {
    # Loop over each input file:
    for (file = 1; file <= files; file++) {
        name = filename[file]

        # Check if each key is listed for this file; append if not.
        for (key in seen)
            if (!((file OFS key) in seenin))
                printf("%s%s", key, fill) >> name

        # Close the appended to file.
        close(name)
    }
}

Each combination of the first five fields forms a key. The key is saved into the seen array, across all input files. The input file (number) and key is saved as a key into another array, seenin . Note that all arrays in awk are associative; here, the indexes are strings.

The idea is that after all input records have been processed, the awk scriptlet loops over each input file, and each unique key (seen across all input files). Combining the file (number) and the key, we can check if that key was listed for the current file or not. If not, we append it to the file.

Note that unlike most awk scripts, this one does modify its input files directly. (The missing records get appended at the end of the file.)

I/O-wise this script is quite efficient. It does use a lot of RAM, relatively speaking, though: there is one global array containing all the unique keys (first five fields) as strings, and another basically duplicating it for each file. If the input files are large (a large fraction of available RAM in total size), then that may pose a real problem.

One could significantly reduce the amount of memory used by the script by using the values in the seen array, to note which files did specify that key. There would then be no need for the seenin array, cutting down the RAM requirements to basically the set of unique five-field combinations (plus a character per file per combination). Currently the values are completely unused, not even set: just referencing the element in an array is enough in awk to define it, it just does not set any corresponding value.

For example, you could append a character in seen corresponding to the current file, under the specific key, to note that that key was listed in the current file: seen[key] = seen[key] sprintf("%c", 32 + files)
Then, index(seen[key], sprintf("%c", 32 + file) is zero if and only if file file did not specify key key. This should be safe portability-wise in all current Linux systems, for up to about 94 input files.

(You could go up to about 255 input files using that method, if you make sure the locale is POSIX/C by setting LANG=C and LC_ALL=C in the environment before running the script, and using sprintf("%c", files) -- remember, files starts at one. For higher counts, I'd save the file number, separated by spaces, and looking for the space-separated number instead of a character, in just about exactly the same way.)

Questions? Comments?

pixellany · 06-21-2012, 08:52 AM

Nominal*;

Not sure, but this may have been a homework problem---I hope you get a prize.....

Nominal Animal · 06-21-2012, 03:31 PM

Quote:

Originally Posted by pixellany

Not sure, but this may have been a homework problem

I seriously hope not.

However, the OP posed the problem as a Bash script, and mine is completely awk. From experience, I'd say any instructor should be able to immediately notice if a fledgeling script programmer suddenly submits something like that to a homework question. One or two sharp questions should immediately reveal if the programmer just copy-pasted it.

As a programming technique, I thought it was useful enough to show here. It is also a bit "unorthodox" approach, in that it first reads the input files, then appends to them in the END rule; I was hoping maybe it would give others ideas on how to use a similar technique to solve other problems using awk.

Personally, I've used the "string of space separated index number" variant I mentioned in the last part of my post, in a few awk scripts to very good results. The key there is to make sure you start with space number space, and then append each new number as number space; then you can always safely check for any space number space in the string (via e.g. index(string, sprintf(" %d ", number)) in awk). It is not nearly as fast as associated arrays in awk, though; but when the indexes are potentially very long (like here, five input fields concatenated), it should save a lot of memory compared to multidimensional array implementation in awk. (Multidimensional arrays have string indexes, with all indices concatenated with a separator to form the actual array index. For long indices, that wastes a lot of memory, by repeating most of the same string over and over in memory.)