LinuxQuestions.org - [SOLVED] Copy and replacing specific line from file1 to file2 line by line

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Copy and replacing specific line from file1 to file2 line by line (https://www.linuxquestions.org/questions/programming-9/copy-and-replacing-specific-line-from-file1-to-file2-line-by-line-870158/)

Copy and replacing specific line from file1 to file2 line by line

I have two files, file1.traj and file2.traj. Both these files contain identical data and the data are arranged in same format in them. The first line of both files is a comment.

At line 7843 of both files there is a cartesian coordinate X, Y and Z ( three digits ). And at line 15685 there is another three digits. The number of lines in between two cartesian coordinates are 7841. And there are few hundreds of thousands of lines in a file.

What I need to do is copy the X Y Z coordinate (three digits) from file1.traj at line 7843 and paste into file2.traj at the same line number as in file1.traj. The next line will be 15685 from file1.traj and replace at line 15685 at file2.traj. And I dont want other lines (data) in file2.traj get altered. This sequence shall be going on until the end of the file. Means copy and substitude the selected lines from file1.traj into file2.traj.

I tried to use paste command but I cant do for specified line alone.

Here i showed the data format in the file. I used the line number for clarity purpose.

Code:

line.1    trajectory generated by ptraj

line.2      5.844  4.178  7.821  6.423  4.054  8.578  6.606  4.907  6.827  7.557

line.3      4.385  6.722  6.877  6.384  7.283  5.950  6.884  7.565  7.668  6.282

line.2      8.474  7.721  7.127  8.928  7.628  7.205  6.259  8.589  6.712  6.110

line.3      7.712  8.602  6.643  8.151  8.654  7.495  6.940  7.183  4.871  6.108

line.4      7.887  4.864  7.755  7.814  3.754  8.697  7.267  3.724  7.081  7.633

line.5      2.478  6.246  8.089  2.604  8.026  8.853  3.943  6.623  5.754  4.529

    .

    .

    .

    .          1.516  41.749  54.260  0.108  41.176  54.536  -0.626  40.627  53.818  -0.303

    .          41.920  42.179  3.556  3.251  41.623  3.530  2.472  42.558  2.678  3.304

    .          44.723  1.496  5.937  44.339  1.355  6.803  44.866  0.614  5.593  52.401

line.7842      86.323  2.974  52.385  85.816  3.785  51.879  85.808  2.359

line.7843    104.140 159.533  88.303

line.7844      4.792  5.052  8.317  5.279  4.463  8.898  5.663  5.341  7.220  6.267

line.7845      4.438  7.137  6.477  6.566  7.627  5.857  7.407  7.936  7.301  6.170

    .          8.741  7.647  7.020  9.023  7.315  7.107  6.475  8.171  6.435  6.413

    .          7.823  8.416  6.704  8.208  8.473  7.582  6.560  7.126  5.141  5.816

    .

    .

    .

    .          52.050  7.905  42.026  38.561  1.747  39.847  39.375  2.235  39.972  38.634

    .          1.382  38.965  0.810  0.477  39.394  0.717  -0.349  39.867  0.222  1.081

    .          39.847  43.073  5.033  2.756  43.387  5.428  1.942  42.256  4.598  2.511

line.15683    47.302  4.261  7.071  47.801  4.632  7.799  47.256  4.968  6.428  54.279

line.15684      0.498  3.477  53.964  0.612  2.580  53.500  0.612  4.021

line.15685    104.140 159.533  88.303

line.15686      4.970  4.868  7.979  5.342  4.250  8.612  5.988  5.450  7.184  6.903

line 15687      4.861  7.246  6.381  6.921  7.550  5.526  7.597  7.536  6.953  7.009

    .

    .

If the lines containing the XYZ coordinates are the only ones with three numbers, you can try to retrieve them along with the line number using grep:

Code:

grep -En '^[ ]*[0-9.]+[ ]+[0-9.]+[ ]+[0-9.]+[ ]*$' file1.traj

This takes in account leading and trailing spaces (if any) and any number of spaces between numbers. It assumes there are not tabs instead of spaces, otherwise use the generic pattern [:space:].

Once you've retrieved this information you can easily use sed with the c command to replace a specific line. Putting all together in a loop:

Code:

while read number line

do

  number=${number/:/}

  echo sed -i "${number}c ${line}" file2.traj

done < <(grep -En '^[ ]*[0-9.]+[ ]+[0-9.]+[ ]+[0-9.]+[ ]*$' file1.traj)

The echo statement is just for testing purposes, whereas the -i option of sed will edit the file in place. After having tested the loop works properly, remove the echo and run again. In any case keep a backup copy of the original file. You never know...! ;)

Thanks so much.
This script works fine as I wanted.
But there is only one thing lacking. That is about the position of the data.

The decimal points are not aligned straight. The replaced data should be pushed two spaces to the right hand site.

I trying to figure out this but in vein.

Regards
Vijay

Correct. The read statement uses the white space as field delimiter so that any leading space is removed from the line. If I interpret things correctly, the problem is to retain the original format of the XYZ line with leading blank spaces (if any), right?

In this case you have to change the IFS variable (see man bash for details) that is the Input Field Separator. This is actually mandatory to get the correct results, since if the XYZ line does not contain leading spaces, the line number is not read properly from the grep's output (I just didn't notice this before). In other words, suppose the grep command give something like:

Code:

7843:104.140 159.533  88.303

15685:  104.140 159.533  88.303

the read statement that uses blank space as delimiter will get:

Code:

number="7843:104.140" line="159.533  88.303"

number="15685:"      line="104.140 159.533  88.303"

respectively. Instead if the delimiter is : (colon) it will get:

Code:

number="7843"        line="104.140 159.533  88.303"

number="15685"        line="  104.140 159.533  88.303"

which is what we want. Another problem arises: the c command of sed removes the leading spaces unless you put a backslash in front of the line. But the line is referenced as a shell variable so that the first character after \ would be $. In this case the $ will be escaped and it will be interpreted literally resulting in a wrong substitution. For this reason you have to escape the backslash with another backslash.

Sorry for the confusion. It's not easy to explain clearly. Anyway, this is the code:

Code:

OLD_IFS="$IFS"

IFS=":"



while read number line

do

  sed -i "${number}c \\${line}" file2.traj

done < <(grep -En '^[ ]*[0-9.]+[ ]+[0-9.]+[ ]+[0-9.]+[ ]*$' file1.traj)



IFS="$OLD_IFS"

Cheers! :)

Dear Colucix,
Your solution is perfect. It is working exactly how want it.
Thanks so much for your kind.

Cheers

Dear Sir,

I have additional issue to ask related to the coding above.

The file that I am operating nearly has got 7,842,000 lines. Means the replacement has to taken place every 7842 lines and it should go 1000 times. When I calculate the time taken to do this job, it is around 17 hours in supercomputer.

So I wonder if there is possible to alter this code to speed up the process.

Is that possible to extract only the coordinates (lines with three numbers) from file1.traj into separate file (lets say coordinate.txt) and use the data from this file to substitute the same line in the file2.traj?

Awk should be much faster than a shell loop. Could you check (with a shorter files, say just 78421 lines) if this does what you want? (I tested it with a dictionary file, so I do believe it should work correctly.)

Code:

awk -v "other=file1.traj" '

    BEGIN {

        split("", replacement)

        r = 0

        while ((getline line < other) > 0) {

            r++

            if ((r > 1) && (r % 7842 == 1))

                replacement[r] = line

        }

    }



    {

        if ((NR > 1) && (NR % 7842 == 1))

            print replacement[NR]

        else

            print

    }

' file2.traj > new.traj

Note that this script will first read thorough the entire file1.traj file (keeping only the replacement lines in memory). When using huge data files, it'll take a while before it starts saving data to new.traj.

And here's a C program you can use, if you really have that large input files. It's probably faster than any scripting version. (It reads the input files in parallel, too, so there is no delay in output.)

Code:

#include <stdio.h>

#include <string.h>

#include <stdlib.h>

#include <errno.h>



int main(int argc, char *argv[])

{

    char    buffer1[65536];

    char    buffer2[65536];

    char    *line1;

    char    *line2;

    FILE    *in1;

    FILE    *in2;



    long    headerlines, lines1, lines2;

    long    lines, total;



    int      status = 0;

    char    dummy;



    if (argc != 6) {

        fprintf(stderr, "\n");

        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);

        fprintf(stderr, "      %s <header> <copy1> <copy2> file1 file2 [ > output ]\n", argv[0]);

        fprintf(stderr, "Where\n");

        fprintf(stderr, "      <header>    is the number of initial header lines,\n");

        fprintf(stderr, "      <copy1>    is the number of lines to copy from file1,\n");

        fprintf(stderr, "      <copy2>    is the number of lines to copy from file2.\n");

        fprintf(stderr, "This program reads lines from file1 and file2 in parallel.\n");

        fprintf(stderr, "First, <header> lines are copied from file1 to output.\n");

        fprintf(stderr, "Then, for as long as there is input in file1 and file2,\n");

        fprintf(stderr, "<copy1> lines are copied from file1, then <copy2> lines from file2.\n");

        fprintf(stderr, "The output will end whenever either file1 or file2 runs out.\n");

        fprintf(stderr, "If there is still lines in one but not other, a warning is printed.\n");

        fprintf(stderr, "\n");

        return 1;

    }



    if (sscanf(argv[1], "%ld %c", &headerlines, &dummy) != 1) {

        fprintf(stderr, "%s: Invalid number of initial header lines.\n", argv[1]);

        return 1;

    }

    if (headerlines < 0L) {

        fprintf(stderr, "%s: Invalid number of initial header lines.\n", argv[1]);

        return 1;

    }



    if (sscanf(argv[2], "%ld %c", &lines1, &dummy) != 1) {

        fprintf(stderr, "%s: Invalid number of lines from %s.\n", argv[2], argv[4]); 

        return 1;

    }

    if (lines1 < 1L) {

        fprintf(stderr, "%s: Invalid number of lines from %s.\n", argv[2], argv[4]); 

        return 1;

    }



    if (sscanf(argv[3], "%ld %c", &lines2, &dummy) != 1) {

        fprintf(stderr, "%s: Invalid number of lines from %s.\n", argv[3], argv[5]); 

        return 1;

    }

    if (lines2 < 1L) {

        fprintf(stderr, "%s: Invalid number of lines from %s.\n", argv[3], argv[5]); 

        return 1;

    }



    in1 = fopen(argv[4], "rb");

    if (!in1) {

        char const *const error = strerror(errno);

        fprintf(stderr, "%s: %s.\n", argv[4], error);

        return 1;

    }



    in2 = fopen(argv[5], "rb");

    if (!in2) {

        char const *const error = strerror(errno);

        fprintf(stderr, "%s: %s.\n", argv[5], error);

        fclose(in1);

        return 1;

    }



    total = 0L;



    lines = headerlines;

    while (lines > 0L) {

        line1 = fgets(buffer1, sizeof(buffer1), in1);

        line2 = fgets(buffer2, sizeof(buffer2), in2);

        if (!line1 || !line2)

            break;



        lines--;

        total++;

        fputs(line1, stdout);

    }



    while (line1 && line2) {



        lines = lines1;

        while (line1 && line2 && lines > 0L) {

            line1 = fgets(buffer1, sizeof(buffer1), in1);

            line2 = fgets(buffer2, sizeof(buffer2), in2);

            if (!line1 || !line2)

                break;



            lines--;

            total++;

            fputs(line1, stdout);

        }

        if (lines != 0L || !line1 || !line2)

            break;



        lines = lines2;

        while (line1 && line2 && lines > 0L) {

            line1 = fgets(buffer1, sizeof(buffer1), in1);

            line2 = fgets(buffer2, sizeof(buffer2), in2);

            if (!line1 || !line2)

                break;



            lines--;

            total++;

            fputs(line2, stdout);

        }

        if (lines != 0L || !line1 || !line2)

            break;

    }



    if (ferror(in1))

        fprintf(stderr, "%s: Read error.\n", argv[4]);

    if (ferror(in2))

        fprintf(stderr, "%s: Read error.\n", argv[5]);

    if (ferror(in1) || ferror(in2)) {

        fclose(in1);

        fclose(in2);

        return 1;

    }



    if (line1 || !feof(in1)) {

        fprintf(stderr, "Warning: %s had excess lines.\n", argv[4]);

        status |= 2;

    }

    if (fclose(in1)) {

        char const *const error = strerror(errno);        

        fprintf(stderr, "Warning: %s: %s.\n", argv[4], error);

        status |= 4;

    }



    if (line2 || !feof(in2)) {

        fprintf(stderr, "Warning: %s had excess lines.\n", argv[5]);

        status |= 8;

    }

    if (fclose(in2)) {

        char const *const error = strerror(errno);

        fprintf(stderr, "Warning: %s: %s.\n", argv[5], error);

        status |= 16;

    }



    return status;

}

Save the above code as e.g. mergelines.c, and compile it using e.g.

Code:

gcc -Wall -O3 -o mergelines mergelines.c

Run ./mergelines to see the usage. To duplicate the function of my awk script, run

Code:

./mergelines 1 7841 1 file2.traj file1.traj > new.traj

Hope you find this useful.

Thank you so much for your kind.

I tried the awk script so beneficent. It took about just 10 minutes to convert 10 files each with around 7 million lines of data.
Seems awk is so powerful. How I could get a grip on this language? Could you suggest any website which give good explanation on awk?

Regards

Vijay

Quote:

Originally Posted by vjramana (Post 4305904)

You're welcome!

I personally use The GNU Awk User Manual a lot when writing awk scripts.
I'd recommend first reading the Getting started section, then starting by writing some test scripts or scripts you already need or use for your data manipulation, and looking at the manual for interesting functions to use. I've especially found the Built-in variables section and the String functions section quite informative. Also, picking apart the awk scripts you find here might be fun.

Note that GNU awk (gawk) is more powerful than most other awk implementations, since it contains additional functions for e.g. sorting which other awks do not have. (If you read the GNU awk manual carefully, it does say which features are standard and which are gawk extensions.)

Then, when you feel a bit more comfortable, start looking at the examples in the manual. They are well explained, although a bit complex. I'd say they are more useful when you already are comfortable with writing simple awk scripts that modify or create data files.

Hope this helps.

If in India, you can buy "The UNIX programming environment" by Brian Kernighan and Rob Pike. It has got a very good general introduction to Unix (and awk) and is available in book stores.