Replacing part (lines) of a file ( bash or perl )

zomane · 10-25-2007, 12:44 PM

Hello all,
I have two files
file orig.txt(bigger one, almost 7000 lines):

Code:

12345 $secondfield $thirdfield $nfield SDFFF
23456 $secondfield $thirdfield $nfield DFEDFRGFF
34567 8090988 33435 655646 SFFEFEFKKLKL
90783 5433543 54532543 5454 HJHJHGGH
76576 435345 534 5453566767 WEQRTQ
and so on ....

file repl.txt(smaller one, 1000 lines):

Code:

34567 1111 3354 566  SFFEFEFKKLKL
90783 324324 255 54435 HJHJHGGH
76576 3232 4545 4554 WEQRTQ

field names(for avoiding confusions) as ordered example files :

USERNUMBER AMOUNT1 AMOUNT2 AMOUNT3 NAME

orig.txt contains all USERNUMBERs from repl.txt but AMOUNT[1-3]s are different .
I want to replace it line by line .
Can someone gave my an idea how to do that ?
First I try that

Code:

cat repl.txt | awk '{print $1}' > USERNUMBERS.txt 
for i in $(awk '{print $1}' < USERNUMBERS.txt ) ; do  grep -v $i orig.txt > removed_wrong_lines.txt ; done

I expected removed_wrong_lines.txt will contains only correct lines and after that simply do

Code:

cat removed_wrong_lines.txt repl.txt > corrected.txt

but my experiment was unsuccessful.
I will be thankful for any suggestions how to solve this.

cfaj · 10-25-2007, 07:41 PM

Quote:

Originally Posted by zomane

Hello all,
I have two files
file orig.txt(bigger one, almost 7000 lines):

Code:

12345 $secondfield $thirdfield $nfield SDFFF
23456 $secondfield $thirdfield $nfield DFEDFRGFF
34567 8090988 33435 655646 SFFEFEFKKLKL
90783 5433543 54532543 5454 HJHJHGGH
76576 435345 534 5453566767 WEQRTQ
and so on ....

file repl.txt(smaller one, 1000 lines):

Code:

34567 1111 3354 566  SFFEFEFKKLKL
90783 324324 255 54435 HJHJHGGH
76576 3232 4545 4554 WEQRTQ

field names(for avoiding confusions) as ordered example files :

USERNUMBER AMOUNT1 AMOUNT2 AMOUNT3 NAME

orig.txt contains all USERNUMBERs from repl.txt but AMOUNT[1-3]s are different .
I want to replace it line by line .
Can someone gave my an idea how to do that ?
First I try that

Code:

cat repl.txt | awk '{print $1}' > USERNUMBERS.txt 
for i in $(awk '{print $1}' < USERNUMBERS.txt ) ; do  grep -v $i orig.txt > removed_wrong_lines.txt ; done

I expected removed_wrong_lines.txt will contains only correct lines and after that simply do

Code:

cat removed_wrong_lines.txt repl.txt > corrected.txt

but my experiment was unsuccessful.
I will be thankful for any suggestions how to solve this.

Is this what you want?

Code:

awk '
FNR == NR { x[$1] = $0; next }
{ print (x[$1]) ? x[$1] : $0 }
' repl.txt orig.txt

zomane · 10-26-2007, 01:19 AM

Thanks,
I have one question about your awk construct.
Order of USERNUMBERs in two files is not important, am I right ?
I mean if USERNUMBER_xxx is on line 3520 in orig.txt and on line 542 in repl.txt, this will not make confusion in replacing.
If I understand correct all above it works for me , but if your answer on my question is "NO" then my first post is not formated correct

.

PAix · 10-26-2007, 04:10 PM

Hi Zomane,
Cfaj's bit of code is is not dependent on the ordering, as I will explain. I suspect that you asked about the ordering because you didn't fully understand how the code works. It took me a moment too, so for the benefit of others I will describe it.

Quote:

The files repl.txt and orig.txt are specified in that particular order for good reason. The files are read one after the other .
NR is the record number of the input record since the start of input
FNR is the record number of the current input file

Code:

FNR==NR { x[$1] = $0; next }

This reads the first file (the shorter repl.txt) and at this point FNR and NR are the same and so the complete file is read line by line into aray x[ ] indexed by $1, the contents of the first column USERNUMBER (this is not an index number, we are talking about an associative array. The key is a unique string or number associated with the record.
USERNUMBER is assumed to be unique within in the repl.txt file otherwise subsequent occurences will overwrite earlier ones during this first phase, populating the array. Clue - if your array has less records than the repl.txt file then a or duplicate/s have been found, but that's for you to worry about elsewhere if necessary, I just thought you should know about it.
So now our replacement array is full of replacement text.
At this point FNR and NR are still synchronised and from what you say, let us assume 1000.
The first record read from the second file orig.txt will see NR become 1001 while FNR will become 1. Plainly the first line of code will no longer be executed beyond the pattern matching. Note that while FNR and NR matched and the array was being populated, the 'next' statement caused the next line of the file to be read in without proceeding to the next statement in the code.

Code:

{ print (x[$1]) ? x[$1] : $0 }

No pattern here indicates that this line of code should be executed for each and every line read in. Execution has been prevented so far because of the tight loop mentioned. Now however the patterns in the first line of code no longer match so no more Mr Tight Loop. Instead welcome to the ternary operator

Code:

selector ? if-true-exp : if-false-exp

In our code the brackets around the piece of code preceding the ? are intended to force evaluation of the array using the value of USERNUMBER in the current record. if it exists then print the value from the array. if it doesn't exist (no replacement) then print the current (original line).
From this point the next record is read in and the code iterates naturally as described until the end of file.
The input files appear at the end of the code and are read in almost as if they were one except for the behavour or the record counters NR and FNR.

Oh were I able to make the description as short as Cfaj's super couple of lines of code

So it does everything that it says on the tin and as long as you understand the nature of duplicates you are thoroughly home and dry.

PAix