LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 03-21-2011, 12:15 PM   #1
Lowellj
LQ Newbie
 
Registered: Mar 2011
Posts: 6

Rep: Reputation: 0
match and combine 2 text files line by line


This solution works but is slow with large files. I am looking for a faster solution.

The 2 files contain filenames, one of them has associated data I want to append to the other file's matching filenames

file1:
unique_filename:tag1,tag2,etc
unique_filename:tag1,tag2,etc
unique_filename:tag1,tag2,etc

file2:
"file_path","unique_filename"
"file_path","unique_filename"
"file_path","unique_filename"

I append file2 by matching the unique_filenames and appending them with the tag data and some formatting

appended file2:
"file_path","unique_filename","tag1,tag2,etc"
"file_path","unique_filename","tag1,tag2,etc"
"file_path","unique_filename","tag1,tag2,etc"

Here is the SLOW code

while read inputline
do
filename="$(echo $inputline | cut -d: -f1)"
tags="$(echo $inputline | cut -d: -f2)"
sed -e "s|\"$filename\"|\"$filename\",\"$tags\"|" -i file2.csv
done < file1.csv;
 
Old 03-21-2011, 12:51 PM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Hi and welcome to LinuxQuestions!

How big the files are? Is there an exact match between the file names in the first file and those ones in the second file? Do the files have the same number of lines? If this is the case the following awk code should work and should be faster, too:
Code:
FNR == NR {

  _[$1] = $2

}

FNR < NR {

  split($0,__,"\"")
  
  print $0 ",\"" _[__[4]] "\""
  
}
To run this code (suppose you save it in a file called test.awk) do
Code:
awk -F: -f test.awk file1.csv file2.csv
where the order of the arguments is mandatory.
 
Old 03-21-2011, 01:00 PM   #3
Lowellj
LQ Newbie
 
Registered: Mar 2011
Posts: 6

Original Poster
Rep: Reputation: 0
the files may be as large as 20,000 lines each, and they will not be exact matches. Although the filenames are matched so that I can append the tag data to the correct filenames.

Some of the filenames have no tag data so they are omitted from the first file. I may be able to retain them, and then the files might have exactly the same number of lines with matching order. I'm not sure if this is a possibility though, I'd have to try it and see.

If it the best method available it might be worth it to try.
 
Old 03-21-2011, 01:12 PM   #4
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
20,000 lines are a glitch for awk, unless they are very very long. Anyway, the suggested code needs a little modification to print out file names without tags:
Code:
FNR == NR {

  _[$1] = $2

}

FNR < NR {

  split($0,__,"\"")
  
  if ( __[4] in _ )
    print $0 ",\"" _[__[4]] "\""
  else
    print
  
}
I don't know what are your skills with awk. Fell free to ask for any clarification about the code.
 
Old 03-21-2011, 02:08 PM   #5
Lowellj
LQ Newbie
 
Registered: Mar 2011
Posts: 6

Original Poster
Rep: Reputation: 0
Thank you so much for your help

This is my first exposure to awk
I will need to learn awk code and syntax to understand the script.

I did try to run it as you instructed. It seems to output only the contents of file2. If I can get it working I will need to output it to a file.

Is the code you posted supposed to be able to handle the 2 files of differing numbers of lines?

I have been able to sort both files by filenames and there appears to be some missing lines in file2 when comparing to file1
Since the filenames have a numerical sequence element, I could try to insert dummy text lines to bring both files into a line-by-line matching condition and then deleting the dummy lines when complete, although this might be more difficult than it is worth

Last edited by Lowellj; 03-21-2011 at 02:09 PM.
 
Old 03-21-2011, 02:22 PM   #6
Lowellj
LQ Newbie
 
Registered: Mar 2011
Posts: 6

Original Poster
Rep: Reputation: 0
it looks like it is working now, and quite speedy

My file2 is actually a bit longer than described above, changed the [4] accordingly to [33]

Now I need to output it in the text file, and I am running it from withing a ./ script (probably not described properly but you get my meaning)

Thanks again for your help so far. It looks like I might want to use awk for more things in the future. It's pretty fast :-)
 
Old 03-21-2011, 02:55 PM   #7
Lowellj
LQ Newbie
 
Registered: Mar 2011
Posts: 6

Original Poster
Rep: Reputation: 0
ok, got it working in the script file by simply invoking the above command and having the separate test.awk file

problem though, the text file created with awk, I can't open it in text editors. vi works, but others don't.

awk -F: -f test.awk file1.csv file2.csv > file3.csv
 
Old 03-21-2011, 03:54 PM   #8
Lowellj
LQ Newbie
 
Registered: Mar 2011
Posts: 6

Original Poster
Rep: Reputation: 0
yay, I got it working
 
Old 03-21-2011, 05:08 PM   #9
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Yep! You've done! Glad to see it works!

Regarding awk it is worth to learn, indeed. It can speed up a lot of tasks. I use it intensively in production environment and in day-by-day work. If you'd like to learn it, I suggest the official GNU awk manual: http://www.gnu.org/software/gawk/manual/ a great piece of documentation. For a lighter approach: http://www.grymoire.com/Unix/Awk.html.
 
Old 03-21-2011, 09:21 PM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,250

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
Just for some diversity of what awk can do
Code:
awk -F: 'FNR==NR{_[$1]=$2;next}match($0,/^[^,]+,"([^"]+)/,f) && f[1] in _{$0=$0","_[f[1]]}1' file1.csv file2.csv
You can of course put that in your test.awk file
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] open two text files , read them line by line and update parameters of the 3rd file rastin_nz Programming 17 10-20-2010 08:10 PM
BASH: Each line of multiple text files gets added to one line Gavin Harper Programming 3 09-12-2010 08:31 PM
[SOLVED] Using python to merge two text files line for line simplified Programming 1 07-06-2010 02:09 PM
Need to combine text on to one line clstanton Linux - Newbie 10 05-12-2009 07:55 AM
Perl question: delete line from text file with duplicate match at beginning of line mrealty Programming 7 04-01-2009 07:46 PM


All times are GMT -5. The time now is 12:38 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration