LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-05-2010, 11:27 AM   #1
bonzer21
LQ Newbie
 
Registered: Sep 2010
Location: Midlands, United Kingdom
Posts: 6

Rep: Reputation: 0
Merge lists with common heads/tails


Generally I have been able to find answers to most of my Linux problems on LQ, but today I have come across one that has finally prompted me to register, so hello to all

My question is probably better explained by example; I have two files that look like this:

Code:
$ cat file1.txt 
mercury
venus
earth
mars
jupiter
saturn

$ cat file2.txt 
mars
jupiter
saturn
uranus
neptune
I wish to merge them into one file, like cat but without repeating the contiguous block of duplicated lines (if any), in this example resulting in the complete list of planets.

I have an idea of how to achieve this by repeatedly comparing the tail of file1 with the head of file2, starting with -n as the number of lines in the smallest of the two files and working it down to zero, but I'd be surprised if there isn't a simpler way of doing this given how close the output of diff looks:

Code:
$ diff -y file1.txt file2.txt
mercury							      <
venus							      <
earth							      <
mars								mars
jupiter								jupiter
saturn								saturn
							      >	uranus
							      >	neptune
However, whatever options I specify diff seems exclusively geared towards turning file1 into file2.

The general stipulations are as you'd expect; both files are of an arbitrary length, generally file1 will be larger than file2 but this is not guaranteed, the common lines (if any) will always be contiguous and only occur at the end of file1 and the start of file2.

Thanks in advance
 
Old 09-05-2010, 11:47 AM   #2
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Code:
cat file1 file2 | sort | uniq

Last edited by crts; 09-05-2010 at 12:07 PM.
 
Old 09-05-2010, 11:58 AM   #3
bonzer21
LQ Newbie
 
Registered: Sep 2010
Location: Midlands, United Kingdom
Posts: 6

Original Poster
Rep: Reputation: 0
Thanks crts, I didn't expect such a quick response!

I did look into using comm, but it does seem to require that the files be sorted. The real-world application of what I'm doing is with log files that don't have a timestamp (the lack of timestamp is unfortunately beyond my control), so the order must be maintained.
 
Old 09-05-2010, 12:08 PM   #4
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
Originally Posted by bonzer21 View Post
Thanks crts, I didn't expect such a quick response!

I did look into using comm, but it does seem to require that the files be sorted. The real-world application of what I'm doing is with log files that don't have a timestamp (the lack of timestamp is unfortunately beyond my control), so the order must be maintained.
Hi,

my initial suggestion did not work as I expected.
I edited it in the meantime. Please have a look at my first post again.
 
Old 09-05-2010, 12:20 PM   #5
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
I don't suppose it's possible that the only common lines in the files will be in the contiguous section. Because then you could simply remove any and all duplicates from one file before catting them together. maybe something like this:
Code:
grep -vF -f file2 file1 ; cat file2

Last edited by David the H.; 09-05-2010 at 12:21 PM. Reason: removed pointless comment about posting, after seeing corrected info
 
Old 09-05-2010, 12:36 PM   #6
bonzer21
LQ Newbie
 
Registered: Sep 2010
Location: Midlands, United Kingdom
Posts: 6

Original Poster
Rep: Reputation: 0
@crts: All fine except the sort, in my initial example I'd get the planets in alphabetical order. Not having timestamps is a real annoyance as otherwise I would be able to sort | uniq, and not have to worry about losing duplicate records.

@David: Sadly no, there will be common lines in both files. I'd cat the files together as a last resort, the effort here is to handle where they overlap. The analogy I keep thinking of is stitching together photographs to make a panorama.

I wonder if I can take this anywhere (added --left-column):

Code:
$ diff --side-by-side --left-column file1.txt file2.txt 
mercury							      <
venus							      <
earth							      <
mars							      (
jupiter							      (
saturn							      (
							      >	uranus
							      >	neptune
All I need to do is kind of "push" the two columns together...
 
Old 09-05-2010, 12:44 PM   #7
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
Originally Posted by bonzer21 View Post
@crts: All fine except the sort, in my initial example I'd get the planets in alphabetical order. Not having timestamps is a real annoyance as otherwise I would be able to sort | uniq, and not have to worry about losing duplicate records.
Ok,

I did not see sort problem before. Try this

Code:
while read line
do
	if [[ $(grep "$line" file1) == "" ]];then
		echo "$line" >> file1
	fi
done < file2
This will alter file1, so make a backup first.

Last edited by crts; 09-05-2010 at 12:45 PM.
 
Old 09-05-2010, 01:22 PM   #8
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Unfortunately, crts, that has a similar problem to what I posted above. It doesn't fully take into account duplicate lines occuring outside the contiguous zone. In a couple of tests I ran, duplicates existing in file2 appear to be lost.

The diff thing should work though, if a bit clumsily. The following sed expression should filter out the formatting; at least it works in the simple test.
Code:
sed -rn -e 's/^(.*[^[ \t])[ \t]+[<(]$/\1/p' -e 's/^[ \t]+>\t(.*)$/\1/p'
 
Old 09-05-2010, 01:39 PM   #9
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
Originally Posted by David the H. View Post
Unfortunately, crts, that has a similar problem to what I posted above. It doesn't fully take into account duplicate lines occuring outside the contiguous zone.
This is the data I tested the above script.
Code:
file1:

mercury
venus
earth
mars
jupiter
saturn


file2:

jupiter
saturn
uranus
neptune
mars
I was under the impression that mars has to be left out. Maybe the OP can state what the expected output for this scenario is. Is it
Code:
mercury
venus
earth
mars
jupiter
saturn
uranus
neptune
mars
by any chance?
 
Old 09-05-2010, 01:51 PM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
I'd say it's pretty clear from his previous posts that only the overlapping duplication where the two files meet should be affected, and all the other lines should be left alone. Like stitching together a panorama, as he said. So yes, the result should be like in your final example.

I think it would help to clear things up if we could get a real-life example of the text used, however. I could also confirm if my regex is properly formatted.

Last edited by David the H.; 09-05-2010 at 01:55 PM. Reason: added comment
 
Old 09-05-2010, 03:12 PM   #11
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
Originally Posted by David the H. View Post
I'd say it's pretty clear from his previous posts that only the overlapping duplication where the two files meet should be affected ....
Well, that did not seem obvious to me. Anyway, I tried your sed solution with the following samples:
Code:
file1
mercury
uranus
venus
jupiter
earth
mars
jupiter
saturn

file2
jupiter
saturn
uranus
neptune
mars

file3
mercury
uranus
venus
jupiter
earth
mars
jupiter
saturn
neptune
Now if I hopefully understood it correctly then merging file1 and file2 should look like
Code:
mercury
uranus
venus
jupiter
earth
mars
jupiter
saturn
uranus
neptune
mars
However, the contiguous
jupiter
saturn

does not appear in the diff/sed solution.
Here is another suggestion
Code:
#!/bin/bash
# invoke as ./script.sh fileA fileB

count=0
lastOccurence=$(grep -n "$(head -n 1 ${2})" "$1" | sed -nr '$ {s/^([0-9]*):.*/\1/;p}')
while read -r line
do
	if [[ $(grep -n "$line" "$1" | sed -nr '$ {s/^([0-9]*):.*/\1/;p}') ==  $lastOccurence ]];then
		(( count++ ))
		(( lastOccurence++ ))
	else
		(( lastOccurence-- ))
		break
	fi
done < "$2"

if [[ $(wc -l "$1"|awk '{print $1}') == $lastOccurence ]];then

	sed -e "1,$count d" "$2" >> "$1"
else
	cat "$2" >> "$1"
fi
The first file will be altered. I tested it with some combinations of the above mentioned sample files. It seems to work and merge only the contiguous parts at the head/tail of the processed files.

@OP: some feedback would be helpful ...

Last edited by crts; 09-06-2010 at 09:31 PM. Reason: typo
 
Old 09-05-2010, 03:42 PM   #12
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.

This awk script is not correct, see later post for correct version.

This looks like a desire for a union. Such a native command does not exist that I know of, but awk could be used to perform a uniq without a sort (i.e. to preserve the order).

So for the files data1, data2 (side-by-side):
Code:
mercury	mars
venus	jupiter
earth	saturn
mars	uranus
jupiter	neptune
saturn
an awk script:
Code:
awk '
BEGIN	{ len = 1 }
	{ if ( $0 in a ) {
	    next
	  } else {
	    a[len++] = $0
	  }
	}
END	{ for (i=1;i<=len;i++ ) print a[i] }
' data1 data2
produces:
Code:
mercury
venus
earth
mars
jupiter
saturn
mars
jupiter
saturn
uranus
neptune
The condition being that the unique list must fit into memory.

Best wishes ... cheers, makyo

Last edited by makyo; 09-05-2010 at 05:31 PM.
 
Old 09-05-2010, 03:53 PM   #13
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Hi,

this produces the same result as a
cat file1 file2

This is not what is desired.
 
Old 09-05-2010, 05:29 PM   #14
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Does not address the problem at hand. However, for a far better version of this awk code, see post # 33.

Hi, crts.

Thanks for noticing the blunder. Here's a better awk-uniq:
Code:
awk '
BEGIN { a = "" ; len = 1 }
      { if ( match(a,$0) ) {
          # print "already in a: " $0
          next
          } else {
          # print "adding " $0 " to " a
          if ( length(a) == 0 ) {
            a = $0
            } else {
            a = a ";" $0
            len++
          }
        }
      }
END   {
        split(a,b,";")
        for ( i = 1; i <= len ; i++ ) { print b[i] }
      }
' data1 data2
producing:
Code:
mercury
venus
earth
mars
jupiter
saturn
uranus
neptune
Best wishes ... cheers, makyo

Last edited by makyo; 09-06-2010 at 09:42 AM.
 
Old 09-05-2010, 05:42 PM   #15
bonzer21
LQ Newbie
 
Registered: Sep 2010
Location: Midlands, United Kingdom
Posts: 6

Original Poster
Rep: Reputation: 0
Thanks for all your efforts thus far, I think David's onto it but I will use a different example that's closer to the real-world problem. Imagine a fictional access log of a web server without timestamps:

Code:
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
Log interpretation: two visitors; I'll call them Dawkins and Sagan.
  1. Dawkins visits my home page
  2. ...followed by Sagan
  3. Dawkins clicks through to my info page
  4. Sagan stays on my home page to view my source, refreshing a couple of times
  5. Dawkins returns to my home page

My situation is that I don't have the original log above, my aim is to recreate it from the excerpts that I do have which are split across two files, which may overlap slightly as follows:
Code:
   $ cat log1.txt
A  123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
   123.456.789.012 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
   123.456.789.012 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
B  123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
C  987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
   987.654.321.098 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
   987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
   987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
   123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
   123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"

   $ cat log2.txt
   987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
   987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
   123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
   123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
B  123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
C  987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
C  987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
A  123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
B  123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
Sure enough, if the log files contained timestamps then each line would be unique, and I would be able to simply cat the files, pipe to uniq and I would even be able to intermediately sort if the timestamps were at the start of each line.

The log files I have to work with are already in chronological order, but without timestamps I can't append to log1.txt a filtered log2.txt from grep as it would lose all duplicated lines, which is sometimes undesirable (as in the case of a web server log). I've attempted to label the duplicates that should remain with corresponding letters.

diff's output is tantalisingly close to what I'm looking for, but after trying it on my example above it appears to truncate long lines:

Code:
$ diff -y log1.txt log2.txt
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macinto <
123.456.789.012 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/5 <
123.456.789.012 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozil <
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 <
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compati <
987.654.321.098 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/4 <
987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozil	987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozil
987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0	987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0	123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozil	123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozil
							      >	123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0
							      >	987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compati
							      >	987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compati
							      >	123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macinto
							      >	123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0

Last edited by bonzer21; 09-05-2010 at 05:43 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
xfree86-common xserver-common xfonts-base missing in etch/lenny unev_21 Debian 2 09-11-2009 02:12 AM
LXer: Unique Sorting Of Lists And Lists Of Lists With Perl For Linux Or Unix LXer Syndicated Linux News 0 09-05-2008 01:50 PM
LXer: kgdb, To Merge Or Not To Merge LXer Syndicated Linux News 0 02-05-2008 06:10 PM
LXer: KHTML Vs Webkit: To Merge or Not To Merge LXer Syndicated Linux News 0 10-27-2007 06:41 AM
BOGUS.common.04y -> /home/common/Mailbox jayakrishnan Linux - Networking 0 11-19-2005 04:48 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 02:25 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration