Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
|
09-05-2010, 11:27 AM
|
#1
|
LQ Newbie
Registered: Sep 2010
Location: Midlands, United Kingdom
Posts: 6
Rep:
|
Merge lists with common heads/tails
Generally I have been able to find answers to most of my Linux problems on LQ, but today I have come across one that has finally prompted me to register, so hello to all
My question is probably better explained by example; I have two files that look like this:
Code:
$ cat file1.txt
mercury
venus
earth
mars
jupiter
saturn
$ cat file2.txt
mars
jupiter
saturn
uranus
neptune
I wish to merge them into one file, like cat but without repeating the contiguous block of duplicated lines (if any), in this example resulting in the complete list of planets.
I have an idea of how to achieve this by repeatedly comparing the tail of file1 with the head of file2, starting with -n as the number of lines in the smallest of the two files and working it down to zero, but I'd be surprised if there isn't a simpler way of doing this given how close the output of diff looks:
Code:
$ diff -y file1.txt file2.txt
mercury <
venus <
earth <
mars mars
jupiter jupiter
saturn saturn
> uranus
> neptune
However, whatever options I specify diff seems exclusively geared towards turning file1 into file2.
The general stipulations are as you'd expect; both files are of an arbitrary length, generally file1 will be larger than file2 but this is not guaranteed, the common lines (if any) will always be contiguous and only occur at the end of file1 and the start of file2.
Thanks in advance 
|
|
|
09-05-2010, 11:47 AM
|
#2
|
Senior Member
Registered: Jan 2010
Posts: 2,020
|
Code:
cat file1 file2 | sort | uniq
Last edited by crts; 09-05-2010 at 12:07 PM.
|
|
|
09-05-2010, 11:58 AM
|
#3
|
LQ Newbie
Registered: Sep 2010
Location: Midlands, United Kingdom
Posts: 6
Original Poster
Rep:
|
Thanks crts, I didn't expect such a quick response!
I did look into using comm, but it does seem to require that the files be sorted. The real-world application of what I'm doing is with log files that don't have a timestamp (the lack of timestamp is unfortunately beyond my control), so the order must be maintained.
|
|
|
09-05-2010, 12:08 PM
|
#4
|
Senior Member
Registered: Jan 2010
Posts: 2,020
|
Quote:
Originally Posted by bonzer21
Thanks crts, I didn't expect such a quick response!
I did look into using comm, but it does seem to require that the files be sorted. The real-world application of what I'm doing is with log files that don't have a timestamp (the lack of timestamp is unfortunately beyond my control), so the order must be maintained.
|
Hi,
my initial suggestion did not work as I expected.
I edited it in the meantime. Please have a look at my first post again.
|
|
|
09-05-2010, 12:20 PM
|
#5
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
I don't suppose it's possible that the only common lines in the files will be in the contiguous section. Because then you could simply remove any and all duplicates from one file before catting them together. maybe something like this:
Code:
grep -vF -f file2 file1 ; cat file2
Last edited by David the H.; 09-05-2010 at 12:21 PM.
Reason: removed pointless comment about posting, after seeing corrected info
|
|
|
09-05-2010, 12:36 PM
|
#6
|
LQ Newbie
Registered: Sep 2010
Location: Midlands, United Kingdom
Posts: 6
Original Poster
Rep:
|
@crts: All fine except the sort, in my initial example I'd get the planets in alphabetical order. Not having timestamps is a real annoyance as otherwise I would be able to sort | uniq, and not have to worry about losing duplicate records.
@David: Sadly no, there will be common lines in both files. I'd cat the files together as a last resort, the effort here is to handle where they overlap. The analogy I keep thinking of is stitching together photographs to make a panorama.
I wonder if I can take this anywhere (added --left-column):
Code:
$ diff --side-by-side --left-column file1.txt file2.txt
mercury <
venus <
earth <
mars (
jupiter (
saturn (
> uranus
> neptune
All I need to do is kind of "push" the two columns together...
|
|
|
09-05-2010, 12:44 PM
|
#7
|
Senior Member
Registered: Jan 2010
Posts: 2,020
|
Quote:
Originally Posted by bonzer21
@crts: All fine except the sort, in my initial example I'd get the planets in alphabetical order. Not having timestamps is a real annoyance as otherwise I would be able to sort | uniq, and not have to worry about losing duplicate records.
|
Ok,
I did not see sort problem before. Try this
Code:
while read line
do
if [[ $(grep "$line" file1) == "" ]];then
echo "$line" >> file1
fi
done < file2
This will alter file1, so make a backup first.
Last edited by crts; 09-05-2010 at 12:45 PM.
|
|
|
09-05-2010, 01:22 PM
|
#8
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
Unfortunately, crts, that has a similar problem to what I posted above. It doesn't fully take into account duplicate lines occuring outside the contiguous zone. In a couple of tests I ran, duplicates existing in file2 appear to be lost.
The diff thing should work though, if a bit clumsily. The following sed expression should filter out the formatting; at least it works in the simple test.
Code:
sed -rn -e 's/^(.*[^[ \t])[ \t]+[<(]$/\1/p' -e 's/^[ \t]+>\t(.*)$/\1/p'
|
|
|
09-05-2010, 01:39 PM
|
#9
|
Senior Member
Registered: Jan 2010
Posts: 2,020
|
Quote:
Originally Posted by David the H.
Unfortunately, crts, that has a similar problem to what I posted above. It doesn't fully take into account duplicate lines occuring outside the contiguous zone.
|
This is the data I tested the above script.
Code:
file1:
mercury
venus
earth
mars
jupiter
saturn
file2:
jupiter
saturn
uranus
neptune
mars
I was under the impression that mars has to be left out. Maybe the OP can state what the expected output for this scenario is. Is it
Code:
mercury
venus
earth
mars
jupiter
saturn
uranus
neptune
mars
by any chance?
|
|
|
09-05-2010, 01:51 PM
|
#10
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
I'd say it's pretty clear from his previous posts that only the overlapping duplication where the two files meet should be affected, and all the other lines should be left alone. Like stitching together a panorama, as he said. So yes, the result should be like in your final example.
I think it would help to clear things up if we could get a real-life example of the text used, however. I could also confirm if my regex is properly formatted.
Last edited by David the H.; 09-05-2010 at 01:55 PM.
Reason: added comment
|
|
|
09-05-2010, 03:12 PM
|
#11
|
Senior Member
Registered: Jan 2010
Posts: 2,020
|
Quote:
Originally Posted by David the H.
I'd say it's pretty clear from his previous posts that only the overlapping duplication where the two files meet should be affected ....
|
Well, that did not seem obvious to me. Anyway, I tried your sed solution with the following samples:
Code:
file1
mercury
uranus
venus
jupiter
earth
mars
jupiter
saturn
file2
jupiter
saturn
uranus
neptune
mars
file3
mercury
uranus
venus
jupiter
earth
mars
jupiter
saturn
neptune
Now if I hopefully understood it correctly then merging file1 and file2 should look like
Code:
mercury
uranus
venus
jupiter
earth
mars
jupiter
saturn
uranus
neptune
mars
However, the contiguous
jupiter
saturn
does not appear in the diff/sed solution.
Here is another suggestion
Code:
#!/bin/bash
# invoke as ./script.sh fileA fileB
count=0
lastOccurence=$(grep -n "$(head -n 1 ${2})" "$1" | sed -nr '$ {s/^([0-9]*):.*/\1/;p}')
while read -r line
do
if [[ $(grep -n "$line" "$1" | sed -nr '$ {s/^([0-9]*):.*/\1/;p}') == $lastOccurence ]];then
(( count++ ))
(( lastOccurence++ ))
else
(( lastOccurence-- ))
break
fi
done < "$2"
if [[ $(wc -l "$1"|awk '{print $1}') == $lastOccurence ]];then
sed -e "1,$count d" "$2" >> "$1"
else
cat "$2" >> "$1"
fi
The first file will be altered. I tested it with some combinations of the above mentioned sample files. It seems to work and merge only the contiguous parts at the head/tail of the processed files.
@OP: some feedback would be helpful ...
Last edited by crts; 09-06-2010 at 09:31 PM.
Reason: typo
|
|
|
09-05-2010, 03:42 PM
|
#12
|
Member
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735
Rep:
|
Hi.
This awk script is not correct, see later post for correct version.
This looks like a desire for a union. Such a native command does not exist that I know of, but awk could be used to perform a uniq without a sort (i.e. to preserve the order).
So for the files data1, data2 (side-by-side):
Code:
mercury mars
venus jupiter
earth saturn
mars uranus
jupiter neptune
saturn
an awk script:
Code:
awk '
BEGIN { len = 1 }
{ if ( $0 in a ) {
next
} else {
a[len++] = $0
}
}
END { for (i=1;i<=len;i++ ) print a[i] }
' data1 data2
produces:
Code:
mercury
venus
earth
mars
jupiter
saturn
mars
jupiter
saturn
uranus
neptune
The condition being that the unique list must fit into memory.
Best wishes ... cheers, makyo
Last edited by makyo; 09-05-2010 at 05:31 PM.
|
|
|
09-05-2010, 03:53 PM
|
#13
|
Senior Member
Registered: Jan 2010
Posts: 2,020
|
Hi,
this produces the same result as a
cat file1 file2
This is not what is desired.
|
|
|
09-05-2010, 05:29 PM
|
#14
|
Member
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735
Rep:
|
Does not address the problem at hand. However, for a far better version of this awk code, see post # 33.
Hi, crts.
Thanks for noticing the blunder. Here's a better awk-uniq:
Code:
awk '
BEGIN { a = "" ; len = 1 }
{ if ( match(a,$0) ) {
# print "already in a: " $0
next
} else {
# print "adding " $0 " to " a
if ( length(a) == 0 ) {
a = $0
} else {
a = a ";" $0
len++
}
}
}
END {
split(a,b,";")
for ( i = 1; i <= len ; i++ ) { print b[i] }
}
' data1 data2
producing:
Code:
mercury
venus
earth
mars
jupiter
saturn
uranus
neptune
Best wishes ... cheers, makyo
Last edited by makyo; 09-06-2010 at 09:42 AM.
|
|
|
09-05-2010, 05:42 PM
|
#15
|
LQ Newbie
Registered: Sep 2010
Location: Midlands, United Kingdom
Posts: 6
Original Poster
Rep:
|
Thanks for all your efforts thus far, I think David's onto it but I will use a different example that's closer to the real-world problem. Imagine a fictional access log of a web server without timestamps:
Code:
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
Log interpretation: two visitors; I'll call them Dawkins and Sagan. - Dawkins visits my home page
- ...followed by Sagan
- Dawkins clicks through to my info page
- Sagan stays on my home page to view my source, refreshing a couple of times
- Dawkins returns to my home page
My situation is that I don't have the original log above, my aim is to recreate it from the excerpts that I do have which are split across two files, which may overlap slightly as follows:
Code:
$ cat log1.txt
A 123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
B 123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
C 987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
$ cat log2.txt
987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
B 123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
C 987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
C 987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
A 123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
B 123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
Sure enough, if the log files contained timestamps then each line would be unique, and I would be able to simply cat the files, pipe to uniq and I would even be able to intermediately sort if the timestamps were at the start of each line.
The log files I have to work with are already in chronological order, but without timestamps I can't append to log1.txt a filtered log2.txt from grep as it would lose all duplicated lines, which is sometimes undesirable (as in the case of a web server log). I've attempted to label the duplicates that should remain with corresponding letters.
diff's output is tantalisingly close to what I'm looking for, but after trying it on my example above it appears to truncate long lines:
Code:
$ diff -y log1.txt log2.txt
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macinto <
123.456.789.012 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/5 <
123.456.789.012 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozil <
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 <
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compati <
987.654.321.098 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/4 <
987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozil 987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozil
987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozil 123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozil
> 123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0
> 987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compati
> 987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compati
> 123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macinto
> 123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0
Last edited by bonzer21; 09-05-2010 at 05:43 PM.
|
|
|
All times are GMT -5. The time now is 02:25 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|