LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-05-2010, 05:45 PM   #16
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928

And a highly inelegant version using diff & awk ;}
Code:
diff -y file1.txt file2.txt|awk '/</{print $1} !/[<>]/{print $1} />/{print $2}'mercury
venus
earth
mars
jupiter
saturn
uranus
neptune

To work around diff's shortcoming you may be able
to work with a '-W 200'

Cheers,
Tink

Last edited by Tinkster; 09-05-2010 at 05:58 PM. Reason: grandmar
 
Old 09-05-2010, 05:53 PM   #17
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Hi makyo,

this still won't do. I also had some problems to understand what the OP exactly wants. The panorama description is indeed helpful. Try running your command with the sample data from post #11. If you merge file1 and file2 the output should be
Code:
mercury
uranus
venus
jupiter
earth
mars
jupiter
saturn
uranus
neptune
mars
Think of jupiter and saturn as seam.

Last edited by crts; 09-05-2010 at 06:34 PM.
 
Old 09-05-2010, 06:13 PM   #18
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
@bonzer21: What is wrong with the script in post #11? I tested it with the sample data and it seems to perform OK. In order to work with the new data I adjusted some quoting issue.
 
Old 09-05-2010, 06:17 PM   #19
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
Originally Posted by Tinkster View Post
And a highly inelegant version using diff & awk ...
If you try to merge file3 and file2 from post #11 this should have the same result as a 'cat'. But this is not the case with diff/awk.
 
Old 09-05-2010, 06:22 PM   #20
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.
Code:
       -W NUM  --width=NUM
              Output at most NUM (default 130) print columns.
-- excerpt from man diff, q.v.
If that does help you solve the problem, then it would be useful if you were to post the complete output you are expecting from the merging of the 2 sample log files ... cheers, makyo
 
Old 09-05-2010, 06:39 PM   #21
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
@bonzer21:
One issue that I noticed with my script is that you will have to make sure that your logfiles won't have any leading or trailing empty lines. Otherwise they will just be concatenated. So you might have to preprocess them. Probably best to make sure that there are no blank lines at all.
 
Old 09-05-2010, 07:16 PM   #22
bonzer21
LQ Newbie
 
Registered: Sep 2010
Location: Midlands, United Kingdom
Posts: 6

Original Poster
Rep: Reputation: 0
Your script works exactly as required crts - it reconstructs the original log file. Thank you

Although minor, I know from past experience that wc's output is less awkward (gives no filename) when its input is stdin, avoiding the need for awk altogether.

Here's the whole story including crts's script, the input files and the final output:

Code:
$ cat script.sh 
#!/bin/bash
# invoke as ./script.sh fileA fileB

count=0
lastOccurence=$(grep -n "$(head -n 1 ${2})" "$1" | sed -nr '$ {s/^([0-9]*):.*/\1/;p}')
while read line
do
	if [[ $(grep -n "$line" "$1" | sed -nr '$ {s/^([0-9]*):.*/\1/;p}') ==  $lastOccurence ]]; then
		(( count++ ))
		(( lastOccurence++ ))
	else
		(( lastOccurence-- ))
		break
	fi
done < "$2"

if [[ $(wc -l < "$1") == $lastOccurence ]]; then
	sed -e "1,$count d" "$2" >> "$1"
else
	cat "$2" >> "$1"
fi

$ cat log1.txt
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"

$ cat log2.txt
987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"

$ ./script.sh log1.txt log2.txt

$ cat log1.txt
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
Tinkster - your diff/awk worked fine until it saw space! I'd adjust it myself but I'm not that well versed in awk
 
Old 09-05-2010, 07:25 PM   #23
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Glad it solved your problem. Just keep the blank lines issue in mind I mentioned earlier.
And don't forget to mark the thread as solved.

P.S.:
You might want to take a look again at post #11. I changed
while read -r line

This way backslashes won't be interpreted as escape characters.

Last edited by crts; 09-05-2010 at 07:28 PM.
 
Old 09-05-2010, 09:06 PM   #24
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by bonzer21 View Post
Tinkster - your diff/awk worked fine until it saw space! I'd adjust it myself but I'm not that well versed in awk
Heh. Fair enough. I wouldn't have gone down that alley
if I had read through the entire thread first, and seen
your actual log data rather than planet names ;}


Here's a version with 'diff - only'
Code:
$ diff --old-line-format='%L' --new-line-format='%L' --unchanged-line-format='%L'  -W 200  log1.txt log2.txt
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /favicon.ico" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /images/head.jpg" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /style.css" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"

I had no idea diff could do this until today - reading
the man-page in a desperate attempt to find a separator
I could output for use w/ awk (so the spaces weren't an
issue for awk's defaults).


Cheers,
Tink
 
Old 09-05-2010, 09:44 PM   #25
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
Originally Posted by Tinkster View Post
Here's a version with 'diff - only' ...
Hi,

I hate to nag but what about this situation:
file1
Code:
mars
jupiter
saturn
neptune
file2
Code:
jupiter
saturn
uranus
deimos
If we go with the panorama picture then these two files should result in
Code:
mars
jupiter
saturn
neptune
jupiter
saturn
uranus
deimos
The tail of file1 does not match the head of file2. The diff command ignores that and puts them together as
Code:
mars
jupiter
saturn
neptune
uranus
deimos
I don't know if this is a realistic scenario with the actual data. But it does address a conceptual matter.
 
Old 09-05-2010, 10:21 PM   #26
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by crts View Post
Hi,

I hate to nag but what about this situation:
file1
Code:
mars
jupiter
saturn
neptune
file2
Code:
jupiter
saturn
uranus
deimos
If we go with the panorama picture then these two files should result in
Code:
mars
jupiter
saturn
neptune
jupiter
saturn
uranus
deimos
The tail of file1 does not match the head of file2. The diff command ignores that and puts them together as
Code:
mars
jupiter
saturn
neptune
uranus
deimos
I don't know if this is a realistic scenario with the actual data. But it does address a conceptual matter.

Fair enough, too. Did bonzer21 specify what he'd like to
happen in that case?



Cheers,
Tink

Last edited by Tinkster; 09-05-2010 at 10:22 PM.
 
Old 09-05-2010, 10:29 PM   #27
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
Originally Posted by Tinkster View Post
Fair enough, too. Did bonzer21 specify what he'd like to
happen in that case?



Cheers,
Tink
Nope, he did not. We also do not know how many log files are to be merged and if he made sure that every logs tail matches the next ones head. If he did the I'd say that the diff solution is definitely the elegant way to go.
 
Old 09-05-2010, 11:09 PM   #28
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
True that ... another "stuff-up" possibility that wasn't
discussed is what happens if user A has a habit of revisiting
pages in the same sequence, over and over again, and that
gets split into several files ... ;}


Logs w/o timing info seem quite pointless, really.


Cheers,
Tink
 
Old 09-06-2010, 06:43 AM   #29
bonzer21
LQ Newbie
 
Registered: Sep 2010
Location: Midlands, United Kingdom
Posts: 6

Original Poster
Rep: Reputation: 0
Rest assured if I'd set this system up it would be timestamped to the millisecond, as it happens I am applying approximate times to the logs I'm joining. They're not perfect, but they'll give an idea of relative times between each request.

I'll bring the example home a little more - the system I'm working with is embedded and proprietary, I can't view the whole log (I doubt the device even keeps the whole log), but I can view the last few ~70ish entries of it at any given time. It's as if the log were scrolling up like the credits of a movie but with varying speed, depending on how busy the web server is, and once an entry has scrolled off the top it's gone forever. In an effort to produce a much more useful, browsable log, I am dumping each "screen" of data every few seconds to text files. To avoid missing any entries, I'm capturing at a rate that is faster than the log is every likely to scroll, which means that the text files I'm creating often heavily overlap.

As Tinkster pointed out, there will be times when there are duplicate entries in the log. If this happens to be the end of one capture and the start of another, it will be impossible for me to determine which are part of the overlap and which are new entries. Splitting my log example in a different place demonstrates this problem:

First capture:
Code:
123.456.789.012 - "GET /info.html" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /images/logo.gif" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
Second capture:
Code:
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
123.456.789.012 - "GET /" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
123.456.789.012 - "GET /style.css" HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-gb)"
There's no way to tell how many times the entry 987.654.321.098 - "GET /" HTTP/1.1" 200 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" would have appeared in the original log - it could be two, three, or four times. In such a case, I am forced to make an assumption. I'll opt for the lesser value.

crts - a very valid point. I was impressed to see that diff can do the job all by itself in this example, but I have been mindful that it is more of a comparison tool and you correctly show where it would slip up.
 
Old 09-06-2010, 07:22 AM   #30
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I have a come a bit late to this party but would like to submit an option that seems to work with the data from post #22 (ie I get the same output):
Code:
awk '!f{getline line < "file2";f=1}$0 == line{f=0}1;END{while(getline < "file2")print}' file1
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
xfree86-common xserver-common xfonts-base missing in etch/lenny unev_21 Debian 2 09-11-2009 02:12 AM
LXer: Unique Sorting Of Lists And Lists Of Lists With Perl For Linux Or Unix LXer Syndicated Linux News 0 09-05-2008 01:50 PM
LXer: kgdb, To Merge Or Not To Merge LXer Syndicated Linux News 0 02-05-2008 06:10 PM
LXer: KHTML Vs Webkit: To Merge or Not To Merge LXer Syndicated Linux News 0 10-27-2007 06:41 AM
BOGUS.common.04y -> /home/common/Mailbox jayakrishnan Linux - Networking 0 11-19-2005 04:48 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 10:32 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration