sorting & finding unique in python logs

kshabbir · 02-29-2012, 02:52 AM

Hi,

I want to sort the paragraphs which starts with ^Traceback, sort them and find the unique paragraphs, the problem is the block ending could be different and block length also differs.

Any help would be of great use.

====================================================
Traceback (most recent call last):
File "/home/shabbir/apps/django/core/handlers/base.py", line 111, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/home/shabbir/apps/web/views/sbf_views.py", line 213, in search
ctxt = search_browser_filter(request)
File "/home/shabbir/apps/web/views/sbf_views.py", line 560, in search_browser_filter
score=True, sort=sort, sort_order=sort_order, operation='/spell', request=request, **params)
File "/home/shabbir/apps/utils/solrutils.py", line 103, in solr_search
response = s.query(q, fields, highlight, score, sort, sort_order, **params)
File "/home/shabbir/apps/solr/core.py", line 495, in query
request, self.form_headers)
File "/home/shabbir/apps/solr/core.py", line 746, in _post
return check_response_status(self.conn.getresponse())
File "/home/shabbir/apps/solr/core.py", line 994, in check_response_status
raise ex
SolrException: HTTP code=500, reason=Internal Server Error
Traceback (most recent call last):
File "/home/shabbir/apps/django/core/handlers/base.py", line 111, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/home/shabbir/apps/django/views/decorators/cache.py", line 79, in _wrapped_view_func
response = view_func(request, *args, **kwargs)
File "/home/shabbir/apps/payments/views.py", line 64, in process_payment_hdfc
return payment_status(request,payment_attempt)
File "/home/shabbir/apps/payments/views.py", line 548, in payment_status
order.update_inventory(request, action='add', delta=delta_oi)
File "/home/shabbir/apps/orders/models.py", line 2355, in update_inventory
raise exp
InventoryError
====================================================

makyo · 03-06-2012, 07:50 PM

Hi.

Welcome to the forum.

Best and quickest answers are provided when you post representative sample input and expected output between CODE and /CODE tags, the symbols being surrounded by [ ] -- see the guide in the signature below:

Code:

this is a code block

Telling us what you have tried so far helps too.

Best wishes ... cheers, makyo

kshabbir · 03-07-2012, 01:20 AM

So far i have got a little success with the below code:

Code:

awk '/^Traceback/{if(NR!=1){for(i=0;i<j;i++)print a[i]>"file"k;j=0;k++;}a[j++]=$0;next}{a[j++]=$0;}END{for(i=0;i<j;i++)print a[i]>"file"k}' i=0 k=1  <filename>

But the problem is it creates as many files as it finds the paragraph in the file, now can some1 help me to find all the unique paragraphs and sort them according to the number of occurrences.

The sample output would be like below:

Code:

=========================Below error appeared 1 time=======================
Traceback (most recent call last):
File "/home/shabbir/apps/django/core/handlers/base.py", line 111, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/home/shabbir/apps/web/views/sbf_views.py", line 213, in search
ctxt = search_browser_filter(request)
File "/home/shabbir/apps/web/views/sbf_views.py", line 560, in search_browser_filter
score=True, sort=sort, sort_order=sort_order, operation='/spell', request=request, **params)
File "/home/shabbir/apps/utils/solrutils.py", line 103, in solr_search
response = s.query(q, fields, highlight, score, sort, sort_order, **params)
File "/home/shabbir/apps/solr/core.py", line 495, in query
request, self.form_headers)
File "/home/shabbir/apps/solr/core.py", line 746, in _post
return check_response_status(self.conn.getresponse())
File "/home/shabbir/apps/solr/core.py", line 994, in check_response_status
raise ex
SolrException: HTTP code=500, reason=Internal Server Error
=========================Below error appeared 1 time=======================
Traceback (most recent call last):
File "/home/shabbir/apps/django/core/handlers/base.py", line 111, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/home/shabbir/apps/django/views/decorators/cache.py", line 79, in _wrapped_view_func
response = view_func(request, *args, **kwargs)
File "/home/shabbir/apps/payments/views.py", line 64, in process_payment_hdfc
return payment_status(request,payment_attempt)
File "/home/shabbir/apps/payments/views.py", line 548, in payment_status
order.update_inventory(request, action='add', delta=delta_oi)
File "/home/shabbir/apps/orders/models.py", line 2355, in update_inventory
raise exp
InventoryError

Thanks in Advance,
Shabbir

makyo · 03-07-2012, 08:56 AM

Hi.

Thanks for reposting.

Comments:

1) I did not find your daa sample to be representative, so I created a different one that seems more appropriate in that there are several instances of duplicate and unique blocks.

2) I'm glad you tried something, but I didn't understand what you would finally do if you were able to create all the files that your solution produced.

3) Among the *nix commands, there is a command uniq that can eliminate duplicates. However it works on lines, where you have blocks of lines, and uniq also requires the file to be sorted. That happens often enough that command sort can also do that as one result of its operation.

4) So one approach is to create a long line for each of your blocks, sort, and eliminate duplicates. That can be done with standard commands, as the example below shows. This may requires a few extra files (some for illustration), but does not need much memory, and is very general. The steps are:
a) create the long lines (done with awk here), using some character to take the place of the embedded newlines,
b) sort, eliminate duplicates (sort -u),
c) expand blocks to original separated lines (tr).

5) Another approach is to have a code that will track the occurrence of the entire contents of the block as a key. The awk and perl languages both have associative arrays, hashes, that make that easy. However, that's memory-intensive.

The script below is long because it shows the context, the intermediate results, and finally compares the result to the expected output. Concentrate on the inner part that obtains the solution.

Code:

#!/usr/bin/env bash

# @(#) s1	Demonstrate obtaining unique instances of multi-line blocks of text.

# Section 1, setup, pre-solution, $Revision: 1.25 $".
# Infrastructure details, environment, debug commands for forum posts. 
# Uncomment export command to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin" HOME=""
set +o nounset
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
db() { : ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
C=$HOME/bin/context && [ -f $C ] && $C awk sort tr
set -o nounset
pe

FILE=${1-data1}
E="expected-output.txt"

# Display sample of data file, with head & tail as a last resort.
db " Section 1: display of input data."
pe " || start sample [ specimen first:middle:last ] $FILE"
specimen 10 $FILE $E  2>/dev/null
pe " || end"

# Section 2, solution.
db " Section 2: solution."
pl " Place all line in a block into a single long line:"
awk '
BEGIN	{ block = "" }
# /^Traceback/	{
/^Traceback \(most recent call last\):/	{
	if ( NR > 1 ) print block
	block = $0 ; next 
	}
	{ block = block "@" $0 }
END	{ print block }
' $FILE |
tee t1

# Sort the file, remove duplicates.
pl " Sort the long lines, leaving only unique lines:"
sort -u t1 |
tee t2

# Expand the blocks into separate lines.
pl " Separate the long lines into individual lines:"
tr '@' '\n' < t2 |
tee f1

# Section 3, post-solution, check results, clean-up, etc.
v1=$(wc -l <expected-output.txt)
v2=$(wc -l < f1)
pl " Comparison of $v2 created lines with $v1 lines of desired results:"
db " Section 3: validate generated calculations with desired results."

pl " Comparison with desired results:"
if [ ! -f expected-output.txt -o ! -s expected-output.txt ]
then
  pe " Comparison file \"expected-output.txt\" zero-length or missing."
  exit
fi
if cmp expected-output.txt f1
then
  pe " Succeeded -- files have same content."
else
  pe " Failed -- files not identical -- detailed comparison follows."
  if diff -b expected-output.txt f1
  then
    pe " Succeeded by ignoring whitespace differences."
  fi
fi

exit 0

producing:

Code:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
awk GNU Awk 3.1.5
sort (GNU coreutils) 6.10
tr (GNU coreutils) 6.10

 db,  Section 1: display of input data.
 || start sample [ specimen first:middle:last ] data1
Whole: 10:0:10 of 20 lines in file "data1"
Traceback (most recent call last):
Luci
commediene
extraordinaire
Traceback (most recent call last):
Desi
Traceback (most recent call last):
Fred
grouch
Traceback (most recent call last):
Luci
commediene
extraordinaire
Traceback (most recent call last):
Fred
grouch
Traceback (most recent call last):
Ethel
Traceback (most recent call last):
Little Ricki

Whole: 10:0:10 of 13 lines in file "expected-output.txt"
Traceback (most recent call last):
Desi
Traceback (most recent call last):
Ethel
Traceback (most recent call last):
Fred
grouch
Traceback (most recent call last):
Little Ricki
Traceback (most recent call last):
Luci
commediene
extraordinaire
 || end
 db,  Section 2: solution.

-----
 Place all line in a block into a single long line:
Traceback (most recent call last):@Luci@commediene@extraordinaire
Traceback (most recent call last):@Desi
Traceback (most recent call last):@Fred@grouch
Traceback (most recent call last):@Luci@commediene@extraordinaire
Traceback (most recent call last):@Fred@grouch
Traceback (most recent call last):@Ethel
Traceback (most recent call last):@Little Ricki

-----
 Sort the long lines, leaving only unique lines:
Traceback (most recent call last):@Desi
Traceback (most recent call last):@Ethel
Traceback (most recent call last):@Fred@grouch
Traceback (most recent call last):@Little Ricki
Traceback (most recent call last):@Luci@commediene@extraordinaire

-----
 Separate the long lines into individual lines:
Traceback (most recent call last):
Desi
Traceback (most recent call last):
Ethel
Traceback (most recent call last):
Fred
grouch
Traceback (most recent call last):
Little Ricki
Traceback (most recent call last):
Luci
commediene
extraordinaire

-----
 Comparison of 13 created lines with 13 lines of desired results:
 db,  Section 3: validate generated calculations with desired results.

-----
 Comparison with desired results:
 Succeeded -- files have same content.

Adapt as you need to for your data. See man pages for details.

Best wishes ... cheers, makyo

kshabbir · 03-08-2012, 02:05 AM

Mayko,

Thanks for the detailed explanation

Will try your solution and get back to you..

Thanks again.

Regards,
Shabbir