Opening big log files for sorting.

txalin · 12-15-2009, 03:43 AM

Hi all,

I'm a newbie on this forums so first of all i want to present myself. I'm a 29 years old spanish UNIX/linux administrator with 7 years of experience on administrating unix machines and also sercurity devices.

My (latest) problem is related with one script that i'm trying to put in production, but i'm encountering serious stopper problems. The goal of this script is to read one big oracle xml log and sort it by one date field and put the result to another file. Oracle log is about 36 gb per day and sort is made on the fly.

Now i'm making the sort and everything seems to be working fine, but i encounter one problem, my process, made in bash, start at 12pm and the problem is that sometimes it stop sortening at about 12am more or less, the process is active (ps -ef | grep sort) but the destination file don't receive anything, so i think that maybe the problem is that, for an unknow reason, my process close the original Oracle log after 12 hours.

The way i open this log is using FDs, on this way:

exec 0< logfile.log
while [ $start -eq 1 ]
do
read -r line
blablablaba
done

I think that maybe the file descriptors have a low priority on my kernel and it was closed after a few time, but i don't know how to verify this problem or modify this kernel parameter, someone knows how to do this or even tell me a better way of letting open files to read?

Thanks in advance and regards,.

ghostdog74 · 12-15-2009, 04:01 AM

if you are going to work with big files, use tools like awk to parse the files, not bash's while loop. Also, since you have such a big file, is it not possible to use a database in your environment?

AleLinuxBSD · 12-15-2009, 04:09 AM

Perhaps a more safe way consist on write some simple program using a program language instead of using a script bash.
So you could have a better control, you specify a buffer that you clean when is full, after you write on the disk, so the impact on the memory resource of your system is low.

How many resource are available on your system after 24 use of that script?
(You can ran top on the start and on the "end" or using more sophisticated systems to analyze the situation on your server).

Quote:

Originally Posted by ghostdog74

is it not possible to use a database in your environment?

Nice idea.
Perhaps this could help the OP:
Oracle log files : An introduction

txalin · 12-15-2009, 04:33 AM

Unfortunatelly i cannot use a database on my system

Regarding resources available, it takes about 20% of one cpu (8 cores available on this machine), so it didn't have a big impact on my system (in fact, 90% is idle right now)

Regarding making a program in another languaje, wich one do you prefer for this tasks? I can do it on C o perl maybe but i don't know wich one is better for this.

And thanks for your help

ghostdog74 · 12-15-2009, 04:40 AM

Quote:

Originally Posted by txalin

Regarding making a program in another languaje, wich one do you prefer for this tasks? I can do it on C o perl maybe but i don't know wich one is better for this.
And thanks for your help

if you only know C and Perl, go for Perl for the speed of development, plus the advantage to be able to use Perl modules( even those for Oracle ) and XML parsing especially. You can also extend Perl with C if you want speed performance. Otherwise, you can code in pure C (for the speed factor), which, you have to to be prepared to spend some time on.

syg00 · 12-15-2009, 05:01 AM

Maybe you should consider a merge sort. Dynamically split the input into (disk) files and merge sort them. Should be less memory intensive.
File::Sort on CPAN looks like one candidate.