Extracting a specific line from an ASCII file

heyyou · 03-17-2005, 07:34 PM

I have a (large) file that contains a specific string on the nth line. I need a fast command line or script that will output the _previous line_, i.e. line number n - 1.

It can be awk, Perl, ksh, ... as long as it can run under Solaris from the command line and it is very fast (and simple).

As an example, say the file looks like this:

line 1
line 2
line 3
....
this is the line I want to extract
abcXXXXXdef
....
line 100
line 101

When I execute :

yourscript XXXXX

the output is:

this is the line I want to extract

As a bonus question, it would be even better if the script could return specific strings from the previous line! The previous line format looks like this
[1 0 0 1 111.11 222.22] 0 0 333.33 444.44

yourscript XXXXX
would ideally return
111.11 222.22 333.33 444.44

Note that the 999.99 format can vary: It is any decimal number with any number of decimal digits

Thanks

AltF4 · 03-17-2005, 08:10 PM

Code:

#!/usr/bin/perl -w

use strict;

my $LINE;               # Line buffer
my $LINENUM = 4;        # Line number to match

#
my $NARGS = $#ARGV + 1;
if ( $NARGS != 1 ) {
        print "USE: cooltool.pl filename\n";
        exit(1);
}
my $FILENAME = $ARGV[0];
open ( F , "<$FILENAME" ) or die "error opening $FILENAME\n";


while ( $LINE = <F> ) {
        if ( $. == $LINENUM ) {
                # substitute
                # "[1 0 0 1 111.11 222.22] 0 0 333.33 444.44"
                # by "111.11 222.22 333.33 444.44"
                $LINE =~ s/^\[\d+ \d+ \d+ \d+ (\d+\.?\d* \d+\.?\d*)\] \d+ \d+ (\d+\.?\d* \d+\.?\d*)/$1 $2/;
                print $LINE;
                last;
        }
}

close(F);
exit(0);

perfect_circle · 03-17-2005, 09:22 PM

Something like this will output you the previous line of every line matching a pattern (XXXXX)

Code:

#!/bin/bash
#usage: script <pattern> <filename>
for i in `grep -n -e "$1" $2 |cut -d':' -f1`; do
    head -n $(($i-1)) $2 |tail -n 1
done

dustu76 · 03-18-2005, 05:50 AM

This may be slightly OT. I simply tried extracting a line given the line number from a largish file (on Solaris). My program is:

Code:

#!/usr/bin/ksh

echo "Enter file name : \c"; read fname
lc=$(wc -l $fname |awk '{print $1}')
ml=$(($lc\/2))
echo "Line count : $lc"
echo "Middle line : $ml"

echo "--------headtail---------"
time head -n ${ml} $fname |tail -1

echo "--------sed--------------"
time sed -n -e "${ml},${ml}p" $fname

echo "--------nl--------------"
time nl -ba -nln -s+ $fname |grep "^${ml}" |cut -d"+" -f2

echo "--------awk------------"
time nawk -v ml=$ml 'NR==ml {print}' $fname

The head/tail approach by perfect_circle was generally (8-10 runs) SLOWEST when the length of each line is short:

Code:

SF1B : /supmis/ora/11mar05 > sd
Enter file name : bbbb
Line count : 390614
Middle line : 195307
--------headtail---------
000401557901

real    0m1.66s
user    0m0.44s
sys     0m2.74s
--------sed--------------
000401557901

real    0m0.35s
user    0m0.15s
sys     0m0.20s
--------nl--------------
000401557901

real    0m0.93s
user    0m1.07s
sys     0m0.27s
--------awk------------
000401557901

real    0m0.87s
user    0m0.81s
sys     0m0.05s
SF1B : /supmis/ora/11mar05 >

BUT, the same approach was generally FASTEST when the lines were much longer:

Code:

SF1B : /supmis/ora/11mar05 > sd
Enter file name : 0004newpl.dat
Line count : 390614
Middle line : 195307
--------headtail---------
0004|504807296|000401557901|Y|N|08050|GAA05|01|735|RTL-INDIVIDUAL|GRP AC WITH AVGBAL= 5000|C|no|1. Up to Rs 1 lac
|N|0|G|More Than 3 Months|27-JUL-2004|11-MAR-2005|VINOD VASANT PATIL|R1|N|INR|N|1|SBA|SBKIT|15-NOV-2004

real    0m2.85s
user    0m0.95s
sys     0m3.89s
--------sed--------------
0004|504807296|000401557901|Y|N|08050|GAA05|01|735|RTL-INDIVIDUAL|GRP AC WITH AVGBAL= 5000|C|no|1. Up to Rs 1 lac|N|0|G|More Than 3 Months|27-JUL-2004|11-MAR-2005|VINOD VASANT PATIL|R1|N|INR|N|1|SBA|SBKIT|15-NOV-2004

real    0m4.74s
user    0m2.24s
sys     0m2.47s
--------nl--------------
0004|504807296|000401557901|Y|N|08050|GAA05|01|735|RTL-INDIVIDUAL|GRP AC WITH AVGBAL= 5000|C|no|1. Up to Rs 1 lac|N|0|G|More Than 3 Months|27-JUL-2004|11-MAR-2005|VINOD VASANT PATIL|R1|N|INR|N|1|SBA|SBKIT|15-NOV-2004

real    0m2.98s
user    0m3.25s
sys     0m1.92s
--------awk------------
0004|504807296|000401557901|Y|N|08050|GAA05|01|735|RTL-INDIVIDUAL|GRP AC WITH AVGBAL= 5000|C|no|1. Up to Rs 1 lac|N|0|G|More Than 3 Months|27-JUL-2004|11-MAR-2005|VINOD VASANT PATIL|R1|N|INR|N|1|SBA|SBKIT|15-NOV-2004

real    0m3.22s
user    0m2.51s
sys     0m0.71s
SF1B : /supmis/ora/11mar05 >

Maybe there is nothing intriguing here & I'm just being picky (but if there is - I would like to know the reason)....

heyyou · 03-18-2005, 10:27 AM

Perfect_circle elegant solution makes two passes to the file isn't it? First a grep, then a head.

Would a Perl (a language I do not know) script that makes only one pass be better? i.e. with a pseudo code along these lines:

previous_line = blank
do while pattern not found and not EOF:
read new line
if new line matches *pattern_we_are_looking_for* then {output previous_line 4 parameters, then exit loop}
previous_line = current_line
end loop
exit

perfect_circle · 03-19-2005, 07:20 AM

If efficiency and speed is what you need, you should wright this in C. It's really simple

AltF4 · 03-21-2005, 02:48 AM

Code:

#!/usr/bin/perl -w

# print line before pattern

use strict;

my $LINE;               # Line buffer
my $PATTERN = "^abc.*def";      # what to find

my $NARGS = $#ARGV + 1;
if ( $NARGS != 1 ) {
        print "USE: cooltool.pl filename\n";
        exit(1);
}
my $FILENAME = $ARGV[0];
open ( F , "<$FILENAME" ) or die "error opening $FILENAME\n";


my $LASTLINE;
while ( $LINE = <F> ) {
        if ( $LINE =~ /$PATTERN/ ) {
                # substitute
                # "[1 0 0 1 111.11 222.22] 0 0 333.33 444.44"
                # by "111.11 222.22 333.33 444.44"
                $LASTLINE =~ s/^\[\d+ \d+ \d+ \d+ (\d+\.?\d* \d+\.?\d*)\] \d+ \d+ (\d+\.?\d* \d+\.?\d*)/$1
$2/;
                print $LASTLINE;
                #last; # uncomment if you ONLY need to find the 1st occurance
        }
        $LASTLINE = $LINE;
}

close(F);
exit(0);

jim mcnamara · 03-21-2005, 04:19 PM

Code:

head -linenumber | tail -1

For a really short cmd line