LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Need more efficient script PDF417 parsing (http://www.linuxquestions.org/questions/programming-9/need-more-efficient-script-pdf417-parsing-803246/)

suse_nerd 04-21-2010 05:28 AM

Need more efficient script PDF417 parsing
 
I wrote this

Code:

cat $1  | sed -e 's/[ ][ ]*/,/g' | sed 's/,/,\n/g' | sed -e '$s/.....$//' - >> processing
cat processing |  awk 'NR==1' | sed 's/^........//' >> processed
cat processing |  awk 'NR==2,NR==100' |  sed 's/^....//' >> processed
mv -i processed $1.csv

to process a particularly formatted PDF417 file.

It is formatted like this
$CENTAUR<code_length_8><batch_length_20><expiry_length_8><quantity_length_4>$

So a typical record looks like this
<code_length_8><batch_length_20><expiry_length_8><quantity_length_4>
(note, no spaces between each record or field if the field is present, otherwise white space for the length specified).

A file will never have more than 22 records.

The files I have been processing so far only have the quantity and code fields, so everything else is whitespace and my script doesn't take the other fields into account. I would like it to make a csvfile like this:

$CENTAUR<LF>
<code_length_8>,<batch_length_20>,<expiry_length_8>,<quantity_length_4><LF>
<code_length_8>,<batch_length_20>,<expiry_length_8>,<quantity_length_4><LF>
$<LF>

<LF> = linefeed/new line
<> separate fields only to show the format and are not actually present in the barcode output

It is actually a file scanned directly from a barcode output - so the PDF417 barcode is scanned and the file is saved in notepad++ - can it be scanned directly into the script?


example file - not sure how much formatting will be retained on the forum.
Code:

$CENTAUR30298309                            000130287018                            000130318905                            000130295355                            000130295344                            000130295333                            000130209138                            000130210705                            000130217293                            000130273352                            000130292823                            000130292834                            000130293065                            000130293076                            000130293087                            000130293000                            000130293010                            000130293021                            000130292415                            000130292426                            000130292947                            000130292958                            0001$
There is white space in the example as some of the fields are not present - only the quantity and code are present here

Sergei Steshenko 04-21-2010 05:35 AM

Quote:

Originally Posted by suse_nerd (Post 3942446)
I wrote this

[code]
cat $1 | sed -e 's/[ ][ ]*/,/g' | sed 's/,/,\n/g' | sed -e '$s/.....$//' - >> processing
cat processing | awk 'NR==1' | sed 's/^........//' >> processed
cat processing | awk 'NR==2,NR==100' | sed 's/^....//' >> processed
mv -i processed $1.csv
[code]

to process a particularly formatted PDF417 file.

It is formatted like this
$CENTAUR<code_length_8><batch_length_20><expiry_length_8><quantity_length_4><code_length_8>....001$
(note, no spaces between each record or field) n

The files I have been processing so far only have the quantity and code fields, so everything else is whitespace and my script doesn't take the other fields into account. I would like it to make a csvfile like this:

$CENTAUR
<code_length_8>,<batch_length_20>,<expiry_length_8>,<quantity_length_4>
<code_length_8>,<batch_length_20>,<expiry_length_8>,<quantity_length_4>
....
001$


It is actually a file scanned directly from a barcode output - so the PDF417 barcode is scanned and the file is saved in notepad++ - can it be scanned directly into the script?


example file - not sure how much formatting will be retained on the forum.
Code:

$CENTAUR30298309                            000130287018                            000130318905                            000130295355                            000130295344                            000130295333                            000130209138                            000130210705                            000130217293                            000130273352                            000130292823                            000130292834                            000130293065                            000130293076                            000130293087                            000130293000                            000130293010                            000130293021                            000130292415                            000130292426                            000130292947                            000130292958                            0001$

There is whole bunch of Perl PDF related (including parsing) modules: http://search.cpan.org/search?query=PDF&mode=all .

suse_nerd 04-21-2010 05:45 AM

Quote:

Originally Posted by Sergei Steshenko (Post 3942449)
There is whole bunch of Perl PDF related (including parsing) modules: http://search.cpan.org/search?query=PDF&mode=all .

Thanks, but this is about PDF417 barcodes, rather than PDF files.

Sergei Steshenko 04-21-2010 06:30 AM

Quote:

Originally Posted by suse_nerd (Post 3942456)
Thanks, but this is about PDF417 barcodes, rather than PDF files.

Your example shows whitespaces, but you wrote: "(note, no spaces between each record or field)".

Anyway, are fields of the matter of constant width ? If yes, in Perl constant width fields can be extracted by regular expression like this:

Code:

if($line =~ m/^(.{3})(.{5})(.{8})/)
  {
  print "\$1=$1 $2=$2 \$3=$3"; # $1 contains the first 3 characters, $2 contain the following 5 characters, $3 contains the following 8 charactes
  }

Or Perl 'substr' function can be used.

suse_nerd 04-21-2010 06:36 AM

Hello,

Yes, to claify

There is no whitespace if all fields are present.
If fields are missing, like in my example, there are whitespaces to the length of the missing fields.

So would something like this work (after stripping the header and footer)
Code:

while(<STDIN>)
{
if($line =~ m/^(.{8})(.{20})(.{8})(.{4})/)
  {print "\$1=$1\,\$2=$2\,\$3=$3\,\$4=$4\n";}
}

Note the header is
$CENTAUR
and the footer is just
$
not 001$

grail 04-21-2010 07:10 AM

Based on your example:
Code:

$CENTAUR30298309                            000130287018 ...
Would you show the desired output?

suse_nerd 04-21-2010 07:18 AM

The output needs to be captured as a file, yes. Ideally should prompt for a filename.

If the input is coming directly from the barcode scanner, this functionality would be ideal.

Output needs to be in a CSV-style format, see my OP, I have made some changes/updates to it.

Sergei Steshenko 04-21-2010 07:26 AM

Quote:

Originally Posted by suse_nerd (Post 3942514)
Hello,

Yes, to claify

There is no whitespace if all fields are present.
If fields are missing, like in my example, there are whitespaces to the length of the missing fields.

So would something like this work (after stripping the header and footer)
Code:

while(<STDIN>)
{
if($line =~ m/^(.{8})(.{20})(.{8})(.{4})/)
  {print "\$1=$1\,\$2=$2\,\$3=$3\,\$4=$4\n";}
}

Note the header is
$CENTAUR
and the footer is just
$
not 001$

Just replace

Code:

while(<STDIN>)
with

Code:

while(defined(my $line = <STDIN>))
.

After you have all the fields you can check them for being whitespaces only.

If the header is of constant width, it can simply be considered an unneeded field. I.e. use something like

Code:

if($line =~ m/^.{10}(.{8})(.{20})(.{8})(.{4})/)
where '10' is the the header width - since there are no parenthesis, the header won't be captured the $N variables.

suse_nerd 04-21-2010 07:37 AM

Quote:

Originally Posted by Sergei Steshenko (Post 3942571)
{print "\$1=$1\,\$2=$2\,\$3=$3\,\$4=$4\n";}

Will this syntax definitely work? I need the commas between each field and the new line at the end.

I assume it will just discard the $ at the end of the file.

Perl script runs, but doesn't output anything -how do I get it to output?

Edit: There is no carriage returns, everything is output on one line - is this script expecting a LF or CR? Also it only needs to skip the header once for each output - will this work?

Something like this?

Code:

#!/usr/bin/perl
open (MYFILE, '>>data.txt');

while(defined(my $line = <STDIN>))
{
if($line =~ m/^.{8}(.{8})(.{20})(.{8})(.{4})/)
  {print MYFILE "\$1=$1\,\$2=$2\,\$3=$3\,\$4=$4\n";}
}
  close (MYFILE);

Still not outputting anything though.

Sergei Steshenko 04-21-2010 08:39 AM

Quote:

Originally Posted by suse_nerd (Post 3942589)
Will this syntax definitely work? I need the commas between each field and the new line at the end.

I assume it will just discard the $ at the end of the file.

Perl script runs, but doesn't output anything -how do I get it to output?

Edit: There is no carriage returns, everything is output on one line - is this script expecting a LF or CR? Also it only needs to skip the header once for each output - will this work?

Something like this?

Code:

#!/usr/bin/perl
open (MYFILE, '>>data.txt');

while(defined(my $line = <STDIN>))
{
if($line =~ m/^.{8}(.{8})(.{20})(.{8})(.{4})/)
  {print MYFILE "\$1=$1\,\$2=$2\,\$3=$3\,\$4=$4\n";}
}
  close (MYFILE);

Still not outputting anything though.

First and foremost - put

Code:

use strict;
use warnings;

just after

Code:

#!/usr/bin/perl
.

You do not need backslashes before commas.

You need to debug the script - first make sure the lines are indeed read from STDIN, for this just before the 'if' statement put
Code:

warn "\$line=$line";
.

suse_nerd 04-21-2010 09:16 AM

Quote:

Originally Posted by Sergei Steshenko (Post 3942683)
.


I got it to do it for the first field only, how do you get it do loop until the end of the input?

Sergei Steshenko 04-21-2010 09:22 AM

Quote:

Originally Posted by suse_nerd (Post 3942743)
Code:

$perl out.pl
$CENTAUR16124319                    09082011000130001705                    2309
2011000130193694                    20042010000130209998                    2004
2010000130213907                    31012012000130217602                    0109
2011000130217613                    11092011000130222883                    1901
2012000130226217                    160420120001302355020                  3105
2011000130237348                    20042010000130237359                    2004
2010000130238544                    12082011000130238566                    2004
2010000130242020                    20042010000130278571                    2004
2010000130280336                    20042010000130288291                    0902
2011000130288316                    01072010000130288327                    1201
2011000130291955                    20042010000130293542                    2004
20100001$
$line=$CENTAUR16124319                    09082011000130001705
  23092011000130193694                    20042010000130209998
  20042010000130213907                    31012012000130217602
  01092011000130217613                    11092011000130222883
  19012012000130226217                    160420120001302355020
  31052011000130237348                    20042010000130237359
  20042010000130238544                    12082011000130238566
  20042010000130242020                    20042010000130278571
  20042010000130280336                    20042010000130288291
  09022011000130288316                    01072010000130288327
  12012011000130291955                    20042010000130293542
  200420100001$

This point the script just waits and doesn't do anything. If I kill the program data.txt is empty.


And why shouldn't it wait ?

If you are invoking it as

Code:

perl out.pl
, then it waits for something to come from STDIN, i.e. keyboard. Why won't you feed data into script using pipe or redirection ?

For that matter, I do not understand why/how the script managed to print even one $line.

suse_nerd 04-21-2010 09:42 AM

Quote:

Originally Posted by Sergei Steshenko (Post 3942750)
For that matter, I do not understand why/how the script managed to print even one $line.

I think because it's waiting for a new line, so when i press the enter key, it outputs the value of $line as specified in the warn statement.

I open and closed the file within the if statement and now it outputs to file properly. However, it seems only works on the first field and doesn't loop the regex search/parse until the end of the line.

So I get something like:

Code:

30293644,                    ,20042010,0001

Sergei Steshenko 04-21-2010 10:12 AM

Quote:

Originally Posted by suse_nerd (Post 3942767)
I think because it's waiting for a new line, so when i press the enter key, it outputs the value of $line as specified in the warn statement.

I open and closed the file within the if statement and now it outputs to file properly. However, it seems only works on the first field and doesn't loop the regex search/parse until the end of the line.

So I get something like:

Code:

30293644,                    ,20042010,0001

Why won't you yo do two things:
  1. chmod +x your_script.pl
  2. ./your_script.pl < input_data_file
?

Adjust path to your script and to your input data file as needed in the above.

suse_nerd 04-21-2010 10:32 AM

Quote:

Originally Posted by Sergei Steshenko (Post 3942785)
Why won't you yo do two things:
  1. chmod +x your_script.pl
  2. ./your_script.pl < input_data_file
?

Adjust path to your script and to your input data file as needed in the above.

Yeah, but I was hoping to be able to take input directly from the barcode scanner, not a file.


All times are GMT -5. The time now is 03:36 AM.