LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-21-2010, 05:28 AM   #1
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Rep: Reputation: 15
Need more efficient script PDF417 parsing


I wrote this

Code:
cat $1  | sed -e 's/[ ][ ]*/,/g' | sed 's/,/,\n/g' | sed -e '$s/.....$//' - >> processing
cat processing |  awk 'NR==1' | sed 's/^........//' >> processed
cat processing |  awk 'NR==2,NR==100' |  sed 's/^....//' >> processed
mv -i processed $1.csv
to process a particularly formatted PDF417 file.

It is formatted like this
$CENTAUR<code_length_8><batch_length_20><expiry_length_8><quantity_length_4>$

So a typical record looks like this
<code_length_8><batch_length_20><expiry_length_8><quantity_length_4>
(note, no spaces between each record or field if the field is present, otherwise white space for the length specified).

A file will never have more than 22 records.

The files I have been processing so far only have the quantity and code fields, so everything else is whitespace and my script doesn't take the other fields into account. I would like it to make a csvfile like this:

$CENTAUR<LF>
<code_length_8>,<batch_length_20>,<expiry_length_8>,<quantity_length_4><LF>
<code_length_8>,<batch_length_20>,<expiry_length_8>,<quantity_length_4><LF>
$<LF>

<LF> = linefeed/new line
<> separate fields only to show the format and are not actually present in the barcode output

It is actually a file scanned directly from a barcode output - so the PDF417 barcode is scanned and the file is saved in notepad++ - can it be scanned directly into the script?


example file - not sure how much formatting will be retained on the forum.
Code:
$CENTAUR30298309                            000130287018                            000130318905                            000130295355                            000130295344                            000130295333                            000130209138                            000130210705                            000130217293                            000130273352                            000130292823                            000130292834                            000130293065                            000130293076                            000130293087                            000130293000                            000130293010                            000130293021                            000130292415                            000130292426                            000130292947                            000130292958                            0001$
There is white space in the example as some of the fields are not present - only the quantity and code are present here

Last edited by suse_nerd; 04-21-2010 at 07:22 AM.
 
Old 04-21-2010, 05:35 AM   #2
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by suse_nerd View Post
I wrote this

[code]
cat $1 | sed -e 's/[ ][ ]*/,/g' | sed 's/,/,\n/g' | sed -e '$s/.....$//' - >> processing
cat processing | awk 'NR==1' | sed 's/^........//' >> processed
cat processing | awk 'NR==2,NR==100' | sed 's/^....//' >> processed
mv -i processed $1.csv
[code]

to process a particularly formatted PDF417 file.

It is formatted like this
$CENTAUR<code_length_8><batch_length_20><expiry_length_8><quantity_length_4><code_length_8>....001$
(note, no spaces between each record or field) n

The files I have been processing so far only have the quantity and code fields, so everything else is whitespace and my script doesn't take the other fields into account. I would like it to make a csvfile like this:

$CENTAUR
<code_length_8>,<batch_length_20>,<expiry_length_8>,<quantity_length_4>
<code_length_8>,<batch_length_20>,<expiry_length_8>,<quantity_length_4>
....
001$


It is actually a file scanned directly from a barcode output - so the PDF417 barcode is scanned and the file is saved in notepad++ - can it be scanned directly into the script?


example file - not sure how much formatting will be retained on the forum.
Code:
$CENTAUR30298309                            000130287018                            000130318905                            000130295355                            000130295344                            000130295333                            000130209138                            000130210705                            000130217293                            000130273352                            000130292823                            000130292834                            000130293065                            000130293076                            000130293087                            000130293000                            000130293010                            000130293021                            000130292415                            000130292426                            000130292947                            000130292958                            0001$
There is whole bunch of Perl PDF related (including parsing) modules: http://search.cpan.org/search?query=PDF&mode=all .
 
Old 04-21-2010, 05:45 AM   #3
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by Sergei Steshenko View Post
There is whole bunch of Perl PDF related (including parsing) modules: http://search.cpan.org/search?query=PDF&mode=all .
Thanks, but this is about PDF417 barcodes, rather than PDF files.
 
Old 04-21-2010, 06:30 AM   #4
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by suse_nerd View Post
Thanks, but this is about PDF417 barcodes, rather than PDF files.
Your example shows whitespaces, but you wrote: "(note, no spaces between each record or field)".

Anyway, are fields of the matter of constant width ? If yes, in Perl constant width fields can be extracted by regular expression like this:

Code:
if($line =~ m/^(.{3})(.{5})(.{8})/)
  {
  print "\$1=$1 $2=$2 \$3=$3"; # $1 contains the first 3 characters, $2 contain the following 5 characters, $3 contains the following 8 charactes 
  }
Or Perl 'substr' function can be used.
 
1 members found this post helpful.
Old 04-21-2010, 06:36 AM   #5
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Original Poster
Rep: Reputation: 15
Hello,

Yes, to claify

There is no whitespace if all fields are present.
If fields are missing, like in my example, there are whitespaces to the length of the missing fields.

So would something like this work (after stripping the header and footer)
Code:
while(<STDIN>)
{
if($line =~ m/^(.{8})(.{20})(.{8})(.{4})/)
  {print "\$1=$1\,\$2=$2\,\$3=$3\,\$4=$4\n";}
}
Note the header is
$CENTAUR
and the footer is just
$
not 001$

Last edited by suse_nerd; 04-21-2010 at 07:00 AM.
 
Old 04-21-2010, 07:10 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,999

Rep: Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190
Based on your example:
Code:
$CENTAUR30298309                            000130287018 ...
Would you show the desired output?
 
1 members found this post helpful.
Old 04-21-2010, 07:18 AM   #7
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Original Poster
Rep: Reputation: 15
The output needs to be captured as a file, yes. Ideally should prompt for a filename.

If the input is coming directly from the barcode scanner, this functionality would be ideal.

Output needs to be in a CSV-style format, see my OP, I have made some changes/updates to it.
 
Old 04-21-2010, 07:26 AM   #8
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by suse_nerd View Post
Hello,

Yes, to claify

There is no whitespace if all fields are present.
If fields are missing, like in my example, there are whitespaces to the length of the missing fields.

So would something like this work (after stripping the header and footer)
Code:
while(<STDIN>)
{
if($line =~ m/^(.{8})(.{20})(.{8})(.{4})/)
  {print "\$1=$1\,\$2=$2\,\$3=$3\,\$4=$4\n";}
}
Note the header is
$CENTAUR
and the footer is just
$
not 001$
Just replace

Code:
while(<STDIN>)
with

Code:
while(defined(my $line = <STDIN>))
.

After you have all the fields you can check them for being whitespaces only.

If the header is of constant width, it can simply be considered an unneeded field. I.e. use something like

Code:
if($line =~ m/^.{10}(.{8})(.{20})(.{8})(.{4})/)
where '10' is the the header width - since there are no parenthesis, the header won't be captured the $N variables.
 
1 members found this post helpful.
Old 04-21-2010, 07:37 AM   #9
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by Sergei Steshenko View Post
{print "\$1=$1\,\$2=$2\,\$3=$3\,\$4=$4\n";}
Will this syntax definitely work? I need the commas between each field and the new line at the end.

I assume it will just discard the $ at the end of the file.

Perl script runs, but doesn't output anything -how do I get it to output?

Edit: There is no carriage returns, everything is output on one line - is this script expecting a LF or CR? Also it only needs to skip the header once for each output - will this work?

Something like this?

Code:
#!/usr/bin/perl
open (MYFILE, '>>data.txt');

while(defined(my $line = <STDIN>))
{
if($line =~ m/^.{8}(.{8})(.{20})(.{8})(.{4})/)
  {print MYFILE "\$1=$1\,\$2=$2\,\$3=$3\,\$4=$4\n";}
}
  close (MYFILE);
Still not outputting anything though.

Last edited by suse_nerd; 04-21-2010 at 08:31 AM.
 
Old 04-21-2010, 08:39 AM   #10
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by suse_nerd View Post
Will this syntax definitely work? I need the commas between each field and the new line at the end.

I assume it will just discard the $ at the end of the file.

Perl script runs, but doesn't output anything -how do I get it to output?

Edit: There is no carriage returns, everything is output on one line - is this script expecting a LF or CR? Also it only needs to skip the header once for each output - will this work?

Something like this?

Code:
#!/usr/bin/perl
open (MYFILE, '>>data.txt');

while(defined(my $line = <STDIN>))
{
if($line =~ m/^.{8}(.{8})(.{20})(.{8})(.{4})/)
  {print MYFILE "\$1=$1\,\$2=$2\,\$3=$3\,\$4=$4\n";}
}
  close (MYFILE);
Still not outputting anything though.
First and foremost - put

Code:
use strict;
use warnings;
just after

Code:
#!/usr/bin/perl
.

You do not need backslashes before commas.

You need to debug the script - first make sure the lines are indeed read from STDIN, for this just before the 'if' statement put
Code:
warn "\$line=$line";
.
 
1 members found this post helpful.
Old 04-21-2010, 09:16 AM   #11
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by Sergei Steshenko View Post
.

I got it to do it for the first field only, how do you get it do loop until the end of the input?

Last edited by suse_nerd; 04-21-2010 at 09:21 AM.
 
Old 04-21-2010, 09:22 AM   #12
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by suse_nerd View Post
Code:
$perl out.pl
$CENTAUR16124319                    09082011000130001705                    2309
2011000130193694                    20042010000130209998                    2004
2010000130213907                    31012012000130217602                    0109
2011000130217613                    11092011000130222883                    1901
2012000130226217                    160420120001302355020                   3105
2011000130237348                    20042010000130237359                    2004
2010000130238544                    12082011000130238566                    2004
2010000130242020                    20042010000130278571                    2004
2010000130280336                    20042010000130288291                    0902
2011000130288316                    01072010000130288327                    1201
2011000130291955                    20042010000130293542                    2004
20100001$
$line=$CENTAUR16124319                    09082011000130001705
  23092011000130193694                    20042010000130209998
  20042010000130213907                    31012012000130217602
  01092011000130217613                    11092011000130222883
  19012012000130226217                    160420120001302355020
  31052011000130237348                    20042010000130237359
  20042010000130238544                    12082011000130238566
  20042010000130242020                    20042010000130278571
  20042010000130280336                    20042010000130288291
  09022011000130288316                    01072010000130288327
  12012011000130291955                    20042010000130293542
  200420100001$
This point the script just waits and doesn't do anything. If I kill the program data.txt is empty.

And why shouldn't it wait ?

If you are invoking it as

Code:
perl out.pl
, then it waits for something to come from STDIN, i.e. keyboard. Why won't you feed data into script using pipe or redirection ?

For that matter, I do not understand why/how the script managed to print even one $line.
 
1 members found this post helpful.
Old 04-21-2010, 09:42 AM   #13
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by Sergei Steshenko View Post
For that matter, I do not understand why/how the script managed to print even one $line.
I think because it's waiting for a new line, so when i press the enter key, it outputs the value of $line as specified in the warn statement.

I open and closed the file within the if statement and now it outputs to file properly. However, it seems only works on the first field and doesn't loop the regex search/parse until the end of the line.

So I get something like:

Code:
30293644,                    ,20042010,0001
 
Old 04-21-2010, 10:12 AM   #14
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by suse_nerd View Post
I think because it's waiting for a new line, so when i press the enter key, it outputs the value of $line as specified in the warn statement.

I open and closed the file within the if statement and now it outputs to file properly. However, it seems only works on the first field and doesn't loop the regex search/parse until the end of the line.

So I get something like:

Code:
30293644,                    ,20042010,0001
Why won't you yo do two things:
  1. chmod +x your_script.pl
  2. ./your_script.pl < input_data_file
?

Adjust path to your script and to your input data file as needed in the above.

Last edited by Sergei Steshenko; 04-21-2010 at 10:15 AM.
 
Old 04-21-2010, 10:32 AM   #15
suse_nerd
Member
 
Registered: May 2008
Distribution: SuSe
Posts: 50

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by Sergei Steshenko View Post
Why won't you yo do two things:
  1. chmod +x your_script.pl
  2. ./your_script.pl < input_data_file
?

Adjust path to your script and to your input data file as needed in the above.
Yeah, but I was hoping to be able to take input directly from the barcode scanner, not a file.
 
  


Reply

Tags
awk, barcode, barcodes, bash, decode, perl, process, processing, script, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Looking for experience to make a script more efficient. aSingularity Programming 7 04-03-2010 01:41 AM
Next step to make this script more efficient...? redlinuxxx Programming 2 10-01-2009 09:42 AM
parsing script sthompson Linux - Server 14 06-10-2008 09:39 AM
LXer: Memory-efficient XML parsing in PHP with XMLReader LXer Syndicated Linux News 0 02-04-2007 05:33 PM
Help with Text Parsing Script NiallC Programming 6 06-24-2006 03:28 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:14 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration