Strange Perl problem

thanhvn · 04-25-2006, 06:54 PM

I'm using Perl 5.8.7 on Cygwin 1.5.18 and recently I ran into a strange problem:

I have file1 whose contents are along the lines of:
...
XXX 10
YYY 12
ZZZ 17
...

I have file2 whose contents are along the lines of:
...
XXX
AAA
ZZZ
DDD
...

I have a perl script which reads each line from file2, attempts a match in file1, then extracts the second field:

Code:

...
my $file1 = ## path to file1 ##
my $file2 = ## path to file2 ##
open IN, "<$file2" or die "blabblahblah"
while (<IN>) {
   chomp;
   my $d = $_; print "d=zzz${d}zzz\n";
   my $left = `grep $d $file1`; print "left=aaa${left}aaa\n";
   chomp $left; print "left=bbb${left}bbb\n";
   my $v = "-";
   if ($left ne "") { $v = (split)[1], $left; }
   print "v=ccc${v}ccc\n";
   ...
}
...

My debug print statements output something totally unexpected (I'm only going to show one attempted match for XXX below; the rest are similar):
d=zzzXXXzzz
left=aaaXXX 10
aaa
bbbt=bbbXXX 10
v=cccccc

Calling chomp on the newline terminated string returned by grep totally messed up that string (as seen from the debug outputs). Subsequently, (split)[0] on that string returns XXX as expected (not shown here), but (split)[1] on that string returns a null string (instead of 10 as expected). Anyone knows what is going on here or how to fix it? Thanks in advance.

puffinman · 04-25-2006, 08:11 PM

Here's something a little simpler that does it. Just call it with file1 and file2 as arguments (in that order). And please don't put two statements on one line in a program, it makes things impossible to read.

Code:

#!/usr/bin/perl
while(<>){
  my ($key,$val) = split /\s+/;
  if ($hash{$key}) {
    print $hash{$key},"\n";
  } else {
    $hash{$key} = $val;
  }
}

bigearsbilly · 04-26-2006, 07:29 AM

you can simply use grep for this,

grep -f key-file data-file

bigearsbilly · 04-26-2006, 07:44 AM

alternatively use the proper perl grep

Code:

#!/usr/local/bin/perl -w

open KEYS, "<file1";
open DATA, "<file2";

@slurp = <DATA>;
@keys = <KEYS>;

chomp @keys;

foreach $key (@keys) {
    print grep /$key/, @slurp;
}

look at map and grep for perl - v. good.

thanhvn · 04-27-2006, 05:02 PM

Basically, there are two problems in my script:

1) A user on Linux Forums pointed for me that I used split on the default $_ instead of $left

Incorrect:

Code:

if ($left ne "") { $v = (split)[1], $left; }

Should be:

Code:

if ($left ne "") { $v = (split /\s+/, $left)[1]; }

This caused the $v to be undef/blank in the debug printouts.

2) Chomp has a problem with windows-style terminated strings, i.e. \r\f. Chomp only removes the \r leaving the \f intact, which causes the wraparound problem as shown in the debug output. The hanging \f also causes other string comparison problems.

Code:

Code:

my $left = `grep $d $file1`; 
print "left=aaa${left}aaa\n";
chomp $left; 
print "left=bbb${left}bbb\n";

Output:
[HTML]left=aaaXXX 10
aaa
bbbt=bbbXXX 10[/HTML]

So beware when using chomp on non-unix strings/files.

Is it too much to ask for a chomp that works correctly on all three types of files, unix, windows, mac? or is it already exists?

puffinman · 04-27-2006, 06:15 PM

Chomp removes the input record separator (special variable $\) which by default is a "\n". Set it to whatever you want and chomp will remove it. Alternatively, you can use the regex

Code:

s/\s+$//

which will remove any and all whitespace characters (including carriage returns and line feeds) from the end of the string.

thanhvn · 04-29-2006, 01:36 AM

puffinman, thanks for the suggestion. But I see a problem with each of the alternatives:

1) Setting $\

This means the script will only work correctly for one specific type of files (unix, windows, or mac). Certainly, this alternative will not work if you don't know ahead of time which type of files your script will have to deal with. Also, this definitely won't work if your script needs to work with more than one type of files.

2) Using regex s/\s+$// instead of chomp

This of course will work with all types of files. I can even define my custom chomp to do this regex if calling chomp is more convenient. However, this highlights the problem of having to define (or redefine) common functions in Perl just have my scripts work correctly cross-platform. If there are a dozen more like chomp, then I have to redefine them all for every single one of my scripts? Wouldn't it be better if the Perl language is implemented with cross-platform in mind instead of shifting this burden to its programmers?

this213 · 04-29-2006, 02:59 PM

Quote:

Originally Posted by thanhvn

Chomp has a problem with windows-style terminated strings, i.e. \r\f.

Actually, it's \r\n, not \r\f. \f is a formfeed, not a newline. Anyway, chomp is operating exactly as it's supposed to, removing any \n's from a string. as long as you're aware of this, there isn't any problem, just strip out the \r's ($line =~ s/\r//).

Quote:

Originally Posted by thanhvn

1) Setting $\

This means the script will only work correctly for one specific type of files (unix, windows, or mac). Certainly, this alternative will not work if you don't know ahead of time which type of files your script will have to deal with.

Exactly, so you need to normalize your incoming data so that it all has the expected format.

Quote:

Originally Posted by thanhvn

However, this highlights the problem of having to define (or redefine) common functions in Perl just have my scripts work correctly cross-platform. If there are a dozen more like chomp, then I have to redefine them all for every single one of my scripts? Wouldn't it be better if the Perl language is implemented with cross-platform in mind instead of shifting this burden to its programmers?

No, I don't see M$ bending over backwards so their applications will run on Linux, but that's beside the point. There's quite a bit of functionality in Perl and M$ can't support half of it simply because it doesn't have the facilities (such as socket programming). Since Perl started on, is developed on and the majority of scripts run on Linux and Unix systems, that's where the focus of expansion of features resides. Further, since Perl "lives" in Linux, it has to at least try to be backward compatable as much as possible with earlier versions of Perl, and a new chomp function, a function which is in just about every script out there that reads a file, would probably (or at least possibly) break all of those scripts (since the original developers would already have dealt with it only removing \n if they had to). If you're writing a cross platform script, you just have to take that into consideration and code accordingly. I could perhaps see adding in a Linux-safe alternative function to chomp, but not in replacing chomp altogether. The same goes for any other function that works just fine under *nix and not Windows - in fact, it wouldn't surprise me at all if there's already a Perl module that includes this functionality.

Every single developer I know also keeps a collection of code snippets, no matter what language they work in. If you're going to be writing a lot of cross platform scripts, perhaps you should invest some time into creating a few subs to keep around and just include in your scripts as needed.

Now, on to your code. To begin with, you need to work on your formatting, especially, as has already been pointed out, hit the enter key every once in a while (as in, after every ; { or }). Your 4th line down in the OP is missing a semi-colon on the end. Finally, you look as though you're trying to fit as much code as possible into a small space; putting multiple statements on one line, using single character variables and so forth. If you keep this up, you're going to get lost fast when you start writing larger scripts. Each statement should be on its own line, and each variable should have a descriptive name, this isn't C.

The following code solves your issue and does so in a clear and consise manner. Anyone with passing knowlege of Perl should be able to read this without difficulty. This not only make troubleshooting easier, but getting used to writing code like this allows you to see at a glance what the script is doing.

Code:

#!/usr/bin/perl -w

my $keysfile = 'file1.txt'; ## the file containing the keys ##
my $datafile = 'file2.txt'; ## the file containing data ##

# this is boilerplate code for reading a file into an array
open(FILE,"$keysfile") || die "Cannot open $keysfile: $1";
my @keys = <FILE>;
close FILE;
chomp(@keys);

# ...so do it again for the data file
open(FILE,"$datafile") || die "Cannot open $datafile: $1";
my @data = <FILE>;
close FILE;
chomp(@data);

# Now, regardless of what each line is terminated with, it doesn't matter
# because each element is an array element. We're not searching for
# anything on the end of the element (except for maybe formatting later).

# you have 2 *sane* options here: Either spin the keys array and
# regex match it to the data array, or split the data into a hash.
# Personally, I think a hash is perfect for this because you can do a direct match
# in this situation
%datahash = ();
foreach my $el (@data){
	my($key,$value) = split(/\s+/,$el);
	$datahash{$key} = $value;
}

# Finally, all you need to do now is spin through the keys array
# and match that to the data hash. If you opted to not assign the data to a hash,
# you would simply perform a regex on the @data array here instead of the match
# to the %datahash key.
foreach my $key (@keys){
	my $data_value = ''; # reinit with each pass
	if( $datahash{$key} && $datahash{$key} ne '' ){
    	$data_value = $datahash{$key};
    	# now you have $key and $data_value and you can handle them
    	# however you want
    	print $key.' = '.$data_value."\n";
	}
}

exit;

OK, I'm done now