LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 04-25-2006, 07:54 PM   #1
thanhvn
Member
 
Registered: Mar 2005
Location: CA
Distribution: RHEL3, FC4
Posts: 46

Rep: Reputation: 15
Strange Perl problem


I'm using Perl 5.8.7 on Cygwin 1.5.18 and recently I ran into a strange problem:

I have file1 whose contents are along the lines of:
...
XXX 10
YYY 12
ZZZ 17
...

I have file2 whose contents are along the lines of:
...
XXX
AAA
ZZZ
DDD
...

I have a perl script which reads each line from file2, attempts a match in file1, then extracts the second field:
Code:
...
my $file1 = ## path to file1 ##
my $file2 = ## path to file2 ##
open IN, "<$file2" or die "blabblahblah"
while (<IN>) {
   chomp;
   my $d = $_; print "d=zzz${d}zzz\n";
   my $left = `grep $d $file1`; print "left=aaa${left}aaa\n";
   chomp $left; print "left=bbb${left}bbb\n";
   my $v = "-";
   if ($left ne "") { $v = (split)[1], $left; }
   print "v=ccc${v}ccc\n";
   ...
}
...
My debug print statements output something totally unexpected (I'm only going to show one attempted match for XXX below; the rest are similar):
d=zzzXXXzzz
left=aaaXXX 10
aaa
bbbt=bbbXXX 10
v=cccccc

Calling chomp on the newline terminated string returned by grep totally messed up that string (as seen from the debug outputs). Subsequently, (split)[0] on that string returns XXX as expected (not shown here), but (split)[1] on that string returns a null string (instead of 10 as expected). Anyone knows what is going on here or how to fix it? Thanks in advance.
 
Old 04-25-2006, 09:11 PM   #2
puffinman
Member
 
Registered: Jan 2005
Location: Atlanta, GA
Distribution: Gentoo, Slackware
Posts: 217

Rep: Reputation: 30
Here's something a little simpler that does it. Just call it with file1 and file2 as arguments (in that order). And please don't put two statements on one line in a program, it makes things impossible to read.
Code:
#!/usr/bin/perl
while(<>){
  my ($key,$val) = split /\s+/;
  if ($hash{$key}) {
    print $hash{$key},"\n";
  } else {
    $hash{$key} = $val;
  }
}
 
Old 04-26-2006, 08:29 AM   #3
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: FreeBSD, Debian, Mint, Puppy
Posts: 3,314

Rep: Reputation: 175Reputation: 175
you can simply use grep for this,

grep -f key-file data-file
 
Old 04-26-2006, 08:44 AM   #4
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: FreeBSD, Debian, Mint, Puppy
Posts: 3,314

Rep: Reputation: 175Reputation: 175
alternatively use the proper perl grep

Code:
#!/usr/local/bin/perl -w

open KEYS, "<file1";
open DATA, "<file2";

@slurp = <DATA>;
@keys = <KEYS>;

chomp @keys;

foreach $key (@keys) {
    print grep /$key/, @slurp;
}
look at map and grep for perl - v. good.

Last edited by bigearsbilly; 04-26-2006 at 08:46 AM.
 
Old 04-27-2006, 06:02 PM   #5
thanhvn
Member
 
Registered: Mar 2005
Location: CA
Distribution: RHEL3, FC4
Posts: 46

Original Poster
Rep: Reputation: 15
Basically, there are two problems in my script:

1) A user on Linux Forums pointed for me that I used split on the default $_ instead of $left

Incorrect:
Code:
if ($left ne "") { $v = (split)[1], $left; }
Should be:
Code:
if ($left ne "") { $v = (split /\s+/, $left)[1]; }
This caused the $v to be undef/blank in the debug printouts.

2) Chomp has a problem with windows-style terminated strings, i.e. \r\f. Chomp only removes the \r leaving the \f intact, which causes the wraparound problem as shown in the debug output. The hanging \f also causes other string comparison problems.

Code:
Code:
my $left = `grep $d $file1`; 
print "left=aaa${left}aaa\n";
chomp $left; 
print "left=bbb${left}bbb\n";
Output:
[HTML]left=aaaXXX 10
aaa
bbbt=bbbXXX 10[/HTML]

So beware when using chomp on non-unix strings/files.

Is it too much to ask for a chomp that works correctly on all three types of files, unix, windows, mac? or is it already exists?
 
Old 04-27-2006, 07:15 PM   #6
puffinman
Member
 
Registered: Jan 2005
Location: Atlanta, GA
Distribution: Gentoo, Slackware
Posts: 217

Rep: Reputation: 30
Chomp removes the input record separator (special variable $\) which by default is a "\n". Set it to whatever you want and chomp will remove it. Alternatively, you can use the regex

Code:
s/\s+$//
which will remove any and all whitespace characters (including carriage returns and line feeds) from the end of the string.
 
Old 04-29-2006, 02:36 AM   #7
thanhvn
Member
 
Registered: Mar 2005
Location: CA
Distribution: RHEL3, FC4
Posts: 46

Original Poster
Rep: Reputation: 15
puffinman, thanks for the suggestion. But I see a problem with each of the alternatives:

1) Setting $\

This means the script will only work correctly for one specific type of files (unix, windows, or mac). Certainly, this alternative will not work if you don't know ahead of time which type of files your script will have to deal with. Also, this definitely won't work if your script needs to work with more than one type of files.

2) Using regex s/\s+$// instead of chomp

This of course will work with all types of files. I can even define my custom chomp to do this regex if calling chomp is more convenient. However, this highlights the problem of having to define (or redefine) common functions in Perl just have my scripts work correctly cross-platform. If there are a dozen more like chomp, then I have to redefine them all for every single one of my scripts? Wouldn't it be better if the Perl language is implemented with cross-platform in mind instead of shifting this burden to its programmers?
 
Old 04-29-2006, 03:59 PM   #8
this213
Member
 
Registered: Dec 2001
Location: ./
Distribution: Fedora, CentOS, RHEL, Gentoo
Posts: 167

Rep: Reputation: 34
Quote:
Originally Posted by thanhvn
Chomp has a problem with windows-style terminated strings, i.e. \r\f.
Actually, it's \r\n, not \r\f. \f is a formfeed, not a newline. Anyway, chomp is operating exactly as it's supposed to, removing any \n's from a string. as long as you're aware of this, there isn't any problem, just strip out the \r's ($line =~ s/\r//).

Quote:
Originally Posted by thanhvn
1) Setting $\

This means the script will only work correctly for one specific type of files (unix, windows, or mac). Certainly, this alternative will not work if you don't know ahead of time which type of files your script will have to deal with.
Exactly, so you need to normalize your incoming data so that it all has the expected format.

Quote:
Originally Posted by thanhvn
However, this highlights the problem of having to define (or redefine) common functions in Perl just have my scripts work correctly cross-platform. If there are a dozen more like chomp, then I have to redefine them all for every single one of my scripts? Wouldn't it be better if the Perl language is implemented with cross-platform in mind instead of shifting this burden to its programmers?
No, I don't see M$ bending over backwards so their applications will run on Linux, but that's beside the point. There's quite a bit of functionality in Perl and M$ can't support half of it simply because it doesn't have the facilities (such as socket programming). Since Perl started on, is developed on and the majority of scripts run on Linux and Unix systems, that's where the focus of expansion of features resides. Further, since Perl "lives" in Linux, it has to at least try to be backward compatable as much as possible with earlier versions of Perl, and a new chomp function, a function which is in just about every script out there that reads a file, would probably (or at least possibly) break all of those scripts (since the original developers would already have dealt with it only removing \n if they had to). If you're writing a cross platform script, you just have to take that into consideration and code accordingly. I could perhaps see adding in a Linux-safe alternative function to chomp, but not in replacing chomp altogether. The same goes for any other function that works just fine under *nix and not Windows - in fact, it wouldn't surprise me at all if there's already a Perl module that includes this functionality.

Every single developer I know also keeps a collection of code snippets, no matter what language they work in. If you're going to be writing a lot of cross platform scripts, perhaps you should invest some time into creating a few subs to keep around and just include in your scripts as needed.

Now, on to your code. To begin with, you need to work on your formatting, especially, as has already been pointed out, hit the enter key every once in a while (as in, after every ; { or }). Your 4th line down in the OP is missing a semi-colon on the end. Finally, you look as though you're trying to fit as much code as possible into a small space; putting multiple statements on one line, using single character variables and so forth. If you keep this up, you're going to get lost fast when you start writing larger scripts. Each statement should be on its own line, and each variable should have a descriptive name, this isn't C.

The following code solves your issue and does so in a clear and consise manner. Anyone with passing knowlege of Perl should be able to read this without difficulty. This not only make troubleshooting easier, but getting used to writing code like this allows you to see at a glance what the script is doing.
Code:
#!/usr/bin/perl -w

my $keysfile = 'file1.txt'; ## the file containing the keys ##
my $datafile = 'file2.txt'; ## the file containing data ##

# this is boilerplate code for reading a file into an array
open(FILE,"$keysfile") || die "Cannot open $keysfile: $1";
my @keys = <FILE>;
close FILE;
chomp(@keys);

# ...so do it again for the data file
open(FILE,"$datafile") || die "Cannot open $datafile: $1";
my @data = <FILE>;
close FILE;
chomp(@data);

# Now, regardless of what each line is terminated with, it doesn't matter
# because each element is an array element. We're not searching for
# anything on the end of the element (except for maybe formatting later).

# you have 2 *sane* options here: Either spin the keys array and
# regex match it to the data array, or split the data into a hash.
# Personally, I think a hash is perfect for this because you can do a direct match
# in this situation
%datahash = ();
foreach my $el (@data){
	my($key,$value) = split(/\s+/,$el);
	$datahash{$key} = $value;
}

# Finally, all you need to do now is spin through the keys array
# and match that to the data hash. If you opted to not assign the data to a hash,
# you would simply perform a regex on the @data array here instead of the match
# to the %datahash key.
foreach my $key (@keys){
	my $data_value = ''; # reinit with each pass
	if( $datahash{$key} && $datahash{$key} ne '' ){
    	$data_value = $datahash{$key};
    	# now you have $key and $data_value and you can handle them
    	# however you want
    	print $key.' = '.$data_value."\n";
	}
}

exit;
OK, I'm done now
 
  


Reply

Tags
perl, programming


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Strange '500 read timeout' with both FireFox and Perl GET on RedHat 9 luckyco Linux - General 0 10-10-2005 06:09 PM
Problem with perl module for w3c validator to work on my local Apache+PHP+perl instal tbamt Linux - Software 0 12-16-2004 06:37 PM
strange, strange alsa problem: sound is grainy/pixellated? fenderman11111 Linux - Software 1 11-01-2004 06:16 PM
Strange tar usage problem through Perl Darthlord Programming 0 09-02-2004 10:46 AM
Slackware 9.0 strange perl(?) problem lholt Slackware 5 01-05-2003 09:41 PM


All times are GMT -5. The time now is 10:20 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration