LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 03-13-2011, 12:00 PM   #1
BarataPT
LQ Newbie
 
Registered: Mar 2011
Posts: 12

Rep: Reputation: 1
Perl - Regex with Array Elements


Hi,

I have .txt.gz files that store queries made on a browser, and my job is to analyze them.

The information is stored in a xml-like style.
Quote:
<browser>lwp-trivial/1.41</browser>
<http_code>200</http_code>
<keywords />
<city>Seattle</city>
Right now i'm analyzing the browser tag because i have to "remove" queries of robots, spiders, etc.

My idea is to compare the contents of browser tag with an array that stores names of crawlers, spiders, feed readers etc.

Something like this:

Code:
@myCrawlers = ("80legs", "bingbot");
while(<STDIN>){
  if (/^\<browser\>(.*)\<\/browser\>/){
   foreach my $elem (@myCrawlers) {
        if ($1 =~ /$elem/)
        doSomething
    }
  }
}
Note that there are about 140.000.000 queries, and the array will contain many crawlers, spiders, etc...

Is there some way to do this in a more efficient way?
 
Old 03-13-2011, 01:30 PM   #2
timetraveler
Member
 
Registered: Apr 2010
Posts: 243
Blog Entries: 2

Rep: Reputation: 31
More efficient in what way? Time, money, cpu, memory, disk, patience?
You should check for bot existence by using a hash instead of iterating over an array.

our %bots = ( '80legs' => 1, 'bingbot' => 1 );

next unless (/^\<browser\>(.+)\<\/browser\>/);
my $elem = $1;
if($bots{$elem})
{
doit;
}

Checkout File::Map also, etc., etc.
 
Old 03-13-2011, 02:24 PM   #3
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Are the contents in XML ?
 
Old 03-13-2011, 02:30 PM   #4
BarataPT
LQ Newbie
 
Registered: Mar 2011
Posts: 12

Original Poster
Rep: Reputation: 1
Thanks for answering.

No, they are stored in .txt.gz files.

As most of the user-agents have in it's names the words crawler, spider or bot including this words as first elements of the data structure will save many time. So i'm only including words that don't have this words.
 
Old 03-13-2011, 03:39 PM   #5
BarataPT
LQ Newbie
 
Registered: Mar 2011
Posts: 12

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by timetraveler View Post
More efficient in what way? Time, money, cpu, memory, disk, patience?
You should check for bot existence by using a hash instead of iterating over an array.

our %bots = ( '80legs' => 1, 'bingbot' => 1 );

next unless (/^\<browser\>(.+)\<\/browser\>/);
my $elem = $1;
if($bots{$elem})
{
doit;
}

Checkout File::Map also, etc., etc.
timetraveler, that would work fine if i stored the exact content of the user-agent string; i am only storing a word that clearly identifies the user-agent so then i can apply to the regex. That way would be faster but many bots have different versions and identification names, so it is difficult to have all the exact user-agent's strings.
 
Old 03-13-2011, 03:42 PM   #6
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by BarataPT View Post
Thanks for answering.

No, they are stored in .txt.gz files.

As most of the user-agents have in it's names the words crawler, spider or bot including this words as first elements of the data structure will save many time. So i'm only including words that don't have this words.
You can compress/archive XML using any archiver/compressor, including 'tar' and 'gzip'.

So, my question (again) is: what is the format of the unpacked file - is it XML ?

Anyway, XML parsers already exist: http://search.cpan.org/search?query=XML+parser&mode=all , and even if it is not XML, it is so called balanced text (as far as I can see) : http://search.cpan.org/search?query=...anced&mode=all -> http://search.cpan.org/~adamk/Text-B...xt/Balanced.pm .
 
Old 03-14-2011, 04:03 PM   #7
timetraveler
Member
 
Registered: Apr 2010
Posts: 243
Blog Entries: 2

Rep: Reputation: 31
So then
for(keys %bots){ dostuff if $elem =~ /$_/; }

Last edited by timetraveler; 03-15-2011 at 02:52 AM. Reason: ..wrong answer...
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
perl array last 2 elements will not pop casperdaghost Programming 3 04-27-2010 11:45 PM
printing array contents after "push" of elements gives a different output in PERL gaynut Programming 1 08-20-2008 05:04 AM
trimming perl array elements homey Programming 7 02-17-2008 04:48 PM
Deleting elements from array in perl with splice signalno9 Programming 2 08-16-2005 11:57 PM
perl - get number of elements in an array AM1SHFURN1TURE Programming 3 03-07-2005 04:59 PM


All times are GMT -5. The time now is 12:21 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration