LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 04-01-2009, 09:14 AM   #1
mrealty
LQ Newbie
 
Registered: Apr 2009
Posts: 6

Rep: Reputation: 0
Perl question: delete line from text file with duplicate match at beginning of line


Hi all:

Was wondering if any perl guru's could help me with a quick log file adjustment. I have a text file that looks like so (tabs and newlines are revealed so you can see what separates the data):

1234 {tab} purchase {tab} sale {newline}
4567 {tab} broken {tab} sale {newline}
4588 {tab} theft {tab} misc {newline}
1234 {tab} purchase {tab} audit {newline}

There are maybe 100 lines of text in this file at any given time. I need to delete all duplicate lines only looking at the first bit of text prior to the first tab. It doesn't matter which one gets deleted as long as there are no two lines that begin with that same text at the beginning before the first tab. So in this example, either the fist line "1234" or the last line "1234" would need to be deleted. I already have code in my script that opens the files - I just need the code to read the text into an array and the part that would find matches based on the above criteria, and make the deletions.

If it would be easier, I can even do a system call and use SED (v4.1.5) and/or AWK (3.1.5) instead.

With kind regards.
 
Old 04-01-2009, 09:25 AM   #2
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by mrealty View Post
Hi all:

Was wondering if any perl guru's could help me with a quick log file adjustment. I have a text file that looks like so (tabs and newlines are revealed so you can see what separates the data):

1234 {tab} purchase {tab} sale {newline}
4567 {tab} broken {tab} sale {newline}
4588 {tab} theft {tab} misc {newline}
1234 {tab} purchase {tab} audit {newline}

There are maybe 100 lines of text in this file at any given time. I need to delete all duplicate lines only looking at the first bit of text prior to the first tab. It doesn't matter which one gets deleted as long as there are no two lines that begin with that same text at the beginning before the first tab. So in this example, either the fist line "1234" or the last line "1234" would need to be deleted. I already have code in my script that opens the files - I just need the code to read the text into an array and the part that would find matches based on the above criteria, and make the deletions.

If it would be easier, I can even do a system call and use SED (v4.1.5) and/or AWK (3.1.5) instead.

With kind regards.
Think about your problem from a different angle. Consider the first field (e.g. "1234") as hash key, and the rest of the line as value.

So, if you compose your hash this way, since keys are unique, there will be exactly on line with one unique key, and the line will be the key followed by the key's value.
 
Old 04-01-2009, 10:00 AM   #3
mrealty
LQ Newbie
 
Registered: Apr 2009
Posts: 6

Original Poster
Rep: Reputation: 0
Thanks for the speedy reply.

I see where you're coming from but...I still don't see the whole picture. I'm not clear on the conditions. Read line, if hash value exists, then that section of the array gets that line, but what if the hash value does not exist? Then insert? Not sure how to read it in to begin with. Am I needing two arrays?

With kind regards.
 
Old 04-01-2009, 10:25 AM   #4
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by mrealty View Post
Thanks for the speedy reply.

I see where you're coming from but...I still don't see the whole picture. I'm not clear on the conditions. Read line, if hash value exists, then that section of the array gets that line, but what if the hash value does not exist? Then insert? Not sure how to read it in to begin with. Am I needing two arrays?

With kind regards.
There is no array, and there is no "if" - just add the hash key => value unconditionally.

I.e.
  1. read the line;
  2. split it it into the first field and the rest;
  3. unconditionally insert the first_field => the_rest into the hash.
 
Old 04-01-2009, 11:46 AM   #5
mrealty
LQ Newbie
 
Registered: Apr 2009
Posts: 6

Original Poster
Rep: Reputation: 0
Sorry, I'm old school and don't remember hash data structure from the language I learned (Turbo Pascal about 20 years ago). This is about 2 lines of code, correct?

I don't know how to "say" that pseudo code in perl.

With kind regards.
 
Old 04-01-2009, 12:17 PM   #6
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by mrealty View Post
Sorry, I'm old school and don't remember hash data structure from the language I learned (Turbo Pascal about 20 years ago). This is about 2 lines of code, correct?

I don't know how to "say" that pseudo code in perl.

With kind regards.
Yes, it's about two lines in Perl.

Did you start learning Perl at all ? I.e. did you write any Perl code with hashes ?
 
Old 04-01-2009, 01:01 PM   #7
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
Here's a version with some commentary. It's more than two lines, but I'm not a fan of shortest possible code for its own sake.
Code:
#!/usr/bin/env perl
use strict;
use warnings;

my %file_hash;

while (<>) {
  next unless $_ =~ m/^\d/; # line doesn't begin with a digit; skip
  
  # split line into the digit portion and everything else;
  # assign digit to $key and everything else to $value
  my ($key, $value) = ($_ =~ m/(\d+)(.*)/);

  # each line becomes one entry in the hash %file_hash;
  # since hash keys must be unique, any repeats overwrite
  # the previous duplicate (ie, the second 1234 overwrites
  # the first 1234 and the third would overwrite the second)
  $file_hash{$key} = $value;
}
    
# go through the hash and print out what's left
for my $key (keys %file_hash) {
  print $key, $file_hash{$key}, "\n";
}
If you save this code as, say, file_fixer, you can then run it by typing perl file_fixer file-name. Substitute the name of your file for <file-name>. The output will print to your terminal. If the output is sane, then you can save it with redirection in the shell: [b]perl file_fixer file-name > new-file
 
Old 04-01-2009, 07:46 PM   #8
mrealty
LQ Newbie
 
Registered: Apr 2009
Posts: 6

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by Telemachos View Post
Here's a version with some commentary. It's more than two lines, but I'm not a fan of shortest possible code for its own sake.
Telemachos! Thank you! You are a valuable asset to this forum. It worked beautifully (with a slight modification, as I didn't mention there was a header line in that file, but all is well). Thank you so much for explaining it too.

With kind regards.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to compare two lines and delete the duplicate line from a file? Shobhna Linux - Newbie 10 12-05-2008 02:08 PM
delete a line containing a pattern and the next line of a text file powah Programming 3 01-31-2007 06:34 PM
duplicate the line of a text file to the same line powah Programming 4 01-11-2007 09:27 PM
Delete line from flat text file in C zaichik Programming 6 01-26-2005 07:16 PM


All times are GMT -5. The time now is 05:06 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration