LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-22-2015, 01:41 AM   #1
MisterJellyBeans
LQ Newbie
 
Registered: Apr 2015
Posts: 4

Rep: Reputation: Disabled
Remove lines that are subsets of other lines in File


Hello everyone,


Although it seems easy, I've been stuck with this problem for a moment now and I can't figure out a way to get it done.

My problem is the following:

I have a file where each line is a sequence of IP addresses, example :

Line 1: 10.0.01 10.0.0.2
Line 2 : 10.0.0.5 10.0.0.1 10.0.0.2
...

What I'd like to do, is to remove lines that are completely matched in other lines. In the previous example, "Line 1" would be deleted as it is contained in "Line 2".

So far, I've worked with python and set() objects to get the job done but I've got more than 100K lines and sets lookups are becoming time consuming as the program goes :/

Thanks for you help
 
Old 04-22-2015, 03:16 AM   #2
RMLinux
Member
 
Registered: Jul 2006
Posts: 260

Rep: Reputation: 37
copy the file and try this first.
# sed '/10.0.0.2/d' source.txt > destination.txt

then try to open distination.txt.
 
Old 04-22-2015, 03:42 AM   #3
MisterJellyBeans
LQ Newbie
 
Registered: Apr 2015
Posts: 4

Original Poster
Rep: Reputation: Disabled
Hi,


The command : sed '/10.0.0.2/d' source.txt > destination.txt will effectively delete all occurrences of 10.0.0.2, regardless of its appearance in another line. In my situation, only lines that are subsets of other lines have to be removed

Thanks for the reply
 
Old 04-22-2015, 03:48 AM   #4
RMLinux
Member
 
Registered: Jul 2006
Posts: 260

Rep: Reputation: 37
look for the regex...expression and insert that regex between '/[REGEX expression here]/d'

or try to use vi/vim

:g/10.0.0.2/d

Last edited by RMLinux; 04-22-2015 at 03:53 AM.
 
Old 04-22-2015, 07:29 AM   #5
eklavya
Member
 
Registered: Mar 2013
Posts: 619

Rep: Reputation: 136Reputation: 136
Suppose your filename is source.txt. First take backup of the file and make another copy source_copy.txt
Now run this script.
Code:
#!/bin/bash
filename=/path/of/source.txt
i=1
var=$(wc -l < $filename)
while [ $i -le $var ]
do
var2=$(tail -n +$i $filename | head -n 1)
var3=$(grep "$var2" $filename | wc -l)
if [ $var3 -gt 1 ]
then
sed -i "$i"d $filename
fi
i=`expr $i + 1`
done
Now check the output in source.txt
 
Old 04-22-2015, 08:15 AM   #6
MisterJellyBeans
LQ Newbie
 
Registered: Apr 2015
Posts: 4

Original Poster
Rep: Reputation: Disabled
Hello eklavya,


Thanks for your help. I've tried your script but unfortunately, the procedure is very slow. I've got a 7 MB file (91K lines) and after running for 2mn, less than 2K redundant lines were deleted :/. Normally, the resulting file should take around 50K lines. Anyway, I appreciate your help
 
Old 04-22-2015, 08:44 AM   #7
JeremyBoden
Senior Member
 
Registered: Nov 2011
Distribution: Debian
Posts: 1,149

Rep: Reputation: 237Reputation: 237Reputation: 237
It's not an easy thing to do:-

Essentially, first you need to sort each line in the file; then:-

compare the first line against subsequent lines, looking for a match on some substring of each of these lines
repeat for the next line - and so on.

Time to do this should be roughly proportional to the square of the number of lines in the file - disregarding program inefficiencies.

Try it on a 1,000 line file to get an estimate of run time.
 
Old 04-22-2015, 09:43 AM   #8
zhjim
Senior Member
 
Registered: Oct 2004
Distribution: Debian Squeeze x86_64
Posts: 1,748
Blog Entries: 11

Rep: Reputation: 233Reputation: 233Reputation: 233
Just two binaries needed. Take your time for it

Quote:
sort ./filename | uniq
Just have the output redirected into a new file.

Quote:
mom@rm:~/lan$ ls -lha
total 431M
drwxr-xr-x 2 mom mom 4.0K Apr 22 16:37 .
drwxr-xr-x 10 mom mom 4.0K Apr 22 16:07 ..
-rw-r--r-- 1 mom mom 338 Sep 3 2014 ap
-rw-r--r-- 1 mom mom 392M Apr 22 16:38 dump
-rw-r--r-- 1 mom mom 39M Apr 22 16:38 new
-rw-r--r-- 1 mom mom 23K Sep 3 2014 scan
mom@rm:~/lan$ man sort
mom@rm:~/lan$ time sort ./new | uniq
Host is up (0.00051s latency).
Host is up (0.00054s latency).
Host is up (0.00058s latency).
MAC Address: 00:183:07 (Microsoft)
MAC Address: 00:15:83:0A (Microsoft)
MAC Address: 00:21DD:76:90 (IBM)
Nmap scan report for mom (12.68.00.2)
Nmap scan report for mom (12.68.00.1)
Nmap scan report for mom (12.68.00.3)

real 0m8.720s
user 0m8.289s
sys 0m0.176s
mom@rm:~/lan$ time sort ./dump | uniq
Host is up (0.00051s latency).
Host is up (0.00054s latency).
Host is up (0.00058s latency).
MAC Address: 00:153:07 (Microsoft)
MAC Address: 00:153:0A (Microsoft)
MAC Address: 00:276:90 (IBM)
Nmap scan report for mom (92.18.0.)
Nmap scan report for mom (92.18.0.)
Nmap scan report for mom (92.18.0.)

real 1m30.868s
user 1m25.913s
sys 0m1.416s
 
Old 04-22-2015, 09:56 AM   #9
eklavya
Member
 
Registered: Mar 2013
Posts: 619

Rep: Reputation: 136Reputation: 136
If there is extra spaces or tabs between data, teh above script will not remove and count as a different line.
Suppose
Quote:
Line 1: 10.0.01 10.0.0.2
Line 2: 10.0.0.5 10.0.01 10.0.0.2
Line 3: 10.0.01 [ space bar ] 10.0.0.2
Line 4: 10.0.0110.0.0.2
Line 5: [ space bar ] 10.0.01 [ space bar ] 10.0.0.2
Now for you, all line have same but script will count as different lines.
As well as if sequence of IPs are changed like
Quote:
10.0.0.1 10.0.02
[ space bar ] 10.0.02 [ space bar ] 10.0.0.1
Again these lines are same for you but script will not remove and it will count it as different line.

Last edited by eklavya; 04-22-2015 at 10:01 AM.
 
Old 04-22-2015, 10:26 AM   #10
MisterJellyBeans
LQ Newbie
 
Registered: Apr 2015
Posts: 4

Original Poster
Rep: Reputation: Disabled
In fact, I'm already doing this. (sort -u input > output). But this will only remove pure duplicate lines.

Also, the data is perfectly formatted with exactly one space bar between each IP addresses and a carriage return after the last IP address. I've thought about using a tree-like data structure (a trie maybe) but this would be like using the H Bomb for a "dumb" problem ...
 
Old 04-22-2015, 12:40 PM   #11
eklavya
Member
 
Registered: Mar 2013
Posts: 619

Rep: Reputation: 136Reputation: 136
There are 90000 llines in a file. Can you paste here atleast 50 lines of real file, no dummy data.
 
Old 04-23-2015, 03:25 AM   #12
zhjim
Senior Member
 
Registered: Oct 2004
Distribution: Debian Squeeze x86_64
Posts: 1,748
Blog Entries: 11

Rep: Reputation: 233Reputation: 233Reputation: 233
Just nuke them
 
Old 04-23-2015, 03:54 AM   #13
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 840

Rep: Reputation: 380Reputation: 380Reputation: 380Reputation: 380
Is the order of the lines in output and order of IP's in the lines significant?
 
Old 04-23-2015, 05:17 AM   #14
yo8rxp
Member
 
Registered: Jul 2009
Location: Romania
Distribution: Ubuntu 10.04 Gnome 2
Posts: 102

Rep: Reputation: 30
implement this

enumerate lines in that file

echo "" > empty
for line in $(cat file)
do
#use empty space as delimiter and count them for each line
#extract ip within empty spaces using cut or IFS and compare it versus an empty file, if is it there skip it , if not #then populate it with #new ip
x=$(grep " " $line | wc -l)
for x in how many delimiters
do
ip=$(echo $line | cut -d ' ' -f $x )
if [ "$(grep $ip empty)" = "" ] ; then echo $ip >> empty
fi
done
done

#when done , empty file will contain only unique ip , then can overwrite your file with this new empty one

this is just a guide line , it is not a working script

OR , even simplier, replace all spaces with new line
Quote:
sed -e 's/\s\+/\n/g' your_file > new_file
cut new file | sort | uniq > your_file
rm new_file
Have fun , but backup any important files before testing !

Last edited by yo8rxp; 04-23-2015 at 06:23 AM.
 
Old 04-23-2015, 07:50 AM   #15
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 840

Rep: Reputation: 380Reputation: 380Reputation: 380Reputation: 380
My attempt in Perl:

Code:
#!/usr/bin/perl

use strict;
use warnings;
use Array::Utils qw(:all);

my @ln;
LBL: while (<STDIN>) {
	my @a = split /\s+/;
	foreach (@ln) {
		next LBL if not array_minus(@a, @{$_});
		@{$_} = () if not array_minus(@{$_}, @a);
	}
	push @ln,\@a;
}

foreach (@ln) { print "@{$_}\n" if @{$_}; }
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Linux append lines in a file after matched lines are found shajay12 Linux - Newbie 4 02-25-2015 06:59 AM
[SOLVED] Perl: how to replace blank lines in a file with given lines from another karamaz0v Programming 8 04-19-2012 06:48 AM
Delete Duplicate Lines in a file, leaving only the unique lines left xmrkite Linux - Software 6 01-14-2010 06:18 PM
How to remove lines and parts of lines from python strings? golmschenk Programming 3 11-26-2009 11:29 PM
How to remove lines from a file doza Linux - General 2 04-27-2005 11:59 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 04:50 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration