Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
11-11-2009, 07:58 AM
|
#1
|
LQ Newbie
Registered: Oct 2009
Posts: 7
Rep:
|
awk, sed gawk - the hardest unix problem.
Hi all,
I posted a problem up on here not long ago and was surprised at the response.
I havea really difficult problem, I have a huge dataset and it has some errors. It will take5 days to re-create if I can fix the problem so Im looking at fixing the data.
I have a CSV file of data which has 21 fields. The file has a field 1 as a date field, field2 is a timestamp to 3 miliseconds. example 12:00:01.360
Field6 is a number and field 8 is the important count.
The data looks as follows:
field1, field2 , field6, field8
date,timestamp,id,money
The file counts through the rows in the timestamp, id order and the money grows as it goes through the file and then starts agin at 0 when it sees a new ID and then the money counts up through the rows.
Every so often I have a bad row where the timestamp and ID is correct, but the money is less than the row before. When this happens I want to remove the row which has the less money in.
I have got a separate file which the timestamp, ID and money fields in them in case I needed it.
There are millions of rows so I cant use Excel, the database mess up with the timestamps and I wondered if anyone could think of a way of removing these lines ?
I wondered if the separate file which the rows I want to delete with could be used with a grep -v ??? but you would have to save each one and pipe it to the next field matches ?
Can anyone think of a solution ?
Thanx
Rgds
WEBS
|
|
|
11-11-2009, 08:13 AM
|
#2
|
Senior Member
Registered: Aug 2006
Posts: 2,697
|
describe your problem again , this time showing a exact sample of input file and show how your final output will be like. To be frank with you, I stop half way reading your problem.
|
|
|
11-11-2009, 08:45 AM
|
#3
|
LQ Newbie
Registered: Oct 2009
Posts: 7
Original Poster
Rep:
|
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.111,2615270,5314.88 <-- suspect row to delete
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61
This data is showing field 1,2,6,8 of a data file. I want to remove any row that looks out of sync based on field8 which is an amount of money. The timestamps grouped by the id ( field6).
Can you see the highlighted row is wrong as the id is the same as the row before ( 2615270 ) but the money is less then the row before.
The row :
2007-11-19,21:05:23.861,6613277,456.61
is ok as the id is a new one and the money started again ?
Hope that makes it clearer
-WEBS
|
|
|
11-11-2009, 10:05 AM
|
#4
|
Member
Registered: Dec 2004
Location: MA
Distribution: Various
Posts: 149
Rep:
|
I'd do this with Perl. There may be some convoluted way to do it with other tools, but this works:
(Not necessarily efficient Perl - showing it step by step to make it easy to follow)
Code:
#!/usr/bin/perl
$previd="";
$prevamount=0;
while (<>) {
($date,$time,$id,$amount)=split /,/;
next if ($id == $ previd and $amount lt $prevamount);
print;
$previd=$id;
$prevamount=$amount;
}
|
|
|
11-11-2009, 10:09 AM
|
#5
|
LQ 5k Club
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
|
"out of sync" is not a sufficiently clear specification to design an algorithm or write a program.
Are you saying that for a given ID (example 2615270) that the field 8 value must increase for each increased timestamp and, if it does not, the line must be deleted? Is the data pre-sorted by ID and then by timestamp? Are all the ID+timestamp combinations unique?
|
|
|
11-11-2009, 10:13 AM
|
#6
|
Senior Member
Registered: Aug 2006
Posts: 2,697
|
Quote:
Originally Posted by webs
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.111,2615270,5314.88 <-- suspect row to delete
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61
This data is showing field 1,2,6,8 of a data file. I want to remove any row that looks out of sync based on field8 which is an amount of money. The timestamps grouped by the id ( field6).
Can you see the highlighted row is wrong as the id is the same as the row before ( 2615270 ) but the money is less then the row before.
The row :
2007-11-19,21:05:23.861,6613277,456.61
is ok as the id is a new one and the money started again ?
Hope that makes it clearer
-WEBS
|
but you said you have 21 fields? so where are they ? i can only see 4 fields, comma separated since its CSV. Where is field 6, field 8? are they any other files you haven't shown ?
|
|
|
11-11-2009, 02:49 PM
|
#7
|
LQ Newbie
Registered: Oct 2009
Posts: 7
Original Poster
Rep:
|
Hi,
There are loads of fields in the file and doesnt matter about them, Ive shown field 1,2,6,8 which are the important ones. i want to remove a line when the figure in the last field is less than the line before if the 3rd field is the same ID number.
-WEBS
|
|
|
11-11-2009, 02:51 PM
|
#8
|
LQ Newbie
Registered: Oct 2009
Posts: 7
Original Poster
Rep:
|
Catkin
Thats a better way of explaining it. If the ID in the 3rd field is the same and the 4tf Field number is less than the line before you need to remove this line.
Not an easy thing
WEBS
|
|
|
11-11-2009, 02:57 PM
|
#9
|
Member
Registered: Dec 2004
Location: MA
Distribution: Various
Posts: 149
Rep:
|
Quote:
Originally Posted by webs
Hi,
There are loads of fields in the file and doesnt matter about them, Ive shown field 1,2,6,8 which are the important ones. i want to remove a line when the figure in the last field is less than the line before if the 3rd field is the same ID number.
-WEBS
|
OK. The Perl script I gave you above does that. You'd adjust the split line for your actual fields, but other than that, it does the job.
|
|
|
11-11-2009, 02:59 PM
|
#10
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
You could use pcunix perl version, or pretty much "transliterate" it
into awk w/ little effort:
Code:
cat webs
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.111,2615270,5314.88
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61
awk -F, 'BEGIN{previ="";preva=0}{if($3==previ&&$4<preva){next}else{print};previ=$3;preva=$4}' webs
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61
Cheers,
Tink
|
|
|
11-11-2009, 03:12 PM
|
#11
|
Member
Registered: Dec 2004
Location: MA
Distribution: Various
Posts: 149
Rep:
|
I figured someone would post an awk version.
I used to use awk, but once Perl sarted becoming ubiquitous, I gave it up.
It's funny: some people see awk as easy and Perl as hard. Others (like me) see Perl as much easier. With awk, I was constantly referring to my Awk and Sed book; with Perl I can usually just write what I want.
I wrote http://aplawrence.com/Unixart/awk-vs.perl.html just a few days ago and got comments like "I tried Perl, and used it for a few projects, but it just never clicked with me, especially all the quirky variable sigils and their usage."
For me it's just the other way around, which shows how different we all are.
But anyway: thanks to Tinkster, you can use whatever feels best to you.
|
|
|
11-12-2009, 11:08 AM
|
#12
|
LQ Newbie
Registered: Oct 2009
Posts: 7
Original Poster
Rep:
|
Thanx All,
I'll use the awk one as Im more comfortable in using that. Trying it now.
Thanx again.
WEBS
|
|
|
All times are GMT -5. The time now is 10:14 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|