awk, sed gawk

webs · 11-11-2009, 06:58 AM

Hi all,

I posted a problem up on here not long ago and was surprised at the response.

I havea really difficult problem, I have a huge dataset and it has some errors. It will take5 days to re-create if I can fix the problem so Im looking at fixing the data.

I have a CSV file of data which has 21 fields. The file has a field 1 as a date field, field2 is a timestamp to 3 miliseconds. example 12:00:01.360

Field6 is a number and field 8 is the important count.

The data looks as follows:

field1, field2 , field6, field8

date,timestamp,id,money

The file counts through the rows in the timestamp, id order and the money grows as it goes through the file and then starts agin at 0 when it sees a new ID and then the money counts up through the rows.

Every so often I have a bad row where the timestamp and ID is correct, but the money is less than the row before. When this happens I want to remove the row which has the less money in.

I have got a separate file which the timestamp, ID and money fields in them in case I needed it.

There are millions of rows so I cant use Excel, the database mess up with the timestamps and I wondered if anyone could think of a way of removing these lines ?

I wondered if the separate file which the rows I want to delete with could be used with a grep -v ??? but you would have to save each one and pipe it to the next field matches ?

Can anyone think of a solution ?

Thanx

Rgds
WEBS

ghostdog74 · 11-11-2009, 07:13 AM

describe your problem again , this time showing a exact sample of input file and show how your final output will be like. To be frank with you, I stop half way reading your problem.

webs · 11-11-2009, 07:45 AM

2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.111,2615270,5314.88 <-- suspect row to delete
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61

This data is showing field 1,2,6,8 of a data file. I want to remove any row that looks out of sync based on field8 which is an amount of money. The timestamps grouped by the id ( field6).

Can you see the highlighted row is wrong as the id is the same as the row before ( 2615270 ) but the money is less then the row before.

The row :
2007-11-19,21:05:23.861,6613277,456.61

is ok as the id is a new one and the money started again ?

Hope that makes it clearer

-WEBS

pcunix · 11-11-2009, 09:05 AM

I'd do this with Perl. There may be some convoluted way to do it with other tools, but this works:

(Not necessarily efficient Perl - showing it step by step to make it easy to follow)

Code:

#!/usr/bin/perl
$previd="";
$prevamount=0;
while (<>) {
 ($date,$time,$id,$amount)=split /,/;
 next if  ($id == $ previd and $amount lt $prevamount);
 print;
 $previd=$id;
 $prevamount=$amount;
}

catkin · 11-11-2009, 09:09 AM

"out of sync" is not a sufficiently clear specification to design an algorithm or write a program.

Are you saying that for a given ID (example 2615270) that the field 8 value must increase for each increased timestamp and, if it does not, the line must be deleted? Is the data pre-sorted by ID and then by timestamp? Are all the ID+timestamp combinations unique?

ghostdog74 · 11-11-2009, 09:13 AM

Quote:

Originally Posted by webs

2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.111,2615270,5314.88 <-- suspect row to delete
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61

This data is showing field 1,2,6,8 of a data file. I want to remove any row that looks out of sync based on field8 which is an amount of money. The timestamps grouped by the id ( field6).

Can you see the highlighted row is wrong as the id is the same as the row before ( 2615270 ) but the money is less then the row before.

The row :
2007-11-19,21:05:23.861,6613277,456.61

is ok as the id is a new one and the money started again ?

Hope that makes it clearer

-WEBS

but you said you have 21 fields? so where are they ? i can only see 4 fields, comma separated since its CSV. Where is field 6, field 8? are they any other files you haven't shown ?

webs · 11-11-2009, 01:49 PM

Hi,

There are loads of fields in the file and doesnt matter about them, Ive shown field 1,2,6,8 which are the important ones. i want to remove a line when the figure in the last field is less than the line before if the 3rd field is the same ID number.

-WEBS

webs · 11-11-2009, 01:51 PM

Catkin

Thats a better way of explaining it. If the ID in the 3rd field is the same and the 4tf Field number is less than the line before you need to remove this line.

Not an easy thing

WEBS

pcunix · 11-11-2009, 01:57 PM

Quote:

Originally Posted by webs

Hi,

There are loads of fields in the file and doesnt matter about them, Ive shown field 1,2,6,8 which are the important ones. i want to remove a line when the figure in the last field is less than the line before if the 3rd field is the same ID number.

-WEBS

OK. The Perl script I gave you above does that. You'd adjust the split line for your actual fields, but other than that, it does the job.

Tinkster · 11-11-2009, 01:59 PM

You could use pcunix perl version, or pretty much "transliterate" it
into awk w/ little effort:

Code:

cat webs
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.111,2615270,5314.88
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61
awk -F, 'BEGIN{previ="";preva=0}{if($3==previ&&$4<preva){next}else{print};previ=$3;preva=$4}' webs
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61

Cheers,
Tink

pcunix · 11-11-2009, 02:12 PM

I figured someone would post an awk version.

I used to use awk, but once Perl sarted becoming ubiquitous, I gave it up.

It's funny: some people see awk as easy and Perl as hard. Others (like me) see Perl as much easier. With awk, I was constantly referring to my Awk and Sed book; with Perl I can usually just write what I want.

I wrote http://aplawrence.com/Unixart/awk-vs.perl.html just a few days ago and got comments like "I tried Perl, and used it for a few projects, but it just never clicked with me, especially all the quirky variable sigils and their usage."

For me it's just the other way around, which shows how different we all are.

But anyway: thanks to Tinkster, you can use whatever feels best to you.

webs · 11-12-2009, 10:08 AM

Thanx All,

I'll use the awk one as Im more comfortable in using that. Trying it now.

Thanx again.

WEBS