awk, sed gawk - the hardest unix problem.
Hi all,
I posted a problem up on here not long ago and was surprised at the response. I havea really difficult problem, I have a huge dataset and it has some errors. It will take5 days to re-create if I can fix the problem so Im looking at fixing the data. I have a CSV file of data which has 21 fields. The file has a field 1 as a date field, field2 is a timestamp to 3 miliseconds. example 12:00:01.360 Field6 is a number and field 8 is the important count. The data looks as follows: field1, field2 , field6, field8 date,timestamp,id,money The file counts through the rows in the timestamp, id order and the money grows as it goes through the file and then starts agin at 0 when it sees a new ID and then the money counts up through the rows. Every so often I have a bad row where the timestamp and ID is correct, but the money is less than the row before. When this happens I want to remove the row which has the less money in. I have got a separate file which the timestamp, ID and money fields in them in case I needed it. There are millions of rows so I cant use Excel, the database mess up with the timestamps and I wondered if anyone could think of a way of removing these lines ? I wondered if the separate file which the rows I want to delete with could be used with a grep -v ??? but you would have to save each one and pipe it to the next field matches ? Can anyone think of a solution ? Thanx Rgds WEBS |
describe your problem again , this time showing a exact sample of input file and show how your final output will be like. To be frank with you, I stop half way reading your problem.
|
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78 2007-11-19,21:05:29.111,2615270,5314.88 <-- suspect row to delete 2007-11-19,21:05:29.884,2615270,5394.47 2007-11-19,21:05:23.861,6613277,456.61 2007-11-19,21:05:28.861,6613277,496.24 2007-11-19,21:05:29.861,6613277,536.61 This data is showing field 1,2,6,8 of a data file. I want to remove any row that looks out of sync based on field8 which is an amount of money. The timestamps grouped by the id ( field6). Can you see the highlighted row is wrong as the id is the same as the row before ( 2615270 ) but the money is less then the row before. The row : 2007-11-19,21:05:23.861,6613277,456.61 is ok as the id is a new one and the money started again ? Hope that makes it clearer -WEBS |
I'd do this with Perl. There may be some convoluted way to do it with other tools, but this works:
(Not necessarily efficient Perl - showing it step by step to make it easy to follow) Code:
#!/usr/bin/perl |
"out of sync" is not a sufficiently clear specification to design an algorithm or write a program.
Are you saying that for a given ID (example 2615270) that the field 8 value must increase for each increased timestamp and, if it does not, the line must be deleted? Is the data pre-sorted by ID and then by timestamp? Are all the ID+timestamp combinations unique? |
Quote:
|
Hi,
There are loads of fields in the file and doesnt matter about them, Ive shown field 1,2,6,8 which are the important ones. i want to remove a line when the figure in the last field is less than the line before if the 3rd field is the same ID number. -WEBS |
Catkin
Thats a better way of explaining it. If the ID in the 3rd field is the same and the 4tf Field number is less than the line before you need to remove this line. Not an easy thing WEBS |
Quote:
|
You could use pcunix perl version, or pretty much "transliterate" it
into awk w/ little effort: Code:
cat webs Cheers, Tink |
I figured someone would post an awk version.
I used to use awk, but once Perl sarted becoming ubiquitous, I gave it up. It's funny: some people see awk as easy and Perl as hard. Others (like me) see Perl as much easier. With awk, I was constantly referring to my Awk and Sed book; with Perl I can usually just write what I want. I wrote http://aplawrence.com/Unixart/awk-vs.perl.html just a few days ago and got comments like "I tried Perl, and used it for a few projects, but it just never clicked with me, especially all the quirky variable sigils and their usage." For me it's just the other way around, which shows how different we all are. But anyway: thanks to Tinkster, you can use whatever feels best to you. |
Thanx All,
I'll use the awk one as Im more comfortable in using that. Trying it now. Thanx again. WEBS |
All times are GMT -5. The time now is 03:02 PM. |