LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 11-11-2009, 06:58 AM   #1
webs
LQ Newbie
 
Registered: Oct 2009
Posts: 7

Rep: Reputation: 0
awk, sed gawk - the hardest unix problem.


Hi all,

I posted a problem up on here not long ago and was surprised at the response.

I havea really difficult problem, I have a huge dataset and it has some errors. It will take5 days to re-create if I can fix the problem so Im looking at fixing the data.

I have a CSV file of data which has 21 fields. The file has a field 1 as a date field, field2 is a timestamp to 3 miliseconds. example 12:00:01.360

Field6 is a number and field 8 is the important count.


The data looks as follows:

field1, field2 , field6, field8

date,timestamp,id,money

The file counts through the rows in the timestamp, id order and the money grows as it goes through the file and then starts agin at 0 when it sees a new ID and then the money counts up through the rows.


Every so often I have a bad row where the timestamp and ID is correct, but the money is less than the row before. When this happens I want to remove the row which has the less money in.

I have got a separate file which the timestamp, ID and money fields in them in case I needed it.

There are millions of rows so I cant use Excel, the database mess up with the timestamps and I wondered if anyone could think of a way of removing these lines ?


I wondered if the separate file which the rows I want to delete with could be used with a grep -v ??? but you would have to save each one and pipe it to the next field matches ?

Can anyone think of a solution ?



Thanx

Rgds
WEBS
 
Old 11-11-2009, 07:13 AM   #2
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
describe your problem again , this time showing a exact sample of input file and show how your final output will be like. To be frank with you, I stop half way reading your problem.
 
Old 11-11-2009, 07:45 AM   #3
webs
LQ Newbie
 
Registered: Oct 2009
Posts: 7

Original Poster
Rep: Reputation: 0
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.111,2615270,5314.88 <-- suspect row to delete
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61


This data is showing field 1,2,6,8 of a data file. I want to remove any row that looks out of sync based on field8 which is an amount of money. The timestamps grouped by the id ( field6).

Can you see the highlighted row is wrong as the id is the same as the row before ( 2615270 ) but the money is less then the row before.

The row :
2007-11-19,21:05:23.861,6613277,456.61

is ok as the id is a new one and the money started again ?


Hope that makes it clearer




-WEBS
 
Old 11-11-2009, 09:05 AM   #4
pcunix
Member
 
Registered: Dec 2004
Location: MA
Distribution: Various
Posts: 149

Rep: Reputation: 23
I'd do this with Perl. There may be some convoluted way to do it with other tools, but this works:

(Not necessarily efficient Perl - showing it step by step to make it easy to follow)

Code:
#!/usr/bin/perl
$previd="";
$prevamount=0;
while (<>) {
 ($date,$time,$id,$amount)=split /,/;
 next if  ($id == $ previd and $amount lt $prevamount);
 print;
 $previd=$id;
 $prevamount=$amount;
}
 
Old 11-11-2009, 09:09 AM   #5
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
"out of sync" is not a sufficiently clear specification to design an algorithm or write a program.

Are you saying that for a given ID (example 2615270) that the field 8 value must increase for each increased timestamp and, if it does not, the line must be deleted? Is the data pre-sorted by ID and then by timestamp? Are all the ID+timestamp combinations unique?
 
Old 11-11-2009, 09:13 AM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by webs View Post
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.111,2615270,5314.88 <-- suspect row to delete
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61


This data is showing field 1,2,6,8 of a data file. I want to remove any row that looks out of sync based on field8 which is an amount of money. The timestamps grouped by the id ( field6).

Can you see the highlighted row is wrong as the id is the same as the row before ( 2615270 ) but the money is less then the row before.

The row :
2007-11-19,21:05:23.861,6613277,456.61

is ok as the id is a new one and the money started again ?


Hope that makes it clearer




-WEBS
but you said you have 21 fields? so where are they ? i can only see 4 fields, comma separated since its CSV. Where is field 6, field 8? are they any other files you haven't shown ?
 
Old 11-11-2009, 01:49 PM   #7
webs
LQ Newbie
 
Registered: Oct 2009
Posts: 7

Original Poster
Rep: Reputation: 0
Hi,

There are loads of fields in the file and doesnt matter about them, Ive shown field 1,2,6,8 which are the important ones. i want to remove a line when the figure in the last field is less than the line before if the 3rd field is the same ID number.




-WEBS
 
Old 11-11-2009, 01:51 PM   #8
webs
LQ Newbie
 
Registered: Oct 2009
Posts: 7

Original Poster
Rep: Reputation: 0
Catkin

Thats a better way of explaining it. If the ID in the 3rd field is the same and the 4tf Field number is less than the line before you need to remove this line.

Not an easy thing



WEBS
 
Old 11-11-2009, 01:57 PM   #9
pcunix
Member
 
Registered: Dec 2004
Location: MA
Distribution: Various
Posts: 149

Rep: Reputation: 23
Quote:
Originally Posted by webs View Post
Hi,

There are loads of fields in the file and doesnt matter about them, Ive shown field 1,2,6,8 which are the important ones. i want to remove a line when the figure in the last field is less than the line before if the 3rd field is the same ID number.


-WEBS
OK. The Perl script I gave you above does that. You'd adjust the split line for your actual fields, but other than that, it does the job.
 
Old 11-11-2009, 01:59 PM   #10
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
You could use pcunix perl version, or pretty much "transliterate" it
into awk w/ little effort:
Code:
cat webs
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.111,2615270,5314.88
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61
awk -F, 'BEGIN{previ="";preva=0}{if($3==previ&&$4<preva){next}else{print};previ=$3;preva=$4}' webs
2007-11-19,21:05:28.074,2615270,5334.48
2007-11-19,21:05:28.374,2615270,5344.78
2007-11-19,21:05:29.884,2615270,5394.47
2007-11-19,21:05:23.861,6613277,456.61
2007-11-19,21:05:28.861,6613277,496.24
2007-11-19,21:05:29.861,6613277,536.61

Cheers,
Tink
 
Old 11-11-2009, 02:12 PM   #11
pcunix
Member
 
Registered: Dec 2004
Location: MA
Distribution: Various
Posts: 149

Rep: Reputation: 23
I figured someone would post an awk version.

I used to use awk, but once Perl sarted becoming ubiquitous, I gave it up.

It's funny: some people see awk as easy and Perl as hard. Others (like me) see Perl as much easier. With awk, I was constantly referring to my Awk and Sed book; with Perl I can usually just write what I want.

I wrote http://aplawrence.com/Unixart/awk-vs.perl.html just a few days ago and got comments like "I tried Perl, and used it for a few projects, but it just never clicked with me, especially all the quirky variable sigils and their usage."

For me it's just the other way around, which shows how different we all are.

But anyway: thanks to Tinkster, you can use whatever feels best to you.
 
Old 11-12-2009, 10:08 AM   #12
webs
LQ Newbie
 
Registered: Oct 2009
Posts: 7

Original Poster
Rep: Reputation: 0
Thanx All,

I'll use the awk one as Im more comfortable in using that. Trying it now.


Thanx again.


WEBS
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Static awk or gawk binary rbautch Linux - Software 5 06-14-2011 01:41 AM
awk or gawk question sharky Programming 4 10-24-2008 01:29 PM
awk/gawk/sed - read lines from file1, comment out or delete matching lines in file2 rascal84 Linux - General 1 05-24-2006 09:19 AM
Deleting a line with gawk/awk caps_phisto Linux - General 4 11-06-2004 02:31 PM
sed/awk problem player_2 Programming 9 08-26-2003 06:09 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 11:20 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration