Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I posted a problem up on here not long ago and was surprised at the response.
I havea really difficult problem, I have a huge dataset and it has some errors. It will take5 days to re-create if I can fix the problem so Im looking at fixing the data.
I have a CSV file of data which has 21 fields. The file has a field 1 as a date field, field2 is a timestamp to 3 miliseconds. example 12:00:01.360
Field6 is a number and field 8 is the important count.
The data looks as follows:
field1, field2 , field6, field8
date,timestamp,id,money
The file counts through the rows in the timestamp, id order and the money grows as it goes through the file and then starts agin at 0 when it sees a new ID and then the money counts up through the rows.
Every so often I have a bad row where the timestamp and ID is correct, but the money is less than the row before. When this happens I want to remove the row which has the less money in.
I have got a separate file which the timestamp, ID and money fields in them in case I needed it.
There are millions of rows so I cant use Excel, the database mess up with the timestamps and I wondered if anyone could think of a way of removing these lines ?
I wondered if the separate file which the rows I want to delete with could be used with a grep -v ??? but you would have to save each one and pipe it to the next field matches ?
describe your problem again , this time showing a exact sample of input file and show how your final output will be like. To be frank with you, I stop half way reading your problem.
This data is showing field 1,2,6,8 of a data file. I want to remove any row that looks out of sync based on field8 which is an amount of money. The timestamps grouped by the id ( field6).
Can you see the highlighted row is wrong as the id is the same as the row before ( 2615270 ) but the money is less then the row before.
The row :
2007-11-19,21:05:23.861,6613277,456.61
is ok as the id is a new one and the money started again ?
"out of sync" is not a sufficiently clear specification to design an algorithm or write a program.
Are you saying that for a given ID (example 2615270) that the field 8 value must increase for each increased timestamp and, if it does not, the line must be deleted? Is the data pre-sorted by ID and then by timestamp? Are all the ID+timestamp combinations unique?
This data is showing field 1,2,6,8 of a data file. I want to remove any row that looks out of sync based on field8 which is an amount of money. The timestamps grouped by the id ( field6).
Can you see the highlighted row is wrong as the id is the same as the row before ( 2615270 ) but the money is less then the row before.
The row :
2007-11-19,21:05:23.861,6613277,456.61
is ok as the id is a new one and the money started again ?
Hope that makes it clearer
-WEBS
but you said you have 21 fields? so where are they ? i can only see 4 fields, comma separated since its CSV. Where is field 6, field 8? are they any other files you haven't shown ?
There are loads of fields in the file and doesnt matter about them, Ive shown field 1,2,6,8 which are the important ones. i want to remove a line when the figure in the last field is less than the line before if the 3rd field is the same ID number.
Thats a better way of explaining it. If the ID in the 3rd field is the same and the 4tf Field number is less than the line before you need to remove this line.
There are loads of fields in the file and doesnt matter about them, Ive shown field 1,2,6,8 which are the important ones. i want to remove a line when the figure in the last field is less than the line before if the 3rd field is the same ID number.
-WEBS
OK. The Perl script I gave you above does that. You'd adjust the split line for your actual fields, but other than that, it does the job.
I used to use awk, but once Perl sarted becoming ubiquitous, I gave it up.
It's funny: some people see awk as easy and Perl as hard. Others (like me) see Perl as much easier. With awk, I was constantly referring to my Awk and Sed book; with Perl I can usually just write what I want.
I wrote http://aplawrence.com/Unixart/awk-vs.perl.html just a few days ago and got comments like "I tried Perl, and used it for a few projects, but it just never clicked with me, especially all the quirky variable sigils and their usage."
For me it's just the other way around, which shows how different we all are.
But anyway: thanks to Tinkster, you can use whatever feels best to you.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.