[SOLVED] How do I scan several hundreds files for, in each file the first instance of an entry in a particular column and.......
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How do I scan several hundreds files for, in each file the first instance of an entry in a particular column and.......
I want to scan several hundreds files, all csv's and all in one folder, for, in each file/csv, the first instance of an entry in column number 'such and such' and output that entire line to another csv in a differing folder.
How may I do this?
For a given file the following appears to be a start
gawk ' { if (NF >= 4){if (S111 =="") print $0}} ' *.csv > /home/name/test/test.csv
but it outputs every line with a relevant entry.
The csv's are perhaps large, 200+ columns and up to 15,000 rows.
I have come across a grep command that does this but that searches for specific text in any column and I seek a solution for entries that do not contain specific text.
Pressing my luck and possibly more difficult, it might also be useful if I could also output the last line to contain and entry in the desired column but that may be a question for a later threads.
...is probably to add an exit statement as part of the second if's success.
Quote:
I have come across a grep command that does this but that searches for specific text in any column and I seek a solution for entries that do not contain specific text.
This sentence does not seem to match the rest of your post, but consider that the last row that matches is the first row when reading rows backwards.
In a windows command window following
grep -a -m 1 -r "sample" *.csv > xyz.csv
searches for the word "sample" and if it finds the word in the csv it outputs the line to xyz.csv. It then moves on to search the next csv file. Yes I am aware that it probably searches xyz.csv and that this is probably not good programming but I can adjust that later. Excuse the use of windows but these csv's come from drone flight logs and are generated from the manufacturer's flight logs using software that I only know how to run under windows and using the grep version for windows saved/s swapping between windows and linux. Besides which, grep seems a much better search tool than anything I have found for windows. From memory I have run that grep 'line' under linux and it works there too. My 'DOS' skills are even worse than my gawk skills so it is likely I will have to get used to switching back and forth between windows and Linux.
With regards to the NF >=4 that's a legacy of my copying the lay out for my one liner from some old programs that I have which work/ed on input files that had a varialble number of fields in each line. The csv's of the current project do have some variation in the number of fields, depending on drone software version, but all lines are well over 4 columns. As such NF >= 4 is redundant.
With regards to
"Are you trying to output the first row where the column number 'such and such' is not empty?" Yes but for each csv, so if there were 623 csv's and the search target is found in each csv there would be 623/4? outputs all sent to one file
"Do you want each matching row be output to a separate file? Or should they all be added to the same file? All sent to the one file.
with regards to samples, say csv1 to csv4 are fictional flight logs and I want to search for the first line in each cav that contains a charge count, they and the desired output are shown in the attached.
Ultimately one thing I would like to be able to track is the flight time capability of each battery and that is where the first and last occurrences of the search target comes in.
@Softsprocket, thanks some of that rings a bell from the "old programs" many thanks. @everyone else thanks I will google the suggestions and see where they lead me
Last edited by sean mckinney; 12-16-2020 at 01:47 PM.
I would go with boughtonp's suggestion of simply using an exit once the pattern has been matched. That should result in a simple one line command if your CSV data are reasonably uniform.
A small but representative sample of the CSV data would help others to better understand the whole problem.
Sample from one flight log down loaded off the web, altered to preserve anonymity, I hope. I am not sure if it will work as I am only able to upload a limited range of file types so had to convert it, it is described as a tab delimited txt.
From what I can see that works if you open it with a spreadsheet program the battery charge count is column DG with the title "CENTER_BATTERY.loopNum". That raises one point, the presence of a title might trigger a 'success' so it might be necessary to search for the second occurrence.
Last edited by sean mckinney; 12-16-2020 at 03:37 PM.
That column, CENTER_BATTERY.loopNum, besides a few empty rows contains only zeros. Is it OK? Just to be sure, you want the first row where CENTER_BATTERY.loopNum is zero after the gap in the data, and then the last row before the next gap?
Ok, this is probably not quite what you want, but similar: it outputs the first row where CENTER_BATTERY.loopNum is not empty, but was empty in the previous row:
Code:
mlr -t step -a shift -f CENTER_BATTERY.loopNum \
then filter '${CENTER_BATTERY.loopNum_shift}==""&&${CENTER_BATTERY.loopNum}!=""' \
then cut -x -f CENTER_BATTERY.loopNum_shift sample.csv
And this will output the first and the last row of non-empty runs between empty gaps:
Code:
mlr -t put '
@n=${CENTER_BATTERY.loopNum};
if(NR>1&&@p!=""&&@n==""){emit @r};
filter @p==""&&@n!="";@p=@n;@r=$*' sample.csv
Rather than learn a new language, awk should handle it just fine. I'm guessing the OP means '$111 != ""' - pretty standard to use an assoc array to hold matching records - print the first before saving, print the last in an END block.
Works for commas, but not newlines (common in addresses).
And the example there doesn't deal with escaped quotes - e.g. "Bond, James ""007"" Bond" - which is easy to do, but just one more bit of noise when it should be no more complex than specifying a "--csv" option instead.
Quote:
and there's a (yet unreleased) extension to gawk, gawk-csv. I'm using it in this post.
Which is nice, but why is it an extension and what's the deal with it being "not yet released" (despite also claiming "Version: 1.0.0")?
(Questions more for the gawk/gawkextlib teams I guess, but I don't know if I can be bothered going and asking them; I'd probably just get annoyed by the response.)
Shruggy it was just a sample file and choosing "CENTER_BATTERY.loopNum" as the target was semi random. That those cells in the sample are empty is just an unlucky coincidence.
Sometimes data is missing from some of these logs, be that from the entire log or only from part of it, and that is actually part of the reason for my 'quest', to see if I can work out a pattern.
BTW I think I made a HUGE HUGE mistake with this thread, the == should, I think, have been !=, as in not empty. The sun is shining out of my window from the scarlet glow on my face. Whoops.
I may also have to adjust the "pattern" if it turns out that some entries are invisible to the human eye but visible to the search. PLus make allowances for the fact that the first row on each csv is full, of column titles, which I realised as I started fiddling with trial searches.
One point though, it would, I think, be better to search columns without addressing their titles, from memory that varies from model to model which is why I was referencing what I know as either the column number or field number
syg00 you are correct about the == vs !=, I doubt I could have made a better mistake if I tried.
Last edited by sean mckinney; 12-17-2020 at 02:52 PM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.