[SOLVED] How do I scan several hundreds files for, in each file the first instance of an entry in a particular column and.......

sean mckinney · 12-16-2020, 07:00 AM

I want to scan several hundreds files, all csv's and all in one folder, for, in each file/csv, the first instance of an entry in column number 'such and such' and output that entire line to another csv in a differing folder.
How may I do this?

For a given file the following appears to be a start
gawk ' { if (NF >= 4){if (S111 =="") print $0}} ' *.csv > /home/name/test/test.csv
but it outputs every line with a relevant entry.

The csv's are perhaps large, 200+ columns and up to 15,000 rows.

I have come across a grep command that does this but that searches for specific text in any column and I seek a solution for entries that do not contain specific text.

Pressing my luck and possibly more difficult, it might also be useful if I could also output the last line to contain and entry in the desired column but that may be a question for a later threads.

Thanks for any assistance.

boughtonp · 12-16-2020, 07:20 AM

CSV logic should use a CSV parser - I would be looking to implement this in a language with a dedicated CSV parser, e.g. Python.

However, the solution to...

Quote:

but it outputs every line with a relevant entry.

...is probably to add an exit statement as part of the second if's success.

Quote:

I have come across a grep command that does this but that searches for specific text in any column and I seek a solution for entries that do not contain specific text.

This sentence does not seem to match the rest of your post, but consider that the last row that matches is the first row when reading rows backwards.

shruggy · 12-16-2020, 07:58 AM

Could you provide a small sample of input data together with desired output? I don't quite get what you're trying to achieve.

Judging from NF >= 4, am I right supposing that the input CSV data may have different number of columns?

Are you trying to output the first row where the column number 'such and such' is not empty?

Do you want each matching row be output to a separate file? Or should they all be added to the same file?

And limiting the grep-like output to the first and the last match only shouldn't be a problem: |sed -n '1p;$p'.

But yes, +1 to using a dedicated CSV parser.

SoftSprocket · 12-16-2020, 08:28 AM

As I understand it the programming logic is relatively simple i.e. (psuedocode)

Code:

files = read_from_dir
while (file = next_file (files)) {
    while (line = read_line (file)) {
        if (column_x_has_entry (line)) {
             write_to_file (line)
             break
        }
    }
}

sean mckinney · 12-16-2020, 01:43 PM

with regards to the comment about grep.

In a windows command window following
grep -a -m 1 -r "sample" *.csv > xyz.csv
searches for the word "sample" and if it finds the word in the csv it outputs the line to xyz.csv. It then moves on to search the next csv file. Yes I am aware that it probably searches xyz.csv and that this is probably not good programming but I can adjust that later. Excuse the use of windows but these csv's come from drone flight logs and are generated from the manufacturer's flight logs using software that I only know how to run under windows and using the grep version for windows saved/s swapping between windows and linux. Besides which, grep seems a much better search tool than anything I have found for windows. From memory I have run that grep 'line' under linux and it works there too. My 'DOS' skills are even worse than my gawk skills so it is likely I will have to get used to switching back and forth between windows and Linux.

With regards to the NF >=4 that's a legacy of my copying the lay out for my one liner from some old programs that I have which work/ed on input files that had a varialble number of fields in each line. The csv's of the current project do have some variation in the number of fields, depending on drone software version, but all lines are well over 4 columns. As such NF >= 4 is redundant.

With regards to
"Are you trying to output the first row where the column number 'such and such' is not empty?" Yes but for each csv, so if there were 623 csv's and the search target is found in each csv there would be 623/4? outputs all sent to one file

"Do you want each matching row be output to a separate file? Or should they all be added to the same file? All sent to the one file.

with regards to samples, say csv1 to csv4 are fictional flight logs and I want to search for the first line in each cav that contains a charge count, they and the desired output are shown in the attached.
Ultimately one thing I would like to be able to track is the flight time capability of each battery and that is where the first and last occurrences of the search target comes in.

@Softsprocket, thanks some of that rings a bell from the "old programs" many thanks. @everyone else thanks I will google the suggestions and see where they lead me

astrogeek · 12-16-2020, 01:43 PM

I would go with boughtonp's suggestion of simply using an exit once the pattern has been matched. That should result in a simple one line command if your CSV data are reasonably uniform.

A small but representative sample of the CSV data would help others to better understand the whole problem.

sean mckinney · 12-16-2020, 03:30 PM

Sample from one flight log down loaded off the web, altered to preserve anonymity, I hope. I am not sure if it will work as I am only able to upload a limited range of file types so had to convert it, it is described as a tab delimited txt.

From what I can see that works if you open it with a spreadsheet program the battery charge count is column DG with the title "CENTER_BATTERY.loopNum". That raises one point, the presence of a title might trigger a 'success' so it might be necessary to search for the second occurrence.

shruggy · 12-17-2020, 04:50 AM

That column, CENTER_BATTERY.loopNum, besides a few empty rows contains only zeros. Is it OK? Just to be sure, you want the first row where CENTER_BATTERY.loopNum is zero after the gap in the data, and then the last row before the next gap?

Ok, this is probably not quite what you want, but similar: it outputs the first row where CENTER_BATTERY.loopNum is not empty, but was empty in the previous row:

Code:

mlr -t step -a shift -f CENTER_BATTERY.loopNum \
 then filter '${CENTER_BATTERY.loopNum_shift}==""&&${CENTER_BATTERY.loopNum}!=""' \
 then cut -x -f CENTER_BATTERY.loopNum_shift sample.csv

And this will output the first and the last row of non-empty runs between empty gaps:

Code:

mlr -t put '
@n=${CENTER_BATTERY.loopNum};
if(NR>1&&@p!=""&&@n==""){emit @r};
filter @p==""&&@n!="";@p=@n;@r=$*' sample.csv

syg00 · 12-17-2020, 06:54 AM

Rather than learn a new language, awk should handle it just fine. I'm guessing the OP means '$111 != ""' - pretty standard to use an assoc array to hold matching records - print the first before saving, print the last in an END block.

boughtonp · 12-17-2020, 07:22 AM

Quote:

Originally Posted by syg00

Rather than learn a new language, awk should handle it just fine.

Now that we've seen the data is simple, sure - it'll work here.

In general, parsing wild CSV needs a tool that can handle quoted separators, and afaik nobody has added that to Awk yet?

shruggy · 12-17-2020, 08:12 AM

Quote:

Originally Posted by boughtonp

afaik nobody has added that to Awk yet?

Well, gawk has FPAT, and there's a (yet unreleased) extension to gawk, gawk-csv. I'm using it in this post.

boughtonp · 12-17-2020, 08:51 AM

Quote:

Originally Posted by shruggy

Well, gawk has FPAT

Works for commas, but not newlines (common in addresses).

And the example there doesn't deal with escaped quotes - e.g. "Bond, James ""007"" Bond" - which is easy to do, but just one more bit of noise when it should be no more complex than specifying a "--csv" option instead.

Quote:

and there's a (yet unreleased) extension to gawk, gawk-csv. I'm using it in this post.

Which is nice, but why is it an extension and what's the deal with it being "not yet released" (despite also claiming "Version: 1.0.0")?

(Questions more for the gawk/gawkextlib teams I guess, but I don't know if I can be bothered going and asking them; I'd probably just get annoyed by the response.)

sean mckinney · 12-17-2020, 12:00 PM

Shruggy it was just a sample file and choosing "CENTER_BATTERY.loopNum" as the target was semi random. That those cells in the sample are empty is just an unlucky coincidence.
Sometimes data is missing from some of these logs, be that from the entire log or only from part of it, and that is actually part of the reason for my 'quest', to see if I can work out a pattern.

BTW I think I made a HUGE HUGE mistake with this thread, the == should, I think, have been !=, as in not empty. The sun is shining out of my window from the scarlet glow on my face. Whoops.
I may also have to adjust the "pattern" if it turns out that some entries are invisible to the human eye but visible to the search. PLus make allowances for the fact that the first row on each csv is full, of column titles, which I realised as I started fiddling with trial searches.

One point though, it would, I think, be better to search columns without addressing their titles, from memory that varies from model to model which is why I was referencing what I know as either the column number or field number

syg00 you are correct about the == vs !=, I doubt I could have made a better mistake if I tried.

shruggy · 12-17-2020, 12:19 PM

Quote:

Originally Posted by sean mckinney

And make allowances for the fact that the first row on each csv is full, of column titles

Well, in awk this could be taken care of with NR>1 or NR==1{next}:

Code:

awk -F'\t' 'NR==1{next}p!=""&&$4==""{print r}p==""&&$4!="";{p=$4;r=$0}' sample.csv

shruggy · 12-17-2020, 12:27 PM

Quote:

Originally Posted by sean mckinney

One point though, it would, I think, be better to search columns without addressing their titles

Miller uses almost the same convention as awk here:

Code:

mlr -t put '
if(NR>1&&@p!=""&&$[[[4]]]==""){emit @r};
filter @p==""&&$[[[4]]]!="";@p=$[[[4]]];@r=$*' sample.csv