LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-16-2020, 07:00 AM   #1
sean mckinney
Member
 
Registered: Mar 2004
Posts: 32

Rep: Reputation: 0
How do I scan several hundreds files for, in each file the first instance of an entry in a particular column and.......


I want to scan several hundreds files, all csv's and all in one folder, for, in each file/csv, the first instance of an entry in column number 'such and such' and output that entire line to another csv in a differing folder.
How may I do this?

For a given file the following appears to be a start
gawk ' { if (NF >= 4){if (S111 =="") print $0}} ' *.csv > /home/name/test/test.csv
but it outputs every line with a relevant entry.

The csv's are perhaps large, 200+ columns and up to 15,000 rows.

I have come across a grep command that does this but that searches for specific text in any column and I seek a solution for entries that do not contain specific text.

Pressing my luck and possibly more difficult, it might also be useful if I could also output the last line to contain and entry in the desired column but that may be a question for a later threads.

Thanks for any assistance.
 
Old 12-16-2020, 07:20 AM   #2
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,597

Rep: Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545

CSV logic should use a CSV parser - I would be looking to implement this in a language with a dedicated CSV parser, e.g. Python.


However, the solution to...
Quote:
but it outputs every line with a relevant entry.
...is probably to add an exit statement as part of the second if's success.


Quote:
I have come across a grep command that does this but that searches for specific text in any column and I seek a solution for entries that do not contain specific text.
This sentence does not seem to match the rest of your post, but consider that the last row that matches is the first row when reading rows backwards.

 
Old 12-16-2020, 07:58 AM   #3
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
Could you provide a small sample of input data together with desired output? I don't quite get what you're trying to achieve.

Judging from NF >= 4, am I right supposing that the input CSV data may have different number of columns?

Are you trying to output the first row where the column number 'such and such' is not empty?

Do you want each matching row be output to a separate file? Or should they all be added to the same file?

And limiting the grep-like output to the first and the last match only shouldn't be a problem: |sed -n '1p;$p'.

But yes, +1 to using a dedicated CSV parser.

Last edited by shruggy; 12-16-2020 at 08:33 AM.
 
Old 12-16-2020, 08:28 AM   #4
SoftSprocket
Member
 
Registered: Nov 2014
Posts: 399

Rep: Reputation: Disabled
As I understand it the programming logic is relatively simple i.e. (psuedocode)

Code:
files = read_from_dir
while (file = next_file (files)) {
    while (line = read_line (file)) {
        if (column_x_has_entry (line)) {
             write_to_file (line)
             break
        }
    }
}
 
Old 12-16-2020, 01:43 PM   #5
sean mckinney
Member
 
Registered: Mar 2004
Posts: 32

Original Poster
Rep: Reputation: 0
with regards to the comment about grep.

In a windows command window following
grep -a -m 1 -r "sample" *.csv > xyz.csv
searches for the word "sample" and if it finds the word in the csv it outputs the line to xyz.csv. It then moves on to search the next csv file. Yes I am aware that it probably searches xyz.csv and that this is probably not good programming but I can adjust that later. Excuse the use of windows but these csv's come from drone flight logs and are generated from the manufacturer's flight logs using software that I only know how to run under windows and using the grep version for windows saved/s swapping between windows and linux. Besides which, grep seems a much better search tool than anything I have found for windows. From memory I have run that grep 'line' under linux and it works there too. My 'DOS' skills are even worse than my gawk skills so it is likely I will have to get used to switching back and forth between windows and Linux.

With regards to the NF >=4 that's a legacy of my copying the lay out for my one liner from some old programs that I have which work/ed on input files that had a varialble number of fields in each line. The csv's of the current project do have some variation in the number of fields, depending on drone software version, but all lines are well over 4 columns. As such NF >= 4 is redundant.

With regards to
"Are you trying to output the first row where the column number 'such and such' is not empty?" Yes but for each csv, so if there were 623 csv's and the search target is found in each csv there would be 623/4? outputs all sent to one file

"Do you want each matching row be output to a separate file? Or should they all be added to the same file? All sent to the one file.

with regards to samples, say csv1 to csv4 are fictional flight logs and I want to search for the first line in each cav that contains a charge count, they and the desired output are shown in the attached.
Ultimately one thing I would like to be able to track is the flight time capability of each battery and that is where the first and last occurrences of the search target comes in.

@Softsprocket, thanks some of that rings a bell from the "old programs" many thanks. @everyone else thanks I will google the suggestions and see where they lead me
Attached Thumbnails
Click image for larger version

Name:	fictional csv's.png
Views:	21
Size:	41.2 KB
ID:	34894  

Last edited by sean mckinney; 12-16-2020 at 01:47 PM.
 
Old 12-16-2020, 01:43 PM   #6
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,263
Blog Entries: 24

Rep: Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194
I would go with boughtonp's suggestion of simply using an exit once the pattern has been matched. That should result in a simple one line command if your CSV data are reasonably uniform.

A small but representative sample of the CSV data would help others to better understand the whole problem.
 
Old 12-16-2020, 03:30 PM   #7
sean mckinney
Member
 
Registered: Mar 2004
Posts: 32

Original Poster
Rep: Reputation: 0
Sample from one flight log down loaded off the web, altered to preserve anonymity, I hope. I am not sure if it will work as I am only able to upload a limited range of file types so had to convert it, it is described as a tab delimited txt.



From what I can see that works if you open it with a spreadsheet program the battery charge count is column DG with the title "CENTER_BATTERY.loopNum". That raises one point, the presence of a title might trigger a 'success' so it might be necessary to search for the second occurrence.
Attached Files
File Type: txt sample.txt (77.2 KB, 33 views)

Last edited by sean mckinney; 12-16-2020 at 03:37 PM.
 
Old 12-17-2020, 04:50 AM   #8
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
That column, CENTER_BATTERY.loopNum, besides a few empty rows contains only zeros. Is it OK? Just to be sure, you want the first row where CENTER_BATTERY.loopNum is zero after the gap in the data, and then the last row before the next gap?

Ok, this is probably not quite what you want, but similar: it outputs the first row where CENTER_BATTERY.loopNum is not empty, but was empty in the previous row:
Code:
mlr -t step -a shift -f CENTER_BATTERY.loopNum \
 then filter '${CENTER_BATTERY.loopNum_shift}==""&&${CENTER_BATTERY.loopNum}!=""' \
 then cut -x -f CENTER_BATTERY.loopNum_shift sample.csv
And this will output the first and the last row of non-empty runs between empty gaps:
Code:
mlr -t put '
@n=${CENTER_BATTERY.loopNum};
if(NR>1&&@p!=""&&@n==""){emit @r};
filter @p==""&&@n!="";@p=@n;@r=$*' sample.csv

Last edited by shruggy; 12-17-2020 at 09:18 AM.
 
Old 12-17-2020, 06:54 AM   #9
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,120

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Rather than learn a new language, awk should handle it just fine. I'm guessing the OP means '$111 != ""' - pretty standard to use an assoc array to hold matching records - print the first before saving, print the last in an END block.
 
Old 12-17-2020, 07:22 AM   #10
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,597

Rep: Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545
Quote:
Originally Posted by syg00 View Post
Rather than learn a new language, awk should handle it just fine.
Now that we've seen the data is simple, sure - it'll work here.

In general, parsing wild CSV needs a tool that can handle quoted separators, and afaik nobody has added that to Awk yet?

 
Old 12-17-2020, 08:12 AM   #11
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
Quote:
Originally Posted by boughtonp View Post
afaik nobody has added that to Awk yet?
Well, gawk has FPAT, and there's a (yet unreleased) extension to gawk, gawk-csv. I'm using it in this post.
 
1 members found this post helpful.
Old 12-17-2020, 08:51 AM   #12
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,597

Rep: Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545
Quote:
Originally Posted by shruggy View Post
Well, gawk has FPAT
Works for commas, but not newlines (common in addresses).

And the example there doesn't deal with escaped quotes - e.g. "Bond, James ""007"" Bond" - which is easy to do, but just one more bit of noise when it should be no more complex than specifying a "--csv" option instead.


Quote:
and there's a (yet unreleased) extension to gawk, gawk-csv. I'm using it in this post.
Which is nice, but why is it an extension and what's the deal with it being "not yet released" (despite also claiming "Version: 1.0.0")?

(Questions more for the gawk/gawkextlib teams I guess, but I don't know if I can be bothered going and asking them; I'd probably just get annoyed by the response.)

 
Old 12-17-2020, 12:00 PM   #13
sean mckinney
Member
 
Registered: Mar 2004
Posts: 32

Original Poster
Rep: Reputation: 0
Shruggy it was just a sample file and choosing "CENTER_BATTERY.loopNum" as the target was semi random. That those cells in the sample are empty is just an unlucky coincidence.
Sometimes data is missing from some of these logs, be that from the entire log or only from part of it, and that is actually part of the reason for my 'quest', to see if I can work out a pattern.

BTW I think I made a HUGE HUGE mistake with this thread, the == should, I think, have been !=, as in not empty. The sun is shining out of my window from the scarlet glow on my face. Whoops.
I may also have to adjust the "pattern" if it turns out that some entries are invisible to the human eye but visible to the search. PLus make allowances for the fact that the first row on each csv is full, of column titles, which I realised as I started fiddling with trial searches.

One point though, it would, I think, be better to search columns without addressing their titles, from memory that varies from model to model which is why I was referencing what I know as either the column number or field number

syg00 you are correct about the == vs !=, I doubt I could have made a better mistake if I tried.

Last edited by sean mckinney; 12-17-2020 at 02:52 PM.
 
Old 12-17-2020, 12:19 PM   #14
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
Quote:
Originally Posted by sean mckinney View Post
And make allowances for the fact that the first row on each csv is full, of column titles
Well, in awk this could be taken care of with NR>1 or NR==1{next}:
Code:
awk -F'\t' 'NR==1{next}p!=""&&$4==""{print r}p==""&&$4!="";{p=$4;r=$0}' sample.csv

Last edited by shruggy; 12-17-2020 at 12:37 PM.
 
Old 12-17-2020, 12:27 PM   #15
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
Quote:
Originally Posted by sean mckinney View Post
One point though, it would, I think, be better to search columns without addressing their titles
Miller uses almost the same convention as awk here:
Code:
mlr -t put '
if(NR>1&&@p!=""&&$[[[4]]]==""){emit @r};
filter @p==""&&$[[[4]]]!="";@p=$[[[4]]];@r=$*' sample.csv

Last edited by shruggy; 12-17-2020 at 12:38 PM.
 
  


Reply

Tags
csv



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
One(!) OpenVPN client re-connects hundreds of times each day sundialsvcs Linux - Networking 2 06-21-2017 07:38 AM
[SOLVED] Need to append header to each entry in a column, sometimes more than once kmkocot Programming 6 11-24-2014 07:41 PM
compare second column of a file then print the first column of it in a ne fil if true java_girl Linux - Newbie 2 03-16-2012 04:50 AM
file processing -- remove the first column of each line George2 Programming 7 11-19-2006 09:15 AM
regexp question: first instance for each line buldir Programming 6 03-20-2006 01:20 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:53 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration