LinuxQuestions.org - [SOLVED] AWK - issues pattern matching

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - AWK - issues pattern matching (https://www.linuxquestions.org/questions/linux-newbie-8/awk-issues-pattern-matching-4175672403/)

AWK - issues pattern matching

hi guys, appreciate any ideas on this scenario below.

I got a file with has a pattern like the one below.
It has a unique value on the 5th column.
I wanted to get all the rows with has same value with the column, and save it to a file, using the 5th column value as its filename.

Here's what I have tried but doesn't work as expected:

Code:

#!/bin/bash 

input="/home/hrts/Documents/input.txt" 

dfile="/home/hrts/Documents/Book2.csv" 



while IFS= read -r line 

do 

#echo "Sline" 

awk -F,  '{if ($5 == "$line") {print}}' $dfile



done < "$input"

input.txt - contains the pattern to match for the value at column 5
book2.csv - has the raw input for processing

Sample data and expected output:

Quote:

Book2.csv has:

aa wqwq 5456 TRFRx 012
sd ddfg 5345 FDFDx 012
qw safa 3451 WQERa 012
jw sgha 3612 WBPRa 012
jw sgha 3612 WBPRa 012
qz asad 3214 EWDAa 014
wq aswq 3414 YWDAa 014
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
as e4aa 3254 DRQQz 091
bn e5yu 3890 PLMGc 091
... pattern goes on

input.txt has:
012
014
019
091

Expected output:
1 File: Filename 012.txt
aa wqwq 5456 TRFRx 012
sd ddfg 5345 FDFDx 012
qw safa 3451 WQERa 012
jw sgha 3612 WBPRa 012
jw sgha 3612 WBPRa 012

Anothe file: Filename: 014.txt
wq aswq 3414 YWDAa 014
qz asad 3214 EWDAa 014

Another file: filename: 019.txt
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019

Thank you for any ideas.

Put some of your field separators in ...

If the data in field $5 is 100% reliable, and guaranteed to be grouped, you could use it for the name of the file:

Code:

cat book2.csv \

| sort -k5,5n \

| awk '$5 {print > $5".txt"}'

Otherwise you'll have to iterate over book2.csv as many times as there are entries in input.txt.

Edit: removed redundant line.

Hm, wouldn't just this be enough

Code:

awk 'out=$5".txt"{print>out}' book2.csv

Print unconditionally into filenames constructed from $5

Code:

awk '{ print > ($5 ".txt") }' book2.csv

Print only if $5 is in input.txt

Code:

awk 'NR==FNR { f[$1]; next } ($5 in f) { print > ($5 ".txt") }' input.txt book2.csv

Quote:

Originally Posted by syg00 (Post 6106976)

Put some of your field separators in ...

Hmm,, you got it Syg00. Yes, it's a csv file "space" should be commas.
It's actually, value1,,value3 or ,,value3,,value5 basically empty commas is equivalent to a single field with a blank value.
My bad forgot to include commas on my original post. Thanks for the heads up.

Quote:

Originally Posted by Turbocapitalist (Post 6106979)

If the data in field $5 is 100% reliable, and guaranteed to be grouped, you could use it for the name of the file:

Code:

cat book2.csv \

| sort -k5,5n \

| awk '$5 {print > $5".txt"}'

Otherwise you'll have to iterate over book2.csv as many times as there are entries in input.txt.

Edit: removed redundant line.

Thanks Turbocapitalist.

It works fine I ended up with this one since i forgot to include commas in my post.

Quote:

cat book2.csv | sort -k5,5n | awk -F',' '$5 {print > $5".txt"}'

I just notice some error, because the CSV file has URLs, it says awk cannot open https://about.gitlab.com

How do we tell AWK not to open the URLs but just process the file?

Cheers!

No problem. Though do look at MadeInGermany's second example in #5.

Quote:

Originally Posted by JJJCR (Post 6107261)

How do we tell AWK not to open the URLs but just process the file?

The field separator can be assigned a pattern, but exactly which pattern depends on the details in your data. Can you please post a few sanitized examples of the problematic lines along with a few normal lines?

Edit: Or else $5 needs to be validated. Again see the second example in #5 above. My example came with the caveat about $5 containing only numbers.

Quote:

Originally Posted by MadeInGermany (Post 6107211)

Print unconditionally into filenames constructed from $5

Code:

awk '{ print > ($5 ".txt") }' book2.csv

Print only if $5 is in input.txt

Code:

awk 'NR==FNR { f[$1]; next } ($5 in f) { print > ($5 ".txt") }' input.txt book2.csv

I just modified to this: awk -F',' '{ print > ($5 ".txt") }' book2.csv

It works fine but ended when the it found URL on the data, it says awk cannot open http://curl.haxx.se (no such file or directory)

How do we bypass AWK to avoid this error?

The second command you give, doesn't output any data.

Thanks for your help.

If the fifth field does not have numbers but URLs instead, the slashes are going to give you trouble in the file names. You'll need to think of some other naming convention. The slash is not allowed in directory or file names. The gsub() function will be needed to escape them in your AWK script.

Quote:

Originally Posted by Turbocapitalist (Post 6107268)

The fifth field is consistent with alphanumeric and dashes, the url is on the 3rd field. Thanks.

Please show one of the offending lines plus the exact script you are trying.

Quote:

Originally Posted by Turbocapitalist (Post 6107273)

Please show one of the offending lines plus the exact script you are trying.

Here's the script and error:

Script:

Quote:

awk -F',' '{print > ($5 ".txt") }' book2.csv

error:

Quote:

awk: cannot open "https://curl.haxx.se/.txt" for output (No such file or directory)

Script:

Quote:

cat book2.csv | sort -k5,5n | awk -F',' '$5 {print > $5".txt"}'

error:

Quote:

awk: cannot open "https://about.gitlab.com/.txt" for output (No such file or directory)

Thank you, not sure how to tell how to bypass the url's and just copy the contents to a new file.

If lines are composed of variable number of fields, but the key number is always the last field, try $NF instead of $5 (or $(NF-1) for second to last field and so on)

Code:

awk -F, '{print>($NF".txt")}' book2.csv

In addition to shruggy's suggestion you might show this output:

Code:

grep curl.haxx.se books2.csv

CSV is not a simple format and does really need a proper parser. So you might be looking at escalating to perl if using $NF instead of $5 does not do the trick.