LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   AWK - issues pattern matching (https://www.linuxquestions.org/questions/linux-newbie-8/awk-issues-pattern-matching-4175672403/)

JJJCR 04-02-2020 06:16 AM

AWK - issues pattern matching
 
hi guys, appreciate any ideas on this scenario below.

I got a file with has a pattern like the one below.
It has a unique value on the 5th column.
I wanted to get all the rows with has same value with the column, and save it to a file, using the 5th column value as its filename.

Here's what I have tried but doesn't work as expected:
Code:

#!/bin/bash
input="/home/hrts/Documents/input.txt"
dfile="/home/hrts/Documents/Book2.csv"

while IFS= read -r line
do
#echo "Sline"
awk -F,  '{if ($5 == "$line") {print}}' $dfile

done < "$input"

input.txt - contains the pattern to match for the value at column 5
book2.csv - has the raw input for processing

Sample data and expected output:

Quote:


Book2.csv has:

aa wqwq 5456 TRFRx 012
sd ddfg 5345 FDFDx 012
qw safa 3451 WQERa 012
jw sgha 3612 WBPRa 012
jw sgha 3612 WBPRa 012
qz asad 3214 EWDAa 014
wq aswq 3414 YWDAa 014
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
as e4aa 3254 DRQQz 091
bn e5yu 3890 PLMGc 091
... pattern goes on

input.txt has:
012
014
019
091

Expected output:
1 File: Filename 012.txt
aa wqwq 5456 TRFRx 012
sd ddfg 5345 FDFDx 012
qw safa 3451 WQERa 012
jw sgha 3612 WBPRa 012
jw sgha 3612 WBPRa 012

Anothe file: Filename: 014.txt
wq aswq 3414 YWDAa 014
qz asad 3214 EWDAa 014


Another file: filename: 019.txt
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
qw weew 4651 DDGFa 019
Thank you for any ideas.

syg00 04-02-2020 06:28 AM

Put some of your field separators in ...

Turbocapitalist 04-02-2020 06:45 AM

If the data in field $5 is 100% reliable, and guaranteed to be grouped, you could use it for the name of the file:

Code:

cat book2.csv \
| sort -k5,5n \
| awk '$5 {print > $5".txt"}'

Otherwise you'll have to iterate over book2.csv as many times as there are entries in input.txt.

Edit: removed redundant line.

shruggy 04-02-2020 07:49 AM

Hm, wouldn't just this be enough
Code:

awk 'out=$5".txt"{print>out}' book2.csv

MadeInGermany 04-02-2020 06:37 PM

Print unconditionally into filenames constructed from $5
Code:

awk '{ print > ($5 ".txt") }' book2.csv
Print only if $5 is in input.txt
Code:

awk 'NR==FNR { f[$1]; next } ($5 in f) { print > ($5 ".txt") }' input.txt book2.csv

JJJCR 04-02-2020 09:42 PM

Quote:

Originally Posted by syg00 (Post 6106976)
Put some of your field separators in ...

Hmm,, you got it Syg00. Yes, it's a csv file "space" should be commas.
It's actually, value1,,value3 or ,,value3,,value5 basically empty commas is equivalent to a single field with a blank value.
My bad forgot to include commas on my original post. Thanks for the heads up.

JJJCR 04-02-2020 09:45 PM

Quote:

Originally Posted by Turbocapitalist (Post 6106979)
If the data in field $5 is 100% reliable, and guaranteed to be grouped, you could use it for the name of the file:

Code:

cat book2.csv \
| sort -k5,5n \
| awk '$5 {print > $5".txt"}'

Otherwise you'll have to iterate over book2.csv as many times as there are entries in input.txt.

Edit: removed redundant line.

Thanks Turbocapitalist.

It works fine I ended up with this one since i forgot to include commas in my post.

Quote:

cat book2.csv | sort -k5,5n | awk -F',' '$5 {print > $5".txt"}'
I just notice some error, because the CSV file has URLs, it says awk cannot open https://about.gitlab.com

How do we tell AWK not to open the URLs but just process the file?

Cheers!

Turbocapitalist 04-02-2020 09:49 PM

No problem. Though do look at MadeInGermany's second example in #5.

Quote:

Originally Posted by JJJCR (Post 6107261)
How do we tell AWK not to open the URLs but just process the file?

The field separator can be assigned a pattern, but exactly which pattern depends on the details in your data. Can you please post a few sanitized examples of the problematic lines along with a few normal lines?

Edit: Or else $5 needs to be validated. Again see the second example in #5 above. My example came with the caveat about $5 containing only numbers.

JJJCR 04-02-2020 09:52 PM

Quote:

Originally Posted by MadeInGermany (Post 6107211)
Print unconditionally into filenames constructed from $5
Code:

awk '{ print > ($5 ".txt") }' book2.csv
Print only if $5 is in input.txt
Code:

awk 'NR==FNR { f[$1]; next } ($5 in f) { print > ($5 ".txt") }' input.txt book2.csv

I just modified to this: awk -F',' '{ print > ($5 ".txt") }' book2.csv

It works fine but ended when the it found URL on the data, it says awk cannot open http://curl.haxx.se (no such file or directory)

How do we bypass AWK to avoid this error?

The second command you give, doesn't output any data.

Thanks for your help.

Turbocapitalist 04-02-2020 09:55 PM

If the fifth field does not have numbers but URLs instead, the slashes are going to give you trouble in the file names. You'll need to think of some other naming convention. The slash is not allowed in directory or file names. The gsub() function will be needed to escape them in your AWK script.

JJJCR 04-02-2020 10:26 PM

Quote:

Originally Posted by Turbocapitalist (Post 6107268)
If the fifth field does not have numbers but URLs instead, the slashes are going to give you trouble in the file names. You'll need to think of some other naming convention. The slash is not allowed in directory or file names. The gsub() function will be needed to escape them in your AWK script.

The fifth field is consistent with alphanumeric and dashes, the url is on the 3rd field. Thanks.

Turbocapitalist 04-02-2020 10:30 PM

Please show one of the offending lines plus the exact script you are trying.

JJJCR 04-02-2020 10:49 PM

Quote:

Originally Posted by Turbocapitalist (Post 6107273)
Please show one of the offending lines plus the exact script you are trying.

Here's the script and error:

Script:
Quote:

awk -F',' '{print > ($5 ".txt") }' book2.csv
error:
Quote:

awk: cannot open "https://curl.haxx.se/.txt" for output (No such file or directory)
Script:
Quote:

cat book2.csv | sort -k5,5n | awk -F',' '$5 {print > $5".txt"}'
error:
Quote:

awk: cannot open "https://about.gitlab.com/.txt" for output (No such file or directory)
Thank you, not sure how to tell how to bypass the url's and just copy the contents to a new file.

shruggy 04-02-2020 11:44 PM

If lines are composed of variable number of fields, but the key number is always the last field, try $NF instead of $5 (or $(NF-1) for second to last field and so on)
Code:

awk -F, '{print>($NF".txt")}' book2.csv

Turbocapitalist 04-02-2020 11:51 PM

In addition to shruggy's suggestion you might show this output:

Code:

grep curl.haxx.se books2.csv
CSV is not a simple format and does really need a proper parser. So you might be looking at escalating to perl if using $NF instead of $5 does not do the trick.


All times are GMT -5. The time now is 10:48 PM.