[SOLVED] Delete row of csv file that doesn't meet a criteria

arbex5 · 07-05-2012, 07:03 AM

I do need something very similar to this an old post that have a solution.
http://www.linuxquestions.org/questi...18#post4719618

I have two cvs files (geodata files)

One have 3 columns (lets call file_1) and the other 5 columns (lets call file_2)

file_1 is like this:

startIpNum,endIpNum,locId
"16777216","16777471","17"
"16777472","16777727","24328"
"16777728","16778239","49"

file_2 is like this:

geo_id,country,region,city,postalCode
32,BR,,,
19304,BR,SP,São Paulo,
24328,BR,RJ,Rio De Janeiro,
22975,BR,SP,Salto,

What I need is to create a new file (lets call file_3) that match the field locId of the file_1 with the field geo_id of the file_2
In other words, if a field locId (file_1) and a field gep_id (file_2) are equal, generate a file_3 with the row of file_1 that matched the criteria.

Example.

Using the tables above and running the script the resulting file_3 would be

startIpNum,endIpNum,locId
"16777472","16777727","24328"

Which is the locId "24328" (file_1) matched the geo_id "24328" (file_2) and generate the resulting file_3

Is this hard to make?
Thanks.

danielbmartin · 07-05-2012, 08:46 AM

Have InFile1...

Code:

"16777216","16777471","17"
"16777472","16777727","24328"
"16777728","16778239","49"

Have InFile2...

Code:

32,BR,,,
19304,BR,SP,São Paulo,
24328,BR,RJ,Rio De Janeiro,
22975,BR,SP,Salto,

Want OutFile...

Code:

"16777472","16777727","24328"

[/QUOTE]

Code:

cut -d, -f1 $InFile2               \
|sed 's/^/"/g'                     \
|sed 's/$/"/g'                     \
|sort                              \
|join -t, -1 3 -2 1 $InFile1 -     \
|awk -F"," '{printf $2","$3","$1}' \
> $OutFile

Daniel B. Martin

arbex5 · 07-05-2012, 09:42 AM

Hi Daniel, thanks for answer.

Here is what I did:

I renamed the files to adjust to your command and make it easy.

Here is what I got:

join: InFile1.csv:2: is not sorted: "16777216","16777471","17"

The resulting was an empty file.
In order to simplify, InFile2.cvs can be just a column with geo_id like this:

geo_id
32
19304
24328
22975

So if the field locId in InFile1.csv has a matching field or number in InFile2.csv make a new InFile3.csv with these rows.

Thanks

danielbmartin · 07-05-2012, 09:54 AM

Quote:

Originally Posted by arbex5

Here is what I got:

join: InFile1.csv:2: is not sorted: "16777216","16777471","17"

I suspect you ran my code against files different from those in your original post. join wants input files in sorted order. Here is a revised solution which assures sorted files. This code requires the use of a temporary work file.

Code:

cut -d, -f1 $InFile2               \
|sed 's/^/"/g'                     \
|sed 's/$/"/g'                     \
|sort                              \
> $Work1

sort -k3,3  $InFile1               \
|join -t, -1 3 -2 1 - $Work1       \
|awk -F"," '{printf $2","$3","$1}' \
> $OutFile2

Daniel B. Martin

arbex5 · 07-05-2012, 11:08 AM

Hi Daniel, thank you again.

I think I'm doing something wrong.
Here is the error.

join: -:2: is not sorted: "1000005632","1000006655","104820"

Here are the commands I executed.

$ cut -d, -f1 InFile2.csv |sed 's/^/"/g' |sed 's/$/"/g' |sort > Work1
$ sort -k3,3 InFile1.csv |join -t, -1 3 -2 1 - Work1 |awk -F"," '{printf $2","$3","$1}' > OutFile2

the result is a Wok1 file with sorted ascending geo_id and an empty OutFile2

The file InFile1.csv are not sorted.
Here is a snapshot of them (InFile1.csv has almost 2M rows and InFile2.csv has almost 5k rows:

InFile1.csv

startIpNum,endIpNum,locId
"16777216","16777471","17"
"16777472","16777727","24328"
"16777728","16778239","49"
"16778240","16779263","17"
"16779264","16781311","49"
"16781312","16785407","111"
"16785408","16793599","49"
"16793600","16809983","111"
"16809984","16842751","209"
.
.
.

InFile2.csv (I isolate to just geo_id column)

geo_id
32
19304
19319
22975
23787
23793
24758
24927
24930
.
.
.

Thanks again.

grail · 07-05-2012, 11:38 AM

How about:

Code:

awk -F, 'NR==FNR{id[$1];next}FNR==1 || gensub(/"/,"","g",$3) in id' file2 file1

This was using original data.

danielbmartin · 07-05-2012, 02:01 PM

Quote:

Originally Posted by arbex5

I think I'm doing something wrong.

I post a solution only after testing it. The code generates the expected output here.

Are you testing with the same small test files which were given in your original post? If not, please do that first. If the code works, then test with the large real-world files. If the code works with small test files but not with large real-world files you may be discovering something interesting in your data.

Daniel B. Martin

danielbmartin · 07-05-2012, 02:15 PM

Quote:

Originally Posted by arbex5

Here is the error.
join: -:2: is not sorted: "1000005632","1000006655","104820"

Let's try a different line of attack. This code does not use join and does not require sorting.

Code:

cut -d, -f1 $InFile2   \
|sed 's/^/"/g'         \
|sed 's/$/"$/g'        \
|grep -f -  $InFile1   \
> $OutFile

Daniel B. Martin

arbex5 · 07-05-2012, 05:12 PM

Hello Daniel, it worked pretty well.
Both codes.
The first one worked with the sample data I provided you.
The last one worked with the file itself.

Thank you so much.
Have a nice day.

danielbmartin · 07-06-2012, 02:18 PM

OP says this code works on his real-world files.

Code:

cut -d, -f1 $InFile2   \
|sed 's/^/"/g'         \
|sed 's/$/"$/g'        \
|grep -f -  $InFile1   \
> $OutFile

I found a way to tighten up this code but (on the minus side) it is less readable.

Code:

sed -r 's/([^,]*).*/"\1"$/1' $InFile2 \
|grep -f - $InFile1                   \
> $OutFile8

This sed says:
- read the file $InFile2
- extract whatever character string precedes the first comma
- prefix it with a " (double quote)
- suffix it with a " (double quote) and a $ (dollar sign)

Daniel B. Martin