Map 1 CSV's columns to matching columns in another CSV

2legit2quit · 10-26-2011, 02:02 PM

Both files have thousands of rows and hundreds of columns of demographics data.

The first row contains the field names, later rows the actual data. Each file is for one year, and from year to year the order of the columns changes and fields may be added or removed.

I want to map the columns from File 2 to the layout of File 1.

The logic would be something like:

For each column in File 1, Row 1
Find a matching column in File 2, Row 1-
If a match, move this column to the position it's at in File 1
Else move the column to column #1001 (increment each time)
--end--

Sample data:

File 1, row 1: Year;ID;Description;%White;%Black;%White_Region;%Black_Region
File 2, row 1: Year;ID;Description;Address;%White;%Black;%Hispanic

Output is File 2's data structured as:
Row 1: Year;ID;Description;%White;%Black;%White_Region;%Black_Region; (more fields through 1000) ;Address;%Hispanic

I've been using *nix command line tools but I'm not picky about a solution in Perl or something else. Or even just instruction about which command line tools to use in which order.

Thanks for any help.

crabboy · 10-26-2011, 03:39 PM

What is your join criteria between the two files, is it the row number, year, ID and do both files have exactly the same number of rows? If the same column exists in file2 that supersedes the data in file1?

2legit2quit · 10-26-2011, 05:14 PM

The files have differing numbers of rows. The content of row 74 (or any other particular row) will usually be different. The columns are data points, the rows are regions the data applies to. Regions as well as data points collected change from year to year (hence file to file)

If the content of row 1 of a column X in File 2 matches the content of row 1, column Y in File 1, the output file should have its File 2's column X data placed in column Y.

If the contents of row 1 in column X in File 2 don't match the content of row 1 of any county in File 2, the output file should have its File 2's column X data placed somewhere far to the right, where it won't column should be moved far to the right and out of the way

We could start with the premise that the output file begins as a completely empty CSV maybe 1500 fields wide and 6000 rows long, then populated as the script runs.

This is a first step of combining about 10 files into a single file. Output columns at the far right are the ones that require manual intervention.

(I hope that makes it clearer...)

crabboy · 10-26-2011, 05:48 PM

I think I get it, but there is still one thing that is not clear, the rows. how do I match up the rows between the files? do I always assume that row 1056 in file 1 will match up with the same row in file 2 and the two can be joined together? It seems to me like the fields like the id field should be consistent between them, but I may just be misinterpreting the data.

Either way, I think this is a bit difficult for a shell script, at least for my skill level. It seems fairly trivial for a Perl or Java program.

what I'd do is find the union of all the headers in all the files and mark where they came from, marking identical columns as found in the latest file. Then step through each file line by line and build a new row of data based on the column information and pull the column data from the correct file.

2legit2quit · 10-26-2011, 10:31 PM

The rows will not match up by row number. The ID field will be consistent across all files.

crabboy · 10-26-2011, 11:32 PM

Here is the easy approach. This is a quick hack and never even compiled, so not likely to work without some tweaks.

It makes a few assumptions:
1. There are no ';' characters within the data. The csv is parsed the quick and dirty way, not the correct way. A real csv library should be used to load the data file.
2. The program assumes that the entire data set can fit into memory. The alternate is a bit more complex and required to index the ID in all the files and then iterate through the ids and pull the data line by line from each file. This allows writing a complete line after each ID, and less memory usage.
3. The order of the data may be changed, both columns and ID order.

Basically the code does the following.

Reads the first file and stores the header
then reads each line and iterates over each column as it iterates over the headers again.
Stuffs each data item into a map keyed by the header.
Then stuffs the line map into an map keyed by the ID.
Repeats this for each line in the file.

Does the same for the following x files
On successive files, the old line map is pulled out and all new fields overwrite the existing items in the map and new fields are added.

Finally what is not written, but simple to do, is dump out the HashMaps to a new file.

Code:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.StringTokenizer;


public class test3
{

   HashMap<String,HashMap<String,String>> data = new HashMap<String,HashMap<String,String>>();
   
   public void parse(String[] files)
   {
      for ( int i = 0; i < files.length; i++)
      {
         String filename = files[i];
         try
         {
            ArrayList<String> headers = new ArrayList<String>();
            
            // Add the header row to the array list.
            BufferedReader br = new BufferedReader( new FileReader(filename));
            String header = br.readLine();
            StringTokenizer st = new StringTokenizer( header, ";");
            
            while ( st.hasMoreTokens() )
            {
               headers.add( st.nextToken() );
            }
            
            // Now parse the data section
            String dataline;
            String ID = null;
            HashMap<String,String> lineMap = new HashMap<String,String>();
            while (( dataline = br.readLine() ) != null )
            {
               Iterator<String> headerIter = headers.iterator();
               StringTokenizer stdata = new StringTokenizer( dataline, ";");
               while ( st.hasMoreTokens() )
               {
                  String head = headerIter.next();
                  String item = stdata.nextToken();
                  if ( head.compareTo("ID") == 0)
                  {
                     ID = head;
                  }
                  lineMap.put(head,item);
               }
            }
            if ( ID != null)
            {
               HashMap<String,String> existingID = data.get(ID);
               if ( existingID == null )
               {
                  data.put( ID, lineMap );
               }
               else
               {
                  existingID.putAll(lineMap);
               }
            }
            else
            {
               System.out.println("Error, unable to find ID");
            }
            
         }
         catch (FileNotFoundException e)
         {
            // TODO Auto-generated catch block
            e.printStackTrace();
         }
         catch (IOException e)
         {
            // TODO Auto-generated catch block
            e.printStackTrace();
         }
      }
   }

   
   public static void main(String[] args)
   {
      new test3().parse( args );

   }

}

ntubski · 10-26-2011, 11:33 PM

EDIT: oops, ninja'd by crabboy!

Quote:

Sample data:

File 1, row 1: Year;ID;Description;%White;%Black;%White_Region;%Black_Region
File 2, row 1: Year;ID;Description;Address;%White;%Black;%Hispanic

Is that CSV or SSV (semicolon separated values)?

So as I understand the problem, we aren't actually joining the files, but just coercing file2 into file1's format. Here's an awk script which I think will do that:

Code:

#!/usr/bin/awk -f

BEGIN {
    FS=";";
    # collect names of file1 columns
    getline < ARGV[1]
    for (i = 1; i <= NF; i++) {
        names2col[$i] = i;
        col2names[i] = $i
        nNames++
    }
}

FILENAME == ARGV[1] { nextfile } # skip file1

# map file2 columns to file1 columns
FILENAME == ARGV[2] && FNR == 1 {
    nf = nNames+1;
    for (i = 1; i <= NF; i++) {
        if ($i in names2col) {
            colmap[(names2col[$i])] = i;
        } else {
            col2names[nf] = $i;
            colmap[nf++] = i;
        }
    }
    # print mapped column names
    for (i = 1; i < nf; i++) {
        printf("%s", col2names[i]);
        if (i < nf-1) printf("%s", FS);
        else printf("\n");
    }
}

# print file2 remapped
FILENAME == ARGV[2] && FNR > 1 {
    for (i = 1; i < nf; i++) {
        if (i in colmap) printf("%s", $colmap[i]);
        else printf("NA");

        if (i < nf-1) printf("%s", FS);
        else printf("\n");
    }
}

Call as ./map.awk file1 file2

2legit2quit · 10-27-2011, 08:53 AM

Wow those solutions are great.

Crabboy and ntubski - thanks so much for your help!