LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-26-2011, 02:02 PM   #1
2legit2quit
LQ Newbie
 
Registered: Oct 2011
Location: St Louis, Missouri, USA
Distribution: Ubuntu 11.04
Posts: 5

Rep: Reputation: Disabled
Question Map 1 CSV's columns to matching columns in another CSV


Both files have thousands of rows and hundreds of columns of demographics data.

The first row contains the field names, later rows the actual data. Each file is for one year, and from year to year the order of the columns changes and fields may be added or removed.

I want to map the columns from File 2 to the layout of File 1.

The logic would be something like:

For each column in File 1, Row 1
Find a matching column in File 2, Row 1-
If a match, move this column to the position it's at in File 1
Else move the column to column #1001 (increment each time)
--end--


Sample data:

File 1, row 1: Year;ID;Description;%White;%Black;%White_Region;%Black_Region
File 2, row 1: Year;ID;Description;Address;%White;%Black;%Hispanic

Output is File 2's data structured as:
Row 1: Year;ID;Description;%White;%Black;%White_Region;%Black_Region; (more fields through 1000) ;Address;%Hispanic


I've been using *nix command line tools but I'm not picky about a solution in Perl or something else. Or even just instruction about which command line tools to use in which order.

Thanks for any help.
 
Old 10-26-2011, 03:39 PM   #2
crabboy
Senior Member
 
Registered: Feb 2001
Location: Atlanta, GA
Distribution: Slackware
Posts: 1,821

Rep: Reputation: 121Reputation: 121
What is your join criteria between the two files, is it the row number, year, ID and do both files have exactly the same number of rows? If the same column exists in file2 that supersedes the data in file1?

Last edited by crabboy; 10-26-2011 at 03:44 PM.
 
Old 10-26-2011, 05:14 PM   #3
2legit2quit
LQ Newbie
 
Registered: Oct 2011
Location: St Louis, Missouri, USA
Distribution: Ubuntu 11.04
Posts: 5

Original Poster
Rep: Reputation: Disabled
The files have differing numbers of rows. The content of row 74 (or any other particular row) will usually be different. The columns are data points, the rows are regions the data applies to. Regions as well as data points collected change from year to year (hence file to file)

If the content of row 1 of a column X in File 2 matches the content of row 1, column Y in File 1, the output file should have its File 2's column X data placed in column Y.

If the contents of row 1 in column X in File 2 don't match the content of row 1 of any county in File 2, the output file should have its File 2's column X data placed somewhere far to the right, where it won't column should be moved far to the right and out of the way

We could start with the premise that the output file begins as a completely empty CSV maybe 1500 fields wide and 6000 rows long, then populated as the script runs.


This is a first step of combining about 10 files into a single file. Output columns at the far right are the ones that require manual intervention.

(I hope that makes it clearer...)
 
Old 10-26-2011, 05:48 PM   #4
crabboy
Senior Member
 
Registered: Feb 2001
Location: Atlanta, GA
Distribution: Slackware
Posts: 1,821

Rep: Reputation: 121Reputation: 121
I think I get it, but there is still one thing that is not clear, the rows. how do I match up the rows between the files? do I always assume that row 1056 in file 1 will match up with the same row in file 2 and the two can be joined together? It seems to me like the fields like the id field should be consistent between them, but I may just be misinterpreting the data.


Either way, I think this is a bit difficult for a shell script, at least for my skill level. It seems fairly trivial for a Perl or Java program.

what I'd do is find the union of all the headers in all the files and mark where they came from, marking identical columns as found in the latest file. Then step through each file line by line and build a new row of data based on the column information and pull the column data from the correct file.
 
Old 10-26-2011, 10:31 PM   #5
2legit2quit
LQ Newbie
 
Registered: Oct 2011
Location: St Louis, Missouri, USA
Distribution: Ubuntu 11.04
Posts: 5

Original Poster
Rep: Reputation: Disabled
The rows will not match up by row number. The ID field will be consistent across all files.
 
Old 10-26-2011, 11:32 PM   #6
crabboy
Senior Member
 
Registered: Feb 2001
Location: Atlanta, GA
Distribution: Slackware
Posts: 1,821

Rep: Reputation: 121Reputation: 121
Here is the easy approach. This is a quick hack and never even compiled, so not likely to work without some tweaks.

It makes a few assumptions:
1. There are no ';' characters within the data. The csv is parsed the quick and dirty way, not the correct way. A real csv library should be used to load the data file.
2. The program assumes that the entire data set can fit into memory. The alternate is a bit more complex and required to index the ID in all the files and then iterate through the ids and pull the data line by line from each file. This allows writing a complete line after each ID, and less memory usage.
3. The order of the data may be changed, both columns and ID order.

Basically the code does the following.

Reads the first file and stores the header
then reads each line and iterates over each column as it iterates over the headers again.
Stuffs each data item into a map keyed by the header.
Then stuffs the line map into an map keyed by the ID.
Repeats this for each line in the file.

Does the same for the following x files
On successive files, the old line map is pulled out and all new fields overwrite the existing items in the map and new fields are added.

Finally what is not written, but simple to do, is dump out the HashMaps to a new file.

Code:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.StringTokenizer;


public class test3
{

   HashMap<String,HashMap<String,String>> data = new HashMap<String,HashMap<String,String>>();
   
   public void parse(String[] files)
   {
      for ( int i = 0; i < files.length; i++)
      {
         String filename = files[i];
         try
         {
            ArrayList<String> headers = new ArrayList<String>();
            
            // Add the header row to the array list.
            BufferedReader br = new BufferedReader( new FileReader(filename));
            String header = br.readLine();
            StringTokenizer st = new StringTokenizer( header, ";");
            
            while ( st.hasMoreTokens() )
            {
               headers.add( st.nextToken() );
            }
            
            // Now parse the data section
            String dataline;
            String ID = null;
            HashMap<String,String> lineMap = new HashMap<String,String>();
            while (( dataline = br.readLine() ) != null )
            {
               Iterator<String> headerIter = headers.iterator();
               StringTokenizer stdata = new StringTokenizer( dataline, ";");
               while ( st.hasMoreTokens() )
               {
                  String head = headerIter.next();
                  String item = stdata.nextToken();
                  if ( head.compareTo("ID") == 0)
                  {
                     ID = head;
                  }
                  lineMap.put(head,item);
               }
            }
            if ( ID != null)
            {
               HashMap<String,String> existingID = data.get(ID);
               if ( existingID == null )
               {
                  data.put( ID, lineMap );
               }
               else
               {
                  existingID.putAll(lineMap);
               }
            }
            else
            {
               System.out.println("Error, unable to find ID");
            }
            
         }
         catch (FileNotFoundException e)
         {
            // TODO Auto-generated catch block
            e.printStackTrace();
         }
         catch (IOException e)
         {
            // TODO Auto-generated catch block
            e.printStackTrace();
         }
      }
   }

   
   public static void main(String[] args)
   {
      new test3().parse( args );

   }

}

Last edited by crabboy; 10-26-2011 at 11:34 PM.
 
1 members found this post helpful.
Old 10-26-2011, 11:33 PM   #7
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
EDIT: oops, ninja'd by crabboy!

Quote:
Sample data:

File 1, row 1: Year;ID;Description;%White;%Black;%White_Region;%Black_Region
File 2, row 1: Year;ID;Description;Address;%White;%Black;%Hispanic
Is that CSV or SSV (semicolon separated values)?

So as I understand the problem, we aren't actually joining the files, but just coercing file2 into file1's format. Here's an awk script which I think will do that:
Code:
#!/usr/bin/awk -f

BEGIN {
    FS=";";
    # collect names of file1 columns
    getline < ARGV[1]
    for (i = 1; i <= NF; i++) {
        names2col[$i] = i;
        col2names[i] = $i
        nNames++
    }
}

FILENAME == ARGV[1] { nextfile } # skip file1

# map file2 columns to file1 columns
FILENAME == ARGV[2] && FNR == 1 {
    nf = nNames+1;
    for (i = 1; i <= NF; i++) {
        if ($i in names2col) {
            colmap[(names2col[$i])] = i;
        } else {
            col2names[nf] = $i;
            colmap[nf++] = i;
        }
    }
    # print mapped column names
    for (i = 1; i < nf; i++) {
        printf("%s", col2names[i]);
        if (i < nf-1) printf("%s", FS);
        else printf("\n");
    }
}

# print file2 remapped
FILENAME == ARGV[2] && FNR > 1 {
    for (i = 1; i < nf; i++) {
        if (i in colmap) printf("%s", $colmap[i]);
        else printf("NA");

        if (i < nf-1) printf("%s", FS);
        else printf("\n");
    }
}
Call as ./map.awk file1 file2

Last edited by ntubski; 10-26-2011 at 11:37 PM.
 
1 members found this post helpful.
Old 10-27-2011, 08:53 AM   #8
2legit2quit
LQ Newbie
 
Registered: Oct 2011
Location: St Louis, Missouri, USA
Distribution: Ubuntu 11.04
Posts: 5

Original Poster
Rep: Reputation: Disabled
Wow those solutions are great.

Crabboy and ntubski - thanks so much for your help!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Bash script to read csv file with multiple length columns japena Linux - Newbie 17 07-27-2011 01:47 PM
convert columns to rows (tab separated file to csv) doug23 Programming 16 08-16-2009 09:14 PM
Batch manipulating CSV columns and files in Perl script briana.paige Linux - Newbie 1 07-14-2009 11:02 AM
CSV null columns with gawk pcock Programming 1 03-30-2009 07:35 PM
Comparing two csv files and write different record in third CSV file irfanb146 Linux - Newbie 3 06-30-2008 09:15 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:16 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration