convert columns to rows (tab separated file to csv)

doug23 · 08-10-2009, 06:46 PM

Hi All,

I bet this is a Perl one-liner (or very simple python script).

I have a tab separated files in which each row looks like:

Unique_Eight_Character_Sequence [3 tabs] data1~moredata1~moredata1 [3 tabs] data2~moredata2~moredata2 ... dataN~....

The output file should have each column converted into a row (with the unique character sequence copied in for the first column), and then each "~" replaced by a comma.

e.g.

Unique_Eight_Character_Sequence1,data1,moredata1,moredata1
Unique_Eight_Character_Sequence1,data2,moredata2,moredata2
Unique_Eight_Character_Sequence2,data1,moredata1,moredata1
...

Thanks for your help!
Doug

Telemachos · 08-10-2009, 07:52 PM

Can you please post a sample of the actual data (or a mocked-up version of it)? Put it in [code][/code] tags and please show where the line breaks are (if any).

I can't quite see what the data looks like from your description.

To clarify the two examples you gave don't seem to me to match up in any obvious way. I can't see how you get from 1 to 2 given your description.

Unique_Eight_Character_Sequence [3 tabs] data1~moredata1~moredata1 [3 tabs] data2~moredata2~moredata2 ... dataN~...
Unique_Eight_Character_Sequence1,data1,moredata1,moredata1
Unique_Eight_Character_Sequence1,data2,moredata2,moredata2
Unique_Eight_Character_Sequence2,data1,moredata1,moredata1

There is no Unique_Eight_Character_Sequence2 in the first example, is there? Why do you repeat Unique_Eight_Character_Sequence1 twice at the head of two different rows? Why are there a bunch of data2 columns in the second row with a Sequence1 first column, but then a bunch of data1 rows in the third line with a Sequence2 first row!?!?

doug23 · 08-10-2009, 09:48 PM

Sure, no problem (this isn't proprietary). Also, turns out it'll have to remain tab-separated because some of the fields already include commas in some of the values.

Code:

PatNum	ApplicantType~ApplicantDesignation~ApplicantLastName~ApplicantFirstName~ApplicantCity~ApplicantState~ApplicantCountry~ApplicantNationality~ApplicantResidence
D0500396			applicant-inventor~us-only~Akagi~Shigeki~Kounu-gun, Hiroshima, 729-3431~~JP~JP~JP
D0500397			applicant-inventor~us-only~Dal Magro~Volney~Farroupilha~~BR~BR~BR			applicant-inventor~us-only~Matsuo~Edson~Porto Alegre~~BR~BR~BR			applicant-inventor~us-only~Canei~Ademir~Farroupilha~~BR~BR~BR
D0500398			applicant-inventor~us-only~Dal Magro~Volney~Farroupilha~~BR~BR~BR			applicant-inventor~us-only~Matsuo~Edson~Porto Alegre~~BR~BR~BR			applicant-inventor~us-only~Canei~Ademir~Farroupilha~~BR~BR~BR
D0500399			applicant-inventor~us-only~Fuerst~Rory W.~Atherton~CA~US~omitted~US

Then, the output should look like

Code:

PatNum	ApplicantType~ApplicantDesignation~ApplicantLastName~ApplicantFirstName~ApplicantCity~ApplicantState~ApplicantCountry~ApplicantNationality~ApplicantResidence
D0500396	applicant-inventor	us-only	Akagi~Shigeki~Kounu-gun, Hiroshima, 729-3431~~JP~JP~JP
D0500397	applicant-inventor	us-only	Dal Magro~Volney~Farroupilha~~BR~BR~BR
D0500397	applicant-inventor	us-only	Matsuo~Edson~Porto Alegre~~BR~BR~BR
D0500397	applicant-inventor	us-only	Canei~Ademir~Farroupilha~~BR~BR~BR
D0500398	applicant-inventor	us-only	Dal Magro~Volney~Farroupilha~~BR~BR~BR
D00500398	applicant-inventor	us-only	Matsuo~Edson~Porto Alegre~~BR~BR~BR
D00500398	applicant-inventor	us-only	Canei~Ademir~Farroupilha~~BR~BR~BR
D0500399	applicant-inventor	us-only	Fuerst~Rory W.~Atherton~CA~US~omitted~US

and so on with all the "~" replaced with tabs

Wim Sturkenboom · 08-11-2009, 12:20 AM

Code:

sed -e 's/\t\t\t/\t/g' yourfile
sed -e 's/~/\t/g' yourfile

should be a start
The first one replaces three tabs by one tab and the second one replaces ~ by tab. You can pipe the result of the first one into the second one and redirect the output into a new file

Code:

sed -e 's/\t\t\t/\t/g' yourfile | sed -e 's/~/\t/g' > result.tab

Note
The csv spec supports commas in data fields.

Code:

wim, "myaddress, mytown"

is a record with 2 fields, the first one containing wim and the second one containing the address info (with a comma)
It will be more complicated to create this using command line tools, but a number of languages support csv and will be able to do it.

Telemachos · 08-11-2009, 06:47 AM

@Wim: take a look again at the OP's data and output in his second post. There needs to be some kind of programming logic (beyond just substitutions using sed) since some patents have multiple inventors. Those that do must be entered as a first column once for each inventor. (See D0500397 and D0500398.) Also, the OP wants "applicant-inventor" made into its own column.

If I'm reading the data->output conversions correctly, then the logic seems to be that if there is more than one "applicant-inventor" on a line, then that first column will need to be repeated, once for each extra "applicant-inventor". This would require Perl, Python, Ruby or awk, I think. (Someone might be able to do it all with sed, but I wouldn't want to maintain it.

Anyhow, here is a start of how I might parse it in Perl:

Code:

#!/usr/bin/env perl
use strict;
use warnings;

open my $fh, '<', 'data.txt'
  or die "Can't open [data.txt] for reading: $!";

my $headers = <$fh>;
my %patents_inventors;

while (my $line = <$fh>) {
  chomp $line;
  my @columns = split /\t+/, $line;
  my $key = shift @columns;
  $patents_inventors{$key} = [ @columns ];
}

use Data::Dumper;
print Dumper \%patents_inventors;

Edit - @Doug23: My answer here doesn't make one thing clear. It's trivial to reduce multiple tabs to one tab and to replace tabs with commas. But what you want is more than that. It's still relatively easy to do, but it's not a trivial one-liner anymore. You need some kind of programming logic to deal with lines that have more than one inventor (so that they get repeated in the way you want). Since you have to maintain it, do you know a scripting language? The best thing will be to work with a language you're comfortable with.

doug23 · 08-11-2009, 09:13 AM

Quote:

Originally Posted by Telemachos

@Wim: take a look again at the OP's data and output in his second post. There needs to be some kind of programming logic (beyond just substitutions using sed) since some patents have multiple inventors. Those that do must be entered as a first column once for each inventor. (See D0500397 and D0500398.) Also, the OP wants "applicant-inventor" made into its own column.

If I'm reading the data->output conversions correctly, then the logic seems to be that if there is more than one "applicant-inventor" on a line, then that first column will need to be repeated, once for each extra "applicant-inventor". This would require Perl, Python, Ruby or awk, I think. (Someone might be able to do it all with sed, but I wouldn't want to maintain it.

Anyhow, here is a start of how I might parse it in Perl:

Code:

#!/usr/bin/env perl
use strict;
use warnings;

open my $fh, '<', 'data.txt'
  or die "Can't open [data.txt] for reading: $!";

my $headers = <$fh>;
my %patents_inventors;

while (my $line = <$fh>) {
  chomp $line;
  my @columns = split /\t+/, $line;
  my $key = shift @columns;
  $patents_inventors{$key} = [ @columns ];
}

use Data::Dumper;
print Dumper \%patents_inventors;

Edit - @Doug23: My answer here doesn't make one thing clear. It's trivial to reduce multiple tabs to one tab and to replace tabs with commas. But what you want is more than that. It's still relatively easy to do, but it's not a trivial one-liner anymore. You need some kind of programming logic to deal with lines that have more than one inventor (so that they get repeated in the way you want). Since you have to maintain it, do you know a scripting language? The best thing will be to work with a language you're comfortable with.

You're right - the difficult part is the logic of copying the first column to each of the multiple rows created if the row has more than one set of entries. Then it is a trivial replacement of 3 tabs with 1 tab, and "~" with 1 tab.

I will try the Perl script very shortly... In terms of maintenance, would this be difficult to convert to Python (which I am more familiar with)?

Thanks!
Doug

Telemachos · 08-11-2009, 09:29 AM

Quote:

Originally Posted by doug23

I will try the Perl script very shortly... In terms of maintenance, would this be difficult to convert to Python (which I am more familiar with)?

It shouldn't be much trouble at all to convert: assuming you know both Perl and Python reasonably well. Unfortunately, I know Python not at all, so I'm no help.

I will repost the key part of the code with more comments to help you translate it.

Code:

# Open the file for reading
open my $fh, '<', 'data.txt'
  or die "Can't open [data.txt] for reading: $!";

# Read in the first line only to remove/capture the headers
my $headers = <$fh>;

# Create an empty hash to store the rows later
my %patents_inventors;

# Begin reading the file line by line
while (my $line = <$fh>) {

  # remove the newline from the end of the line
  chomp $line;

  # split the line on tabs and put each item into the @columns array
  my @columns = split /\t+/, $line;

  # remove the first column from the @columns array; this will be our
  # record header for all the inventors
  my $key = shift @columns;

  # use the first column as the hash key; make the remaining columns 
  # an anonymous array and put that into the hash as the value for this key
  $patents_inventors{$key} = [ @columns ];
}

use Data::Dumper;
print Dumper \%patents_inventors;

So far this just creates a structure with the data. You would then need to write further application logic to go through the %patents_inventors array and write the data out to a csv file. The logic would be this: for each key in the hash (entry in the dictionary for Python?), create a line for each item in the array that is the value of that key. If there is only one inventor, you get one line. If there are three, you get three. It should be reasonably simple to implement, but again, I don't know Python's syntax.

To give you a sense of the structure this produces, here's the output of Data:

umper from the script so far:

Code:

$VAR1 = {
          'D0500398' => [
                          'applicant-inventor~us-only~Dal Magro~Volney~Farroupilha~~BR~BR~BR',
                          'applicant-inventor~us-only~Matsuo~Edson~Porto Alegre~~BR~BR~BR',
                          'applicant-inventor~us-only~Canei~Ademir~Farroupilha~~BR~BR~BR'
                        ],
          'D0500399' => [
                          'applicant-inventor~us-only~Fuerst~Rory W.~Atherton~CA~US~omitted~US'
                        ],
          'D0500397' => [
                          'applicant-inventor~us-only~Dal Magro~Volney~Farroupilha~~BR~BR~BR',
                          'applicant-inventor~us-only~Matsuo~Edson~Porto Alegre~~BR~BR~BR',
                          'applicant-inventor~us-only~Canei~Ademir~Farroupilha~~BR~BR~BR'
                        ],
          'D0500396' => [
                          'applicant-inventor~us-only~Akagi~Shigeki~Kounu-gun, Hiroshima, 729-3431~~JP~JP~JP'
                        ]
        };

Each initial field is a key, the inventors are items (one or more) in arrays. (It's a hash of arrays.)

Wim Sturkenboom · 08-11-2009, 10:15 AM

Sorry,missed that completely

ghostdog74 · 08-11-2009, 10:31 AM

@OP, if you know Python, then why are you not using it?

Code:

#!/usr/bin/env python
f=open("file")
print f.readline().strip()
for line in f:
    line = line.strip()
    s = line.split("\t")
    for item in s[1:]:
        if "applicant" in item:
            item = item.split("~")
            print s[0],'\t'.join(item[:2]),'~'.join(item[2:])
f.close()

output

Code:

# ./python.py
PatNum  ApplicantType~ApplicantDesignation~ApplicantLastName~ApplicantFirstName~ApplicantCity~ApplicantState~ApplicantCountry~ApplicantNationality~ApplicantResidence
D0500396 applicant-inventor     us-only Akagi~Shigeki~Kounu-gun, Hiroshima, 729-3431~~JP~JP~JP
D0500397 applicant-inventor     us-only Dal Magro~Volney~Farroupilha~~BR~BR~BR
D0500397 applicant-inventor     us-only Matsuo~Edson~Porto Alegre~~BR~BR~BR
D0500397 applicant-inventor     us-only Canei~Ademir~Farroupilha~~BR~BR~BR
D0500398 applicant-inventor     us-only Dal Magro~Volney~Farroupilha~~BR~BR~BR
D0500398 applicant-inventor     us-only Matsuo~Edson~Porto Alegre~~BR~BR~BR
D0500398 applicant-inventor     us-only Canei~Ademir~Farroupilha~~BR~BR~BR
D0500399 applicant-inventor     us-only Fuerst~Rory W.~Atherton~CA~US~omitted~US

Telemachos · 08-11-2009, 10:37 AM

Quote:

Originally Posted by ghostdog74

@OP, if you know Python, then why are you not using it?

I was going to mention your name as the resident Python expert, but I figured you would show up soon enough anyhow.

You seem to sense it if people are talking about Python.

ghostdog74 · 08-11-2009, 11:36 AM

Quote:

Originally Posted by Telemachos

I was going to mention your name as the resident Python expert, but I figured you would show up soon enough anyhow.

no, i am definitely not an expert. i only know enough for my job

, mainly sysadmin.

Quote:

You seem to sense it if people are talking about Python.

that's only because OP mentioned it in his post.

doug23 · 08-11-2009, 11:38 AM

Quote:

Originally Posted by ghostdog74

@OP, if you know Python, then why are you not using it?

Code:

#!/usr/bin/env python
f=open("file")
print f.readline().strip()
for line in f:
    line = line.strip()
    s = line.split("\t")
    for item in s[1:]:
        if "applicant" in item:
            item = item.split("~")
            print s[0],'\t'.join(item[:2]),'~'.join(item[2:])
f.close()

output

Code:

# ./python.py
PatNum  ApplicantType~ApplicantDesignation~ApplicantLastName~ApplicantFirstName~ApplicantCity~ApplicantState~ApplicantCountry~ApplicantNationality~ApplicantResidence
D0500396 applicant-inventor     us-only Akagi~Shigeki~Kounu-gun, Hiroshima, 729-3431~~JP~JP~JP
D0500397 applicant-inventor     us-only Dal Magro~Volney~Farroupilha~~BR~BR~BR
D0500397 applicant-inventor     us-only Matsuo~Edson~Porto Alegre~~BR~BR~BR
D0500397 applicant-inventor     us-only Canei~Ademir~Farroupilha~~BR~BR~BR
D0500398 applicant-inventor     us-only Dal Magro~Volney~Farroupilha~~BR~BR~BR
D0500398 applicant-inventor     us-only Matsuo~Edson~Porto Alegre~~BR~BR~BR
D0500398 applicant-inventor     us-only Canei~Ademir~Farroupilha~~BR~BR~BR
D0500399 applicant-inventor     us-only Fuerst~Rory W.~Atherton~CA~US~omitted~US

Thanks!

Can you please help me make the code more generic, as not all the second columns will start with "applicant."

What are the changes I would need to simply have the first 8 characters repeated if there are more than 1 of the column entries? We can use the fact that each of these columns are proceeded by 3 tabs.

Additionally, how would I iterate over the entire row to replace every "~" with tabs?

Best,
Doug

ghostdog74 · 08-11-2009, 11:52 AM

Quote:

Originally Posted by doug23

Thanks!

Can you please help me make the code more generic, as not all the second columns will start with "applicant."

put in print statements after the split() on tabs, and see the output of the list "s". you can see some elements are nulls. you can check for items that are non nulls by using != "".

Quote:

What are the changes I would need to simply have the first 8 characters repeated if there are more than 1 of the column entries? We can use the fact that each of these columns are proceeded by 3 tabs.

don't understand. show examples

Quote:

Additionally, how would I iterate over the entire row to replace every "~" with tabs?

Best,
Doug

you can use mystring.replace().

I thought you mentioned you are familiar with Python, so why not try start coding ...?

doug23 · 08-11-2009, 01:30 PM

Quote:

don't understand. show examples

All I meant was not requiring the first word of the second column to start with "applicant" and instead to just be a word proceeded by three tabs.

Quote:

I thought you mentioned you are familiar with Python, so why not try start coding ...?

I have never tried processing text files in Python before - most have been simple user input to append to a file.

ghostdog74 · 08-11-2009, 07:51 PM

Quote:

Originally Posted by doug23

All I meant was not requiring the first word of the second column to start with "applicant" and instead to just be a word proceeded by three tabs.

since the rest of your text involves only "applicant", you can check for non null string in the list "s".