LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 08-10-2009, 07:46 PM   #1
doug23
LQ Newbie
 
Registered: Aug 2009
Posts: 18

Rep: Reputation: 0
convert columns to rows (tab separated file to csv)


Hi All,

I bet this is a Perl one-liner (or very simple python script).

I have a tab separated files in which each row looks like:

Unique_Eight_Character_Sequence [3 tabs] data1~moredata1~moredata1 [3 tabs] data2~moredata2~moredata2 ... dataN~....

The output file should have each column converted into a row (with the unique character sequence copied in for the first column), and then each "~" replaced by a comma.

e.g.

Unique_Eight_Character_Sequence1,data1,moredata1,moredata1
Unique_Eight_Character_Sequence1,data2,moredata2,moredata2
Unique_Eight_Character_Sequence2,data1,moredata1,moredata1
...


Thanks for your help!
Doug
 
Old 08-10-2009, 08:52 PM   #2
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
Can you please post a sample of the actual data (or a mocked-up version of it)? Put it in [code][/code] tags and please show where the line breaks are (if any).

I can't quite see what the data looks like from your description.

To clarify the two examples you gave don't seem to me to match up in any obvious way. I can't see how you get from 1 to 2 given your description.
  1. Unique_Eight_Character_Sequence [3 tabs] data1~moredata1~moredata1 [3 tabs] data2~moredata2~moredata2 ... dataN~...
  2. Unique_Eight_Character_Sequence1,data1,moredata1,moredata1
    Unique_Eight_Character_Sequence1,data2,moredata2,moredata2
    Unique_Eight_Character_Sequence2,data1,moredata1,moredata1
There is no Unique_Eight_Character_Sequence2 in the first example, is there? Why do you repeat Unique_Eight_Character_Sequence1 twice at the head of two different rows? Why are there a bunch of data2 columns in the second row with a Sequence1 first column, but then a bunch of data1 rows in the third line with a Sequence2 first row!?!?

Last edited by Telemachos; 08-10-2009 at 09:08 PM.
 
Old 08-10-2009, 10:48 PM   #3
doug23
LQ Newbie
 
Registered: Aug 2009
Posts: 18

Original Poster
Rep: Reputation: 0
Sure, no problem (this isn't proprietary). Also, turns out it'll have to remain tab-separated because some of the fields already include commas in some of the values.

Code:
PatNum	ApplicantType~ApplicantDesignation~ApplicantLastName~ApplicantFirstName~ApplicantCity~ApplicantState~ApplicantCountry~ApplicantNationality~ApplicantResidence
D0500396			applicant-inventor~us-only~Akagi~Shigeki~Kounu-gun, Hiroshima, 729-3431~~JP~JP~JP
D0500397			applicant-inventor~us-only~Dal Magro~Volney~Farroupilha~~BR~BR~BR			applicant-inventor~us-only~Matsuo~Edson~Porto Alegre~~BR~BR~BR			applicant-inventor~us-only~Canei~Ademir~Farroupilha~~BR~BR~BR
D0500398			applicant-inventor~us-only~Dal Magro~Volney~Farroupilha~~BR~BR~BR			applicant-inventor~us-only~Matsuo~Edson~Porto Alegre~~BR~BR~BR			applicant-inventor~us-only~Canei~Ademir~Farroupilha~~BR~BR~BR
D0500399			applicant-inventor~us-only~Fuerst~Rory W.~Atherton~CA~US~omitted~US
Then, the output should look like

Code:
PatNum	ApplicantType~ApplicantDesignation~ApplicantLastName~ApplicantFirstName~ApplicantCity~ApplicantState~ApplicantCountry~ApplicantNationality~ApplicantResidence
D0500396	applicant-inventor	us-only	Akagi~Shigeki~Kounu-gun, Hiroshima, 729-3431~~JP~JP~JP
D0500397	applicant-inventor	us-only	Dal Magro~Volney~Farroupilha~~BR~BR~BR
D0500397	applicant-inventor	us-only	Matsuo~Edson~Porto Alegre~~BR~BR~BR
D0500397	applicant-inventor	us-only	Canei~Ademir~Farroupilha~~BR~BR~BR
D0500398	applicant-inventor	us-only	Dal Magro~Volney~Farroupilha~~BR~BR~BR
D00500398	applicant-inventor	us-only	Matsuo~Edson~Porto Alegre~~BR~BR~BR
D00500398	applicant-inventor	us-only	Canei~Ademir~Farroupilha~~BR~BR~BR
D0500399	applicant-inventor	us-only	Fuerst~Rory W.~Atherton~CA~US~omitted~US
and so on with all the "~" replaced with tabs

Last edited by doug23; 08-10-2009 at 10:50 PM.
 
Old 08-11-2009, 01:20 AM   #4
Wim Sturkenboom
Senior Member
 
Registered: Jan 2005
Location: Roodepoort, South Africa
Distribution: Slackware 10.1/10.2/12, Ubuntu 12.04, Crunchbang Statler
Posts: 3,786

Rep: Reputation: 282Reputation: 282Reputation: 282
Code:
sed -e 's/\t\t\t/\t/g' yourfile
sed -e 's/~/\t/g' yourfile
should be a start
The first one replaces three tabs by one tab and the second one replaces ~ by tab. You can pipe the result of the first one into the second one and redirect the output into a new file

Code:
sed -e 's/\t\t\t/\t/g' yourfile | sed -e 's/~/\t/g' > result.tab
Note
The csv spec supports commas in data fields.
Code:
wim, "myaddress, mytown"
is a record with 2 fields, the first one containing wim and the second one containing the address info (with a comma)
It will be more complicated to create this using command line tools, but a number of languages support csv and will be able to do it.

Last edited by Wim Sturkenboom; 08-11-2009 at 01:22 AM.
 
Old 08-11-2009, 07:47 AM   #5
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
@Wim: take a look again at the OP's data and output in his second post. There needs to be some kind of programming logic (beyond just substitutions using sed) since some patents have multiple inventors. Those that do must be entered as a first column once for each inventor. (See D0500397 and D0500398.) Also, the OP wants "applicant-inventor" made into its own column.

If I'm reading the data->output conversions correctly, then the logic seems to be that if there is more than one "applicant-inventor" on a line, then that first column will need to be repeated, once for each extra "applicant-inventor". This would require Perl, Python, Ruby or awk, I think. (Someone might be able to do it all with sed, but I wouldn't want to maintain it.

Anyhow, here is a start of how I might parse it in Perl:

Code:
#!/usr/bin/env perl
use strict;
use warnings;

open my $fh, '<', 'data.txt'
  or die "Can't open [data.txt] for reading: $!";

my $headers = <$fh>;
my %patents_inventors;

while (my $line = <$fh>) {
  chomp $line;
  my @columns = split /\t+/, $line;
  my $key = shift @columns;
  $patents_inventors{$key} = [ @columns ];
}

use Data::Dumper;
print Dumper \%patents_inventors;
Edit - @Doug23: My answer here doesn't make one thing clear. It's trivial to reduce multiple tabs to one tab and to replace tabs with commas. But what you want is more than that. It's still relatively easy to do, but it's not a trivial one-liner anymore. You need some kind of programming logic to deal with lines that have more than one inventor (so that they get repeated in the way you want). Since you have to maintain it, do you know a scripting language? The best thing will be to work with a language you're comfortable with.

Last edited by Telemachos; 08-11-2009 at 09:23 AM.
 
Old 08-11-2009, 10:13 AM   #6
doug23
LQ Newbie
 
Registered: Aug 2009
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by Telemachos View Post
@Wim: take a look again at the OP's data and output in his second post. There needs to be some kind of programming logic (beyond just substitutions using sed) since some patents have multiple inventors. Those that do must be entered as a first column once for each inventor. (See D0500397 and D0500398.) Also, the OP wants "applicant-inventor" made into its own column.

If I'm reading the data->output conversions correctly, then the logic seems to be that if there is more than one "applicant-inventor" on a line, then that first column will need to be repeated, once for each extra "applicant-inventor". This would require Perl, Python, Ruby or awk, I think. (Someone might be able to do it all with sed, but I wouldn't want to maintain it.

Anyhow, here is a start of how I might parse it in Perl:

Code:
#!/usr/bin/env perl
use strict;
use warnings;

open my $fh, '<', 'data.txt'
  or die "Can't open [data.txt] for reading: $!";

my $headers = <$fh>;
my %patents_inventors;

while (my $line = <$fh>) {
  chomp $line;
  my @columns = split /\t+/, $line;
  my $key = shift @columns;
  $patents_inventors{$key} = [ @columns ];
}

use Data::Dumper;
print Dumper \%patents_inventors;
Edit - @Doug23: My answer here doesn't make one thing clear. It's trivial to reduce multiple tabs to one tab and to replace tabs with commas. But what you want is more than that. It's still relatively easy to do, but it's not a trivial one-liner anymore. You need some kind of programming logic to deal with lines that have more than one inventor (so that they get repeated in the way you want). Since you have to maintain it, do you know a scripting language? The best thing will be to work with a language you're comfortable with.
You're right - the difficult part is the logic of copying the first column to each of the multiple rows created if the row has more than one set of entries. Then it is a trivial replacement of 3 tabs with 1 tab, and "~" with 1 tab.

I will try the Perl script very shortly... In terms of maintenance, would this be difficult to convert to Python (which I am more familiar with)?

Thanks!
Doug
 
Old 08-11-2009, 10:29 AM   #7
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
Quote:
Originally Posted by doug23 View Post
I will try the Perl script very shortly... In terms of maintenance, would this be difficult to convert to Python (which I am more familiar with)?
It shouldn't be much trouble at all to convert: assuming you know both Perl and Python reasonably well. Unfortunately, I know Python not at all, so I'm no help.

I will repost the key part of the code with more comments to help you translate it.

Code:
# Open the file for reading
open my $fh, '<', 'data.txt'
  or die "Can't open [data.txt] for reading: $!";

# Read in the first line only to remove/capture the headers
my $headers = <$fh>;

# Create an empty hash to store the rows later
my %patents_inventors;

# Begin reading the file line by line
while (my $line = <$fh>) {

  # remove the newline from the end of the line
  chomp $line;

  # split the line on tabs and put each item into the @columns array
  my @columns = split /\t+/, $line;

  # remove the first column from the @columns array; this will be our
  # record header for all the inventors
  my $key = shift @columns;

  # use the first column as the hash key; make the remaining columns 
  # an anonymous array and put that into the hash as the value for this key
  $patents_inventors{$key} = [ @columns ];
}

use Data::Dumper;
print Dumper \%patents_inventors;
So far this just creates a structure with the data. You would then need to write further application logic to go through the %patents_inventors array and write the data out to a csv file. The logic would be this: for each key in the hash (entry in the dictionary for Python?), create a line for each item in the array that is the value of that key. If there is only one inventor, you get one line. If there are three, you get three. It should be reasonably simple to implement, but again, I don't know Python's syntax.

To give you a sense of the structure this produces, here's the output of Data:umper from the script so far:

Code:
$VAR1 = {
          'D0500398' => [
                          'applicant-inventor~us-only~Dal Magro~Volney~Farroupilha~~BR~BR~BR',
                          'applicant-inventor~us-only~Matsuo~Edson~Porto Alegre~~BR~BR~BR',
                          'applicant-inventor~us-only~Canei~Ademir~Farroupilha~~BR~BR~BR'
                        ],
          'D0500399' => [
                          'applicant-inventor~us-only~Fuerst~Rory W.~Atherton~CA~US~omitted~US'
                        ],
          'D0500397' => [
                          'applicant-inventor~us-only~Dal Magro~Volney~Farroupilha~~BR~BR~BR',
                          'applicant-inventor~us-only~Matsuo~Edson~Porto Alegre~~BR~BR~BR',
                          'applicant-inventor~us-only~Canei~Ademir~Farroupilha~~BR~BR~BR'
                        ],
          'D0500396' => [
                          'applicant-inventor~us-only~Akagi~Shigeki~Kounu-gun, Hiroshima, 729-3431~~JP~JP~JP'
                        ]
        };
Each initial field is a key, the inventors are items (one or more) in arrays. (It's a hash of arrays.)

Last edited by Telemachos; 08-11-2009 at 10:44 AM.
 
Old 08-11-2009, 11:15 AM   #8
Wim Sturkenboom
Senior Member
 
Registered: Jan 2005
Location: Roodepoort, South Africa
Distribution: Slackware 10.1/10.2/12, Ubuntu 12.04, Crunchbang Statler
Posts: 3,786

Rep: Reputation: 282Reputation: 282Reputation: 282
Sorry,missed that completely

Last edited by Wim Sturkenboom; 08-11-2009 at 11:18 AM.
 
Old 08-11-2009, 11:31 AM   #9
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
@OP, if you know Python, then why are you not using it?

Code:
#!/usr/bin/env python
f=open("file")
print f.readline().strip()
for line in f:
    line = line.strip()
    s = line.split("\t")
    for item in s[1:]:
        if "applicant" in item:
            item = item.split("~")
            print s[0],'\t'.join(item[:2]),'~'.join(item[2:])
f.close()
output
Code:
# ./python.py
PatNum  ApplicantType~ApplicantDesignation~ApplicantLastName~ApplicantFirstName~ApplicantCity~ApplicantState~ApplicantCountry~ApplicantNationality~ApplicantResidence
D0500396 applicant-inventor     us-only Akagi~Shigeki~Kounu-gun, Hiroshima, 729-3431~~JP~JP~JP
D0500397 applicant-inventor     us-only Dal Magro~Volney~Farroupilha~~BR~BR~BR
D0500397 applicant-inventor     us-only Matsuo~Edson~Porto Alegre~~BR~BR~BR
D0500397 applicant-inventor     us-only Canei~Ademir~Farroupilha~~BR~BR~BR
D0500398 applicant-inventor     us-only Dal Magro~Volney~Farroupilha~~BR~BR~BR
D0500398 applicant-inventor     us-only Matsuo~Edson~Porto Alegre~~BR~BR~BR
D0500398 applicant-inventor     us-only Canei~Ademir~Farroupilha~~BR~BR~BR
D0500399 applicant-inventor     us-only Fuerst~Rory W.~Atherton~CA~US~omitted~US
 
Old 08-11-2009, 11:37 AM   #10
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
Quote:
Originally Posted by ghostdog74 View Post
@OP, if you know Python, then why are you not using it?
I was going to mention your name as the resident Python expert, but I figured you would show up soon enough anyhow.

You seem to sense it if people are talking about Python.
 
Old 08-11-2009, 12:36 PM   #11
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by Telemachos View Post
I was going to mention your name as the resident Python expert, but I figured you would show up soon enough anyhow.
no, i am definitely not an expert. i only know enough for my job , mainly sysadmin.

Quote:
You seem to sense it if people are talking about Python.
that's only because OP mentioned it in his post.
 
Old 08-11-2009, 12:38 PM   #12
doug23
LQ Newbie
 
Registered: Aug 2009
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by ghostdog74 View Post
@OP, if you know Python, then why are you not using it?

Code:
#!/usr/bin/env python
f=open("file")
print f.readline().strip()
for line in f:
    line = line.strip()
    s = line.split("\t")
    for item in s[1:]:
        if "applicant" in item:
            item = item.split("~")
            print s[0],'\t'.join(item[:2]),'~'.join(item[2:])
f.close()
output
Code:
# ./python.py
PatNum  ApplicantType~ApplicantDesignation~ApplicantLastName~ApplicantFirstName~ApplicantCity~ApplicantState~ApplicantCountry~ApplicantNationality~ApplicantResidence
D0500396 applicant-inventor     us-only Akagi~Shigeki~Kounu-gun, Hiroshima, 729-3431~~JP~JP~JP
D0500397 applicant-inventor     us-only Dal Magro~Volney~Farroupilha~~BR~BR~BR
D0500397 applicant-inventor     us-only Matsuo~Edson~Porto Alegre~~BR~BR~BR
D0500397 applicant-inventor     us-only Canei~Ademir~Farroupilha~~BR~BR~BR
D0500398 applicant-inventor     us-only Dal Magro~Volney~Farroupilha~~BR~BR~BR
D0500398 applicant-inventor     us-only Matsuo~Edson~Porto Alegre~~BR~BR~BR
D0500398 applicant-inventor     us-only Canei~Ademir~Farroupilha~~BR~BR~BR
D0500399 applicant-inventor     us-only Fuerst~Rory W.~Atherton~CA~US~omitted~US
Thanks!

Can you please help me make the code more generic, as not all the second columns will start with "applicant."

What are the changes I would need to simply have the first 8 characters repeated if there are more than 1 of the column entries? We can use the fact that each of these columns are proceeded by 3 tabs.

Additionally, how would I iterate over the entire row to replace every "~" with tabs?

Best,
Doug
 
Old 08-11-2009, 12:52 PM   #13
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by doug23 View Post
Thanks!

Can you please help me make the code more generic, as not all the second columns will start with "applicant."
put in print statements after the split() on tabs, and see the output of the list "s". you can see some elements are nulls. you can check for items that are non nulls by using != "".

Quote:
What are the changes I would need to simply have the first 8 characters repeated if there are more than 1 of the column entries? We can use the fact that each of these columns are proceeded by 3 tabs.
don't understand. show examples

Quote:
Additionally, how would I iterate over the entire row to replace every "~" with tabs?

Best,
Doug
you can use mystring.replace().

I thought you mentioned you are familiar with Python, so why not try start coding ...?
 
Old 08-11-2009, 02:30 PM   #14
doug23
LQ Newbie
 
Registered: Aug 2009
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
don't understand. show examples
All I meant was not requiring the first word of the second column to start with "applicant" and instead to just be a word proceeded by three tabs.

Quote:
I thought you mentioned you are familiar with Python, so why not try start coding ...?
I have never tried processing text files in Python before - most have been simple user input to append to a file.
 
Old 08-11-2009, 08:51 PM   #15
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by doug23 View Post
All I meant was not requiring the first word of the second column to start with "applicant" and instead to just be a word proceeded by three tabs.
since the rest of your text involves only "applicant", you can check for non null string in the list "s".
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Script to convert logs columns to rows fono Linux - Software 10 05-19-2009 09:29 PM
Parsing a comma separated CSV file where fields have commas in to trickyflash Linux - General 7 03-26-2009 04:30 PM
[SOLVED] Convert CSV to Tab imkornhulio Programming 7 11-15-2008 08:20 PM
Convert into CSV file say_hi_ravi Programming 4 07-17-2008 04:25 AM
rows and columns digitalgravy Linux - General 2 03-16-2004 07:47 PM


All times are GMT -5. The time now is 06:32 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration