convert columns to rows (tab separated file to csv)
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
The output file should have each column converted into a row (with the unique character sequence copied in for the first column), and then each "~" replaced by a comma.
Can you please post a sample of the actual data (or a mocked-up version of it)? Put it in [code][/code] tags and please show where the line breaks are (if any).
I can't quite see what the data looks like from your description.
To clarify the two examples you gave don't seem to me to match up in any obvious way. I can't see how you get from 1 to 2 given your description.
There is no Unique_Eight_Character_Sequence2 in the first example, is there? Why do you repeat Unique_Eight_Character_Sequence1 twice at the head of two different rows? Why are there a bunch of data2 columns in the second row with a Sequence1 first column, but then a bunch of data1 rows in the third line with a Sequence2 first row!?!?
Last edited by Telemachos; 08-10-2009 at 08:08 PM.
Sure, no problem (this isn't proprietary). Also, turns out it'll have to remain tab-separated because some of the fields already include commas in some of the values.
sed -e 's/\t\t\t/\t/g' yourfile
sed -e 's/~/\t/g' yourfile
should be a start
The first one replaces three tabs by one tab and the second one replaces ~ by tab. You can pipe the result of the first one into the second one and redirect the output into a new file
Code:
sed -e 's/\t\t\t/\t/g' yourfile | sed -e 's/~/\t/g' > result.tab
Note
The csv spec supports commas in data fields.
Code:
wim, "myaddress, mytown"
is a record with 2 fields, the first one containing wim and the second one containing the address info (with a comma)
It will be more complicated to create this using command line tools, but a number of languages support csv and will be able to do it.
Last edited by Wim Sturkenboom; 08-11-2009 at 12:22 AM.
@Wim: take a look again at the OP's data and output in his second post. There needs to be some kind of programming logic (beyond just substitutions using sed) since some patents have multiple inventors. Those that do must be entered as a first column once for each inventor. (See D0500397 and D0500398.) Also, the OP wants "applicant-inventor" made into its own column.
If I'm reading the data->output conversions correctly, then the logic seems to be that if there is more than one "applicant-inventor" on a line, then that first column will need to be repeated, once for each extra "applicant-inventor". This would require Perl, Python, Ruby or awk, I think. (Someone might be able to do it all with sed, but I wouldn't want to maintain it.
Anyhow, here is a start of how I might parse it in Perl:
Code:
#!/usr/bin/env perl
use strict;
use warnings;
open my $fh, '<', 'data.txt'
or die "Can't open [data.txt] for reading: $!";
my $headers = <$fh>;
my %patents_inventors;
while (my $line = <$fh>) {
chomp $line;
my @columns = split /\t+/, $line;
my $key = shift @columns;
$patents_inventors{$key} = [ @columns ];
}
use Data::Dumper;
print Dumper \%patents_inventors;
Edit - @Doug23: My answer here doesn't make one thing clear. It's trivial to reduce multiple tabs to one tab and to replace tabs with commas. But what you want is more than that. It's still relatively easy to do, but it's not a trivial one-liner anymore. You need some kind of programming logic to deal with lines that have more than one inventor (so that they get repeated in the way you want). Since you have to maintain it, do you know a scripting language? The best thing will be to work with a language you're comfortable with.
Last edited by Telemachos; 08-11-2009 at 08:23 AM.
@Wim: take a look again at the OP's data and output in his second post. There needs to be some kind of programming logic (beyond just substitutions using sed) since some patents have multiple inventors. Those that do must be entered as a first column once for each inventor. (See D0500397 and D0500398.) Also, the OP wants "applicant-inventor" made into its own column.
If I'm reading the data->output conversions correctly, then the logic seems to be that if there is more than one "applicant-inventor" on a line, then that first column will need to be repeated, once for each extra "applicant-inventor". This would require Perl, Python, Ruby or awk, I think. (Someone might be able to do it all with sed, but I wouldn't want to maintain it.
Anyhow, here is a start of how I might parse it in Perl:
Code:
#!/usr/bin/env perl
use strict;
use warnings;
open my $fh, '<', 'data.txt'
or die "Can't open [data.txt] for reading: $!";
my $headers = <$fh>;
my %patents_inventors;
while (my $line = <$fh>) {
chomp $line;
my @columns = split /\t+/, $line;
my $key = shift @columns;
$patents_inventors{$key} = [ @columns ];
}
use Data::Dumper;
print Dumper \%patents_inventors;
Edit - @Doug23: My answer here doesn't make one thing clear. It's trivial to reduce multiple tabs to one tab and to replace tabs with commas. But what you want is more than that. It's still relatively easy to do, but it's not a trivial one-liner anymore. You need some kind of programming logic to deal with lines that have more than one inventor (so that they get repeated in the way you want). Since you have to maintain it, do you know a scripting language? The best thing will be to work with a language you're comfortable with.
You're right - the difficult part is the logic of copying the first column to each of the multiple rows created if the row has more than one set of entries. Then it is a trivial replacement of 3 tabs with 1 tab, and "~" with 1 tab.
I will try the Perl script very shortly... In terms of maintenance, would this be difficult to convert to Python (which I am more familiar with)?
I will try the Perl script very shortly... In terms of maintenance, would this be difficult to convert to Python (which I am more familiar with)?
It shouldn't be much trouble at all to convert: assuming you know both Perl and Python reasonably well. Unfortunately, I know Python not at all, so I'm no help.
I will repost the key part of the code with more comments to help you translate it.
Code:
# Open the file for reading
open my $fh, '<', 'data.txt'
or die "Can't open [data.txt] for reading: $!";
# Read in the first line only to remove/capture the headers
my $headers = <$fh>;
# Create an empty hash to store the rows later
my %patents_inventors;
# Begin reading the file line by line
while (my $line = <$fh>) {
# remove the newline from the end of the line
chomp $line;
# split the line on tabs and put each item into the @columns array
my @columns = split /\t+/, $line;
# remove the first column from the @columns array; this will be our
# record header for all the inventors
my $key = shift @columns;
# use the first column as the hash key; make the remaining columns
# an anonymous array and put that into the hash as the value for this key
$patents_inventors{$key} = [ @columns ];
}
use Data::Dumper;
print Dumper \%patents_inventors;
So far this just creates a structure with the data. You would then need to write further application logic to go through the %patents_inventors array and write the data out to a csv file. The logic would be this: for each key in the hash (entry in the dictionary for Python?), create a line for each item in the array that is the value of that key. If there is only one inventor, you get one line. If there are three, you get three. It should be reasonably simple to implement, but again, I don't know Python's syntax.
To give you a sense of the structure this produces, here's the output of Data:umper from the script so far:
@OP, if you know Python, then why are you not using it?
Code:
#!/usr/bin/env python
f=open("file")
print f.readline().strip()
for line in f:
line = line.strip()
s = line.split("\t")
for item in s[1:]:
if "applicant" in item:
item = item.split("~")
print s[0],'\t'.join(item[:2]),'~'.join(item[2:])
f.close()
@OP, if you know Python, then why are you not using it?
Code:
#!/usr/bin/env python
f=open("file")
print f.readline().strip()
for line in f:
line = line.strip()
s = line.split("\t")
for item in s[1:]:
if "applicant" in item:
item = item.split("~")
print s[0],'\t'.join(item[:2]),'~'.join(item[2:])
f.close()
Can you please help me make the code more generic, as not all the second columns will start with "applicant."
What are the changes I would need to simply have the first 8 characters repeated if there are more than 1 of the column entries? We can use the fact that each of these columns are proceeded by 3 tabs.
Additionally, how would I iterate over the entire row to replace every "~" with tabs?
Can you please help me make the code more generic, as not all the second columns will start with "applicant."
put in print statements after the split() on tabs, and see the output of the list "s". you can see some elements are nulls. you can check for items that are non nulls by using != "".
Quote:
What are the changes I would need to simply have the first 8 characters repeated if there are more than 1 of the column entries? We can use the fact that each of these columns are proceeded by 3 tabs.
don't understand. show examples
Quote:
Additionally, how would I iterate over the entire row to replace every "~" with tabs?
Best,
Doug
you can use mystring.replace().
I thought you mentioned you are familiar with Python, so why not try start coding ...?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.