LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-28-2010, 11:35 AM   #1
zeratul111
LQ Newbie
 
Registered: Sep 2010
Posts: 19

Rep: Reputation: Disabled
Parsing text and combining the parsed text


Hi everyone,

I thought I had a handle on parsing text, but life throws me another scenario for which I'm stuck on and hope one of you can help me.


EDIT1:
I should note that I was trying to do this in PERL, not sure if other alternatives are more simple?

EDIT2: I should note that for text file 3 (reference), it's a long list of MANY cnp_id values and their corresponoding chr, start, and end values. So, the code will have to take the cnp_id from text file 1 and/or 2 and search through textfile 3 (reference) to match on the cnp_id and then take the corresponding chr, start, and end values and add to the relevant line in the output.

EDIT3:
Sorry, I should mention that the text file entries are all tab-delimited.


I have 3 text files:

File 1:
Columns represent sample IDs (sample_id) and rows represent CNP IDs (cnp_id). Cells represents the confidence level (confidence) for each sample and CNP.
Quote:
cnp_id P5E6_SNP6.0_JHP5_010408.CEL P5E11reh_SNP6.0_JHP5_011808.CEL P7C7_SNP6.0_JHP7_021208.CEL ... etc.
CNP10 0.004479834 0.002792951 0.00305613
CNP10010 0.058293636 0.045141521 0.021287972
CNP10026 0.026819976 0.03868063 0.055412474
...
etc.
File 2:
Columns represent sample IDs (sample_id) and rows represent CNP IDs (cnp_id). Cells represents the copy number status (copynumber) for each sample and CNP.
Quote:
cnp_id P5E6_SNP6.0_JHP5_010408.CEL P5E11reh_SNP6.0_JHP5_011808.CEL P7C7_SNP6.0_JHP7_021208.CEL ... etc.
CNP10 2 2 2
CNP10010 2 2 2
CNP10026 2 2 2
...
etc.
File 3 (reference):
This is essentially the reference file for the CNP IDs... so for each CNP ID (cnp_id), there's an associated "chr", "start", and "end" values.
Quote:
cnp_id chr start end
CNP10 1 10293128 10300570
CNP2648 23 76053855 76057477
CNP2654 23 81283429 81298800
CNP2659 23 93309217 93316234
CNP2675 23 109825336 109826953
CNP10010 1 8128717 8131908
CNP10054 1 38414624 38417452
CNP10055 1 38901188 38908792
CNP10056 1 40738766 40742251
CNP10026 1 13081670 13249634
CNP10393 2 141971752 141973472
CNP10398 2 150739697 150745978
CNP10400 2 153378432 153505030
...
etc.
What I was trying to do is to combine the three text files together into a text file that looks like the following (the variables do not necessarily need to be in the stated order, but would be helpful):
Quote:
sample_id cnp_id chr start end copynumber confidence
^So, for each line, I have the sample_id and the corresponding cnp_id (also the associated chr, start, end values for this particular cnp_id from the reference file), and the copynumber and confidence values from files 2 and 1, respectively (please see my post, #3 of thread, for an example output containing the sample data from text files 1, 2, and 3.

Note:
*"sample_id" is the long *.CEL file name.
*"cnp_id" is the columns of files 1 and 2, which is mapped to the chr, start, and stop values of file 3 (reference).
*"confidence" is the cell value of file 1 corresponding to the respective sample and cnp_id.
*"copynumber" is the cell value of file 2 corresponding to the respective sample and cnp_id.


Thanks in advance for any pointers! I will continue to play around with what I have, which isn't much at the moment. I'm just trying to parse the files right and read them into an array, but kind of failing miserably... :/

Last edited by zeratul111; 10-28-2010 at 12:32 PM. Reason: made variable names a bit clearer in descriptions
 
Old 10-28-2010, 12:00 PM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387Reputation: 2387
Hi,

Before me (and maybe others) start interpreting things incorrectly, could you give us an example of the desired output based on the above given example files?
 
Old 10-28-2010, 12:12 PM   #3
zeratul111
LQ Newbie
 
Registered: Sep 2010
Posts: 19

Original Poster
Rep: Reputation: Disabled
Hi druuna, below is what the output would look like for 3 cnp_id values and 3 sample_id values from the 3-by-3 "table" I have in the OP. (I changed the text file examples in my OP to be consistent with what I have below. Sorry for the cluttered output example). Thanks!

EDIT (from the OP): I should note that for text file 3 (reference), it's a long list of MANY cnp_id values and their corresponoding chr, start, and end values. So, the code will have to take the cnp_id from text file 1 and/or 2 and search through textfile 3 (reference) to match on the cnp_id and then take the corresponding chr, start, and end values and add to the relevant line in the output.

EDIT: Sorry, I should mention that the text file entries from the input files are all tab-delimited.


Quote:
sample_id cnp_id chr start end copynumber confidence
P5E6_SNP6.0_JHP5_010408.CEL CNP10 1 10293128 10300570 2 0.004479834
P5E6_SNP6.0_JHP5_010408.CEL CNP10010 1 8128717 2 0.058293636
P5E6_SNP6.0_JHP5_010408.CEL CNP10026 1 13081670 13249634 2 0.026819976
P5E11reh_SNP6.0_JHP5_011808.CEL CNP10 1 10293128 10300570 2 0.002792951
P5E11reh_SNP6.0_JHP5_011808.CEL CNP10010 1 8128717 8131908 2 0.045141521
P5E11reh_SNP6.0_JHP5_011808.CEL CNP10026 1 13081670 13249634 2 0.03868063
P7C7_SNP6.0_JHP7_021208.CEL CNP10 1 10293126 10300570 2 0.00305613
P7C7_SNP6.0_JHP7_021208.CEL CNP10010 1 8128717 2 0.021287972
P7C7_SNP6.0_JHP7_021208.CEL CNP10026 1 13081670 13249634 2 0.055412474

Last edited by zeratul111; 10-28-2010 at 12:33 PM. Reason: oops, fixed some errors
 
Old 10-28-2010, 12:38 PM   #4
fbobraga
Member
 
Registered: Jul 2010
Location: São Paulo - Brasil
Distribution: Debian 7 / Crunchbang 11
Posts: 229

Rep: Reputation: 41
Simply use the join command. This may help: http://articles.techrepublic.com.com...1-5031653.html
 
Old 10-28-2010, 12:46 PM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,256

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
Well I am assuming that you have only provided a sample solution but see what you think:
Code:
#!/usr/bin/awk -f

NR==1{split($0,arr)}

FNR>1{
    if(FILENAME != "f3")
        for(i=2;i<=NF;i++)
            cnp_id[arr[i],$1]=$i" "cnp_id[arr[i],$1]
    else
        for(j in cnp_id)
            if($1 == gensub(".*"SUBSEP,"",1,j))
                cnp_id[j]=$2" "$3" "$4" "cnp_id[j]
}

END{
    print "sample_id cnp_id chr start end copynumber confidence"
    for(f in cnp_id)
        print gensub(SUBSEP," ",1,f),cnp_id[f]
}
And you would call it like so:
Code:
./zeratul111.awk f1 f2 f3
 
Old 10-28-2010, 01:26 PM   #6
kepstin
LQ Newbie
 
Registered: Sep 2010
Posts: 2

Rep: Reputation: 2
This should do the trick, I think. Let me know if it works for you, and if you want any explaination about what's going on.

Code:
#!/usr/bin/perl -w

# Read the reference data
open FILE3, "<file3.txt";
<FILE3>;
while (($cnp_id, $chr, $start, $end) = split ' ', <FILE3>) {
	$reference{$cnp_id}{chr} = $chr;
	$reference{$cnp_id}{start} = $start;
	$reference{$cnp_id}{end} = $end;
}

# Read the first input file
open FILE1, "<file1.txt";
($cnp_id, @sample_ids) = split ' ', <FILE1>;

while (($cnp_id, @confidence) = split ' ', <FILE1>) {
	for ($i = 0; $i < @confidence; $i++) {
		# Copy in the reference data
		$results{"$sample_ids[$i]\t$cnp_id"}{chr} = $reference{$cnp_id}{chr};
		$results{"$sample_ids[$i]\t$cnp_id"}{start} = $reference{$cnp_id}{start};
		$results{"$sample_ids[$i]\t$cnp_id"}{end} = $reference{$cnp_id}{end};

		# Save this confidence value
		$results{"$sample_ids[$i]\t$cnp_id"}{confidence} = $confidence[$i];
	}
}

# Read the second input file
open FILE2, "<file2.txt";
($cnp_id, @sample_ids) = split ' ', <FILE2>;

while (($cnp_id, @copynumbers) = split ' ', <FILE2>) {
	for ($i = 0; $i < @copynumbers; $i++) {
		# Save this copynumber value
		$results{"$sample_ids[$i]\t$cnp_id"}{copynumber} = $copynumbers[$i];
	}
}

# Save the formatted merged output
open OUTPUT, ">output.txt";
print OUTPUT "sample_id\tcnp_id\tchr\tstart\tend\tcopynumber\tconfidence\n";
for (keys %results) {
	print OUTPUT "$_\t";
	print OUTPUT "$results{$_}{chr}\t";
	print OUTPUT "$results{$_}{start}\t";
	print OUTPUT "$results{$_}{end}\t";
	print OUTPUT "$results{$_}{copynumber}\t";
	print OUTPUT "$results{$_}{confidence}\n";
}
 
1 members found this post helpful.
Old 10-28-2010, 01:46 PM   #7
zeratul111
LQ Newbie
 
Registered: Sep 2010
Posts: 19

Original Poster
Rep: Reputation: Disabled
Thanks grail and kepstin! Much appreciated for your help. I ran both codes and they worked perfectly! It's nice to see different approaches, I will study in detail now.

Thanks fbobraga for the helpful join command, I will make note of it.

Last edited by zeratul111; 10-28-2010 at 01:47 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Apache 1.3 - .php file ain't parsed, only text misstajah Linux - Server 4 09-21-2006 07:14 AM
I need help parsing text from a text file rsmccain Linux - General 2 01-05-2006 03:43 PM
need help parsing text file airman99 Linux - General 2 10-08-2004 10:09 PM
Combining text files ebiven Linux - General 12 07-22-2004 12:13 PM
Text parsing question bruoersolitario Linux - General 4 04-15-2004 03:12 PM


All times are GMT -5. The time now is 12:05 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration