Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to
LinuxQuestions.org , a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free.
Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please
contact us . If you need to reset your password,
click here .
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a
virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month.
Click here for more info.
03-26-2014, 12:32 AM
#1
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Rep:
awk to compare and combine two files
Hi all,
I have two files with different types of information about a biological sequence. Here is what they look like.
File 1:
Code:
Alexandromenia_crassa_TRI_1_13_NORM_comp0_c0_seq1 613.00 473.47 8.00 3.21 3.17
Alexandromenia_crassa_TRI_1_13_NORM_comp100015_c0_seq1 407.00 268.45 5.00 3.53 3.50
Alexandromenia_crassa_TRI_1_13_NORM_comp10002_c0_seq1 392.00 253.59 4.00 2.99 2.96
Alexandromenia_crassa_TRI_1_13_NORM_comp10002_c0_seq2 201.00 71.53 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10003_c0_seq1 412.00 273.41 1.00 0.69 0.69
Alexandromenia_crassa_TRI_1_13_NORM_comp100088_c0_seq1 293.00 156.41 2.00 2.43 2.40
Alexandromenia_crassa_TRI_1_13_NORM_comp10009_c0_seq1 648.00 508.39 7.00 2.61 2.58
Alexandromenia_crassa_TRI_1_13_NORM_comp10012_c0_seq1 753.00 613.22 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10012_c0_seq2 706.00 566.28 14.00 4.69 4.64
Alexandromenia_crassa_TRI_1_13_NORM_comp10013_c0_seq1 344.00 206.20 1.00 0.92 0.91
Alexandromenia_crassa_TRI_1_13_NORM_comp10014_c0_seq1 445.00 306.17 3.00 1.86 1.84
Alexandromenia_crassa_TRI_1_13_NORM_comp10015_c0_seq1 294.00 157.38 2.00 2.41 2.39
Alexandromenia_crassa_TRI_1_13_NORM_comp10016_c0_seq1 614.00 474.47 8.00 3.20 3.16
Alexandromenia_crassa_TRI_1_13_NORM_comp1001_c0_seq1 316.00 178.76 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp1001_c0_seq2 519.00 379.78 5.00 2.50 2.47
Alexandromenia_crassa_TRI_1_13_NORM_comp10021_c0_seq1 289.00 152.55 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10021_c0_seq2 539.00 399.70 3.00 1.42 1.41
Alexandromenia_crassa_TRI_1_13_NORM_comp10023_c0_seq1 273.00 137.20 3.00 4.15 4.11
Alexandromenia_crassa_TRI_1_13_NORM_comp10023_c0_seq2 241.00 107.12 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp100244_c0_seq1 228.00 95.24 3.00 5.98 5.91
Alexandromenia_crassa_TRI_1_13_NORM_comp10025_c0_seq1 500.00 360.87 3.00 1.58 1.56
Alexandromenia_crassa_TRI_1_13_NORM_comp10028_c0_seq1 391.00 252.60 15.00 11.27 11.15
Alexandromenia_crassa_TRI_1_13_NORM_comp10035_c0_seq1 228.00 95.24 1.00 1.99 1.97
Alexandromenia_crassa_TRI_1_13_NORM_comp10036_c0_seq1 1188.00 1048.17 79.00 14.30 14.14
Alexandromenia_crassa_TRI_1_13_NORM_comp100382_c0_seq1 218.00 86.29 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10038_c0_seq1 307.00 169.99 7.00 7.81 7.73
Alexandromenia_crassa_TRI_1_13_NORM_comp100396_c0_seq1 237.00 103.44 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp100411_c0_seq1 295.00 158.35 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp100454_c0_seq1 649.00 509.39 8.00 2.98 2.95
Alexandromenia_crassa_TRI_1_13_NORM_comp100458_c0_seq1 401.00 262.51 1.00 0.72 0.72
Alexandromenia_crassa_TRI_1_13_NORM_comp10045_c0_seq1 484.00 344.95 6.00 3.30 3.26
Alexandromenia_crassa_TRI_1_13_NORM_comp10045_c0_seq2 373.00 234.80 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp100462_c0_seq1 246.00 111.75 1.00 1.70 1.68
Alexandromenia_crassa_TRI_1_13_NORM_comp10046_c0_seq1 281.00 144.86 11.00 14.41 14.26
Alexandromenia_crassa_TRI_1_13_NORM_comp10048_c0_seq1 496.00 356.89 6.00 3.19 3.16
Alexandromenia_crassa_TRI_1_13_NORM_comp10050_c0_seq1 744.00 604.23 9.00 2.83 2.80
Alexandromenia_crassa_TRI_1_13_NORM_comp100526_c0_seq1 304.00 167.08 3.00 3.41 3.37
Alexandromenia_crassa_TRI_1_13_NORM_comp100547_c0_seq1 342.00 204.24 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10055_c0_seq1 607.00 467.49 8.00 3.25 3.21
Alexandromenia_crassa_TRI_1_13_NORM_comp10056_c0_seq1 202.00 72.38 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp100589_c0_seq1 443.00 304.18 1.00 0.62 0.62
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq1 456.00 317.10 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 1679.00 1539.17 95.00 11.71 11.58
Alexandromenia_crassa_TRI_1_13_NORM_comp100609_c0_seq1 256.00 121.09 1.00 1.57 1.55
Alexandromenia_crassa_TRI_1_13_NORM_comp10060_c0_seq1 605.00 465.49 15.00 6.11 6.05
Alexandromenia_crassa_TRI_1_13_NORM_comp100623_c0_seq1 445.00 306.17 1.00 0.62 0.61
Alexandromenia_crassa_TRI_1_13_NORM_comp10062_c0_seq1 302.00 165.13 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10062_c0_seq2 320.00 182.67 9.00 9.35 9.25
Alexandromenia_crassa_TRI_1_13_NORM_comp100639_c0_seq1 682.00 542.32 20.00 7.00 6.92
Alexandromenia_crassa_TRI_1_13_NORM_comp100641_c0_seq1 465.00 326.05 3.00 1.75 1.73
Alexandromenia_crassa_TRI_1_13_NORM_comp10064_c0_seq1 557.00 417.64 6.00 2.73 2.70
Alexandromenia_crassa_TRI_1_13_NORM_comp10065_c0_seq1 390.00 251.61 10.00 7.54 7.46
Alexandromenia_crassa_TRI_1_13_NORM_comp10066_c0_seq1 252.00 117.34 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10066_c0_seq2 234.00 100.69 1.00 1.88 1.86
Alexandromenia_crassa_TRI_1_13_NORM_comp100688_c0_seq1 331.00 193.44 7.00 6.87 6.79
Alexandromenia_crassa_TRI_1_13_NORM_comp10069_c0_seq1 661.00 521.36 100.00 36.38 36.00
Alexandromenia_crassa_TRI_1_13_NORM_comp1006_c0_seq1 472.00 333.01 4.00 2.28 2.25
Alexandromenia_crassa_TRI_1_13_NORM_comp100709_c0_seq1 528.00 388.75 1.00 0.49 0.48
Alexandromenia_crassa_TRI_1_13_NORM_comp10070_c0_seq1 1270.00 1130.17 23.00 3.86 3.82
Alexandromenia_crassa_TRI_1_13_NORM_comp100711_c0_seq1 328.00 190.50 2.00 1.99 1.97
Alexandromenia_crassa_TRI_1_13_NORM_comp100728_c0_seq1 656.00 516.37 10.00 3.67 3.63
Alexandromenia_crassa_TRI_1_13_NORM_comp10072_c0_seq1 336.00 198.34 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10072_c0_seq2 391.00 252.60 5.00 3.76 3.72
Alexandromenia_crassa_TRI_1_13_NORM_comp100748_c0_seq1 367.00 228.87 2.00 1.66 1.64
Alexandromenia_crassa_TRI_1_13_NORM_comp10074_c0_seq1 794.00 654.19 13.00 3.77 3.73
Alexandromenia_crassa_TRI_1_13_NORM_comp10075_c0_seq1 675.00 535.33 11.00 3.90 3.86
Alexandromenia_crassa_TRI_1_13_NORM_comp10076_c0_seq1 431.00 292.27 3.00 1.95 1.93
Alexandromenia_crassa_TRI_1_13_NORM_comp10077_c0_seq1 238.00 104.36 9.00 16.37 16.19
Alexandromenia_crassa_TRI_1_13_NORM_comp100795_c0_seq1 755.00 615.22 10.00 3.08 3.05
Alexandromenia_crassa_TRI_1_13_NORM_comp1007_c0_seq1 332.00 194.42 4.00 3.90 3.86
Alexandromenia_crassa_TRI_1_13_NORM_comp100800_c0_seq1 296.00 159.32 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10080_c0_seq1 2319.00 2179.17 55.07 4.79 4.74
Alexandromenia_crassa_TRI_1_13_NORM_comp10082_c0_seq1 497.00 357.88 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10082_c0_seq2 511.00 371.82 5.00 2.55 2.52
Alexandromenia_crassa_TRI_1_13_NORM_comp10083_c0_seq1 326.00 188.54 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10083_c0_seq2 298.00 161.25 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp10085_c0_seq1 363.00 224.92 5.00 4.22 4.17
Alexandromenia_crassa_TRI_1_13_NORM_comp100875_c0_seq1 443.00 304.18 9.00 5.61 5.55
Alexandromenia_crassa_TRI_1_13_NORM_comp10087_c0_seq1 299.00 162.22 3.00 3.51 3.47
Alexandromenia_crassa_TRI_1_13_NORM_comp100900_c0_seq1 260.00 124.86 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp100913_c0_seq1 293.00 156.41 2.00 2.43 2.40
Alexandromenia_crassa_TRI_1_13_NORM_comp10091_c0_seq1 492.00 352.91 5.00 2.69 2.66
Alexandromenia_crassa_TRI_1_13_NORM_comp10092_c0_seq1 338.00 200.31 117.00 110.82 109.65
Alexandromenia_crassa_TRI_1_13_NORM_comp10095_c0_seq1 816.00 676.18 13.00 3.65 3.61
Alexandromenia_crassa_TRI_1_13_NORM_comp10097_c0_seq1 769.00 629.21 14.00 4.22 4.18
Alexandromenia_crassa_TRI_1_13_NORM_comp100987_c0_seq1 350.00 212.11 3.00 2.68 2.65
Alexandromenia_crassa_TRI_1_13_NORM_comp10099_c0_seq1 283.00 146.78 1.00 1.29 1.28
Alexandromenia_crassa_TRI_1_13_NORM_comp1009_c0_seq1 230.00 97.05 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp1009_c0_seq2 297.00 160.28 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp101004_c0_seq1 470.00 331.02 4.00 2.29 2.27
Alexandromenia_crassa_TRI_1_13_NORM_comp10102_c0_seq1 1315.00 1175.17 123.00 19.85 19.64
Alexandromenia_crassa_TRI_1_13_NORM_comp10103_c0_seq1 208.00 77.53 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp101055_c0_seq1 240.00 106.20 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp1010_c0_seq1 439.00 300.21 4.00 2.53 2.50
Alexandromenia_crassa_TRI_1_13_NORM_comp10110_c0_seq1 501.00 361.86 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp101134_c0_seq1 1363.00 1223.17 52.00 8.06 7.98
Alexandromenia_crassa_TRI_1_13_NORM_comp101136_c0_seq1 231.00 97.96 0.00 0.00 0.00
Alexandromenia_crassa_TRI_1_13_NORM_comp101149_c0_seq1 317.00 179.74 1.00 1.06 1.04
Alexandromenia_crassa_TRI_1_13_NORM_comp101152_c0_seq1 408.00 269.45 5.00 3.52 3.48
Alexandromenia_crassa_TRI_1_13_NORM_comp10115_c0_seq1 224.00 91.64 0.00 0.00 0.00
File 2:
Code:
Alexandromenia_crassa_TRI_1_13_NORM_comp10003_c0_seq1 136 0.036 14.5 1.5 1 2 0.068 3.1e+02 1.9 0.0 88 103 38 53 9 59 0.81 Domain of unknown function (DUF1911)
Alexandromenia_crassa_TRI_1_13_NORM_comp10003_c0_seq1 136 0.036 14.5 1.5 2 2 0.00012 0.54 10.7 1.0 14 74 69 129 63 133 0.85 Domain of unknown function (DUF1911)
Alexandromenia_crassa_TRI_1_13_NORM_comp10003_c0_seq1 136 0.3 11.4 9.3 1 2 0.064 2.9e+02 1.8 0.2 86 96 24 34 5 38 0.58 LTXXQ motif family protein
Alexandromenia_crassa_TRI_1_13_NORM_comp10003_c0_seq1 136 0.3 11.4 9.3 2 2 0.00021 0.97 9.8 4.5 7 93 37 113 14 120 0.74 LTXXQ motif family protein
Alexandromenia_crassa_TRI_1_13_NORM_comp10003_c0_seq1 136 0.85 9.4 3.2 1 2 0.0015 6.8 6.5 0.2 34 58 10 33 6 50 0.82 Toluene-4-monooxygenase system protein B (TmoB)
Alexandromenia_crassa_TRI_1_13_NORM_comp10003_c0_seq1 136 0.85 9.4 3.2 2 2 0.018 82 3.0 0.1 41 66 51 79 48 87 0.80 Toluene-4-monooxygenase system protein B (TmoB)
Alexandromenia_crassa_TRI_1_13_NORM_comp10012_c0_seq1 200 1.1e-36 124.8 0.2 1 1 2.1e-40 1.4e-36 124.4 0.1 3 110 30 136 28 137 0.97 WH1 domain
Alexandromenia_crassa_TRI_1_13_NORM_comp10012_c0_seq1 200 1 8.5 8.2 1 2 0.00087 5.9 6.0 0.1 366 394 33 61 23 67 0.90 Pneumovirinae attachment membrane glycoprotein G
Alexandromenia_crassa_TRI_1_13_NORM_comp10012_c0_seq1 200 1 8.5 8.2 2 2 0.0015 10 5.2 2.4 186 267 103 183 92 199 0.74 Pneumovirinae attachment membrane glycoprotein G
Alexandromenia_crassa_TRI_1_13_NORM_comp10012_c0_seq2 235 1.4e-36 124.4 0.2 1 1 3.2e-40 2.2e-36 123.8 0.1 3 110 65 171 63 172 0.97 WH1 domain
Alexandromenia_crassa_TRI_1_13_NORM_comp10012_c0_seq2 235 1.1 8.3 7.6 1 2 0.001 7.1 5.7 0.1 366 394 68 96 56 102 0.90 Pneumovirinae attachment membrane glycoprotein G
Alexandromenia_crassa_TRI_1_13_NORM_comp10012_c0_seq2 235 1.1 8.3 7.6 2 2 0.0021 15 4.7 2.4 186 267 138 218 127 234 0.73 Pneumovirinae attachment membrane glycoprotein G
Alexandromenia_crassa_TRI_1_13_NORM_comp1001_c0_seq2 136 7.1e-22 77.5 0.0 1 2 4.1e-25 5.7e-21 74.6 0.0 29 112 10 93 7 94 0.97 Sorting nexin C terminal
Alexandromenia_crassa_TRI_1_13_NORM_comp1001_c0_seq2 136 7.1e-22 77.5 0.0 2 2 0.018 2.4e+02 1.6 0.0 36 62 95 121 91 134 0.68 Sorting nexin C terminal
Alexandromenia_crassa_TRI_1_13_NORM_comp10055_c0_seq1 201 0.0037 17.0 2.2 1 2 0.0028 19 5.1 0.6 38 54 10 26 6 29 0.82 YqzH-like protein
Alexandromenia_crassa_TRI_1_13_NORM_comp10055_c0_seq1 201 0.0037 17.0 2.2 2 2 2.2e-05 0.15 11.8 0.0 12 54 24 64 22 68 0.86 YqzH-like protein
Alexandromenia_crassa_TRI_1_13_NORM_comp10055_c0_seq1 201 4.1 6.3 7.1 1 2 0.00025 1.7 7.6 3.3 128 213 16 101 5 135 0.71 Putative metallopeptidase domain
Alexandromenia_crassa_TRI_1_13_NORM_comp10055_c0_seq1 201 4.1 6.3 7.1 2 2 0.17 1.1e+03 -1.7 0.0 151 175 169 193 125 200 0.56 Putative metallopeptidase domain
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 2.8e-15 55.3 18.3 1 4 0.0067 18 4.9 0.1 20 40 249 276 241 280 0.79 Leucine Rich repeats (2 copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 2.8e-15 55.3 18.3 2 4 3.5e-09 9.6e-06 24.9 2.1 3 40 285 322 284 327 0.92 Leucine Rich repeats (2 copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 2.8e-15 55.3 18.3 3 4 2.7e-12 7.3e-09 34.9 0.8 1 40 329 368 329 371 0.95 Leucine Rich repeats (2 copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 2.8e-15 55.3 18.3 4 4 3.4 9.4e+03 -3.8 0.0 26 34 376 384 375 385 0.74 Leucine Rich repeats (2 copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 1.7e-12 46.7 13.9 1 2 2.4e-09 6.7e-06 25.6 1.4 3 61 262 318 260 318 0.91 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 1.7e-12 46.7 13.9 2 2 3.8e-12 1.1e-08 34.6 3.8 2 61 307 364 306 364 0.93 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 1.3e-11 42.7 16.7 1 8 2.4 6.4e+03 -2.1 0.0 10 20 57 78 54 79 0.61 Leucine Rich Repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 1.3e-11 42.7 16.7 2 8 5 1.4e+04 -3.1 0.0 3 17 240 254 240 255 0.75 Leucine Rich Repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 1.3e-11 42.7 16.7 3 8 9.7e-05 0.26 11.3 0.0 2 18 262 278 261 282 0.86 Leucine Rich Repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 1.3e-11 42.7 16.7 4 8 0.0002 0.54 10.3 0.2 2 21 285 304 284 305 0.89 Leucine Rich Repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 1.3e-11 42.7 16.7 5 8 0.0023 6.2 7.1 0.5 1 21 307 327 307 328 0.88 Leucine Rich Repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 1.3e-11 42.7 16.7 6 8 7.4e-05 0.2 11.6 0.1 1 22 330 351 330 351 0.91 Leucine Rich Repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 1.3e-11 42.7 16.7 7 8 0.0013 3.7 7.8 0.0 2 18 354 370 353 373 0.89 Leucine Rich Repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 1.3e-11 42.7 16.7 8 8 1.3 3.6e+03 -1.3 0.0 3 12 377 386 375 392 0.82 Leucine Rich Repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 4.7e-06 25.8 11.0 1 6 0.21 5.7e+02 1.4 0.0 3 17 262 276 260 276 0.81 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 4.7e-06 25.8 11.0 2 6 0.0029 8 7.0 0.1 3 17 285 299 283 299 0.92 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 4.7e-06 25.8 11.0 3 6 0.036 97 3.7 0.1 1 16 306 321 303 329 0.76 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 4.7e-06 25.8 11.0 4 6 0.0009 2.5 8.5 0.0 1 16 329 344 329 345 0.94 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 4.7e-06 25.8 11.0 5 6 0.0039 11 6.6 0.0 3 17 354 368 352 368 0.93 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 4.7e-06 25.8 11.0 6 6 1.6 4.3e+03 -1.2 0.0 3 12 376 385 375 390 0.79 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 0.0038 17.0 10.2 1 5 0.086 2.4e+02 2.2 0.0 3 16 261 274 259 280 0.77 Leucine Rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 0.0038 17.0 10.2 2 5 0.34 9.2e+02 0.3 0.0 4 15 285 296 282 298 0.84 Leucine Rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 0.0038 17.0 10.2 3 5 0.0027 7.3 6.8 0.4 1 17 305 321 305 327 0.88 Leucine Rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 0.0038 17.0 10.2 4 5 0.0026 7.1 6.9 0.2 1 16 328 343 328 350 0.91 Leucine Rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10058_c0_seq2 508 0.0038 17.0 10.2 5 5 0.18 5e+02 1.2 0.0 2 15 352 365 351 371 0.86 Leucine Rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp100623_c0_seq1 148 0.0026 17.7 0.9 1 1 6.6e-07 0.0045 16.9 0.6 21 89 61 124 52 130 0.65 Mak16 protein C-terminal region
Alexandromenia_crassa_TRI_1_13_NORM_comp100623_c0_seq1 148 0.16 11.9 2.0 1 2 0.19 1.3e+03 -0.8 0.0 118 138 13 33 2 47 0.52 Mediator complex subunit 29
Alexandromenia_crassa_TRI_1_13_NORM_comp100623_c0_seq1 148 0.16 11.9 2.0 2 2 2.6e-05 0.18 11.7 0.2 12 52 69 109 60 119 0.89 Mediator complex subunit 29
Alexandromenia_crassa_TRI_1_13_NORM_comp100639_c0_seq1 161 2.7 8.0 22.6 1 3 0.0045 62 3.7 1.6 1 22 32 53 32 72 0.79 Anaphylotoxin-like domain
Alexandromenia_crassa_TRI_1_13_NORM_comp100639_c0_seq1 161 2.7 8.0 22.6 2 3 0.00042 5.8 7.0 0.9 2 28 81 109 81 111 0.82 Anaphylotoxin-like domain
Alexandromenia_crassa_TRI_1_13_NORM_comp100639_c0_seq1 161 2.7 8.0 22.6 3 3 0.00029 3.9 7.5 3.0 1 36 112 143 112 143 0.93 Anaphylotoxin-like domain
Alexandromenia_crassa_TRI_1_13_NORM_comp10070_c0_seq1 236 0.074 12.6 0.2 1 1 8.7e-06 0.12 11.9 0.2 2 101 116 213 115 221 0.88 Talin, middle domain
Alexandromenia_crassa_TRI_1_13_NORM_comp10074_c0_seq1 235 1e-62 212.0 0.0 1 1 9.3e-67 1.3e-62 211.7 0.0 253 441 21 212 6 228 0.92 Cytochrome P450
Alexandromenia_crassa_TRI_1_13_NORM_comp10075_c0_seq1 156 1e-13 50.6 0.0 1 1 2.7e-17 1.8e-13 49.8 0.0 1 85 71 151 71 151 0.99 CDC6, C terminal
Alexandromenia_crassa_TRI_1_13_NORM_comp10075_c0_seq1 156 0.11 12.0 0.2 1 2 0.0024 16 4.9 0.0 115 148 40 73 25 101 0.78 Brix domain
Alexandromenia_crassa_TRI_1_13_NORM_comp10075_c0_seq1 156 0.11 12.0 0.2 2 2 0.0014 9.8 5.6 0.0 38 67 110 138 86 142 0.81 Brix domain
Alexandromenia_crassa_TRI_1_13_NORM_comp100795_c0_seq1 120 2.9e-17 62.3 0.0 1 1 2.6e-21 3.5e-17 62.1 0.0 6 119 7 119 3 120 0.95 Protein of unknown function (DUF667)
Alexandromenia_crassa_TRI_1_13_NORM_comp10080_c0_seq1 635 1.5e-59 201.3 5.0 1 1 3e-63 2e-59 200.9 3.5 1 281 159 446 159 447 0.91 Subtilase family
Alexandromenia_crassa_TRI_1_13_NORM_comp10080_c0_seq1 635 1.7e-30 104.5 0.6 1 1 4.4e-34 3e-30 103.7 0.4 1 87 497 580 497 580 0.97 Proprotein convertase P-domain
Alexandromenia_crassa_TRI_1_13_NORM_comp10083_c0_seq1 108 7.4e-17 60.7 4.3 1 1 6.1e-21 8.4e-17 60.6 3.0 236 343 7 107 5 108 0.95 Transmembrane amino acid transporter protein
Alexandromenia_crassa_TRI_1_13_NORM_comp10083_c0_seq2 99 5.4e-15 54.6 0.9 1 1 4.3e-19 5.9e-15 54.5 0.6 236 304 7 78 5 97 0.90 Transmembrane amino acid transporter protein
Alexandromenia_crassa_TRI_1_13_NORM_comp10097_c0_seq1 200 0.022 14.2 0.5 1 1 9.6e-06 0.044 13.3 0.4 14 42 19 47 17 48 0.92 Cbb3-type cytochrome oxidase component FixQ
Alexandromenia_crassa_TRI_1_13_NORM_comp10097_c0_seq1 200 0.032 13.6 0.2 1 1 1.9e-05 0.088 12.2 0.1 12 36 22 43 13 46 0.69 Adenovirus E3 region protein CR2
Alexandromenia_crassa_TRI_1_13_NORM_comp10097_c0_seq1 200 0.22 11.5 2.4 1 1 0.0001 0.47 10.4 1.7 25 57 16 50 6 125 0.75 Domain of unknown function (DUF4381)
Alexandromenia_crassa_TRI_1_13_NORM_comp101134_c0_seq1 346 7e-115 383.5 0.6 1 1 2.9e-118 1e-114 383.0 0.4 2 345 9 341 8 344 0.93 IMP dehydrogenase / GMP reductase domain
Alexandromenia_crassa_TRI_1_13_NORM_comp101134_c0_seq1 346 0.00063 18.5 1.2 1 2 1.4 4.7e+03 -4.1 0.1 283 309 153 179 142 181 0.72 FMN-dependent dehydrogenase
Alexandromenia_crassa_TRI_1_13_NORM_comp101134_c0_seq1 346 0.00063 18.5 1.2 2 2 1.8e-07 0.00063 18.5 0.9 267 312 199 245 190 255 0.82 FMN-dependent dehydrogenase
Alexandromenia_crassa_TRI_1_13_NORM_comp101134_c0_seq1 346 0.15 11.1 0.8 1 2 0.00041 1.4 7.9 0.0 43 84 135 176 122 206 0.86 KDPG and KHG aldolase
Alexandromenia_crassa_TRI_1_13_NORM_comp101134_c0_seq1 346 0.15 11.1 0.8 2 2 0.038 1.3e+02 1.5 0.1 61 90 216 245 200 276 0.85 KDPG and KHG aldolase
Alexandromenia_crassa_TRI_1_13_NORM_comp101134_c0_seq1 346 0.26 10.4 1.1 1 2 0.00088 3 6.9 0.0 154 211 123 179 107 183 0.79 PcrB family
Alexandromenia_crassa_TRI_1_13_NORM_comp101134_c0_seq1 346 0.26 10.4 1.1 2 2 0.027 94 2.0 0.2 184 216 215 247 193 258 0.84 PcrB family
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 6.9e-30 103.1 4.9 1 3 2.8e-14 3.9e-11 43.0 0.1 28 86 44 104 12 108 0.78 Ankyrin repeats (3 copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 6.9e-30 103.1 4.9 2 3 5.3e-16 7.3e-13 48.5 0.1 1 85 49 170 49 174 0.83 Ankyrin repeats (3 copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 6.9e-30 103.1 4.9 3 3 1.9e-09 2.6e-06 27.5 0.1 4 83 118 209 115 215 0.79 Ankyrin repeats (3 copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 2e-18 65.0 3.1 1 5 2.6e-06 0.0036 16.9 0.0 5 32 48 75 46 76 0.92 Ankyrin repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 2e-18 65.0 3.1 2 5 1.9e-08 2.6e-05 23.6 0.1 4 29 79 104 79 106 0.94 Ankyrin repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 2e-18 65.0 3.1 3 5 0.87 1.2e+03 -0.5 0.0 5 21 114 130 113 134 0.86 Ankyrin repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 2e-18 65.0 3.1 4 5 0.071 97 2.9 0.0 5 20 147 162 145 172 0.82 Ankyrin repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 2e-18 65.0 3.1 5 5 2.9e-06 0.0039 16.8 0.0 3 23 186 206 184 212 0.92 Ankyrin repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 5.6e-17 60.0 1.0 1 5 1.7e-06 0.0023 17.8 0.0 4 30 47 73 42 73 0.93 Ankyrin repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 5.6e-17 60.0 1.0 2 5 1.3e-05 0.018 15.1 0.0 4 26 79 101 77 105 0.94 Ankyrin repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 5.6e-17 60.0 1.0 3 5 0.56 7.7e+02 0.7 0.0 4 21 113 130 111 131 0.91 Ankyrin repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 5.6e-17 60.0 1.0 4 5 0.012 17 5.9 0.0 4 20 146 162 143 165 0.89 Ankyrin repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 5.6e-17 60.0 1.0 5 5 4e-05 0.055 13.6 0.0 3 25 186 208 184 212 0.90 Ankyrin repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 3.2e-16 58.8 1.2 1 5 5.2e-09 7.2e-06 25.9 0.0 14 56 43 84 35 84 0.92 Ankyrin repeats (many copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 3.2e-16 58.8 1.2 2 5 1.2e-05 0.016 15.2 0.0 13 44 75 105 69 109 0.84 Ankyrin repeats (many copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 3.2e-16 58.8 1.2 3 5 0.91 1.2e+03 -0.3 0.0 14 36 111 131 106 138 0.82 Ankyrin repeats (many copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 3.2e-16 58.8 1.2 4 5 0.0087 12 6.1 0.0 17 42 145 170 139 173 0.82 Ankyrin repeats (many copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 3.2e-16 58.8 1.2 5 5 9.3e-05 0.13 12.4 0.0 15 42 184 209 180 214 0.82 Ankyrin repeats (many copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 4.3e-16 58.8 0.3 1 4 5e-10 6.8e-07 29.5 0.0 3 54 47 97 45 97 0.95 Ankyrin repeats (many copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 4.3e-16 58.8 0.3 2 4 0.0001 0.14 12.5 0.0 3 53 79 130 79 131 0.81 Ankyrin repeats (many copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 4.3e-16 58.8 0.3 3 4 1.9e-07 0.00026 21.3 0.0 1 54 144 205 144 205 0.77 Ankyrin repeats (many copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 4.3e-16 58.8 0.3 4 4 10 1.4e+04 -3.8 0.0 13 33 329 349 327 352 0.70 Ankyrin repeats (many copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 7.3e-11 41.3 5.2 1 3 6e-08 8.2e-05 22.0 0.1 5 40 267 303 263 304 0.91 Leucine Rich repeats (2 copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 7.3e-11 41.3 5.2 2 3 9.9e-07 0.0013 18.1 0.1 2 31 288 317 287 319 0.84 Leucine Rich repeats (2 copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 7.3e-11 41.3 5.2 3 3 3.1e-07 0.00042 19.7 0.8 3 39 317 353 315 357 0.94 Leucine Rich repeats (2 copies)
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 1e-10 41.0 5.8 1 3 2.2 3e+03 -2.1 0.1 23 33 123 133 122 139 0.66 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 1e-10 41.0 5.8 2 3 7.7e-10 1e-06 28.2 0.1 3 57 265 318 263 319 0.90 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 1e-10 41.0 5.8 3 3 2.5e-06 0.0034 16.9 0.2 2 46 316 359 315 363 0.90 Leucine rich repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 1.1e-06 27.7 7.6 1 5 10 1.4e+04 -3.1 0.1 7 18 41 52 40 55 0.68 Leucine Rich Repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 1.1e-06 27.7 7.6 2 5 0.043 59 4.1 0.0 2 17 265 280 264 286 0.85 Leucine Rich Repeat
Alexandromenia_crassa_TRI_1_13_NORM_comp10120_c0_seq1 366 1.1e-06 27.7 7.6 3 5 5.2e-05 0.071 13.0 0.1 1 17 288 304 288 309 0.90 Leucine Rich Repeat
What I want to do is go through the $1 fields in file2 and whenever there is a match to a $1 field in file1, I want to append fields 19+ of file2 (the text annotation, e.g., Domain of unknown function (DUF1911)) to the end of the corresponding line in file1. Note that there are multiple instances of some sequences in file2 but don't care if multiple identical annotations get appended to the end a line of file1.
I know awk is the tool for the job and I've even found some other threads simlar to this (e.g.,
http://stackoverflow.com/questions/1...umns-using-awk ) but I'm not sure how to do exactly what I want.
Any assistance would be greatly appreciated.
Thanks!
Kevin
Last edited by kmkocot; 03-26-2014 at 03:08 AM .
Reason: Fixed an error where I referred to file1 instead of file2.
03-26-2014, 01:45 AM
#2
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,923
see at the bottom about similar threads, especially this one:
http://www.linuxquestions.org/questi...ed-4175412672/
1 members found this post helpful.
04-01-2014, 01:14 AM
#3
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Original Poster
Rep:
I'm trying but not really making much progress. Here's a more basic question. In the code below, what does the "a" stand for?
Code:
awk -F'\t' -v OFS='\t' 'NR==FNR{a[$2FS$3]=$1;next}$2FS$3 in a{print $0,a[$2FS$3]}' file1 file2
Thanks,
Kevin
04-01-2014, 01:15 AM
#4
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,923
a is a variable, an array, you can reach its element by a[here comes the index]
04-01-2014, 01:17 AM
#5
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,923
I would also try split it into lines:
Code:
awk -F'\t' -v OFS='\t' '
NR==FNR { a[$2FS$3]=$1; next }
$2FS$3 in a { print $0,a[$2FS$3] }
' file1 file2
1 members found this post helpful.
04-01-2014, 02:09 AM
#6
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Original Poster
Rep:
Thanks for that. From that model I posted, I came up with the following command but it just prints file2.
Code:
awk -F' ' 'NR==FNR{a[$1]=$1;next}$1 in a{print $0,a[$19]}' file1 file2
I must have a mistake in the first half?
Thanks,
Kevin
04-01-2014, 02:28 AM
#7
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,923
just split that awk into two lines:
NR==FNR will be true for the first file, therefore {a[$1]=$1;next} will be executed only for file1. Probably here you need to save the whole line: a[$1]=$0 would be better
the second part: $1 in a will be executed only for file2. Here, in { } you need to construct the printed lines.
04-01-2014, 07:16 PM
#8
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Original Poster
Rep:
Does spliting the script across two lines change the behavior? Or do you just want me to do that for ease of readability? When I change it to the following, I get no output.
Code:
awk -F'\t' -v OFS='\t' '
NR==FNR { a[$1]=$0; next }
$1 in a { print $0,a[$19] } file1 file2
Thanks,
Kevin
04-01-2014, 09:03 PM
#9
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,137
Yep - it has the effect of putting a semi-colon in. And in this case highlighting a "truth test" you didn't know about (in your previous).
That last post can't work as you're missing the closing single quote. I'm sorta surprised "$1 in a" is valid awk at all. And you have your data transposed - at the point of the print, $0 is the record just read from file2, and a[$19] will most probably be undefined. Try this
Code:
awk -F'\t' -v OFS='\t' '
NR==FNR { a[$1]=$0; next }
{if (a[$1]) { print a[$1],$19 }}' file1 file2
Note this only adds $19 - to get it all you'll need to loop from $19 to NF to get the rest of the line.
I'm not a big fan of the NR==FNR construct unless I know the data. Especially when saving all of file1 into an array - big files chew up a lot of memory; and if it fails for memory you get nothing at all.
How big is file1 ?. Is the field1 data in file1 unique ?. Is the data
both files truly tab separated ?.
Last edited by syg00; 04-01-2014 at 09:05 PM .
Reason: typos
1 members found this post helpful.
04-01-2014, 11:36 PM
#10
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Original Poster
Rep:
Thanks for explaining the significance of spreading the command out over multiple lines.
That script worked perfectly! Thanks!
To answer your questions, the files are about 20,000 lines long (rough number of protein-coding genes in most animals). The data in field 1 of file 1 is unique whereas the data in field1 of file2 can be repeated (with multiple different annotations). Both files were tab separated but I switched them to use spaces as field separators and was able to easily join words in the annotation into one field.
Thanks to you both!
Best,
Kevin
04-02-2014, 12:05 AM
#11
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,137
If you're still getting multiple lines out per line in file1, have a look at this
Code:
awk -F'\t' -v OFS='\t' '
NR==FNR { a[$1]=$0; next }
{if (a[$1]) a[$1]=a[$1]"\t"substr($0, index($0,$19))} END{ for (i in a) print a[i] }' file1 file2
(presuming tabs - easy to fix).
1 members found this post helpful.
04-02-2014, 02:15 AM
#12
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,008
Well I am a little late to the party, but I actually have a question from the OP.
Quote:
whenever there is a match to a $1 field in file1, I want to append fields 19+ of file2
Note that there are multiple instances of some sequences in file2 but don't care if multiple identical annotations get appended to the end a line of file1.
Based on these 2 snippets, my query is, do you want the line from file1 repeated multiple times with new 19+ columns from file2 on each match (which is where I see the current answers going)
or did you want a single entry from file1 with multiple 19+ columns from file2?
As an example:
Code:
# Current
file1(a) 19+ from file2(1)
file1(a) 19+ from file2(2)
file1(b) 19+ from file2(1)
# or
file1(a) 19+ from file2(1) 19+ from file2(2)
file1(b) 19+ from file2(1)
Hope that show what I meant??
1 members found this post helpful.
04-02-2014, 02:25 AM
#13
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,137
My later code presumes the latter.
04-02-2014, 02:53 AM
#14
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,008
Assuming the data is actually tab separated, the string values at the end of the line are in the 19th field, so the substr could go back to being just this field.
Also worth noting, if the items in file2 are sorted before hand by column 1, this would result in not needing to store the values from the appended file2 items and simply print until there is a change in
$1
05-05-2014, 02:13 AM
#15
Member
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126
Original Poster
Rep:
Sorry, I need to fix my notification settings or something.
grail, the answer to your question is yes but it turned out it didn't really matter for my application as the annotations that differed were all close enough to the same.
All times are GMT -5. The time now is 11:53 AM .
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know .
Latest Threads
LQ News