[SOLVED] Awk to find lines in a file with unbalanced numbers of parentheses

kmkocot · 01-17-2020, 07:03 AM

Hi all,

I'm a biologist working with files that represent phylogenetic trees. I have a couple files from analyses that have been running for literally months but at least one line in each file got corrupted when that hard drive filled up. The files have one tree represented per line. They should have an even number of "(" and ")" parentheses on each line (see below for an example of a good one) and end with a semicolon, but on at least one line got truncated or otherwise messed up somehow.

I used these commands but found no lines without terminal semicolons:

Code:

wirenia@wirenia:~/Desktop/pb_tests$ grep '[^;]$' Chain5.treelist
wirenia@wirenia:~/Desktop/pb_tests$ grep '[^;]$' Chain6.treelist
wirenia@wirenia:~/Desktop/pb_tests$ grep -v \;$ Chain5.treelist 
wirenia@wirenia:~/Desktop/pb_tests$ grep -v \;$ Chain6.treelist

So the problem seems to be that there is an unmatched number of left and right parentheses, but I'm not sure how to figure out which is/are the offending lines. I ran these commands but this will just tell me if the overall number of parentheses is even or odd. It should be an even number but there could be 2 more left than right and it would still 'pass' this test.

Code:

wirenia@wirenia:~/Desktop/pb_tests$ awk -F '[()]' 'NF % 2 == 0' Chain5.treelist
wirenia@wirenia:~/Desktop/pb_tests$ awk -F '[()]' 'NF % 2 == 0' Chain6.treelist

Any suggestions would be greatly appreciated!

Thank you,
Kevin

Code:

Sample tree:
(((((Praticolella_mexicana:0.00480602,Polygyra_cereolus:0.00371952):1.47793,((Camaena_cicatricosa:0.115961,Camaena_poyuensis:0.103371):0.953901,((Aegista_aubryana:0.136351,Aegista_diversifamilia:0.192602):0.843191,Mastigeulota_kiangsinensis:0.522806):0.303342):0.100898):0.121644,((Cernuella_virgata:0.260359,Helicella_itala:0.140086):0.617919,(Cylindrus_obtusus:0.686453,(Helix_aspersa:0.516477,Cepaea_nemoralis:1.89785):0.308462):0.414249):0.137774):1.01136,(Arion_rufus:2.29168,(((Achatina_fulica:1.69866,(Rhopalocaulis_grandidieri:2.29815,((Myosotella_myosotis:1.82384,((Physella_acuta:2.23482,((Radix_balthica:0.142582,Galba_pervia:0.208535):0.50107,(Biomphalaria_glabrata:0.404376,(Planorbarius_corneus:0.00209395,Planorbella_duryi:0.00295444):0.472689):0.687874):0.168854):0.293892,(((((Onchidella_celtica:0.202836,Onchidella_borealis:0.889813):0.185783,(Peronia_peronii:0.159037,Platevindex_mortoni:0.198364):0.101483):0.170859,(((Carychium_tridentatum:0.501811,Ovatella_vulcani:0.317034):0.0582359,(Ellobium_chinense:0.0951185,Auriculinella_bidentata:0.349713):0.0828497):0.0754304,Trimusculus_reticulatus:0.297743):0.0656323):0.0868645,((Pyramidella_dolabrata:0.980992,Salinator_rhamphidia:0.349017):0.0732706,Acochlidium_fijensis:0.465269):0.0420865):0.0276939,(((((Ringicula_conformis:1.11622,((Valvata_sp:0.685025,Microdiscula_charopa:2.95017):0.224082,((((Marisa_cornuarietis:0.179614,Pomacea_canaliculata:0.30989):0.293205,((Turritella_bacillum:0.202813,Tylomelania_sarasinorum:0.189615):0.162401,(Cymatium_parthenopeum:0.155288,(Ilyanassa_obsoleta:0.134292,((Menathais_tuberosa:0.0574794,(Concholepas_concholepas:0.114143,Thais_clavigera:0.110627):0.048665):0.0853467,Conus_striatus:0.240452):0.0408924):0.0419603):0.172519):0.0449037):0.0472011,(Bellamya_quadrata:0.200306,Cipangopaludina_cathayensis:0.178853):0.57194):0.154863,((Titiscania_limacina:0.71693,(Georissa_bangueyensis:0.950074,Clithon_retropictus:0.206275):0.216874):0.105551,(Angaria_delphinus:0.350123,Phasianella_solida:0.561236):0.211091):0.202515):1.24934):2.44132):0.107756,(((Berthellina_sp:0.231649,Pleurobranchaea_novaezealandiae:0.245021):0.328029,(((Notodoris_gardineri:0.373909,Homoiodoris_japonica:0.336948):0.0108453,((Hypselodoris_festiva:0.212083,(Chromodoris_magnifica:0.0107541,Chromodoris_quadricolor:0.0190366):0.180875):0.0775515,((((Tritonia_diomedea:0.191568,Sakuraeolis_japonica:0.34007):0.0426962,Melibe_leonina:0.301892):0.378198,Roboastra_europaea:0.192852):0.125796,Nembrotha_kubaryana:0.153817):0.0875222):0.0345442):0.0680711,Phyllidia_ocellata:0.279088):0.157023):0.267243,((Micromelo_undatus:0.0508658,Hydatina_physis:0.0510341):0.21094,Pupa_strigosa:0.714736):0.307753):0.0715796):0.0425515,(((Illbia_ilbi:1.32697,Runcina_ornata:1.47963):0.262683,((Smaragdinella_calyculata:0.167591,Bulla_sp:0.608702):0.123347,(Odontoglaja_guamensis:0.832243,Sagaminopteron_nigropunctatus:0.340381):0.12399):0.268611):0.0540452,(((Aplysia_californica:0.0374974,Aplysia_dactylomela:0.0167428):0.0267237,(Aplysia_kurodai:0.0140971,Aplysia_vaccaria:0.0103373):0.0444639):0.171587,Tylodina_sp:0.415181):0.0632413):0.123332):0.063078,(Siphonaria_pectinata:0.288448,Siphonaria_gigas:0.509812):0.0896595):0.0615275,(Ascobulla_fragilis:0.395827,(Placida_sp:0.136588,((Elysia_ornata:0.135254,Elysia_chlorotica:0.120108):0.0609832,(Plakobranchus_cf_ocellatus:0.121242,Thuridilla_gracilis:0.179529):0.0317251):0.0298276):0.38227):0.144995):0.262714):0.109505):0.0445404):0.0841337,Pedipes_pedipes:1.1272):0.160818):0.377031):0.201989,((Pupilla_muscorum:0.842588,(Vertigo_pusilla:0.762949,Gastrocopta_cristata:0.47463):0.156762):0.223448,(Achatinella_sowerbyana:0.08089,Achatinella_mustelina:0.0780632):2.00741):0.498303):0.13143,(Succinea_putris:1.75508,Naesiotus_nux:1.52222):0.267404):0.256461):0.252589):1.69963,Cerion_uva:0.401049,Cerion_incanum:0.770272);

TB0ne · 01-17-2020, 07:58 AM

Quote:

Originally Posted by kmkocot

Hi all,
I'm a biologist working with files that represent phylogenetic trees. I have a couple files from analyses that have been running for literally months but at least one line in each file got corrupted when that hard drive filled up. The files have one tree represented per line. They should have an even number of "(" and ")" parentheses on each line (see below for an example of a good one) and end with a semicolon, but on at least one line got truncated or otherwise messed up somehow.

I used these commands but found no lines without terminal semicolons:

Code:

wirenia@wirenia:~/Desktop/pb_tests$ grep '[^;]$' Chain5.treelist
wirenia@wirenia:~/Desktop/pb_tests$ grep '[^;]$' Chain6.treelist
wirenia@wirenia:~/Desktop/pb_tests$ grep -v \;$ Chain5.treelist 
wirenia@wirenia:~/Desktop/pb_tests$ grep -v \;$ Chain6.treelist

So the problem seems to be that there is an unmatched number of left and right parentheses, but I'm not sure how to figure out which is/are the offending lines. I ran these commands but this will just tell me if the overall number of parentheses is even or odd. It should be an even number but there could be 2 more left than right and it would still 'pass' this test.

Code:

wirenia@wirenia:~/Desktop/pb_tests$ awk -F '[()]' 'NF % 2 == 0' Chain5.treelist
wirenia@wirenia:~/Desktop/pb_tests$ awk -F '[()]' 'NF % 2 == 0' Chain6.treelist

Code:

Sample tree:
(((((Praticolella_mexicana:0.00480602,Polygyra_cereolus:0.00371952):1.47793,((Camaena_cicatricosa:0.115961,Camaena_poyuensis:0.103371):0.953901,((Aegista_aubryana:0.136351,Aegista_diversifamilia:0.192602):0.843191,Mastigeulota_kiangsinensis:0.522806):0.303342):0.100898):0.121644,((Cernuella_virgata:0.260359,Helicella_itala:0.140086):0.617919,(Cylindrus_obtusus:0.686453,(Helix_aspersa:0.516477,Cepaea_nemoralis:1.89785):0.308462):0.414249):0.137774):1.01136,(Arion_rufus:2.29168,(((Achatina_fulica:1.69866,(Rhopalocaulis_grandidieri:2.29815,((Myosotella_myosotis:1.82384,((Physella_acuta:2.23482,((Radix_balthica:0.142582,Galba_pervia:0.208535):0.50107,(Biomphalaria_glabrata:0.404376,(Planorbarius_corneus:0.00209395,Planorbella_duryi:0.00295444):0.472689):0.687874):0.168854):0.293892,(((((Onchidella_celtica:0.202836,Onchidella_borealis:0.889813):0.185783,(Peronia_peronii:0.159037,Platevindex_mortoni:0.198364):0.101483):0.170859,(((Carychium_tridentatum:0.501811,Ovatella_vulcani:0.317034):0.0582359,(Ellobium_chinense:0.0951185,Auriculinella_bidentata:0.349713):0.0828497):0.0754304,Trimusculus_reticulatus:0.297743):0.0656323):0.0868645,((Pyramidella_dolabrata:0.980992,Salinator_rhamphidia:0.349017):0.0732706,Acochlidium_fijensis:0.465269):0.0420865):0.0276939,(((((Ringicula_conformis:1.11622,((Valvata_sp:0.685025,Microdiscula_charopa:2.95017):0.224082,((((Marisa_cornuarietis:0.179614,Pomacea_canaliculata:0.30989):0.293205,((Turritella_bacillum:0.202813,Tylomelania_sarasinorum:0.189615):0.162401,(Cymatium_parthenopeum:0.155288,(Ilyanassa_obsoleta:0.134292,((Menathais_tuberosa:0.0574794,(Concholepas_concholepas:0.114143,Thais_clavigera:0.110627):0.048665):0.0853467,Conus_striatus:0.240452):0.0408924):0.0419603):0.172519):0.0449037):0.0472011,(Bellamya_quadrata:0.200306,Cipangopaludina_cathayensis:0.178853):0.57194):0.154863,((Titiscania_limacina:0.71693,(Georissa_bangueyensis:0.950074,Clithon_retropictus:0.206275):0.216874):0.105551,(Angaria_delphinus:0.350123,Phasianella_solida:0.561236):0.211091):0.202515):1.24934):2.44132):0.107756,(((Berthellina_sp:0.231649,Pleurobranchaea_novaezealandiae:0.245021):0.328029,(((Notodoris_gardineri:0.373909,Homoiodoris_japonica:0.336948):0.0108453,((Hypselodoris_festiva:0.212083,(Chromodoris_magnifica:0.0107541,Chromodoris_quadricolor:0.0190366):0.180875):0.0775515,((((Tritonia_diomedea:0.191568,Sakuraeolis_japonica:0.34007):0.0426962,Melibe_leonina:0.301892):0.378198,Roboastra_europaea:0.192852):0.125796,Nembrotha_kubaryana:0.153817):0.0875222):0.0345442):0.0680711,Phyllidia_ocellata:0.279088):0.157023):0.267243,((Micromelo_undatus:0.0508658,Hydatina_physis:0.0510341):0.21094,Pupa_strigosa:0.714736):0.307753):0.0715796):0.0425515,(((Illbia_ilbi:1.32697,Runcina_ornata:1.47963):0.262683,((Smaragdinella_calyculata:0.167591,Bulla_sp:0.608702):0.123347,(Odontoglaja_guamensis:0.832243,Sagaminopteron_nigropunctatus:0.340381):0.12399):0.268611):0.0540452,(((Aplysia_californica:0.0374974,Aplysia_dactylomela:0.0167428):0.0267237,(Aplysia_kurodai:0.0140971,Aplysia_vaccaria:0.0103373):0.0444639):0.171587,Tylodina_sp:0.415181):0.0632413):0.123332):0.063078,(Siphonaria_pectinata:0.288448,Siphonaria_gigas:0.509812):0.0896595):0.0615275,(Ascobulla_fragilis:0.395827,(Placida_sp:0.136588,((Elysia_ornata:0.135254,Elysia_chlorotica:0.120108):0.0609832,(Plakobranchus_cf_ocellatus:0.121242,Thuridilla_gracilis:0.179529):0.0317251):0.0298276):0.38227):0.144995):0.262714):0.109505):0.0445404):0.0841337,Pedipes_pedipes:1.1272):0.160818):0.377031):0.201989,((Pupilla_muscorum:0.842588,(Vertigo_pusilla:0.762949,Gastrocopta_cristata:0.47463):0.156762):0.223448,(Achatinella_sowerbyana:0.08089,Achatinella_mustelina:0.0780632):2.00741):0.498303):0.13143,(Succinea_putris:1.75508,Naesiotus_nux:1.52222):0.267404):0.256461):0.252589):1.69963,Cerion_uva:0.401049,Cerion_incanum:0.770272);

Since you've tried the awk route, here's one I found some time back that does what you're after.

Code:

awk -F"\t" '{for (i=1;i<=NF;i++) if (split($i,a,"(") != split($i,b,")")) {print NR": "$0; next}}' FILENAME > Unmatched-Parens-Output-File

This will leave you with a file containing the lines that don't have matching parens. Won't FIX them, though, and given the complexity of your lines, you'll still have a bear of a time manually going through things.

If it were me doing this, I'd open it with any IDE (like kdevelop), that does context-sensitive highlighting. If I break apart what you posted in kdevelop and use ANSYS highlighting, I get this:

Code:

(((((Praticolella_mexicana:0.00480602,Polygyra_cereolus:0.00371952):1.47793,((Camaena_cicatricosa:0.115961,Camaena_poyuensis:0.103371):0.953901,
((Aegista_aubryana:0.136351,Aegista_diversifamilia:0.192602):0.843191,Mastigeulota_kiangsinensis:0.522806):0.303342):0.100898):0.121644,((Cernuella_virgata:0.260359,Helicella_itala:0.140086):0.617919,(Cylindrus_obtusus:0.686453,(Helix_aspersa:0.516477,
Cepaea_nemoralis:1.89785):0.308462):0.414249):0.137774):1.01136,(Arion_rufus:2.29168,(((Achatina_fulica:1.69866,(Rhopalocaulis_grandidieri:2.29815,((Myosotella_myosotis:1.82384,((Physella_acuta:2.23482,((Radix_balthica:0.142582,Galba_pervia:0.208535):0.50107,
(Biomphalaria_glabrata:0.404376,(Planorbarius_corneus:0.00209395,Planorbella_duryi:0.00295444):0.472689):0.687874):0.168854):0.293892,(((((Onchidella_celtica:0.202836,Onchidella_borealis:0.889813):0.185783,(Peronia_peronii:0.159037,Platevindex_mortoni:0.198364):0.101483):0.170859,
(((Carychium_tridentatum:0.501811,Ovatella_vulcani:0.317034):0.0582359,(Ellobium_chinense:0.0951185,Auriculinella_bidentata:0.349713):0.0828497):0.0754304,Trimusculus_reticulatus:0.297743):0.0656323):0.0868645,
((Pyramidella_dolabrata:0.980992,Salinator_rhamphidia:0.349017):0.0732706,Acochlidium_fijensis:0.465269):0.0420865):0.0276939,(((((Ringicula_conformis:1.11622,((Valvata_sp:0.685025,Microdiscula_charopa:2.95017):0.224082,
((((Marisa_cornuarietis:0.179614,Pomacea_canaliculata:0.30989):0.293205,((Turritella_bacillum:0.202813,Tylomelania_sarasinorum:0.189615):0.162401,(Cymatium_parthenopeum:0.155288,(Ilyanassa_obsoleta:0.134292,
((Menathais_tuberosa:0.0574794,(Concholepas_concholepas:0.114143,Thais_clavigera:0.110627):0.048665):0.0853467,Conus_striatus:0.240452):0.0408924):0.0419603):0.172519):0.0449037):0.0472011,(Bellamya_quadrata:0.200306,Cipangopaludina_cathayensis:0.178853):0.57194):0.154863,
((Titiscania_limacina:0.71693,(Georissa_bangueyensis:0.950074,Clithon_retropictus:0.206275):0.216874):0.105551,(Angaria_delphinus:0.350123,Phasianella_solida:0.561236):0.211091):0.202515):1.24934):2.44132):0.107756,(((Berthellina_sp:0.231649,Pleurobranchaea_novaezealandiae:0.245021):0.328029,
(((Notodoris_gardineri:0.373909,Homoiodoris_japonica:0.336948):0.0108453,((Hypselodoris_festiva:0.212083,(Chromodoris_magnifica:0.0107541,Chromodoris_quadricolor:0.0190366):0.180875):0.0775515,((((Tritonia_diomedea:0.191568,Sakuraeolis_japonica:0.34007):0.0426962,
Melibe_leonina:0.301892):0.378198,Roboastra_europaea:0.192852):0.125796,Nembrotha_kubaryana:0.153817):0.0875222):0.0345442):0.0680711,Phyllidia_ocellata:0.279088):0.157023):0.267243,((Micromelo_undatus:0.0508658,
Hydatina_physis:0.0510341):0.21094,Pupa_strigosa:0.714736):0.307753):0.0715796):0.0425515,(((Illbia_ilbi:1.32697,Runcina_ornata:1.47963):0.262683,((Smaragdinella_calyculata:0.167591,Bulla_sp:0.608702):0.123347,(Odontoglaja_guamensis:0.832243,
Sagaminopteron_nigropunctatus:0.340381):0.12399):0.268611):0.0540452,(((Aplysia_californica:0.0374974,Aplysia_dactylomela:0.0167428):0.0267237,(Aplysia_kurodai:0.0140971,Aplysia_vaccaria:0.0103373):0.0444639):0.171587,Tylodina_sp:0.415181):0.0632413):0.123332):0.063078,
(Siphonaria_pectinata:0.288448,Siphonaria_gigas:0.509812):0.0896595):0.0615275,(Ascobulla_fragilis:0.395827,(Placida_sp:0.136588,((Elysia_ornata:0.135254,Elysia_chlorotica:0.120108):0.0609832,(Plakobranchus_cf_ocellatus:0.121242,
Thuridilla_gracilis:0.179529):0.0317251):0.0298276):0.38227):0.144995):0.262714):0.109505):0.0445404):0.0841337,Pedipes_pedipes:1.1272):0.160818):0.377031):0.201989,((Pupilla_muscorum:0.842588,(Vertigo_pusilla:0.762949,
Gastrocopta_cristata:0.47463):0.156762):0.223448,(Achatinella_sowerbyana:0.08089,Achatinella_mustelina:0.0780632):2.00741):0.498303):0.13143,(Succinea_putris:1.75508,Naesiotus_nux:1.52222):0.267404):0.256461):0.252589):1.69963,Cerion_uva:0.401049,Cerion_incanum:0.770272);

...which highlights things in blue (doesn't show up well on here, had to manually tag), but I'm not sure if those are correct or incorrect. There are numerous scientific highlighting settings in kdevelop, that you may want to look at. Easy to page up/down to see the color differences and adjust.

danielbmartin · 01-17-2020, 08:34 AM

With this InFile ...

Code:

This ( line has balanced ) parentheses;
This ( line (has) balanced ) parentheses;
This one (has) too (many  (open ) parentheses;
This) one has) too (many  (close ) parentheses;
This one looks good but has no trailing semicolon!
Too (many (open (parens and no trailing semicolon

... this awk ...

Code:

awk -F ""  \
  '{n1=split($0,a,"(",seps)
    n2=split($0,a,")",seps)
    if (n1>n2) print "Line",NR,"has too many left parens."
    if (n2>n1) print "Line",NR,"has too many right parens."
    if ($NF!=";") print "Line",NR,"lacks a trailing semicolon."}'  \
$InFile >$OutFile

... produced this OutFile ...

Code:

Line 3 has too many left parens.
Line 4 has too many right parens.
Line 5 lacks a trailing semicolon.
Line 6 has too many left parens.
Line 6 lacks a trailing semicolon.

Daniel B. Martin

.

individual · 01-17-2020, 08:37 AM

I'm not sure it can be done, as there is no clear delimiter of groups, i.e. how are we supposed to know when a closing parenthesis is supposed to appear? Is the output a custom format, or an industry standard?

kmkocot · 01-17-2020, 12:12 PM

My hero. It was only one line (out of about 60,000) in each file that was corrupted.

Thanks!
Kevin

crts · 01-17-2020, 03:40 PM

Quote:

Originally Posted by TB0ne

Since you've tried the awk route, here's one I found some time back that does what you're after.

Code:

awk -F"\t" '{for (i=1;i<=NF;i++) if (split($i,a,"(") != split($i,b,")")) {print NR": "$0; next}}' FILENAME > Unmatched-Parens-Output-File

This awk will test if the parenthesis are balanced for every field if the fields are tab separated. As I understand OP, they should be balanced over the entire line. This works if the fields are not tab separated because then there is only one field. But then again, we are talking about corrupted data here, so who knows how it might have been corrupted.
So I think the for-loop may not be well suited for this scenario and the split should just use $0 without the loop.

Anyway, although this has been solved, here is a Bash alternative. It will probably run slower than the presented awk solutions in post #2 and post #3.

Code:

#!/usr/bin/bash

declare -r filename="$1"
declare -i delta
declare tmp
declare side
declare -i num=1
declare err

if [[ ! -f "$filename" ]];then
        echo "File $filename not found." >&2
        exit 1
fi

while read line;do
        err=
        tmp="${line//\(/}"
        delta=${#tmp}
        tmp="${line//\)/}"
        (( delta -= ${#tmp} ))

        if (( delta != 0 ));then
                if (( delta < 0 ));then
                        side=left
                        (( delta = -delta ))
                else
                        side=right
                fi
                err="$delta too many $side parenthesis."
        fi

        [[ "${line%;}" == "$line" ]] && err="${err:+$err }Missing semicolon."
        [[ -n "$err" ]] && echo "Line $num: $err"

        (( num++ ))
done < "$filename"

Tested with the sample data provided by danielbmartin.