[SOLVED] Awk to find lines in a file with unbalanced numbers of parentheses
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Awk to find lines in a file with unbalanced numbers of parentheses
Hi all,
I'm a biologist working with files that represent phylogenetic trees. I have a couple files from analyses that have been running for literally months but at least one line in each file got corrupted when that hard drive filled up. The files have one tree represented per line. They should have an even number of "(" and ")" parentheses on each line (see below for an example of a good one) and end with a semicolon, but on at least one line got truncated or otherwise messed up somehow.
I used these commands but found no lines without terminal semicolons:
So the problem seems to be that there is an unmatched number of left and right parentheses, but I'm not sure how to figure out which is/are the offending lines. I ran these commands but this will just tell me if the overall number of parentheses is even or odd. It should be an even number but there could be 2 more left than right and it would still 'pass' this test.
Hi all,
I'm a biologist working with files that represent phylogenetic trees. I have a couple files from analyses that have been running for literally months but at least one line in each file got corrupted when that hard drive filled up. The files have one tree represented per line. They should have an even number of "(" and ")" parentheses on each line (see below for an example of a good one) and end with a semicolon, but on at least one line got truncated or otherwise messed up somehow.
I used these commands but found no lines without terminal semicolons:
So the problem seems to be that there is an unmatched number of left and right parentheses, but I'm not sure how to figure out which is/are the offending lines. I ran these commands but this will just tell me if the overall number of parentheses is even or odd. It should be an even number but there could be 2 more left than right and it would still 'pass' this test.
This will leave you with a file containing the lines that don't have matching parens. Won't FIX them, though, and given the complexity of your lines, you'll still have a bear of a time manually going through things.
If it were me doing this, I'd open it with any IDE (like kdevelop), that does context-sensitive highlighting. If I break apart what you posted in kdevelop and use ANSYS highlighting, I get this:
...which highlights things in blue (doesn't show up well on here, had to manually tag), but I'm not sure if those are correct or incorrect. There are numerous scientific highlighting settings in kdevelop, that you may want to look at. Easy to page up/down to see the color differences and adjust.
This ( line has balanced ) parentheses;
This ( line (has) balanced ) parentheses;
This one (has) too (many (open ) parentheses;
This) one has) too (many (close ) parentheses;
This one looks good but has no trailing semicolon!
Too (many (open (parens and no trailing semicolon
... this awk ...
Code:
awk -F "" \
'{n1=split($0,a,"(",seps)
n2=split($0,a,")",seps)
if (n1>n2) print "Line",NR,"has too many left parens."
if (n2>n1) print "Line",NR,"has too many right parens."
if ($NF!=";") print "Line",NR,"lacks a trailing semicolon."}' \
$InFile >$OutFile
... produced this OutFile ...
Code:
Line 3 has too many left parens.
Line 4 has too many right parens.
Line 5 lacks a trailing semicolon.
Line 6 has too many left parens.
Line 6 lacks a trailing semicolon.
I'm not sure it can be done, as there is no clear delimiter of groups, i.e. how are we supposed to know when a closing parenthesis is supposed to appear? Is the output a custom format, or an industry standard?
This awk will test if the parenthesis are balanced for every field if the fields are tab separated. As I understand OP, they should be balanced over the entire line. This works if the fields are not tab separated because then there is only one field. But then again, we are talking about corrupted data here, so who knows how it might have been corrupted.
So I think the for-loop may not be well suited for this scenario and the split should just use $0 without the loop.
Anyway, although this has been solved, here is a Bash alternative. It will probably run slower than the presented awk solutions in post #2 and post #3.
Code:
#!/usr/bin/bash
declare -r filename="$1"
declare -i delta
declare tmp
declare side
declare -i num=1
declare err
if [[ ! -f "$filename" ]];then
echo "File $filename not found." >&2
exit 1
fi
while read line;do
err=
tmp="${line//\(/}"
delta=${#tmp}
tmp="${line//\)/}"
(( delta -= ${#tmp} ))
if (( delta != 0 ));then
if (( delta < 0 ));then
side=left
(( delta = -delta ))
else
side=right
fi
err="$delta too many $side parenthesis."
fi
[[ "${line%;}" == "$line" ]] && err="${err:+$err }Missing semicolon."
[[ -n "$err" ]] && echo "Line $num: $err"
(( num++ ))
done < "$filename"
Tested with the sample data provided by danielbmartin.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.