LinuxQuestions.org - [SOLVED] Complex regex (Sed/perl)

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Complex regex (Sed/perl) (https://www.linuxquestions.org/questions/programming-9/complex-regex-sed-perl-939789/)

Complex regex (Sed/perl)

Hi,

I'm having a hard time with the following;

The is a word file with questions and answers. I need to import in moodle (online question site) in a particular format. Everything is black accept for the right answers, these are green. The start format is the following:

Quote:

1. Question example
a. Wrong
b. Wrong
C. Wrong
D. Right

The output should become

Quote:

:Question example
:Question example
{
~ Wrong
~ Wrong
~ Wrong
= Right
}

I open the file in word replace all red paragraph marks (I can't do a replace with groups) with *. After that I export the .docx file to text.
Open in on my linux computer and throw the following regex on it.

Quote:

sed -i -e 's/^\r/\n/g' tmp #OS X white line replacement
sed -i -e 's/\r//g' tmp #remove white lines
sed -i -e 's:^[a-z]\.:~:' tmp #Replace Leading question letters with tilde
sed -i -e 's/$^[0-9]*\.\ $$.*$/}\n::\2\n::\2\n{/' tmp #regenerate tittle
sed -i -n '${p;q};N;/\n\*/{s/"\?\n//p;b};P;D' tmp #next line starts with * append to front of current
sed -i -e 's:^~$.*$$\*.*$$:=\1:' tmp #move * from back to = to front
sed -i -e 's:^\*:=:' tmp #replace any remaining * with =
sed '/^$/d' tmp #delete any remaining white lines

This isn't great but works well, questions are hand-made and have a lot of errors so I still have to walk trough this by hand. The hard part is when I have multiple correct answers. The output should become like the following;

Quote:

:Question example
:Question example
{
~%-100% Wrong
~%-100% Wrong
~%50% Right
~%50% Right
}

Ideally I have a sed or perl regex that counts the ammount of = sings between { and replaces them with ~%50%. And all the ~ sings with %-100%. I can have this code also for 3 right answers where every right answer becomes ~%33%.

Is this doable? I have over 1000 questions and it would sure help to automate this. Multiline replacement with sed is kind of tricky with two lines so I guess four or more lines will need perl? I have no experience in Perl.

Could someone help me out with this one? Please excuse my bad English i'm a non-native speaker.

How about using awk for this? It uses regular expressions like sed, but is more of a programming language. Well suited for something like this, in my opinion.

Since the OP's true input files do not match the example, I've edited the script as follows:

The FS needs a "\\." because the value is in double quotes. Awk will interpret the \\ as a single backslash, so that the value will end up containing just \. which is what we want. (. means any character, but \. means a full stop, in regular expressions.)
Modified the logic to ignore empty files.
Everything that is not an answer is a question.

Code:

awk '#

    BEGIN {

        # Set record (line) separator to any newline,

        # including any leading or trailing whitespace

        RS = "[\t\v\f ]*[\r\n][\t\n\v\f\r ]*"



        # Set field separator to . followed by whitespace

        FS = "\\.[\t\v\f ]+"



        # Question and answer counts and list.

        question = ""

        answers = split("", answer) # 1 = right, -1 = wrong

    }



    function section() {



        # Output nothing if no question and/or no answers.

        if (length(question) < 1 || answers < 1) {

            question = ""

            answers = split("", answer)

            return;

        }



        # Calculate number of right and wrong answers.

        rights = 0

        wrongs = 0

        for (i = 1; i <= answers; i++)

            if (answer[i] > 0)

                rights++

            else

            if (answer[i] < 0)

                wrongs++



        # Start the question.

        printf(":%s\n:%s\n{\n", question, question)



        if (rights == 1) {

            # Only one right answer.

            for (i = 1; i <= answers; i++)

                if (answer[i] > 0)

                    printf("=Right\n")

                else

                if (answer[i] < 0)

                    printf("~Wrong\n")

        } else {

            rightanswer = int(100 / rights)

            wronganswer = -100

            for (i = 1; i <= answers; i++)

                if (answer[i] > 0)

                    printf("~%%%.0f%% Right\n", rightanswer)

                else

                if (answer[i] < 0)

                    printf("~%%%.0f%% Wrong\n", wronganswer)

        }



        # Close the answer section.

        printf("}\n")



        # Clear this question.

        question = ""

        answers = split("", answer)

    }



    # Ignore empty lines.

    /^[\t\v\f ]*$/ {

        next

    }



    # This rule will match any wrong answers.

    ($2 ~ /[Ww][Rr][Oo][Nn][Gg]/) {

        answer[++answers] = -1

        next

    }



    # This rule will match any right answers.

    ($2 ~ /[Rr][Ii][Gg][Hh][Tt]/) {

        answer[++answers] = +1

        next

    }



    # Everything else is a question.

    {



        # Print previous question

        section()



        # Take the entire record as the question.

        question = $0

        # Remove [number] [full stop] from the start.

        sub(/^[0-9]*\.[\t\v\f ]*/, "", question)

        # Remove whitespace and colons from the start.

        sub(/^[\t\v\f :]+/, "", question)

    }



    END {

        # Last question.

        section()

    }



' input-file(s)... > output-file

Using the redirection above, you'll see if it encounters any lines it does not understand, as it prints them to standard error. If so, just edit the input files, and rerun, until no more errors.

Note that the above does not balance correct answers to 100%, but uses the same truncated value (i.e. the sum will never exceed 100%) for all correct answers. This could be fixed quite easily, though.

Here is how the code works:

The BEGIN rule is executed once when the script starts, before any of the data files are read. Awk parses data in records (here, lines, using any newline convention), which are further split into fields (here, separated by . followed by whitespace). I keep the question string in variable question, the number of answers in the answers variable, and the answers (+1 for right and -1 for wrong) in the answer array. (answers = split("", answer) clears the answer array and sets answers to zero.)

The section function uses the global variables to output the complete question saved in the above variables, then it clears them, preparing for a new question. Note that all variables are global in awk scripts.

The END rule is executed once after all data files have been read and processed. It just outputs the last question, which is probably still stored in the variables.

The rest of the rules in the middle are executed once for each input record. The next at the end of the rule means that awk will not execute the other rules for that record, but skip straight to the next record. The $1 is a reference to the first field in the record, $2 to the second, and so on. (You can even use $i where i is a variable containing an integer value.)

The right and wrong rules check if the second field on the row contains a regular expression pattern ("Right" or "Wrong" in upper- or lowercase letters). If so, they add either +1 (right) or -1 (wrong) to the answer array, using the next value of answers as the index. Thus, the first answer will always be at index 1, and answers contains the number of answers in it.

I think the scriptlet is quite straightforward, so I don't know what else to describe about it. If you could try it on your input, and if you find any deficiencies, I could try to fix it.

Awk is very useful for this kind of tasks -- data conversion, tabulation --, and in my opinion, even more so for simple numeric processing or statistics gathering. GNU awk, gawk, has some additional features (like sorting data, retaining the record separators in separate variable RT for each record, and so on) which are often very useful, but mawk is usually a bit faster if you don't need the GNU features.

If you need clarification on any point on the script, I'd be happy to try,

Nominal Animal, thank you for your time, script and elaborate explanation. Unfortunately my AWK knowledge is zero, I'm digging in to it right now.

I already tried the following;

Past your code in emacs, make it executable.
I changed the last line to:
' tmp > tmp2

tmp is a file containing the questions.

When I run it I get the following erro:
./multiline.sh
awk: cmd. line:8: warning: escape sequence `\.' treated as plain `.'
Ignored: ::Benoem drie technieken die bij social engineering worden gebruikt. [8.1.3]
Ignored: ::Benoem drie technieken die bij social engineering worden gebruikt. [8.1.3]
Ignored: {

The ::[text] is the normal question and the { is the start of the answers. I'm reading up on awk so I can understand your code better.
Once again thank you for your time.

My first guess on the errors would be that you are running the solution on a MAC and it has its own bastardisation of awk which has many issues with many standard features (which
has been my experience when using a relatives MAC).

My question to NA is whether or not the code will work on all forms of awk?

Also, maybe give a small example of the exact input as it may be to do with the language, ie does dutch have some unusual characters outside the ASCII group?

I'm running the AWK on Linux (Ubuntu).

An example of the input file before regex, I only used the first 5 questions:

Quote:

1. Werkstations kunnen voor de tweede maal in een week niet op de server in de server. De technicus die de eerste keer het probleem opgelost heeft kan zich niet meer herinneren welke stappen hij ondernomen heeft voor de oplossing. Welk aspect van het troubleshooting proces heeft de technicus nagelaten? [9.1.1]
a. Het probleem identificeren
b. Stellen van vragen aan de eindgebruikers
c. Het troubleshooting proces documenteren
d. Gestructureerde technieken voor de oplossing te gebruiken

2. Wat moet de netwerkadministrator doen als hij een call van een gebruiker ontvangt die geen toegang krijgt met de bedrijfswebserver? [9.1.2]
a. De webserver opnieuw starten
b. De NIC van de computer vervangen
c. De gebruiker vragen uit te loggen en weer in te loggen
d. De gebruiker vragen welke URL hij gebruikt heeft en wat de foutmelding is

3. Een klant belt met de kabelmaatschappij om te melden dat de Internetverbinding niet stabiel is. Na het uitproberen van verschillende configuratie wijzigingen, besluit de technicus de klant een nieuwe kabelmodem te sturen. Welke troubleshootingtechniek wordt hier gebruikt? [9.1.3]
a. Top-down
b. Bottom-up
c. Substitution
d. Trail-and-error
e. Divide-and-conquer

4. Slechts één werkstation in een bepaald netwerk kan het netwerk niet bereiken. Wat is de eerste troubleshooting stap als de divide-and-conquer methode gebruikt wordt? [9.1.3]
a. Controleer de NIC en daarna de bekabeling
b. Controleer de TCP/IP configuratie van het werkstation
c. Test alle kabels en test dan laag na laag omhoog in het OSI model
d. Probeer te Telnetten en test dan laag na laag omlaag in het OSI model

5. Een gebruiker kan geen email verzenden. De netwerktechnicus gebruikt, om dit probleem te troubleshooten, de webbrowser en probeert toegang tot een paar populaire websites te krijgen. Welke troubleshootingtechniek wordt gebruikt? [9.1.3]
a. Bottom-up
b. Divide-and-conquer
c. Top-down
d. Trial-and-error

The same file after regex:

Quote:

::Werkstations kunnen voor de tweede maal in een week niet op de server in de server. De technicus die de eerste keer het probleem opgelost heeft kan zich niet meer herinneren welke stappen hij ondernomen heeft voor de oplossing. Welk aspect van het troubleshooting proces heeft de technicus nagelaten? [9.1.1]
::Werkstations kunnen voor de tweede maal in een week niet op de server in de server. De technicus die de eerste keer het probleem opgelost heeft kan zich niet meer herinneren welke stappen hij ondernomen heeft voor de oplossing. Welk aspect van het troubleshooting proces heeft de technicus nagelaten? [9.1.1]
{
~ Het probleem identificeren
~ Stellen van vragen aan de eindgebruikers
= Het troubleshooting proces documenteren
~ Gestructureerde technieken voor de oplossing te gebruiken
}
::Wat moet de netwerkadministrator doen als hij een call van een gebruiker ontvangt die geen toegang krijgt met de bedrijfswebserver? [9.1.2]
::Wat moet de netwerkadministrator doen als hij een call van een gebruiker ontvangt die geen toegang krijgt met de bedrijfswebserver? [9.1.2]
{
~ De webserver opnieuw starten
~ De NIC van de computer vervangen
~ De gebruiker vragen uit te loggen en weer in te loggen
= De gebruiker vragen welke URL hij gebruikt heeft en wat de foutmelding is
}
::Een klant belt met de kabelmaatschappij om te melden dat de Internetverbinding niet stabiel is. Na het uitproberen van verschillende configuratie wijzigingen, besluit de technicus de klant een nieuwe kabelmodem te sturen. Welke troubleshootingtechniek wordt hier gebruikt? [9.1.3]
::Een klant belt met de kabelmaatschappij om te melden dat de Internetverbinding niet stabiel is. Na het uitproberen van verschillende configuratie wijzigingen, besluit de technicus de klant een nieuwe kabelmodem te sturen. Welke troubleshootingtechniek wordt hier gebruikt? [9.1.3]
{
~ Top-down
~ Bottom-up
= Substitution
~ Trail-and-error
~ Divide-and-conquer
}
::Slechts ??n werkstation in een bepaald netwerk kan het netwerk niet bereiken. Wat is de eerste troubleshooting stap als de divide-and-conquer methode gebruikt wordt? [9.1.3]
::Slechts ??n werkstation in een bepaald netwerk kan het netwerk niet bereiken. Wat is de eerste troubleshooting stap als de divide-and-conquer methode gebruikt wordt? [9.1.3]
{
~ Controleer de NIC en daarna de bekabeling
= Controleer de TCP/IP configuratie van het werkstation
~ Test alle kabels en test dan laag na laag omhoog in het OSI model
~ Probeer te Telnetten en test dan laag na laag omlaag in het OSI model
}

As you can see there are a couple of conversion errors (i.e the ?? in the last question). But so far it works pretty well.
After the NA script i get the following.

Quote:

$ ./multiline.sh
awk: cmd. line:8: warning: escape sequence `\.' treated as plain `.'
Ignored: ::Werkstations kunnen voor de tweede maal in een week niet op de server in de server. De technicus die de eerste keer het probleem opgelost heeft kan zich niet meer herinneren welke stappen hij ondernomen heeft voor de oplossing. Welk aspect van het troubleshooting proces heeft de technicus nagelaten? [9.1.1]
Ignored: ::Werkstations kunnen voor de tweede maal in een week niet op de server in de server. De technicus die de eerste keer het probleem opgelost heeft kan zich niet meer herinneren welke stappen hij ondernomen heeft voor de oplossing. Welk aspect van het troubleshooting proces heeft de technicus nagelaten? [9.1.1]
Ignored: {
Ignored: ~ Het probleem identificeren
Ignored: ~ Stellen van vragen aan de eindgebruikers
Ignored: = Het troubleshooting proces documenteren
Ignored: ~ Gestructureerde technieken voor de oplossing te gebruiken
Ignored: }
Ignored: ::Wat moet de netwerkadministrator doen als hij een call van een gebruiker ontvangt die geen toegang krijgt met de bedrijfswebserver? [9.1.2]
Ignored: ::Wat moet de netwerkadministrator doen als hij een call van een gebruiker ontvangt die geen toegang krijgt met de bedrijfswebserver? [9.1.2]
Ignored: {
Ignored: ~ De webserver opnieuw starten
Ignored: ~ De NIC van de computer vervangen
Ignored: ~ De gebruiker vragen uit te loggen en weer in te loggen
Ignored: = De gebruiker vragen welke URL hij gebruikt heeft en wat de foutmelding is
Ignored: }

I just copied the first two questions from the NA script output. I appreciate all the help.

I think I completely misunderstood the problem.

My original script above was intended to replace all the work you do with sed et cetera, since awk can easily do all that for you. I somehow missed the fact that you never showed the actual text file snippet you get when first exporting the text file; that is what my above awk scriptlet was supposed to work on.

Correct me if I'm wrong, but you want to supply input similar to the following to an awk script

Code:

:Question example

:Question example

{

~ Wrong

~ Wrong

~ Wrong

= Right

}

to get

Code:

:Question example

:Question example

{

~%-100% Wrong

~%-100% Wrong

~%-100% Wrong

=%100% Right

}

Let's rewrite an awk script for this. The earlier example was a command you could run, this is a script file. Supply the file names as parameters to the script.

Code:

#!/usr/bin/awk -f



BEGIN {

    # Accept any newline convention.

    RS = "(\r\n|\n\r|\r|\n)"



    # Output using Unix newlines.

    ORS = "\n"



    # Field separator does not matter, but set it to whitespace anyway.

    FS = "[\t\v\f ]+"



    # Question lines.

    questions = split("", question)



    # Answer lines.

    answers = split("", answer)



    # Correct answers are set to 1, wrong 0.

    split("", correct)

}



# Question lines begin with a colon.

/^:/ {

    # Remove any leading whitespace and colons from the record.

    line = $0

    sub(/^[\t\v\f :]+/, "", line)



    question[++questions] = line

    next

}



# Ignore lines that begin with {.

/^\{/ {

    next

}



# Add lines that begin with ~ to wrong answers.

/^~/ {

    # Remove any leading whitespace and tildes from the record.

    line = $0

    sub(/^[\t\v\f ~]+/, "", line)



    answer[++answers] = line

    correct[answers] = 0

}



# Add lines that begin with = to right answers.

/^=/ {

    # Remove any leading whitespace and equals signs from the record.

    line = $0

    sub(/^[\t\v\f =]+/, "", line)



    answer[++answers] = line

    correct[answers] = 1

}



# Lines that begin with } flush the question-answer set.

/^\}/ {

    # Calculate the number of correct answers.

    corrects = 0

    for (i = 1; i <= answers; i++)

        if (correct[i])

            corrects++



    # Calculate the percentage to set each answer.

    wrong = -100

    right = int(100 / corrects)



    # Set the values to match.

    for (i = 1; i <= answers; i++)

        if (correct[i])

            correct[i] = right

        else

            correct[i] = wrong



    # Balance the first correct answer to get full 100% total.

    if (corrects * right != 100)

        for (i = 1; i <= answers; i++)

            if (correct[i] == right) {

                correct[i] = 100 - right * (corrects - 1)

                break

            }



    # Output the question line(s).

    for (i = 1; i <= questions; i++)

        printf(":%s%s", question[i], ORS)



    # Start the answer set.

    printf("{%s", ORS)



    # Output the answers. We already have the percentages in correct[].

    for (i = 1; i <= answers; i++)

        printf("~%%%d%% %s%s", correct[i], answer[i], ORS)



    # Close the answer set.

    printf("}%s", ORS)



    # Clear the question and answer set.

    questions = split("", question)

    answers = split("", answer)

    split("", correct)

}

Does this work better? Questions?

NA, I'm sorry I'm giving you the wrong examples.

The questions with one answer should stay the same. Only the questions with multiple right answers should become like your last example.

Input example:

Quote:

:Question example
:Question example
{
~ Wrong
~ Wrong
~ Wrong
= Right
}

:Question example
:Question example
{
~ Wrong
~ Wrong
= Right
= Right
}

Output example:

Quote:

:Question example
:Question example
{
~ Wrong
~ Wrong
~ Wrong
= Right
}

:Question example
:Question example
{
~%-100% Wrong
~%-100% Wrong
~%50% Right
~%50% Right
}

I'm trying to adjust your code so it does this. It does works perfect with the multiple right answers! I only have the remove the part with one right answer. I'm really grateful for your help. Thank you so much!

Okay, then try the following. This works like the awk script above, but only outputs the percentages when there is more than one correct answer.

Code:

#!/usr/bin/awk -f



BEGIN {

    # Accept any newline convention.

    RS = "(\r\n|\n\r|\r|\n)"



    # Output using Unix newlines.

    ORS = "\n"



    # Field separator does not matter, but set it to whitespace anyway.

    FS = "[\t\v\f ]+"



    # Question lines.

    questions = split("", question)



    # Answer lines.

    answers = split("", answer)



    # Correct answers are set to 1, wrong 0.

    split("", correct)

}



# Question lines begin with a colon.

/^:/ {

    # Remove any leading whitespace and colons from the record.

    line = $0

    sub(/^[\t\v\f :]+/, "", line)



    question[++questions] = line

    next

}



# Ignore lines that begin with {.

/^\{/ {

    next

}



# Add lines that begin with ~ to wrong answers.

/^~/ {

    # Remove any leading whitespace and tildes from the record.

    line = $0

    sub(/^[\t\v\f ~]+/, "", line)



    answer[++answers] = line

    correct[answers] = 0

}



# Add lines that begin with = to right answers.

/^=/ {

    # Remove any leading whitespace and equals signs from the record.

    line = $0

    sub(/^[\t\v\f =]+/, "", line)



    answer[++answers] = line

    correct[answers] = 1

}



# Lines that begin with } flush the question-answer set.

/^\}/ {

    # Calculate the number of correct answers.

    corrects = 0

    for (i = 1; i <= answers; i++)

        if (correct[i])

            corrects++



    # Calculate the percentage to set each answer.

    wrong = -100

    right = int(100 / corrects)



    # Set the values to match.

    for (i = 1; i <= answers; i++)

        if (correct[i])

            correct[i] = right

        else

            correct[i] = wrong



    # Balance the first correct answer to get full 100% total.

    if (corrects * right != 100)

        for (i = 1; i <= answers; i++)

            if (correct[i] == right) {

                correct[i] = 100 - right * (corrects - 1)

                break

            }



    # Output the question line(s).

    for (i = 1; i <= questions; i++)

        printf(":%s%s", question[i], ORS)



    # Start the answer set.

    printf("{%s", ORS)



    if (corrects > 1) {

        # Output the answers with percentages.

        for (i = 1; i <= answers; i++)

            printf("~%%%d%% %s%s", correct[i], answer[i], ORS)

    } else {

        # Only one correct answer. Use simpler output format.

        for (i = 1; i <= answers; i++)

            if (correct[i] == wrong)

                printf("~ %s%s", answer[i], ORS)

            else

                printf("= %s%s", answer[i], ORS)

    }



    # Close the answer set.

    printf("}%s", ORS)



    # Clear the question and answer set.

    questions = split("", question)

    answers = split("", answer)

    split("", correct)

}

Wow you're fast! Works like a charm. Your code is way better readable than my sed code, I'm thinking about rewriting everything your way. Thank you so much!

Quote:

Originally Posted by battler (Post 4653218)

I'm thinking about rewriting everything your way.

If you wish to show some example lines from your original text data -- whatever you get when you first export it from your word processing program as a text file -- I'm sure I could edit my first example to handle that directly. I'm sure it would be pretty clear code, but it's save a few steps, too.

Nominal Animal, That's incredibly kind of you. However I don't want to misuse your kindness. I've posted this question here for help, and you've done more than that. No it's up to me to understand and work with your code. So when the time arises, I can be as much help to someone else, as you have been to me.

@battler
You've misunderstood NominalAnimal.

He can't live without doing proper awk - programms.

Do him a favour:
Let him rewrite the code to the perfect one-step-solution.

@uhelp, thank you;)

Unfortunately I cant post the real files (copy right), I changed the text .

Quote:

1. Whole bunch of text [kies er twee] [1.1.2]
a. right answer*
b. tiny bit of text
c. tiny bit of text
d. tiny bit of text
e. right answer*
f. tiny bit of text tiny bit of text

2. Whole bunch of text [ [1.1.2]
a. De tiny bit of text tiny bit of text tiny bit of text
b. right answer*
c. De tiny bit of text is tiny bit of text via tiny bit of text
d. De tiny bit of text tiny bit of text

3. Whole bunch of text [ [Kies er twee] [1.1.2]
a. tiny bit of text tiny bit of textk
b. tiny bit of text tiny bit of text tiny bit of text
c. tiny bit of text tiny bit of text tiny bit of text
d. right answer*
e. right answer
*
4. Whole bunch of text [Kies er drie] [1.1.2]
a. tiny bit of texttiny bit of texttiny bit of texttiny bit of text
b. right answer*
c. right answer*
d. tiny bit of text
e. tiny bit of text
f. right answer
*
5. Whole bunch of text [kies er drie] [1.2.1]
a. tiny bit of text tiny bit of text
b. tiny bit of text
c. right answer*
d. right answer*
e. tiny bit of text
f. right answer
*

How I want it in the end

Quote:

:: Whole bunch of text without the text in brackets []
:: Whole bunch of text [kies er twee] [1.1.2]
{
~%50% right answer
~%-100% tiny bit of text
~%-100% tiny bit of text
~%-100% tiny bit of text
~%50% right answer
~%-100% tiny bit of text tiny bit of text
}
:: Whole bunch of text
:: Whole bunch of text [1.1.2]
{
~ De tiny bit of text tiny bit of text tiny bit of text
= right answer
~ De tiny bit of text is tiny bit of text via tiny bit of text
~ De tiny bit of text tiny bit of text
}
:: Whole bunch of text
:: Whole bunch of text [ [Kies er twee] [1.1.2]
{
~%-100% tiny bit of text tiny bit of textk
~%-100% tiny bit of text tiny bit of text tiny bit of text
~%-100% tiny bit of text tiny bit of text tiny bit of text
~%50% right answer
~%50% right answer
}

And so one, there are a couple of glitches:
- Because the questions are hand made the white lines are often more than one. In the end there should be no blank lines.
- Sometimes the * is on the next row. This only happens when the right answer is the last of the answers. When this happens the last answer is the right answer.

@battler, @uhelp: It is true, I am addicted to solving problems. But, I am also deeply satisfied when people express their wish to learn for themselves, and to express their own skill; true learning -- as opposed to rote or parroting preprocessed data -- is something that I appreciate very, very much. I was torn whether to prod battler a bit, since I knew the single-stage solution to be so near, but on the other hand, I really, really respect the wish to learn and study.

Fortunately, after uhelp's prodding, my dilemma was solved.

EDIT 1: Oops, forgot the note about the solitary asterisk. Easy fix: If a line starts with an asterisk, we mark the previous answer (if any) correct, then remove the asterisk, and continue processing that line as normal. This way it does not matter if the asterisk is at the end of the line, or at the start of the next line.

EDIT 2: In case no answers were marked correct, mark the last one correct.

EDIT 3: Output a warning if no answers were marked correct and the last one was marked correct by default (the EDIT 2 case).

Here is my suggested awk script:

Code:

#!/usr/bin/awk -f



BEGIN {

    # Accept any newline convention.

    # Also, remove leading and trailing whitespace.

    RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"



    # Output using Unix newlines.

    ORS = "\n"



    # Field separator does not matter, but set it to whitespace anyway.

    FS = "[\t\v\f ]+"



    # Question lines.

    questions = split("", question)



    # Answer lines.

    answers = split("", answer)



    # Correct answers are set to 1, wrong 0.

    split("", correct)

}



# Function to output current question structure.

function output() {

    # If there are no answers, do nothing.

    if (answers < 1)

        return



    # Calculate the number of correct answers.

    corrects = 0

    for (i = 1; i <= answers; i++)

        if (correct[i])

            corrects++



    # If no correct answers, mark the last answer correct.

    if (corrects < 1) {

        printf("Warning: Marking last answer (%s) of \047%s\047 correct.%s", answer[answers], question[1], ORS) > "/dev/stderr"

        correct[answers] = 1

        corrects++

    }



    # Calculate the percentage to set each answer.

    wrong = -100

    right = int(100 / corrects)



    # Set the values to match.

    for (i = 1; i <= answers; i++)

        if (correct[i])

            correct[i] = right

        else

            correct[i] = wrong



    # Balance the first correct answer to get full 100% total.

    if (corrects * right < 100)

        for (i = 1; i <= answers; i++)

            if (correct[i] == right) {

                correct[i] = 100 - right * (corrects - 1)

                break

            }



    # Output the question line(s).

    for (i = 1; i <= questions; i++)

        printf(":%s%s", question[i], ORS)



    # Start the answer set.

    printf("{%s", ORS)



    if (corrects > 1) {

        # Output the answers with percentages.

        for (i = 1; i <= answers; i++)

            printf("~%%%d%% %s%s", correct[i], answer[i], ORS)

    } else {

        # Only one correct answer. Use simpler output format.

        for (i = 1; i <= answers; i++)

            if (correct[i] == wrong)

                printf("~ %s%s", answer[i], ORS)

            else

                printf("= %s%s", answer[i], ORS)

    }



    # Close the answer set.

    printf("}%s", ORS)



    # Clear the question and answer set.

    questions = split("", question)

    answers = split("", answer)

    split("", correct)

}



# If a line starts with an asterisk, mark the last question correct.

/^\*/ {

    # Remove the asterisk

    sub(/^\*+/, "")



    # Mark the previous answer correct.

    if (answers > 0)

        correct[answers] = 1

}



# Question lines begin with number.

/^[0-9]/ {

    # Remove the leading number and full stops.

    line2 = $0

    sub(/^[0-9]*[\t\v\f ]*\.[\t\v\f ]*/, "", line2)



    # Remove the asterisk, if there happens to be one.

    sub(/[\t\v\f ]*\*$/, "", line2)



    # Remove everything in brackets for the first line.

    line1 = line2

    gsub(/[\t\v\f ]*\[[^\]]*\][\t\v\f ]*/, "", line1)



    # Add question lines.

    question[++questions] = line1

    question[++questions] = line2

    next

}



# Answer lines are those that start with a letter.

/^[A-Za-z]/ {

    # Remove the leading letter and full stops.

    line = $0

    sub(/^[A-Za-z]*[\t\v\f ]*\.[\t\v\f ]*/, "", line)



    # If the line ends with an asterisk, it is the correct one.

    if (line ~ /\*$/) {

        sub(/[\t\v\f ]*\*$/, "", line)

        answer[++answers] = line

        correct[answers] = 1

    } else {

        answer[++answers] = line

        correct[answers] = 0

    }



    next

}



# Everything else flushes the question.

{

    # Check if the line is empty (ignore trailing asterisk).

    line = $0

    sub(/[\t\v\f ]*\*$/, "", line)

    if (length(line) > 0)

        printf("Ignoring \047%s\047%s", line, ORS) > "/dev/stderr"



    output()

}



# At end, we might have the final question buffered.

END {

    output()

}

The BEGIN rule is still the same, except newlines will consume any leading and trailing whitespace from lines.

The output() function is from the last example (} rule), except it'll do nothing if there are no answers yet.

EDIT: The added asterisk rule edits the current record using sub(), removing the asterisk at the start of the line. If there are answers, the previous answer is marked correct.

Question rule is pretty much the same as before, except now it triggers on lines that begin with a number. The number and the full stop is removed, as well as any trailing asterisk on the line. The line is duplicated, with the second one having everything in brackets [ ] removed, including the brackets. If you do have single bracket characters in there, they should probably be filtered out too, using another gsub() command (so it'll apply it for all matches; sub() only applies once.)

Answer rules match on lines that begin with a letter. The letter and the full stop are removed. The expression is a bit more complex than usually needed, to take care of stray spaces. If the line ends with an asterisk, then it is a correct answer; otherwise it is a wrong answer. The asterisk is of course filtered out from the correct answer.

Everything not matching question or answer rules cause the current question-answer set to be flushed. If the line is not empty (aside from an asterisk), it is printed to standard error.

If the last line in the input files is part of an answer, then the last question-answer set has to be output at the end of the script. The END rule takes care of that.

As you can see, the logic is very much the same as in the above awk scripts, only small changes. The output function is used in two places; using a function is better than copying the code. If there are no answers yet, the function will ignore the command; thus, an empty line between a question and related answers should not do any harm.

Does this work better? (By the way, it only took me about two minutes to modify the awk script, much, much less than writing this post..)

It's almost perfect, I get a "division by zero attempted" attempt on row 38 in the script. This happens when there is only right answer is the last one.
It goes wrong when:

Quote:

26. Question
a. answer
b. answer
c. answer
d. answer
*

All other scenarios work perfect:

Quote:

26. Question
a. answer
b. answer*
c. answer
d. answer
*

Quote:

26. Question
a. answer
b. answer*
c. answer
d. answer