Complex regex (Sed/perl)
Hi,
I'm having a hard time with the following; The is a word file with questions and answers. I need to import in moodle (online question site) in a particular format. Everything is black accept for the right answers, these are green. The start format is the following: Quote:
Quote:
Open in on my linux computer and throw the following regex on it. Quote:
Quote:
Is this doable? I have over 1000 questions and it would sure help to automate this. Multiline replacement with sed is kind of tricky with two lines so I guess four or more lines will need perl? I have no experience in Perl. Could someone help me out with this one? Please excuse my bad English i'm a non-native speaker. |
How about using awk for this? It uses regular expressions like sed, but is more of a programming language. Well suited for something like this, in my opinion.
Since the OP's true input files do not match the example, I've edited the script as follows:
Code:
awk '# Note that the above does not balance correct answers to 100%, but uses the same truncated value (i.e. the sum will never exceed 100%) for all correct answers. This could be fixed quite easily, though. Here is how the code works: The BEGIN rule is executed once when the script starts, before any of the data files are read. Awk parses data in records (here, lines, using any newline convention), which are further split into fields (here, separated by . followed by whitespace). I keep the question string in variable question, the number of answers in the answers variable, and the answers (+1 for right and -1 for wrong) in the answer array. (answers = split("", answer) clears the answer array and sets answers to zero.) The section function uses the global variables to output the complete question saved in the above variables, then it clears them, preparing for a new question. Note that all variables are global in awk scripts. The END rule is executed once after all data files have been read and processed. It just outputs the last question, which is probably still stored in the variables. The rest of the rules in the middle are executed once for each input record. The next at the end of the rule means that awk will not execute the other rules for that record, but skip straight to the next record. The $1 is a reference to the first field in the record, $2 to the second, and so on. (You can even use $i where i is a variable containing an integer value.) The right and wrong rules check if the second field on the row contains a regular expression pattern ("Right" or "Wrong" in upper- or lowercase letters). If so, they add either +1 (right) or -1 (wrong) to the answer array, using the next value of answers as the index. Thus, the first answer will always be at index 1, and answers contains the number of answers in it. I think the scriptlet is quite straightforward, so I don't know what else to describe about it. If you could try it on your input, and if you find any deficiencies, I could try to fix it. Awk is very useful for this kind of tasks -- data conversion, tabulation --, and in my opinion, even more so for simple numeric processing or statistics gathering. GNU awk, gawk, has some additional features (like sorting data, retaining the record separators in separate variable RT for each record, and so on) which are often very useful, but mawk is usually a bit faster if you don't need the GNU features. If you need clarification on any point on the script, I'd be happy to try, |
Nominal Animal, thank you for your time, script and elaborate explanation. Unfortunately my AWK knowledge is zero, I'm digging in to it right now.
I already tried the following; Past your code in emacs, make it executable. I changed the last line to: ' tmp > tmp2 tmp is a file containing the questions. When I run it I get the following erro: ./multiline.sh awk: cmd. line:8: warning: escape sequence `\.' treated as plain `.' Ignored: ::Benoem drie technieken die bij social engineering worden gebruikt. [8.1.3] Ignored: ::Benoem drie technieken die bij social engineering worden gebruikt. [8.1.3] Ignored: { The ::[text] is the normal question and the { is the start of the answers. I'm reading up on awk so I can understand your code better. Once again thank you for your time. |
My first guess on the errors would be that you are running the solution on a MAC and it has its own bastardisation of awk which has many issues with many standard features (which
has been my experience when using a relatives MAC). My question to NA is whether or not the code will work on all forms of awk? Also, maybe give a small example of the exact input as it may be to do with the language, ie does dutch have some unusual characters outside the ASCII group? |
I'm running the AWK on Linux (Ubuntu).
An example of the input file before regex, I only used the first 5 questions: Quote:
Quote:
After the NA script i get the following. Quote:
|
I think I completely misunderstood the problem.
My original script above was intended to replace all the work you do with sed et cetera, since awk can easily do all that for you. I somehow missed the fact that you never showed the actual text file snippet you get when first exporting the text file; that is what my above awk scriptlet was supposed to work on. Correct me if I'm wrong, but you want to supply input similar to the following to an awk script Code:
:Question example Code:
:Question example Code:
#!/usr/bin/awk -f |
NA, I'm sorry I'm giving you the wrong examples.
The questions with one answer should stay the same. Only the questions with multiple right answers should become like your last example. Input example: Quote:
Quote:
|
Okay, then try the following. This works like the awk script above, but only outputs the percentages when there is more than one correct answer.
Code:
#!/usr/bin/awk -f |
Wow you're fast! Works like a charm. Your code is way better readable than my sed code, I'm thinking about rewriting everything your way. Thank you so much!
|
Quote:
|
Nominal Animal, That's incredibly kind of you. However I don't want to misuse your kindness. I've posted this question here for help, and you've done more than that. No it's up to me to understand and work with your code. So when the time arises, I can be as much help to someone else, as you have been to me.
|
@battler
You've misunderstood NominalAnimal. He can't live without doing proper awk - programms. Do him a favour: Let him rewrite the code to the perfect one-step-solution. |
@uhelp, thank you;)
Unfortunately I cant post the real files (copy right), I changed the text . Quote:
Quote:
- Because the questions are hand made the white lines are often more than one. In the end there should be no blank lines. - Sometimes the * is on the next row. This only happens when the right answer is the last of the answers. When this happens the last answer is the right answer. |
@battler, @uhelp: It is true, I am addicted to solving problems. But, I am also deeply satisfied when people express their wish to learn for themselves, and to express their own skill; true learning -- as opposed to rote or parroting preprocessed data -- is something that I appreciate very, very much. I was torn whether to prod battler a bit, since I knew the single-stage solution to be so near, but on the other hand, I really, really respect the wish to learn and study.
Fortunately, after uhelp's prodding, my dilemma was solved. EDIT 1: Oops, forgot the note about the solitary asterisk. Easy fix: If a line starts with an asterisk, we mark the previous answer (if any) correct, then remove the asterisk, and continue processing that line as normal. This way it does not matter if the asterisk is at the end of the line, or at the start of the next line. EDIT 2: In case no answers were marked correct, mark the last one correct. EDIT 3: Output a warning if no answers were marked correct and the last one was marked correct by default (the EDIT 2 case). Here is my suggested awk script: Code:
#!/usr/bin/awk -f The output() function is from the last example (} rule), except it'll do nothing if there are no answers yet. EDIT: The added asterisk rule edits the current record using sub(), removing the asterisk at the start of the line. If there are answers, the previous answer is marked correct. Question rule is pretty much the same as before, except now it triggers on lines that begin with a number. The number and the full stop is removed, as well as any trailing asterisk on the line. The line is duplicated, with the second one having everything in brackets [ ] removed, including the brackets. If you do have single bracket characters in there, they should probably be filtered out too, using another gsub() command (so it'll apply it for all matches; sub() only applies once.) Answer rules match on lines that begin with a letter. The letter and the full stop are removed. The expression is a bit more complex than usually needed, to take care of stray spaces. If the line ends with an asterisk, then it is a correct answer; otherwise it is a wrong answer. The asterisk is of course filtered out from the correct answer. Everything not matching question or answer rules cause the current question-answer set to be flushed. If the line is not empty (aside from an asterisk), it is printed to standard error. If the last line in the input files is part of an answer, then the last question-answer set has to be output at the end of the script. The END rule takes care of that. As you can see, the logic is very much the same as in the above awk scripts, only small changes. The output function is used in two places; using a function is better than copying the code. If there are no answers yet, the function will ignore the command; thus, an empty line between a question and related answers should not do any harm. Does this work better? (By the way, it only took me about two minutes to modify the awk script, much, much less than writing this post..) |
It's almost perfect, I get a "division by zero attempted" attempt on row 38 in the script. This happens when there is only right answer is the last one.
It goes wrong when: Quote:
Quote:
Quote:
|
All times are GMT -5. The time now is 03:30 AM. |