[SOLVED] scripting help/advice; use bash?

jamtat · 02-02-2012, 03:28 PM

Now that I've worked with these scripts to create and print off a real-world outline, I've come up with yet another possible enhancement. The issue I've run into is that the LaTeX processor (pdflatex in my case) that generates a nice-looking print copy from a properly marked-up (mark-upped?) file created in nano, does not recognize " at the beginning of a quotation as an opening quotation mark. Rather, it expects double back-ticks (``) as the sign of an opening quotation mark--not a very intuitive movement for the typist. If pdflatex encounters " at the beginning of a quotation, it makes it into a left-facing rather than right-facing quotation mark. The " occuring at the end of a quotation is, however, turned into the correct left-facing quotation mark. So that one does not need to be replaced.

In any case, these scripts would be enhanced if they could identify opening quotation marks--say, by looking for "[A-Za-z0-9]--and, at those points replacing " by ``. So, any letter--upper or lower case--or any number, preceded immediately by " needs to be targeted for replacement by ``. Any " immediately following any letter or numeral can be ignored.

Can anyone propose a modification of one or more of the above scripts that would search and replace those opening quotation marks with double back-ticks? Once again, many thanks for the very helpful contributions thus far.

James

Dark_Helmet · 02-02-2012, 05:16 PM

Sure, the scripts could all be re-tooled to handle it, but I see some potential problems down the road that you might want to consider before finalizing your hoped-for behavior.

1. Just as an aside, if I'm not mistaken, LaTeX/TeX uses double-backticks (``) for opening double quotes and double-single-quotes ('') for closing double quotes. While pdflatex may translate an ASCII double quote into a closing double quote, if you change to another TeX-to-PDF generator tool, the new tool may not behave the same way.

2. Can a double quote appear as part of a quote? For instance:

Quote:

The court's opinion reads, "This court concludes the Plaintiff's use of a double quote (") in paragraph 5 indicates a verbatim restatement of a prior conversation between the parties."

3. If you switch to the double-backtick and double-single-quote method in #1, would the script need to handle quotes that begin on line X of the outline and end on line Y?

I'm not saying your original solution is a bad one. I'm just trying to say that there may be other situations to think of before marrying yourself to a simple regex-replace.

jamtat · 02-02-2012, 05:43 PM

Good points for consideration, Dark_Helmet. I've actually been using what you're calling doubled single quotes for closing quotations but discovered by accident that a regular close quotation mark gets converted correctly. But as you point out, that may be a peculiar behavior of pdflatex. So I guess ideally, both opening and closing quotes should be converted: opening quotation marks to double back-ticks, closing quotation marks to doubled single-quotes (think they're sometimes called "inverted commas," no? NO: I think they're actually called apostrophes).

I cannot foresee an instance in my usage where a regular quotation mark (i.e., ") would occur within a quotation. I may well have quotations within quotations but, according to style conventions I use, those nested quotations will be surrounded by apostrophes. Anyway, these are definitely things to think over.

Thanks, James

Nominal Animal · 02-02-2012, 05:57 PM

There is a simple way my awk script will address both jamtat's and Dark_Helmet's points. Unfortunately, GNU awk has perfect pattern matching options (\< and \> match the start and the end of a word, not any specific character), so I think I'll restrict this version of the script to GNU awk (gawk).

Code:

#!/usr/bin/gawk -f
#
# -v tab=8
#       set tab stops at every eight columns (the default).
#
# -v template=template.tex
#       set the path to the LaTeX template file.
#

# Convert tabs to spaces.
function detab(detab_line) {

    if (length(tabsp) != tab) {
    }

    while ((detab_pos = index(detab_line, "\t")) > 0)
        detab_line = substr(detab_line, 1, detab_pos - 1) substr(tabsp, detab_pos % tab) substr(detab_line, detab_pos + 1)

    return detab_line
}

# Apply config-based replacements.
function fix(fix_line) {

    # Edit backticks
    backticks = config["[Bb][Aa][Cc][Kk][Tt][Ii][Cc][Kk][Ss]"]
    if (length(backticks) > 0)
        fix_line = gensub(backticks, "\\1``\\2", fix_line)

    return fix_line
}

BEGIN {
    # Set tab width to default, unless set on the command line.
    if (tab < 1)
        tab = 8

    # Set template name to default, unless set on the command line.
    if (length(template) < 1)
        template = "template.tex"

    # Record separator is a newline, including trailing whitespace.
    RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)"

    # Field separator is consecutive whitespace.
    FS = "[\t\v\f ]+"

    # Configuration -- parsed from magic comments.
    split("", config)
    config["tab"] = tab
    config["template"] = template

    # We are not working on anything yet.
    template = ""
    header = ""
    footer = ""
    split("", outline)
    outline[0] = 1
    maxspaces  = 0
    CURR = ""
}

CURR != FILENAME {

    # Empty line?
    if ($0 ~ /^[\t ]*$/)
        next        

    # Configuration comment?
    if ($0 ~ /^%[\t ]*[A-Za-z][0-9A-Za-z]*[\t ]*:/) {
        name = $0
        sub(/^%[\t ]*/, "", name)
        sub(/[\t ]*:.*$/, "", name)
        value = $0
        sub(/^[^:]*:[\t ]*/, "", value)

        # Make the name case-insensitive.
        temp = name
        name = ""
        for (i = 1; i <= length(temp); i++) {
            c = substr(temp, i, 1)
            uc = toupper(c)
            lc = tolower(c)
            if (uc != lc)
                name = name "[" uc lc "]"
            else
                name = name c
        }

        config[name] = value
        next
    }

    # Comment line (skipped)?
    if ($0 ~ /^[\t ]*%/)
        next

    # This is the first line of actual content.
    CURR = FILENAME

    # Set up tabs as currectly specified.
    tab = int(config["tab"])
    tabsp = "                "
    while (length(tabsp) < tab)
        tabsp = tabsp tabsp
    tabsp = substr(tabsp, 1, tab)

    # Have we used a template yet?
    if (length(template) < 1) {
        # No, read it.
        template = config["template"]
        if (length(template) < 1) template = "-"
        OLDRS = RS
        RS = "(\r\n|\n\r|\r|\n)"

        while ((getline line < template) > 0) {
            # Content marker line?
            if (line ~ /^[\t\v\f ]*[Cc][Oo][Nn][Tt][Ee][Nn][Tt][\t\v\f ]*$/)
                break

            # Outline level definition?
            if (line ~ /^%[\t ]*\\outl{/) {
                level = line
                sub(/^[^{]*{/, "", level)
                sub(/}.*$/, "", level)
                level = int(level)

                line = detab(line)
                sub(/\\.*$/, "", line)
                sub(/%/, "", line)
                spaces = length(line)
                outline[spaces] = level
                if (spaces > maxspaces)
                    maxspaces = spaces
                continue
            }

            # Default value definition?
            if (line ~ /^%[\t ]*[A-Z][0-9A-Za-z]*:/) {
                name = line
                sub(/^%[\t ]*/, "", name)
                sub(/[\t ]*:.*$/, "", name)
                value = line
                sub(/^[^:]*:[\t ]*/, "", value)

                # Make the name case-insensitive.
                temp = name
                name = ""
                for (i = 1; i <= length(temp); i++) {
                    c = substr(temp, i, 1)
                    uc = toupper(c)
                    lc = tolower(c)
                    if (uc != lc)
                        name = name "[" uc lc "]"
                    else
                        name = name c
                }

                # If not in config already, set.
                if (!(name in config))
                    config[name] = value
                continue
            }

            # Comment line?
            if (line ~ /^[\t ]*%/)
                continue

            # Ordinary header line. Remove comment.
            sub(/[\t ]%.*$/, "", line)
            header = header line "\n"
        }

        # The rest belongs to footer.
        while ((getline line < template) > 0)
            footer = footer line "\n"

        close(template)
        RS = OLDRS

        # Fill in the outline levels.
        level = outline[0]
        for (spaces = 1; spaces < maxspaces; spaces++)
            if (spaces in outline)
                level = outline[spaces]
            else
                outline[spaces] = level

        # Replace all known ~Name~ in the template.
        for (name in config) {
            gsub("~" name "~", config[name], header)
            gsub("~" name "~", config[name], footer)
        }

        # Replace all other ~Name~ entries in the template with empty strings.
        gsub(/~[A-Z][0-9A-Za-z]*~/, "", header)
        gsub(/~[A-Z][0-9A-Za-z]*~/, "", footer)

        # Emit the template.
        printf("%s", header)
    }
}

/^[\t ]*=/ {
    line = $0
    prefix = index(line, "=") - 1

    # Indentation size in spaces.
    spaces = length(detab(substr(line, 1, prefix)))

    # Find out the outline level for this indentation.
    if (spaces > maxspaces)
        level = outline[maxspaces]
    else
        level = outline[spaces]

    # Add outline level definition.
    line = substr(line, 1, prefix) "\\outl{" level "}" substr(line, prefix + 2)

    printf("%s\n", fix(line))
    next
}

{   printf("%s\n", fix($0))
}

END {
    printf("%s", footer)
}

If you add

Code:

% Backticks: "\<

it converts each double quote before a word into double backticks.

It is a bit more versatile than that, actually. You can specify any pattern using

Code:

% Backticks: (before)replaced(after)

and the replaced bit will be replaced by double backticks. If you use the parentheses, always use both pairs, like ()"\<() or you'll move the after bit to before the backticks. Because the "\< matches only the double quote character before a word, there is no need to specify a preceding or succeeding pattern to be kept intact in parentheses.
The parentheses themselves are only used for grouping. If you want to match an open parens, use \( instead.

Note how this is done in the new fix() function. Because I wanted the config keywords to be case-insensitive, the "Backticks" name is actually "[Bb][Aa][Cc][Kk][Tt][Ii][Cc][Kk][Ss]" in the config array. (I've had issues with IGNORECASE in different versions of GNU awk, so I'd rather do it this way, thank you.)

If you keep in mind that each fix() is applied to each input line separately (do not span lines), you should be able to add any new configurable fixes into it yourself.

Dark_Helmet · 02-03-2012, 12:28 AM

Well, here's a modified python script. It's a 3.2 compatible script... just to get that out of the way

I modified it to insert the backticks and the apostrophes according to your idea about the regular expression matching. Though, I modified the pattern for the apostrophes a little bit.

In American English, grammar rules say that a closing double quote comes after any punctuation that would end the sentence. I believe British English grammar rules say that a closing double quote comes before the same punctuation.

For instance:
American English -> He said, "Why oh why didn't I take the BLUE pill."
British English -> He said, "Why oh why didn't I take the BLUE pill".

So, assuming that you are an American English, grammar-abiding citizen, I allowed for non-alphanumeric characters to come between the end-of-a-word marker and the close quote.

Given that a quote should not open with punctuation, the same pattern modification was not applied to the open quote regular expression.

EDIT:
I changed the script to substitute throughout the string in one command rather than loop over the string repeatedly performing one substitution at a time. It took me a few minutes to read how Python handles backreferences with python's compiled regular expression substitution approach.
/EDIT

I tested it on one or two cases, but it hasn't received a thorough testing.

Code:

#!/usr/bin/python

import sys
import re

if( len( sys.argv ) != 2 ):
    print ( "{0} requires a filename to operate on.".format( sys.argv[0].split('/')[-1] ), file=sys.stderr )
    sys.exit( 1 )

try:
    rawOutline = open( sys.argv[1], 'r' )
except:
    print ( "Unable to open {0} for reading".format( sys.argv[1] ), sys.stderr )
    sys.exit( 2 )

print ( '\\documentclass{article}\n'
        '\\usepackage{cjwoutl}\n'
        '\\usepackage[top=1in,bottom=1in,left=1in,right=1in]{geometry}\n'
        '\\pagestyle{myheadings}\n'
        '\\markright{\\today{\\hfill \\Large{***Header*title*here***}\\hfill}}\n'
        '\\linespread{1.3} % gives 1.5 line spacing\n'
        '\\begin{document}\n'
        '\\begin{outline}[new]\n'
        '\\begin{Large} % gives ca. 14 pt font' )

for inputLine in rawOutline:
    tabMatches = re.match( r"(\t*)=(.*)", inputLine )
    if( tabMatches == None ):
        outputLine = inputLine
    else:
        tabCount = len( tabMatches.group(1).split('\t') )
        outputLine = "{0}\\outl{{{1:d}}}{2}".format( tabMatches.group(1),
                                                     tabCount,
                                                     tabMatches.group(2) )

    openQuoteRE = re.compile( r'\B"\b' )
    outputLine = openQuoteRE.sub( '``', outputLine )

    closeQuoteRE = re.compile( r'\b(?P<punct>[^A-Za-z0-9]*)"\B' )
    outputLine = closeQuoteRE.sub( "\g<punct>''", outputLine )

    print ( outputLine.rstrip() )

print ( '\\end{Large}\n'
        '\\end{outline}\n'
        '\\end{document}\n' )

EDIT2:
I do have to say Nominal, I am impressed with the amount of time and effort you've put into your awk script setup. You're tempting me to "keep up with the Nominals" and adding some command line options to my script

jamtat · 02-04-2012, 01:50 PM

Thanks for all the continuing help with this. It's turning into a real project. I've changed ever so slightly the mark-up that gets prepended and appended since I found a better way of increasing font size. But that just involves deleting a couple of lines and a few characters in the middle of another line as well as adding a few characters in a third line, which seems something I should be able to do myself within the perl and python scripts.

On the quotation mark issue, I've actually discovered the problem I was having with double-quotes also applies to single-quotes: they, like double-quotes, are directional in TeX/LaTeX output, and if you want, as a sort of opening quotation marker, a single-quote instead of a left-facing apostrophe, a single back-tick is required there. Rather than trying to convert all these instances using the scripts, though, I'm beginning to think I'll just have to learn to start typing back-ticks.

James

Nominal Animal · 02-04-2012, 07:49 PM

Quote:

Originally Posted by jamtat

On the quotation mark issue, I've actually discovered the problem I was having with double-quotes also applies to single-quotes

If you use my script, then modify the fix() function into

Code:

# Apply config-based replacements.
function fix(fix_line) {

    # Edit backticks
    backticks = config["[Bb][Aa][Cc][Kk][Tt][Ii][Cc][Kk][Ss]"]
    if (length(backticks) > 0)
        fix_line = gensub(backticks, "\\1``\\2", fix_line)

    # Edit backtick
    backtick = config["[Bb][Aa][Cc][Tt][Ii][Cc][Kk]"]
    if (length(backtick) > 0)
        fix_line = gensub(backtick, "\\1`\\2", fix_line)

    return fix_line
}

and use (in either input text file or template)

Code:

% Backticks: "\<
% Backtick: (^|\s)'([^'])

The Backtick pattern might need some editing, though. As I've written it above, it will replace a single apostrophe at the start of the line and following a whitespace, as long as it is not followed by another apostrophe.

@Dark_Helmet: I don't really spend that much time answering these questions, really. Why I bother, is because I really want to support anyone who is willing to tinker their work environment to suit their needs better. It is the very foundation why open source environments will always beat prepackaged closed-source environments: because you get more done, when the environment is tailored to your needs.

I feel pretty sad towards those that are used to proprietary prepackaged software and lament the lack of polish and the lack of hard guidelines in open source tools. It is like lamenting the lack of barbed wire fences across children's playgrounds. Just because they're used to them being there, does not mean they really should be there.

Dark_Helmet · 02-04-2012, 08:26 PM

Quote:

Originally Posted by Nominal Animal

I don't really spend that much time answering these questions, really.

Well, regardless of the actual amount of time, your awk solution is rather flexible (as far as I can tell--I'm not familiar with awk beyond the basics). Much more flexible than necessary to accomplish the task. I view my script and Cedrik's perl script as more or less the minimum to accomplish the task (no offense meant Cedrik). So, in my book, you've gone beyond the call of duty.

Hence, my comment about adding some command line switches--for customizing the heading, specifying an input/output filename, etc.

I certainly am not trying to tell you to do less. Though, I did want to point out that some folks, like myself, are impressed with your effort--whether you feel it is impressive or not.

Nominal Animal · 02-04-2012, 11:10 PM

Quote:

Originally Posted by Dark_Helmet

your awk solution is rather flexible (as far as I can tell--I'm not familiar with awk beyond the basics). Much more flexible than necessary to accomplish the task.

That is because I've found that adding flexibility to my tools lets me reuse them easily, without having to manage a dozen little scriptlets each doing almost the same thing. It is the economical and efficient thing to do.

I try to keep the number of script utilities I use to a reasonable count. Having a number of scripts share a lot of code is a maintenance burden; having one script with a few options to allow it to work for all similar cases is worth the extra effort. (When script use diverges too much, I'll happily split it, though. It's all in a balance.)

This approach is already so deeply ingrained in the way I write my scripts, I don't even think about it. It obviously affects the results, the scriptlets being more verbose and more adaptable than necessary. Many feel I overengineer my solutions, and prefer simpler ones, ones that solve the problem at hand and nothing else. However, my approach has served me well thus far. The scripts and programs I write crash extremely rarely, if ever, even when encountering strange or mangled input (and the "bad" cases tend to be documented in the script), and I'm not littered with half-forgotten scriptlets no-one remembers why they exist.

jamtat · 02-04-2012, 11:31 PM

Quote:

Originally Posted by Nominal Animal

That is because I've found that adding flexibility to my tools lets me reuse them easily, without having to manage a dozen little scriptlets each doing almost the same thing. It is the economical and efficient thing to do.

I try to keep the number of script utilities I use to a reasonable count. Having a number of scripts share a lot of code is a maintenance burden; having one script with a few options to allow it to work for all similar cases is worth the extra effort. (When script use diverges too much, I'll happily split it, though. It's all in a balance.)

Not to sound ungracious, but I viewed your original solution as a bit overly complex, Nominal. But when I discovered, as I mentioned above, a better way to increase font size in my outlines (believe me, I stand to learn quite a lot about LaTeX mark-up), your script was the easiest to deal with. It needed no modification whatever: I only needed to add the modification to the template file. The perl and python scripts, on the other hand, needed to be modified. I'm glad I was able to discover so quickly some of the wisdom in the approach you've taken, Nominal.

And thanks for the additional tips on single-quotes/back-ticks/apostrophes. It's very much appreciated. I'll take a closer look now.

James

Dark_Helmet · 02-04-2012, 11:52 PM

Adding flexibility is the right way to do it. I do the same for my personal stuff (from scripts to compiled programs). Sometimes I even get frustrated with myself for insisting that I add this-or-that command line switch and think "I am never gonna use this--why bother." Sometimes I end up using it... sometimes not.

I tend not to add flexibility when answering questions. My choice not to do so is not born out of any ill will toward the poster, but because I have no idea what they see as the final solution to their situation. In other words, the cost-benefit becomes murky whereas it is clear when I am the end user.

One thing's for sure, if you ever provide a solution to a problem that I post, I'll certainly know to review it closely in case my problem evolves!