[SOLVED] scripting help/advice; use bash?

jamtat · 01-28-2012, 12:41 AM

Hi. I recently did some tweaking to nano so that I could create outlines with it that look good on the screen. I also need to print out those outlines and have them look nice on paper and I've come up with a way of doing that which involves inserting--manually, for now--TeX/LaTeX mark-up, then changing the file's extension to .tex and running pdflatex on it. You can read about the project at audaciousamateur.blogspot.com for more details if you're interested.

It seems to me that, even for someone with my limited knowledge, there should be some non-manual way to add the mark-up to my outline files. Someone who knows perl or python well, for example, could probably easily cobble together some way of doing this task using one of those languages. But I know next to nothing about either language.

In another forum where I asked about this I was directed to The Advanced Bash Scripting Guide. I was kind of gravitating toward bash anyway for this since, if I can lay somewhat dubious claim to being familiar with any sort of scripting, it would be using bash (I've created some extremely rudimentary bash scripts in the past). But I have such a poor grasp of even bash that I really wasn't sure it could process these files.

Well, I actually located in the ABS a sample script that converts a text file to html--something very close to what I need to do. In short, I need to add some lines at the beginning of the file and append some at the end, as well as to insert some mark-up within the file: that's pretty much what the bash script I found does as well (see the script at http://www.tldp.org/LDP/abs/html/con...ts.html#TOHTML ).

I just want to start off this thread by asking those much better versed in bash whether I'm on the right track in considering the ABS sample script as being a good starting point for a script that could be used to process my outline files?

Thanks, James

catkin · 01-28-2012, 02:04 AM

Its hard to answer without specific information about the format of your file before and after inserting the Tex/LaTeX markup.

Bash is not great at string manipulation but it can use sed to do the complex work for it as done in the linked script.

awk might be a better choice. If you know C, awk is relatively easy to learn -- easier than bash.

Can you post an illustrative example of the input file format and the desired output?

pyroscope · 01-28-2012, 04:29 AM

If I understand this, you want simple text files with simple markup that are directly readable for a human, but also publishable. There are (wiki-like) documentation systems just for that.

You might want to take a step back and consider one of those already round wheels, e.g. http://sphinx.pocoo.org/rest.html#li...te-like-blocks, or actually a personal wiki, which is what I use for quick notes. All wikis convert to HTML obviously, good ones also to PDF.

jamtat · 01-28-2012, 09:34 AM

Thanks for the answers thus far. There are illustrations at my blog (I posted the address in the OP), but I'll repeat some of that here.

The outline text file is, obviously, an outline. Each level of the outline gets indented 0 or more tab spaces from the left margin. Unindented lines are the level one parts of the outline; lines indented one tab space from the left margin are level two parts; lines indented two tab spaces from the left margin are level three parts; and so on. I've designated a unique character--the equals sign--as a sort of pseudo-bullet for all outline levels as well. Here's a link to a screenshot of a sample outline I did that should better illustrate visually what I'm describing: http://1.bp.blogspot.com/-o2ZlVk8sLL...s1600/Scr1.png

So, here's what needs to be done to this text file so as to make it print nicely on paper. Nine lines need to be prepended to the beginning. Those lines are:

Quote:

\documentclass{article}
\usepackage{cjwoutl}
\usepackage[top=1in,bottom=1in,left=1in,right=1in]{geometry}
\pagestyle{myheadings}
\markright{\today{\hfill \Large{***Header*title*here***}\hfill}}
\linespread{1.3} % gives 1.5 line spacing
\begin{document}
\begin{outline}[new]
\begin{Large} % gives ca. 14 pt font

Another three lines need to be appended at the end. Those lines are

Quote:

\end{Large}
\end{outline}
\end{document}

Then, mark-up needs to be added within the body of the outline as follows. Every new line that starts with an equals sign (the equals sign being the pseudo-bullet I've selected to use for all outline levels in the text file) needs to have the equals sign replaced by the mark-up \outl{1}. Every new line followed by a single tab space then the equals sign should have the equals sign replaced by \outl{2}. Every new line followed by two tab spaces and the equals sign should have the equals sign replaced by \outl{3}. And so on, up to \outl{10} (I doubt my outlines will ever go to ten levels, but the cjwoutl package is capable of that so it should be possible for the script to handle it: namely any new line followed by nine tab spaces, then the equals sign, should have the equals sign replaced by \outl{10}). See http://2.bp.blogspot.com/-0ABppprz7A...s1600/Scr2.png for an example of how the file looks after I've added (manually, in that case) the mark-up.

I hope this gives enough further detail to determine whether a bash script is the right tool, or even a possible tool, to use for this job. As I said, it seems to me the bash script for converting a text file to html works very similarly to what I need--though my scenario is actually a bit simpler in that mark-up only needs to be added in a certain relation to new lines. The bash script I found, so far as I can understand it, needs to do replacements within lines and paragraphs and so, it seems, calls sed.

Further input will be appreciated. And by the way, I do not know C or any other programming language. The only thing remotely resembling programming that I have any familiarity with at all is some rudimentary html and, as I said, very rudimentary bash scripting.

James

jamtat · 01-28-2012, 09:47 AM

Quote:

Originally Posted by pyroscope

If I understand this, you want simple text files with simple markup that are directly readable for a human, but also publishable. There are (wiki-like) documentation systems just for that.

You might want to take a step back and consider one of those already round wheels, e.g. http://sphinx.pocoo.org/rest.html#li...te-like-blocks, or actually a personal wiki, which is what I use for quick notes. All wikis convert to HTML obviously, good ones also to PDF.

Thanks for your input, pyroscope. I do use moinmoin and am familiar with its mark-up. So I think I understand what you're getting at and it is an interesting thought.

The reason I like the solution I'm proposing is that I can, using TeX/LaTeX mark-up, essentially create a template that will render the printed output in just the form I want it. I can, for example, control margin width, font size, line spacing, header content--even doing tricky things like having the date auto-inserted in the header. So far as I understand it I would have to get involved in a lot of additional tweaking of the file in order to get that kind of output from a wiki file. But I'll certainly be giving the matter some more thought.

James

Cedrik · 01-28-2012, 10:19 AM

I don't know with bash, but with Perl a way to do it:

(change $tab_limit value if you need the script to handle more than 10 tabs)

Code:

#!/usr/bin/perl

my $tab_limit = 10;

print <<END
\\documentclass{article}
\\usepackage{cjwoutl}
\\usepackage[top=1in,bottom=1in,left=1in,right=1in]{geometry}
\\pagestyle{myheadings}
\\markright{\\today{\\hfill \\Large{***Header*title*here***}\\hfill}}
\\linespread{1.3} % gives 1.5 line spacing
\\begin{document}
\\begin{outline}[new]
\\begin{Large} % gives ca. 14 pt font 
END
;

while (<>) {

        for my $i (1 .. $tab_limit) {
            my $search = '^\t{' . ($i -1). '}=';
            if (/$search/) {
                my $replace = '\\outl{' . $i . '}';
                s/$search/$replace/;
                last;
            }
        }
        print;
}

print <<END
\\end{Large}
\\end{outline}
\\end{document} 
END
;

Then save as edit_tabs.pl (or any name you want)
Make it executable (chmod +x edit_tabs.pl)
Use it like

Code:

./edit_tabs.pl yourfile.txt > newfile.txt

[edit]
I found a better version, removing the need of limiting tabs count
Also remove equal sign as it was one requirement (and the previous script did not satisfy it)

Code:

#!/usr/bin/perl

print <<END
\\documentclass{article}
\\usepackage{cjwoutl}
\\usepackage[top=1in,bottom=1in,left=1in,right=1in]{geometry}
\\pagestyle{myheadings}
\\markright{\\today{\\hfill \\Large{***Header*title*here***}\\hfill}}
\\linespread{1.3} % gives 1.5 line spacing
\\begin{document}
\\begin{outline}[new]
\\begin{Large} % gives ca. 14 pt font 
END
;

while (<>) {
	s/^(\t*)=(.*)/"$1\\outl{".((length $1) + 1)."}$2"/e;
	print;
}

print <<END
\\end{Large}
\\end{outline}
\\end{document} 
END
;

jamtat · 01-28-2012, 11:02 AM

Quote:

Originally Posted by Cedrik

I don't know with bash, but with Perl a way to do it:
. . . snip

Thank you for offering that, Cedrik. I thought this might be a fairly trivial task for someone familiar with a language like perl.

Now testing . . .

Wow. That works pretty well (though I did have some anomolies at first that resulted from some weirdness introduced when I copied and pasted the code). I note that in newfile.txt your script gets rid of the tab spaces where the \outl{#} tags get inserted. Of course pdflatex doesn't care about whether or not there are tab spaces at those points and formats the file just fine for printing anyway. But for my purposes, preserving the tab spaces found in the original outline is helpful: I can make better sense of the file visually with the presence of the tab spaces at those points. So, is there a way to modify your perl script so that it preserves the tab spaces that occur in the original outline in conjunction with the equals signs?

Otherwise, this looks like it could be a great solution.

James

jamtat · 01-28-2012, 01:16 PM

Would replacing the line

Code:

my $replace = '\\outl{' . $i . '}';

with the line

Code:

my $replace = '^\t{' . ($i). '}\\outl{' . $i . '}';

cause the tab spaces to be preserved?

Thanks, James

Never mind. That doesn't work--just prepends the characters ^\t{#} to lines that being with \outl{#}

Cedrik · 01-28-2012, 02:53 PM

If you want to preserve tabs, change $replace line:

Code:

my $replace = "\t" x ($i - 1) . '\\outl{' . $i . '}';

That should do it

jamtat · 01-28-2012, 03:47 PM

Yep, that does do it, Cedrik. Thanks again so much for helping with this! I now have a workable way of inserting the needed mark-up into my outlines!

I still may try and do this with a bash script, though. I've wanted for some time now to advance my pathetic abilities with bash, and figuring out how to do this with bash (if, as it seems to me, it will be possible with bash) would provide an opportunity to learn more about it. So if anyone has further input on whether the bash script I found that adds html mark-up to a text file could be adapted to add TeX mark-up as I'm trying to do, please weigh in.

James

Nominal Animal · 01-28-2012, 05:33 PM

I would use awk instead of Bash, because awk has all the necessary string facilities, whereas with Bash they're a bit lacking. Bash would certainly be a LOT slower.

Here is a plain awk script. You can supply it

-v tab=8
the size of the tab stops
-v template=template.tex
the path to the LaTeX template file
-v title="string"
the string that replaces ~Title~ in the header; empty by default

and the input file name(s). If there are no input files, the input is read from standard input.

You can add further variables, especially ones similar to the title, very easily. I've tried to comment the code well; I want it to be an example and explanation, and not just a suggested solution.

It should work well even with mixed tabs and spaces. It uses the % minimum-indentation \\outl{level} comments in the template (where minimum-indentation is either empty or desired whitespace string). Within the input, extra spaces or tabs do not matter; extra indentation up to the next outline level is accepted. It does not require empty lines between outline levels, as it tracks the preferred outline level for each line, and only inserts the outline definition before the first non-whitespace character when the outline level changes.

Code:

#!/usr/bin/awk -f
#
# -v tab=8
#       set tab stops at every eight columns (the default).
#
# -v template=template.tex
#       set the path to the LaTeX template file.
#
# -v title=text
#       set the text that replaces ~title~ in the template.
#

# Convert tabs to spaces.
function detab(detab_line) {

    while ((detab_pos = index(detab_line, "\t")) > 0)
        detab_line = substr(detab_line, 1, detab_pos - 1) substr(tabsp, detab_pos % tab) substr(detab_line, detab_pos + 1)

    return detab_line
}

BEGIN {
    # -v tab=N sets tab width to N spaces.
    if (tab < 1) tab = 8;

    # tabsp is a tab-length string of spaces.
    tabsp = "        "
    while (length(tabsp) < tab) tabsp = tabsp tabsp
    tabsp = substr(tabsp, 1, tab)

    # -v template=path sets the default template.
    if (length(template) < 1) template = "template.tex"

    # Array mapping indentation in spaces to outline level.
    split("", outline)
    outline[0] = 1      # No indentation maps to outline level 1.
    maxspaces  = 0

    # Record separator is a newline, including trailing whitespace.
    RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)"

    # Field separator is consecutive whitespace.
    FS = "[\t\v\f ]+"

    # Read the header part of the template.
    while ((getline < template) > 0) {

        # "Content" marks the end of the header.
        if (tolower($1) == "content")
            break

        # Does this line define the intentation level?
        if ($0 ~ /^[\t\v\f ]*%[\t\v\f ]*\\outl{/) {

            # Convert tabs to spaces first.
            line = detab($0)

            # Remove the leading percent sign.
            sub(/^[^%]%/, "", line)

            # Calculate the indentation in spaces.
            spaces = index(line, "\\") - 1

            # Parse the outline parameter.
            level = line
            sub(/^[^{]*{/, "", level)
            sub(/}.*$/,    "", level)
            level = int(level)

            # Add to outline array.
            if (spaces >= 0 && level > 0) {
                outline[spaces] = level
                if (spaces > maxspaces)
                    maxspaces = spaces
            }

            # Do not include this line in the template.
            continue
        }

        # This line is output as part of the header.
        line = $0

        # Replace title.
        gsub(/~[Tt]itle~/, title, line)

        # Remove comments.
        gsub(/[\t\v\f ]%[^{}\\]*$/, "", line)

        printf("%s\n", line)
    }

    # Fill in the number-of-spaces-to-outline-level mapping.
    # Each indentation in the template is the minimum;
    # extra spaces are allowed (up to the next level).
    level = outline[0]
    for (spaces = 1; spaces < maxspaces; spaces++)
        if (outline[spaces] > 0)
            level = outline[spaces]
        else
            outline[spaces] = level

    # Start without outline.
    level = 0
}

/^[\t\v\f ]*%/ {
    # Skip comment lines.
    next
}

{
    line = $0

    # Remove comments.
    sub(/[\t\v\f ]%[^{}\\]*$/, "", line)

    # Create a prefix of just the indentation.
    prefix = line
    sub(/[^\t\v\f ].*$/, "", prefix)

    # Indentation size in spaces.
    spaces = length(detab(prefix))

    # We only need the length of the prefix in the input line.
    prefix = length(prefix)

    # Find out the outline level for this indentation.
    if (spaces > maxspaces)
        newlevel = outline[maxspaces]
    else
        newlevel = outline[spaces]

    # Outline level change?
    if (level != newlevel) {
        level = newlevel
        line = substr(line, 1, prefix) "\\outl{" level "}" substr(line, prefix + 1)
    }

    printf("%s\n", line)
}

END {
    # Output template footer.
    while ((getline line < template) > 0)
        printf("%s\n", line)

    close(template)
}

With this, or Cedrik's Perl script, you can use a single Bash command to regenerate the LaTex and PDF files whenever you save your file. You'll also need inotifywait from the inotify-tools package. For example, if your text files are named .txt in current working directory and/or subdirectories, with the template being the default template.tex in each directory, and the above script is named text2latex, run

Code:

inotifywait -q -m -e close_write,moved_to --format '%w%f' -r . | while read FILE ; do
    [ "$FILE" = "${FILE%%.txt}" ] && continue
    NAME="${FILE##*/}"
    DIR="${FILE%/*}"
    [ -f "$DIR/$NAME" ] || continue
    [ -f "$DIR/template.tex" ] || continue

    clear
    echo -n "$FILE: "
    date

    TEMP="$FILE.$$"
    ./text2latex -v title="${NAME%.txt}" "$FILE" > "$TEMP.tex" && \
        pdflatex "$TEMP.tex" && \
        mv -f "$TEMP.pdf" "$FILE.pdf" && \
        mv -f "$TEMP.tex" "$FILE.tex"
    rm -f "$TEMP."*
done

in another shell. Then, every time you save (or copy/rename/move) a file name ending with .txt, that will automatically create/overwrite the .txt.tex and .txt.pdf files for you.

If you like to use evince to look at the PDFs, you'll soon notice it lacks the option to watch the files; that is, it will not automatically reload the PDF file when the file changes. (You need to hit Ctrl-R to see the updates.) To make life easier, you could run in yet another shell

Code:

inotifywait -q -m -e moved_to --format '%w%f' -r . | while read FILE ; do
    [ "${FILE}" = "${FILE%%.txt.pdf}" ] && continue
    kill -HUP $(jobs -p)
    evince "$FILE" </dev/null >/dev/null 2>/dev/null
done

which will always reopen one instance of evince to the latest .txt.pdf file in the working directory (or any subdirectory). It will not interfere with any other evince instances, and if you happen to close the window, it'll just reopen when the next PDF file emerges.

Pretty nifty, eh?

In a different thread I tried to explain why the Unix philosophy, using small interchangeable modules to construct complex tools, is way better than large monolithic applications that direct you to work in a certain way. Above, you only use bash, awk, pdflatex, inotifywait, evince , and your favourite text editor nano (my preference too, actually!), to construct a fully automated document generation suited to your needs. Talk about powerful...

jamtat · 01-28-2012, 11:27 PM

Wow. That is some script you've put together, Nominal. I'm impressed. And even more impressed by the additional suggestions for how to keep the various forms of the files updated. That looks like a lot of work. I'm truly grateful.

That said, understanding the workings of this script is way beyond me. I've looked at it several times now to see if I can get some idea of which does what. I get lost almost immediately and have to give up.

I have tested it though, and it is, of course, quite effective. I assumed it needed to be saved with and *.awk extension and would then need to be chmod +x'd, so I did that. I wasn't sure yet whether it should be used in the same way as Cedrik's (i.e., script.awk outline.file > tex-file.tex), but some brief experimentation cleared that up. I was able this way to do a largely successful first run.

By largely successful what I mean is that the script properly identifed and marked up most, though not all, outline levels. I do have some question about that, but those will need to be prefaced by a bit of explanation. Maybe I can move to that later in this response--assuming you'll be able to devote a bit more attention to clarifying some things. But first, I have some other questions.

I'm not quite understanding about the title aspect of the script. It seems you might be allowing here for some way to sort of automate entry of the header text (i.e., what goes between curly brackets at {***Header*title*here***})--certainly a helpful addition: have I understood correctly? If so, I'm failing to gather from looking at the script from where the input for that is supposed to come. Is it reading some part of the input file for that information? Further clarification on that part of the script will be appreciated.

I haven't quite understood what you've said about your script's handling of spaces as opposed to tabs, either. That touches on an issue I was struggling to comprehend: namely, whether any script that could process these files would be able to distinguish tab spaces from single spaces. As you may be aware, the nano tweak I applied in order to get nano's color highlighting to work on my outline files does not distinguish between the two. It sounds as though your script treats them the same, true? If so, that seems like a plus.

Which brings up another issue I'm wondering about: since your script does not seem to rely on my pseudo-bullet (the equals sign) how does it distinguish an outline level from, say, a wrapped line? I never managed to understand whether, when nano does line wrapping (which I have it set to do), it inserts an end-of-line mark then a new-line mark at the beginning of the next line. If your script searches for (regular expression) new-line marks, then those must occur only when a carriage return is entered, rather than at points where nano simply wraps a line? Clarification on that will be appreciated.

On the pseudo-bullet character I've chosen, I'll just mention that I chose it for two reasons. One is that it can help me better to distinguish outline levels when I'm looking, on a screen, at one of my outline files under nano. Perhaps just as importantly though, I decided such a unique character might be needed in order for some search-and-replace script to even work. Yet your script seems to work fairly effectively even without the presence of the pseudo-bullet (something I discovered by accident, btw). Can you clarify how that happens? What I was trying to do in some initial experiments with searching and replacing regular expressions using nano's built-in search-and-replace function, was to get it to detect instances of end-of-line followed by new-line. I couldn't get it to detect such a combination and was unsure, in any case, whether line wrapping would entail that end-of-line/new-line combination as well. So I decided the pseudo-bullet would probably be needed and, in any case, would be helpful in distinguishing outline levels under nano on the screen. So I introduced it. Maybe I should be rethinking that?

I think I'll leave my questions at that for now and perhaps pose others later, if you will have any more time to devote to this thread. In conclusion, yes, what you've put together is truly nifty. Thanks again for your input on this!

James

Dark_Helmet · 01-29-2012, 02:58 AM

Hi jamtat,

I know you have a working solution, but I wanted to respond. I'm forcing myself to do Python scripting so I can learn it. So, since nobody has posted a Python solution, I'll post mine. It will work the same way that Cedrik's perl script does (i.e. "script.py inputfile.txt > outputfile.txt")

EDIT:
I modified the script based on Cedrik's point (in a later response) that tabs appearing after the equal sign would cause problems with the "\out{}" text.
/EDIT

EDIT2:
This script runs on my 2.6.6 Python interpreter. As jamtat later discovers, it will not work for 3.2.2. The problem is a change in Python's print() syntax. An updated script for 3.2.2 is posted on the next page of this thread.
/EDIT

I named the script "texoutline.py" but as long as you use the ".py" extension and adjust your path for python at the top (if necessary), it should work:

Code:

#!/usr/bin/python

import sys
import re

if( len( sys.argv ) != 2 ):
    print >> sys.stderr, "{0} requires one filename to process.".format( sys.argv[0].split('/')[-1] )
    sys.exit( 1)

try:
    rawOutline = open( sys.argv[1], 'r' )
except:
    print >> sys.stderr, "Unable to open {0} for reading".format( sys.argv[1] )
    sys.exit( 2 )

print ( '\\documentclass{article}\n'
        '\\usepackage{cjwoutl}\n'
        '\\usepackage[top=1in,bottom=1in,left=1in,right=1in]{geometry}\n'
        '\\pagestyle{myheadings}\n'
        '\\markright{\\today{\\hfill \\Large{***Header*title*here***}\\hfill}}\n'
        '\\linespread{1.3} % gives 1.5 line spacing\n'
        '\\begin{document}\n'
        '\\begin{outline}[new]\n'
        '\\begin{Large} % gives ca. 14 pt font' )

for inputLine in rawOutline:
    reMatches = re.match( r"(\t*)=(.*)", inputLine )
    if( reMatches == None ):
        print inputLine.rstrip()
    else:
        tabCount = len( reMatches.group(1).split('\t') )
        print "{0}\\outl{{{1:d}}}{2}".format( reMatches.group(1), tabCount, reMatches.group(2) )

print ( '\\end{Large}\n'
        '\\end{outline}\n'
        '\\end{document}\n' )

Since I'm learning, it may not be very Python-ish style-wise. Oh, and one last note, Python cares about indentation. So if you try the script yourself, make sure that the indentation is preserved. Given the problem is all about outlines, I don't think that should be a problem

Nominal Animal · 01-29-2012, 05:42 AM

@Dark_Helmet: Nice; MUCH easier to read than mine, that's for sure!

Quote: