scripting: how to change markdown links to wikitext links?

Niels Olson · 02-03-2009, 10:21 AM

Hello,

I have a personal wiki of notes, with now thousands of links in markdown format:

[link text](http://example.com)

but now that fckeditor is available for mediawiki (very beta), it has become much better to just stick with wikitext format. There are only a few conversions to do: tables, links, and bulleted lists. The lists are a fairly simple regex and fckeditor magically reformats the tables, so all I'm left with is the links. But I'm not a regex master. How do I reformat

[link text](http://example.com)

to this

[http://example.com link text]

The steps, in no particular order are

* delete ")"
* add a space after "example.com" ==> "example.com "
* move "link text](" to the end
* delete "("

any suggestions would be greatly appreciated.

colucix · 02-03-2009, 10:37 AM

You can try the following sed command:

Code:

sed -i.bck 's/\[link text\](\(.*\))/\[\1 link text\]/g' file

the -i.bck will save a backup copy of the original file using .bck as extension, before editing the file.

Niels Olson · 02-03-2009, 01:25 PM

Awesome, thanks so much. That is a really nice way to do it. Now I just need to figure out how do identify the stuff in between [] and, I guess, make that a variable in place of "link text" in that sed command. Is this best done in a particular language? I really wish I understood how to do this kind of very simple thing . . .

Code:

** [Varicella zoster](http://en.wikipedia.org/wiki/Varicella_zoster_vaccine)
** [Intranasal influenza](http://en.wikipedia.org/wiki/Intranasal_influenza_vaccine)
** [Oral polio (Sabin's)](http://en.wikipedia.org/wiki/Sabin_vaccine)
** [Yellow fever](http://en.wikipedia.org/wiki/Yellow_fever_vaccine)
** [Rotavirus](http://en.wikipedia.org/wiki/Rotavirus_vaccine)
** [Smallpox](http://en.wikipedia.org/wiki/Smallpox_vaccine)

theNbomr · 02-03-2009, 01:41 PM

This perl code should perform the conversion, even if the tagged data is embedded in the middle of a line of text, as I expect would be the norm for your application.

Code:

#! /usr/bin/perl -w
use strict;
    while(<>){
        if( $_ =~ m/\[(.+)\]\s*\((.+)\)/ ){
            print $`."[".$2." ".$1."]".$';
        }
        else{
            print $_;
        }
    }

Run it with the name of the input file as an argument.

--- rod.

Niels Olson · 02-03-2009, 01:55 PM

You know, I thought it would come down to Perl. I have been whittling away at simple Python programs and hacking PHP and CSS on a couple of server projects, avoiding perl like the plague, but you just convinced me. I need to go buy a Perl book. Because you just saved me so much time I could literally drag myself to the bookstore with my tongue and figure out how that script works while reading upside down and backwards and that little script would still save me an order of magnitude more time. What timezones are you guys in, so that I may bow in your direction?

colucix · 02-03-2009, 01:58 PM

Maybe this:

Code:

sed -i.bck 's/\[\(.*\)\](\(.*\))/\[\2 \1\]/g' file

to record a part of the matching text, just embed it within escaped parentheses in the regular expression. Then use \1 for the first recorded text, \2 for the second one and so on. Here is a colorful version of the command above to distinguish the two recorded patterns:

Code:

s/\[\(.*\)\](\(.*\))/\[\2 \1\]/g

The other relevant parts of this sed expression are the escaped square brackets, otherwise sed interprets them as character lists. Finally the .* symbol matches any sequence of characters.

Here is a good tutorial about sed programming, if you want to deepen a bit more.

theNbomr · 02-03-2009, 05:12 PM

Quote:

Originally Posted by Niels Olson

You know, I thought it would come down to Perl. I have been whittling away at simple Python programs and hacking PHP and CSS on a couple of server projects, avoiding perl like the plague, but you just convinced me. I need to go buy a Perl book. Because you just saved me so much time I could literally drag myself to the bookstore with my tongue and figure out how that script works while reading upside down and backwards and that little script would still save me an order of magnitude more time. What timezones are you guys in, so that I may bow in your direction?

Ah shucks. Actually, colucix did it best with sed (assuming it actually works). Python and PHP are definitely not best for this job. Awk would also be high on the list. Perl is just like sed with a programming language built around it, and since I like to write, it feels right for me. I don't know about all those colors in the sed code, though; kinda makes me dizzy along with all of those backslashes.
--- rod.

Niels Olson · 02-03-2009, 06:09 PM

Okay, I tried sed for my grand markdown-2-wikitext translator. I actually got a good sed command for the lists, I thought, before I posted, so I thought I'd try to go all sed. (That and Larry Wall's "Programming Perl" seems to require more than an hour or two to grok). Anyway, here's my script, which fails miserably, and I'm trying to run it with this command:

niels@school$ sh md2wt.sh immunizations_md

md2wt.sh:

Code:

#! /bin/bash

# First, headlines. Every # should be replaced with surrounding = =
# that is, #Title becomes =Title= and ##Subtitle becomes ==Subtitle==
# This also needs a while loop.

sed 's/\#\(.*\)/\=\1\=/g' |

# Second, top level lists, which, in markdown, are numerical. But, and 
# here's a nasty but, sometimes, if 2. immediately follows 1., then 
# fckeditor also strips the newline, so 2. is on the same line. I need 
# to put the newline back in also. fckeditor will take out redundant 
# lines later.

sed 's/[0-9]\.\ /\n\*/g' | /bin/echo |

# Third, the deeper lists. Markdown uses a sort of pythony 4 spaces for
# each level of indention, but wikitext just uses another * for each
# level, so "    *" becomes "**" and "        *" becomes "***"
# this also needs a while loop

sed 's/\ \ \ \ \*/\*\*/g' |

# Finally, the grand poo bah, the links

sed 's/\[\(.*\)\](\(.*\))/\[\2 \1\]/g'

Where did I go wrong?

Edit: I added examples of input (immunizations_md.txt) and desired output (immunizations_perfect.txt).

Edit2: added comments that I need a while loop in here.

Edit3: went looking for how to parse the perl. Regarding =~, perl.org says "everyone knows how =~ and =! work" (http://dev.perl.org/perl6/rfc/164.html). Um, I don't know.

theNbomr · 02-03-2009, 07:00 PM

Okay, since you probably wanted a one-liner, and since I got way too wordy on my first effort:

Code:

perl -e 'while(<>){ $_ =~ s/\[(.+)\]\s*\((.+)\)/[$2 $1]/g; print;}'  filename.whatever

I still think that's more readable than the sed version (leaves out a few tilty sticks)...
--- rod.

Niels Olson · 02-03-2009, 07:18 PM

Don't worry, I have a degree in physics. I mean . . . don't take that the wrong way, but I feel comfortable with nested functions and multiple lines, and I the general idea of thinks like perl leaves you to use whitespace as you like. If I could grok the perl *generally* I would probably prefer to do it that way. In fact, that is really my more general goal. I'm comfortable with the sed function (right now), and I'd really rather get better at parsing a more general language, and, since I tend to do more sysadmin than anything, perl seems to be a logical choice. I've hacked other people's perl scripts, for cronjobs, rsync, etc, just haven't written my own script to solve my own problem, hence, in part, taking advantage of this real world exercise.

For instance, what is the s* for? Is that "second argument" or "new stanza" or what? and the =~, is that for "approximately equal", and if so, approximately equal to what?

And how would I introduce the other three functions in my "md2wt" translator (above) if I rewrote it in perl? Would I need to nest the functions inside your first while function (and then nest additional while functions inside)? How does one syntactically do that? What's "$_" and what's up with the parens and angle brackets for the while loop?

colucix · 02-03-2009, 07:54 PM

Well, here is a slightly modified version of your script. Does it work as expected?

Code:

#!/bin/bash
#
# Repeat the sequence of hashes at the end of the line
#
sed 's/^\(#*\)\(.*\)/\1\2\1/g' immunizations_md.txt |
#
# Substitute all the hashes with equal signs
#
sed 's/#/=/g' |
#
# Substitute numbers with newline + asterisk
#
sed 's/[0-9]\.\ /\*/g' |
#
# First take care of the inner item in the list
#
sed 's/         \*/\*\*\*/g' |
#
# Second take care of the rest
#
sed 's/    \*/\*\*/g' |
#
# Change the links
#
sed 's/\[\(.*\)\](\(.*\))/\[\2 \1\]/g'

Niels Olson · 02-03-2009, 10:26 PM

Works like a champ. Awesome. One note, I had a typo in the sample text, so there should be only spaces in groups of 4. I apparently had 9 in there. Sorry.

here's what I've got right now

Code:

#!/bin/bash
#
# Repeat the sequence of hashes at the end of the line
#
sed 's/^\(#*\)\(.*\)/\1\2\1/g' md2wt.pad |
#
# Substitute all the hashes with equal signs
#
sed 's/#/=/g' |
#
# Substitute numbers with newline + asterisk
#
sed 's/[0-9]\.\ /\*/g' |
#
# This really seems like it needs some recursion, doesn't it?
#
sed 's/                        \*/\*\*\*\*\*\*\*/g' |
sed 's/                    \*/\*\*\*\*\*\*/g' |
sed 's/                \*/\*\*\*\*\*/g' |
sed 's/            \*/\*\*\*\*/g' |
sed 's/        \*/\*\*\*/g' |
sed 's/    \*/\*\*/g' |
#
# Change the links
#
sed 's/\[\(.*\)\](\(.*\))/\[\2 \1\]/g'

One thing this has highlighted is that there really are some newlines that I need to clean up and I'm not clear yet whether they were manually put in by me, or if it's an effect of fckeditor's parsing.

theNbomr · 02-04-2009, 10:17 AM

Quote:

Originally Posted by Niels Olson

For instance, what is the s* for? Is that "second argument" or "new stanza" or what? and the =~, is that for "approximately equal", and if so, approximately equal to what?

In perl regular expressions, '\s' is shorthand for 'whitespace'. The '*' modifier says 'zero or more of the previous token'. The net effect is that it allows zero or more whitespace characters between the two parts of the original parsed text. As a general practice, I like to put these into my regex's to make them more general. Some syntaxes allow whitespace in various places, and when allowed, whitespace is not always applied consistently. I don't think sed has the shorthand notations that perl has, and I know it doesn't have a lot of the extended regular expression syntax that perl has. I do find the perl shorthand notations handy, but the extended regex's are not often used, although I'm sure there are times when I could have used them to good effect if I had the notation in my head, instead of having to look them up. I suppose that's kind of where you're at with the whole language and regular expression thing.
--- rod.