scripting: how to change markdown links to wikitext links?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
but now that fckeditor is available for mediawiki (very beta), it has become much better to just stick with wikitext format. There are only a few conversions to do: tables, links, and bulleted lists. The lists are a fairly simple regex and fckeditor magically reformats the tables, so all I'm left with is the links. But I'm not a regex master. How do I reformat
Awesome, thanks so much. That is a really nice way to do it. Now I just need to figure out how do identify the stuff in between [] and, I guess, make that a variable in place of "link text" in that sed command. Is this best done in a particular language? I really wish I understood how to do this kind of very simple thing . . .
This perl code should perform the conversion, even if the tagged data is embedded in the middle of a line of text, as I expect would be the norm for your application.
You know, I thought it would come down to Perl. I have been whittling away at simple Python programs and hacking PHP and CSS on a couple of server projects, avoiding perl like the plague, but you just convinced me. I need to go buy a Perl book. Because you just saved me so much time I could literally drag myself to the bookstore with my tongue and figure out how that script works while reading upside down and backwards and that little script would still save me an order of magnitude more time. What timezones are you guys in, so that I may bow in your direction?
sed -i.bck 's/\[\(.*\)\](\(.*\))/\[\2 \1\]/g' file
to record a part of the matching text, just embed it within escaped parentheses in the regular expression. Then use \1 for the first recorded text, \2 for the second one and so on. Here is a colorful version of the command above to distinguish the two recorded patterns:
Code:
s/\[\(.*\)\](\(.*\))/\[\2 \1\]/g
The other relevant parts of this sed expression are the escaped square brackets, otherwise sed interprets them as character lists. Finally the .* symbol matches any sequence of characters.
Here is a good tutorial about sed programming, if you want to deepen a bit more.
You know, I thought it would come down to Perl. I have been whittling away at simple Python programs and hacking PHP and CSS on a couple of server projects, avoiding perl like the plague, but you just convinced me. I need to go buy a Perl book. Because you just saved me so much time I could literally drag myself to the bookstore with my tongue and figure out how that script works while reading upside down and backwards and that little script would still save me an order of magnitude more time. What timezones are you guys in, so that I may bow in your direction?
Ah shucks. Actually, colucix did it best with sed (assuming it actually works). Python and PHP are definitely not best for this job. Awk would also be high on the list. Perl is just like sed with a programming language built around it, and since I like to write, it feels right for me. I don't know about all those colors in the sed code, though; kinda makes me dizzy along with all of those backslashes.
--- rod.
Okay, I tried sed for my grand markdown-2-wikitext translator. I actually got a good sed command for the lists, I thought, before I posted, so I thought I'd try to go all sed. (That and Larry Wall's "Programming Perl" seems to require more than an hour or two to grok). Anyway, here's my script, which fails miserably, and I'm trying to run it with this command:
niels@school$ sh md2wt.sh immunizations_md
md2wt.sh:
Code:
#! /bin/bash
# First, headlines. Every # should be replaced with surrounding = =
# that is, #Title becomes =Title= and ##Subtitle becomes ==Subtitle==
# This also needs a while loop.
sed 's/\#\(.*\)/\=\1\=/g' |
# Second, top level lists, which, in markdown, are numerical. But, and
# here's a nasty but, sometimes, if 2. immediately follows 1., then
# fckeditor also strips the newline, so 2. is on the same line. I need
# to put the newline back in also. fckeditor will take out redundant
# lines later.
sed 's/[0-9]\.\ /\n\*/g' | /bin/echo |
# Third, the deeper lists. Markdown uses a sort of pythony 4 spaces for
# each level of indention, but wikitext just uses another * for each
# level, so " *" becomes "**" and " *" becomes "***"
# this also needs a while loop
sed 's/\ \ \ \ \*/\*\*/g' |
# Finally, the grand poo bah, the links
sed 's/\[\(.*\)\](\(.*\))/\[\2 \1\]/g'
Where did I go wrong?
Edit: I added examples of input (immunizations_md.txt) and desired output (immunizations_perfect.txt).
Edit2: added comments that I need a while loop in here.
Edit3: went looking for how to parse the perl. Regarding =~, perl.org says "everyone knows how =~ and =! work" (http://dev.perl.org/perl6/rfc/164.html). Um, I don't know.
Last edited by Niels Olson; 02-03-2009 at 06:58 PM.
Reason: added file attachment
Don't worry, I have a degree in physics. I mean . . . don't take that the wrong way, but I feel comfortable with nested functions and multiple lines, and I the general idea of thinks like perl leaves you to use whitespace as you like. If I could grok the perl *generally* I would probably prefer to do it that way. In fact, that is really my more general goal. I'm comfortable with the sed function (right now), and I'd really rather get better at parsing a more general language, and, since I tend to do more sysadmin than anything, perl seems to be a logical choice. I've hacked other people's perl scripts, for cronjobs, rsync, etc, just haven't written my own script to solve my own problem, hence, in part, taking advantage of this real world exercise.
For instance, what is the s* for? Is that "second argument" or "new stanza" or what? and the =~, is that for "approximately equal", and if so, approximately equal to what?
And how would I introduce the other three functions in my "md2wt" translator (above) if I rewrote it in perl? Would I need to nest the functions inside your first while function (and then nest additional while functions inside)? How does one syntactically do that? What's "$_" and what's up with the parens and angle brackets for the while loop?
Last edited by Niels Olson; 02-03-2009 at 07:42 PM.
Well, here is a slightly modified version of your script. Does it work as expected?
Code:
#!/bin/bash
#
# Repeat the sequence of hashes at the end of the line
#
sed 's/^\(#*\)\(.*\)/\1\2\1/g' immunizations_md.txt |
#
# Substitute all the hashes with equal signs
#
sed 's/#/=/g' |
#
# Substitute numbers with newline + asterisk
#
sed 's/[0-9]\.\ /\*/g' |
#
# First take care of the inner item in the list
#
sed 's/ \*/\*\*\*/g' |
#
# Second take care of the rest
#
sed 's/ \*/\*\*/g' |
#
# Change the links
#
sed 's/\[\(.*\)\](\(.*\))/\[\2 \1\]/g'
Works like a champ. Awesome. One note, I had a typo in the sample text, so there should be only spaces in groups of 4. I apparently had 9 in there. Sorry.
here's what I've got right now
Code:
#!/bin/bash
#
# Repeat the sequence of hashes at the end of the line
#
sed 's/^\(#*\)\(.*\)/\1\2\1/g' md2wt.pad |
#
# Substitute all the hashes with equal signs
#
sed 's/#/=/g' |
#
# Substitute numbers with newline + asterisk
#
sed 's/[0-9]\.\ /\*/g' |
#
# This really seems like it needs some recursion, doesn't it?
#
sed 's/ \*/\*\*\*\*\*\*\*/g' |
sed 's/ \*/\*\*\*\*\*\*/g' |
sed 's/ \*/\*\*\*\*\*/g' |
sed 's/ \*/\*\*\*\*/g' |
sed 's/ \*/\*\*\*/g' |
sed 's/ \*/\*\*/g' |
#
# Change the links
#
sed 's/\[\(.*\)\](\(.*\))/\[\2 \1\]/g'
One thing this has highlighted is that there really are some newlines that I need to clean up and I'm not clear yet whether they were manually put in by me, or if it's an effect of fckeditor's parsing.
For instance, what is the s* for? Is that "second argument" or "new stanza" or what? and the =~, is that for "approximately equal", and if so, approximately equal to what?
In perl regular expressions, '\s' is shorthand for 'whitespace'. The '*' modifier says 'zero or more of the previous token'. The net effect is that it allows zero or more whitespace characters between the two parts of the original parsed text. As a general practice, I like to put these into my regex's to make them more general. Some syntaxes allow whitespace in various places, and when allowed, whitespace is not always applied consistently. I don't think sed has the shorthand notations that perl has, and I know it doesn't have a lot of the extended regular expression syntax that perl has. I do find the perl shorthand notations handy, but the extended regex's are not often used, although I'm sure there are times when I could have used them to good effect if I had the notation in my head, instead of having to look them up. I suppose that's kind of where you're at with the whole language and regular expression thing.
--- rod.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.