[SOLVED] BBCode replacement technique in PHP

vharishankar · 07-11-2009, 05:24 AM

I am working on implementing a simple BBCode system for my comment forms in my blog software. For various reasons I want to avoid direct HTML. Currently using straightforward search and replace as in

[ b ] - > <b>
[ i ] - > <i>
[ code ] - > <code>
[ quote ] - > <blockquote>

etc

The problem with this approach is that I cannot really do error checking as there is no way to determine if tags are properly closed etc.

This can lead to bugs on the web page. For example a single commenter who does not close a code block can render the rest of the page in ugly fixed width font.

So is there a simple, yet safe way to implement bbcode using regular expressions. Since I'm using PHP and server side scripting, I really don't want to implement a whole Lexer/parser scanner algorithm for this.

Yet I'm sure regular expressions can handle this. Can anybody help me out here? Any tips or indications.

What I want to do is simply like

[ b ]sometext here[ /b ] to be replaced with <b>sometext here</b>

But I don't want to implement it if there is no end tag. All pointers and hints gratefully accepted.

(spaces used to avoid BBCode on this forum)

Wim Sturkenboom · 07-11-2009, 05:42 AM

I think it will be safe if you put the user's input in e.g. a <div> or a <p>. That way the browser should ignore tags that are still open.

On a site note:
I think it's better to replace [ b ] by a <span class="myclass"> than by <b>. Nowadays it's the preferred way to use CSS so you can separate content from formatting.

vharishankar · 07-11-2009, 05:46 AM

Actually most browsers will spill over the tags in the <div> or <p> even outside it because of incorrect implementation. But even otherwise, I'd prefer a cleaner solution to this.

As for using <span class= > I use it extensively to mark up special text which have some contextual meaning, but for normal markup of ordinary text inline, I still prefer the plain bold and italic tags.

I have searched the web for this, but I couldn't find a BBCode parsing using regexp to my liking. I prefer to use normal regexps to Perl regexps, as I am more comfortable with the POSIX regexps.

ntubski · 07-11-2009, 04:47 PM

What about the BBCode extension of PHP? It seems to me that since BBCode allows nesting this is a case where regexps really aren't appropriate.

vharishankar · 07-11-2009, 09:44 PM

Hi ntubski, unfortunately I may not be able to use PHP extensions, because I am hosting on a shared hosting provider and I have no control over which version of PHP is installed and which extensions are available.

However, I am leaning towards "growing my own" regexp for the moment. My needs are pretty simple and straightforward and too much advanced error handling is not needed. All I want to check for is whether every opening tag has a closing tag.

vharishankar · 07-12-2009, 10:58 AM

After thinking a lot about the pros and cons of different approaches, I've implemented simple regexp rule that is not perfect, but at least matches opening and closing tags and prevents the possibility of overflowing the formatting. It's too trivial and I am using it only for simple tags like bold, italic, code and quote. I am not implementing any tag that requires attributes or nested elements, like lists.

Reg exp I used:

PHP Code:



$str_to_replace = eregi_replace ("\[b\](.+)\[\/b\]", "<b>\\1</b>", $str_to_replace);
$str_to_replace = eregi_replace ("\[i\](.+)\[\/i\]", "<i>\\1</i>", $str_to_replace);
// ...
// etc.

It's very trivial though. Can you see anything wrong with it? So far it seems to be reasonably OK. I can live with improperly nested tags as I can always correct it manually. Writing a full-fledged BBCode grammar rules and a parser in PHP is probably too big an overhead for a small application.

Wim Sturkenboom · 07-12-2009, 11:12 AM

Just make sure that it works with a multi-line text segment like shown below.

Code:

[ b ]Hi harishankar
hope it works

regards
WimS
[ /b ]

vharishankar · 07-12-2009, 11:15 AM

Yes, it does works. Thanks for the suggestion. I didn't think of that before. But seems to work.

You can check my blog here for the comments section where I implemented the code:
http://harishankar.org/blog/entry.ph...ed-to-comments

Thanks again.

ntubski · 07-12-2009, 03:06 PM

I posted a case on your blog that doesn't quite work:

Code:

[b]bold[/b] [i]and[/i] [b]beautiful[/b]

Should render as
bold and beautiful

But shows up as
bold [/b] and [b]beautiful

vharishankar · 07-12-2009, 08:49 PM

Hmm... thanks for the test. I think that the regular expression requires a bit of tweaking. I am not sure why it is not matching the first [/b].

It is being greedy. Is there any way to make that expression non-greedy? Adding a question mark at the end like (.+?) gives me an error in eregi_replace.

Code:

Warning: eregi_replace() [function.eregi-replace]: REG_BADRPT in /home/hari/public_html/harishankar.org/blog/Functions.php on line 1176

vharishankar · 07-12-2009, 09:57 PM

I fixed the issue by using PCRE instead of POSIX regular expressions:

There's no apparent way to prevent greedy parsing in ereg functions in PHP.

PHP Code:



$patterns = array ("/\[b\](.+?)\[\/b\]/i", 
            "/\[i\](.+?)\[\/i\]/i",
            "/\[quote\](.+?)\[\/quote\]/i",
            "/\[code\](.+?)\[\/code\]/i"
            );        
$replacements = array (    "<b>$1</b>",
            "<i>$1</i>", "<blockquote>$1</blockquote>",
            "<code>$1</code>"
);                             
        $bb_str = preg_replace ($patterns, $replacements, $str);

Wim Sturkenboom · 07-13-2009, 12:05 AM

Regular expressions are by default greedy and will try to match as much as possible. And there is a way, but maybe not in php. From man re_syntax

Quote:

*? +? ?? {m}? {m,}? {m,n}?
non-greedy quantifiers, which match the same possibilities, but prefer the smallest number rather than the largest number of matches (see MATCHING)

PS
OK re-reading your last post and I see that you found that.

vharishankar · 07-13-2009, 12:15 AM

Quote:

Originally Posted by Wim Sturkenboom

Regular expressions are by default greedy and will try to match as much as possible. And there is a way, but maybe not in php. From man re_syntax

PS
OK re-reading your last post and I see that you found that.

Thanks. Actually the problem was that the ereg functions don't accept the qualifier to make the expression non-greedy. I thought of using ereg because they tend to be simpler.

Luckily preg functions work as well or better in most cases without significant overhead. PCRE is certainly more complex than POSIX regular expressions, but I think it is more featureful.