Need to scrub UNICODE NULLS from MySQL tables

di11rod · 11-15-2012, 01:26 AM

Ok, I have searched the google and these forums and haven't found any guidance for someone in my predicament.

We have content that got sucked into our MySQL 5.5 database (UTF8) that came from flat files that were formatted in UNICODE. The impact is that our other software that is supposed to parse the content and do stuff like publish it in PDFs really barfs when it hits these UNICODE NULL characters.

I believe when you do a query in mysql, the values appear as so:

g o r g o n z o l a

Can anyone recommend a select query that would locate the values in varchar columns? Ideally, I'd like some kind of stored procedure that would scrub the UNICODE nulls.

All suggestions appreciated,

di11rod

tronayne · 11-16-2012, 10:53 AM

Oh, been there, did that (and it hurt, too!).

For some years I was receiving files from clients that were supposed to be text only, no special characters, no junk. But, as we all know, Microsoft users don't know a slash from a back slant and...

I developed a front-end filtering utility that all received files passed through to eliminate every bad character users could come up with, \, ^, tab, back space (yes, back space) and who knows what all. Kept a lot of crap out of our data bases.

I believe that doing this in SQL is going to be extremely difficult, even if you use the C-API (it's probably doable but, wow, what a pain).

I suspect that the easy way is to unload your affected tables to text files and use one of the editors, sed for example, to clean up the mess. You could write a little C filter program that deletes non-ASCII characters or perhaps make use of this sample program (a subset of the "big" front end filter):

Code:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#ifndef	TRUE
#	define	TRUE	1
#endif
#ifndef	FALSE
#	define	FALSE	0
#endif

void	main	(int argc, char *argv [])
{
	int	c;
	int	error = FALSE;		/* error flag		*/
	int	line = 1;		/* number of lines	*/
	int	ropt = FALSE;		/* remove		*/
	int	sopt = FALSE;		/* replace with blank	*/
	int	wopt = FALSE;		/* warn			*/
	FILE	*infile, *outfile;

	/*	process command line options			*/
	while ((c = getopt (argc, argv, "rsw?")) != EOF) {
		switch (c) {
			case 'r':
				ropt = TRUE;
				break;
			case 's':
				sopt = TRUE;
				break;
			case 'w':
				wopt = TRUE;
				break;
			case '?':
				error = TRUE;
				break;
			default:
				error = TRUE;
				break;
		}
	}
	if (error || (ropt && sopt)) {
		(void) fprintf (stderr,
		  "usage:\t%s [-r remove] [-s sub] [-w warn] [file]\n",
		  argv [0]);
		exit (1);
	}
	if (optind == argc) {
		infile = stdin;
	} else {
		if ((infile = fopen (argv [optind], "r")) == (FILE *) NULL) {
			(void) fprintf (stderr, "can't open %s\n", argv [optind]);
			exit (2);
		}
	}
	/*	make the output file stdout		*/
	outfile = stdout;
	/*	while there are characters to get...	*/
	while ((c = getc (infile)) != EOF) {
		/*	new line is ok			*/
		if (c == '\n') {
			line++;
		/*	tab and form feed are ok	*/
		} else if (c != '\t' &&
			   c != '\f' &&
			   (c < 32 || c > 126)) {
			/*	warn about it?		*/
			if (wopt) {
				if (c < 32) {
					(void) fprintf (stderr,
					    "at line %d, Ctrl-%c (0x%03o)\n",
					    line, c+64, c);
				} else {
					(void) fprintf (stderr,
					    "at line %d, 0x%04o\n",
					    line, c);
				}
			}
			if (ropt) {		/* remove it		*/
				continue;
			} else if (sopt) {	/* substitute it	*/
				c = ' ';
			}
		}
		(void) putc (c, outfile);
	}
}

Save the above as, oh, filter.c and compile it with

Code:

cc -o filter filter.c

The usage is

Code:

usage:	filter [-r remove] [-s sub] [-w warn] infile

So you would execute it with one of those options; -r removes bad characters, -s substitutes a bad character with a blank, w= warns you where bad characters appear in your file. The output is to standard out, so

Code:

filter -r infile > outfile

Feel free to fiddle with it; it's yours to do with what you wish.

Hope this helps some.