LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 11-15-2012, 01:26 AM   #1
di11rod
Member
 
Registered: Jan 2004
Location: Austin, TEXAS
Distribution: CentOS 6.5
Posts: 207

Rep: Reputation: 32
Need to scrub UNICODE NULLS from MySQL tables


Ok, I have searched the google and these forums and haven't found any guidance for someone in my predicament.

We have content that got sucked into our MySQL 5.5 database (UTF8) that came from flat files that were formatted in UNICODE. The impact is that our other software that is supposed to parse the content and do stuff like publish it in PDFs really barfs when it hits these UNICODE NULL characters.

I believe when you do a query in mysql, the values appear as so:

g o r g o n z o l a

Can anyone recommend a select query that would locate the values in varchar columns? Ideally, I'd like some kind of stored procedure that would scrub the UNICODE nulls.

All suggestions appreciated,

di11rod

Last edited by di11rod; 11-15-2012 at 01:42 AM.
 
Old 11-16-2012, 10:53 AM   #2
tronayne
Senior Member
 
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 3,042

Rep: Reputation: 761Reputation: 761Reputation: 761Reputation: 761Reputation: 761Reputation: 761Reputation: 761
Oh, been there, did that (and it hurt, too!).

For some years I was receiving files from clients that were supposed to be text only, no special characters, no junk. But, as we all know, Microsoft users don't know a slash from a back slant and...

I developed a front-end filtering utility that all received files passed through to eliminate every bad character users could come up with, \, ^, tab, back space (yes, back space) and who knows what all. Kept a lot of crap out of our data bases.

I believe that doing this in SQL is going to be extremely difficult, even if you use the C-API (it's probably doable but, wow, what a pain).

I suspect that the easy way is to unload your affected tables to text files and use one of the editors, sed for example, to clean up the mess. You could write a little C filter program that deletes non-ASCII characters or perhaps make use of this sample program (a subset of the "big" front end filter):
Code:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#ifndef	TRUE
#	define	TRUE	1
#endif
#ifndef	FALSE
#	define	FALSE	0
#endif

void	main	(int argc, char *argv [])
{
	int	c;
	int	error = FALSE;		/* error flag		*/
	int	line = 1;		/* number of lines	*/
	int	ropt = FALSE;		/* remove		*/
	int	sopt = FALSE;		/* replace with blank	*/
	int	wopt = FALSE;		/* warn			*/
	FILE	*infile, *outfile;

	/*	process command line options			*/
	while ((c = getopt (argc, argv, "rsw?")) != EOF) {
		switch (c) {
			case 'r':
				ropt = TRUE;
				break;
			case 's':
				sopt = TRUE;
				break;
			case 'w':
				wopt = TRUE;
				break;
			case '?':
				error = TRUE;
				break;
			default:
				error = TRUE;
				break;
		}
	}
	if (error || (ropt && sopt)) {
		(void) fprintf (stderr,
		  "usage:\t%s [-r remove] [-s sub] [-w warn] [file]\n",
		  argv [0]);
		exit (1);
	}
	if (optind == argc) {
		infile = stdin;
	} else {
		if ((infile = fopen (argv [optind], "r")) == (FILE *) NULL) {
			(void) fprintf (stderr, "can't open %s\n", argv [optind]);
			exit (2);
		}
	}
	/*	make the output file stdout		*/
	outfile = stdout;
	/*	while there are characters to get...	*/
	while ((c = getc (infile)) != EOF) {
		/*	new line is ok			*/
		if (c == '\n') {
			line++;
		/*	tab and form feed are ok	*/
		} else if (c != '\t' &&
			   c != '\f' &&
			   (c < 32 || c > 126)) {
			/*	warn about it?		*/
			if (wopt) {
				if (c < 32) {
					(void) fprintf (stderr,
					    "at line %d, Ctrl-%c (0x%03o)\n",
					    line, c+64, c);
				} else {
					(void) fprintf (stderr,
					    "at line %d, 0x%04o\n",
					    line, c);
				}
			}
			if (ropt) {		/* remove it		*/
				continue;
			} else if (sopt) {	/* substitute it	*/
				c = ' ';
			}
		}
		(void) putc (c, outfile);
	}
}
Save the above as, oh, filter.c and compile it with
Code:
cc -o filter filter.c
The usage is
Code:
usage:	filter [-r remove] [-s sub] [-w warn] infile
So you would execute it with one of those options; -r removes bad characters, -s substitutes a bad character with a blank, w= warns you where bad characters appear in your file. The output is to standard out, so
Code:
filter -r infile > outfile
Feel free to fiddle with it; it's yours to do with what you wish.

Hope this helps some.

Last edited by tronayne; 11-16-2012 at 10:57 AM.
 
  


Reply

Tags
mysql5, unicode, utf8


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
MySQL tables gone diwljina Linux - Server 11 07-30-2012 11:39 AM
Does mySQL support unicode? LinLove Linux - Software 2 11-20-2007 08:17 AM
Unicode support broken on Gentoo/MySQL server streamkid Linux - Server 4 05-27-2007 02:32 AM
Gentoo / MySQL: Unicode not working streamkid Linux - Server 0 05-24-2007 09:43 AM
mysql reinstallation distorted by previous tables in /var/mysql mad4linux Linux - Software 0 10-04-2005 01:39 PM


All times are GMT -5. The time now is 04:29 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration