LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   Need to scrub UNICODE NULLS from MySQL tables (http://www.linuxquestions.org/questions/linux-software-2/need-to-scrub-unicode-nulls-from-mysql-tables-4175437223/)

di11rod 11-15-2012 01:26 AM

Need to scrub UNICODE NULLS from MySQL tables
 
Ok, I have searched the google and these forums and haven't found any guidance for someone in my predicament.

We have content that got sucked into our MySQL 5.5 database (UTF8) that came from flat files that were formatted in UNICODE. The impact is that our other software that is supposed to parse the content and do stuff like publish it in PDFs really barfs when it hits these UNICODE NULL characters.

I believe when you do a query in mysql, the values appear as so:

g o r g o n z o l a

Can anyone recommend a select query that would locate the values in varchar columns? Ideally, I'd like some kind of stored procedure that would scrub the UNICODE nulls.

All suggestions appreciated,

di11rod

tronayne 11-16-2012 10:53 AM

Oh, been there, did that (and it hurt, too!).

For some years I was receiving files from clients that were supposed to be text only, no special characters, no junk. But, as we all know, Microsoft users don't know a slash from a back slant and...

I developed a front-end filtering utility that all received files passed through to eliminate every bad character users could come up with, \, ^, tab, back space (yes, back space) and who knows what all. Kept a lot of crap out of our data bases.

I believe that doing this in SQL is going to be extremely difficult, even if you use the C-API (it's probably doable but, wow, what a pain).

I suspect that the easy way is to unload your affected tables to text files and use one of the editors, sed for example, to clean up the mess. You could write a little C filter program that deletes non-ASCII characters or perhaps make use of this sample program (a subset of the "big" front end filter):
Code:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#ifndef        TRUE
#        define        TRUE        1
#endif
#ifndef        FALSE
#        define        FALSE        0
#endif

void        main        (int argc, char *argv [])
{
        int        c;
        int        error = FALSE;                /* error flag                */
        int        line = 1;                /* number of lines        */
        int        ropt = FALSE;                /* remove                */
        int        sopt = FALSE;                /* replace with blank        */
        int        wopt = FALSE;                /* warn                        */
        FILE        *infile, *outfile;

        /*        process command line options                        */
        while ((c = getopt (argc, argv, "rsw?")) != EOF) {
                switch (c) {
                        case 'r':
                                ropt = TRUE;
                                break;
                        case 's':
                                sopt = TRUE;
                                break;
                        case 'w':
                                wopt = TRUE;
                                break;
                        case '?':
                                error = TRUE;
                                break;
                        default:
                                error = TRUE;
                                break;
                }
        }
        if (error || (ropt && sopt)) {
                (void) fprintf (stderr,
                  "usage:\t%s [-r remove] [-s sub] [-w warn] [file]\n",
                  argv [0]);
                exit (1);
        }
        if (optind == argc) {
                infile = stdin;
        } else {
                if ((infile = fopen (argv [optind], "r")) == (FILE *) NULL) {
                        (void) fprintf (stderr, "can't open %s\n", argv [optind]);
                        exit (2);
                }
        }
        /*        make the output file stdout                */
        outfile = stdout;
        /*        while there are characters to get...        */
        while ((c = getc (infile)) != EOF) {
                /*        new line is ok                        */
                if (c == '\n') {
                        line++;
                /*        tab and form feed are ok        */
                } else if (c != '\t' &&
                          c != '\f' &&
                          (c < 32 || c > 126)) {
                        /*        warn about it?                */
                        if (wopt) {
                                if (c < 32) {
                                        (void) fprintf (stderr,
                                            "at line %d, Ctrl-%c (0x%03o)\n",
                                            line, c+64, c);
                                } else {
                                        (void) fprintf (stderr,
                                            "at line %d, 0x%04o\n",
                                            line, c);
                                }
                        }
                        if (ropt) {                /* remove it                */
                                continue;
                        } else if (sopt) {        /* substitute it        */
                                c = ' ';
                        }
                }
                (void) putc (c, outfile);
        }
}

Save the above as, oh, filter.c and compile it with
Code:

cc -o filter filter.c
The usage is
Code:

usage:        filter [-r remove] [-s sub] [-w warn] infile
So you would execute it with one of those options; -r removes bad characters, -s substitutes a bad character with a blank, w= warns you where bad characters appear in your file. The output is to standard out, so
Code:

filter -r infile > outfile
Feel free to fiddle with it; it's yours to do with what you wish.

Hope this helps some.


All times are GMT -5. The time now is 05:39 AM.