Help from a Regex Guru?

goemon · 01-28-2007, 09:38 PM

Hi, I'm not really new to Linux but usually use GUIs and am gradually coming out of my shell (pun intended). I'm trying to use SED to insert some semicolons into a vocabulary file that's got some Japanese and English words in it. The file is in the following format:

Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;

And I need it to become:

Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;

(The first has 2 semicolons, the second has an additional one after
the "Kanacharacters")

Here's the real deal:

無料: むりょう free; no charge;
酒: さけ alcohol; sake;

Would become:
無料: むりょう; free; no charge;
酒: さけ; alcohol; sake;

I've tried numerous combinations of regexs and am getting nowhere. Anyone out there have an idea?

One possibility might be to put a semicolon before the second space in each line (after ensuring that there are no double spaces in the file). This ought to be simple but I'm just not getting it...sigh.

Any help would be great. Thank you...

homey · 01-28-2007, 10:00 PM

While I am certainly no guru, this may just work for you ...

Code:

cat file.txt
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;

sed 's/ /; /2' file.txt
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;

goemon · 01-28-2007, 10:18 PM

Okay, that was absurdly simple. Thank you so much...

However, having run this I now see that it's not quite as simple after all. There's an additional wrinkle to the equation, which is that some of these lines have an additional part of speech enclosed within parentheses and I'd ideally like the semicolon to come _after_ the ")" if indeed the line _has_ one. For example:

Before:
Kanjicharacters: Kanacharacters (n, vs) Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters (n) Englishdef1; Englishdef2;

And the result would end up as:
Kanjicharacters: Kanacharacters (n, vs); Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters (n); Englishdef1; Englishdef2;

I _did_ manage to write something that will put the semicolon _into_ the ones with parens in the right place, but if I then use the line above it will give the undesired result of:

Kanjicharacters: Kanacharacters; (n, vs); Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; (n); Englishdef1; Englishdef2;

I need essentially to add a "only add a semicolon before the second space if the line doesn't have anything with parenthesis right after it" (because sometimes the definitions also have parentheses).

I greatly appreciate your help!

donv2 · 01-28-2007, 10:49 PM

You could simply follow with a search and replace on "; (" to make it " (" on the current result of your modded approach.

If you have perl available, it is very fast for this sort of purpose. You can run it from the command line - if your file name is kanji.txt the following will do it:

Code:

perl -pi -e 's/; \(/ \(/ig' kanji.txt

Quote:

Originally Posted by goemon

Okay, that was absurdly simple. Thank you so much...

However, having run this I now see that it's not quite as simple after all. There's an additional wrinkle to the equation, which is that some of these lines have an additional part of speech enclosed within parentheses and I'd ideally like the semicolon to come _after_ the ")" if indeed the line _has_ one. For example:

Before:
Kanjicharacters: Kanacharacters (n, vs) Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters (n) Englishdef1; Englishdef2;

And the result would end up as:
Kanjicharacters: Kanacharacters (n, vs); Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters (n); Englishdef1; Englishdef2;

I _did_ manage to write something that will put the semicolon _into_ the ones with parens in the right place, but if I then use the line above it will give the undesired result of:

Kanjicharacters: Kanacharacters; (n, vs); Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; (n); Englishdef1; Englishdef2;

I need essentially to add a "only add a semicolon before the second space if the line doesn't have anything with parenthesis right after it" (because sometimes the definitions also have parentheses).

I greatly appreciate your help!

goemon · 01-29-2007, 11:14 AM

Hmmm, the hardest part seems to be developing the eye for creating simpler searches. That helped a lot; however the final additional wrinkle was that I needed it to only remove a semicolon if it was before an open parens with a v, n, or a right after it. (For example, (adv), (n), (v), and so on)

I think I might have gotten a really clunky way...inserting semicolons, then tagging ones that are incorrect with an '@' symbol, and then removing anything that has a '@;'. Here's my current complete script (I'm sure there's easier ways to do this with Perl and so on, but it took me so long to grok SED that I'm trying to stick with something I know):

for file
do
echo $file
mv $file $$.tempfile
sed 's/ /; /2
s/ /; /3
s/; ([vna]/@&/g
s/@;//g
s/;;/;/g
s/; ;/; /g
s/:/;/g' $$.tempfile > $file
done
rm $$.tempfile

There's probably a cleaner, simpler way to do this, but for now I guess I'm in good shape. If you have time to clean it up or simplify it, great...but otherwise, I think I'm all set for now.

With SED, what would the syntax be for "the first block of text before a space"? It'd be great to identify and repeat the first block of text if, and only if, there was only one Japanese-encoded block. Some lines do not have a Kanjicharacter field, just a lone Kanacharacter field, and thus I need to manually repeat it so that I have a properly fleshed out line.

'^+ ' (without the ' marks) ???

Is there any way to, for the following example:

Kanji; Kana; Englishmeaning1; Englishmeaning2;
Kana; Englishmeaning1; Englishmeaning2;

Turn it into this:

Kanji; Kana; Englishmeaning1; Englishmeaning2;
Kana; Kana; Englishmeaning1; Englishmeaning2;

(the second line's "Kana;" string is doubled.

As always, many thanks for any tips, help, or suggestions...