LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-28-2007, 10:38 PM   #1
goemon
LQ Newbie
 
Registered: Nov 2005
Posts: 17

Rep: Reputation: 0
Help from a Regex Guru?


Hi, I'm not really new to Linux but usually use GUIs and am gradually coming out of my shell (pun intended). I'm trying to use SED to insert some semicolons into a vocabulary file that's got some Japanese and English words in it. The file is in the following format:

Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;

And I need it to become:

Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;

(The first has 2 semicolons, the second has an additional one after
the "Kanacharacters")

Here's the real deal:

無料: むりょう free; no charge;
酒: さけ alcohol; sake;

Would become:
無料: むりょう; free; no charge;
酒: さけ; alcohol; sake;

I've tried numerous combinations of regexs and am getting nowhere. Anyone out there have an idea?

One possibility might be to put a semicolon before the second space in each line (after ensuring that there are no double spaces in the file). This ought to be simple but I'm just not getting it...sigh.

Any help would be great. Thank you...
 
Old 01-28-2007, 11:00 PM   #2
homey
Senior Member
 
Registered: Oct 2003
Posts: 3,057

Rep: Reputation: 59
While I am certainly no guru, this may just work for you ...
Code:
cat file.txt
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;

sed 's/ /; /2' file.txt
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
 
Old 01-28-2007, 11:18 PM   #3
goemon
LQ Newbie
 
Registered: Nov 2005
Posts: 17

Original Poster
Rep: Reputation: 0
Wow, that worked!!!

Okay, that was absurdly simple. Thank you so much...

However, having run this I now see that it's not quite as simple after all. There's an additional wrinkle to the equation, which is that some of these lines have an additional part of speech enclosed within parentheses and I'd ideally like the semicolon to come _after_ the ")" if indeed the line _has_ one. For example:

Before:
Kanjicharacters: Kanacharacters (n, vs) Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters (n) Englishdef1; Englishdef2;

And the result would end up as:
Kanjicharacters: Kanacharacters (n, vs); Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters (n); Englishdef1; Englishdef2;

I _did_ manage to write something that will put the semicolon _into_ the ones with parens in the right place, but if I then use the line above it will give the undesired result of:

Kanjicharacters: Kanacharacters; (n, vs); Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; (n); Englishdef1; Englishdef2;

I need essentially to add a "only add a semicolon before the second space if the line doesn't have anything with parenthesis right after it" (because sometimes the definitions also have parentheses).

I greatly appreciate your help!
 
Old 01-28-2007, 11:49 PM   #4
donv2
Member
 
Registered: Nov 2004
Location: Upper right corner of USA
Distribution: Ubuntu/Mint, unSLUng (NSLU2), Arch/PlugApps (Dockstar)
Posts: 50

Rep: Reputation: 15
You could simply follow with a search and replace on "; (" to make it " (" on the current result of your modded approach.

If you have perl available, it is very fast for this sort of purpose. You can run it from the command line - if your file name is kanji.txt the following will do it:

Code:
perl -pi -e 's/; \(/ \(/ig' kanji.txt
Quote:
Originally Posted by goemon
Okay, that was absurdly simple. Thank you so much...

However, having run this I now see that it's not quite as simple after all. There's an additional wrinkle to the equation, which is that some of these lines have an additional part of speech enclosed within parentheses and I'd ideally like the semicolon to come _after_ the ")" if indeed the line _has_ one. For example:

Before:
Kanjicharacters: Kanacharacters (n, vs) Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters (n) Englishdef1; Englishdef2;

And the result would end up as:
Kanjicharacters: Kanacharacters (n, vs); Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters (n); Englishdef1; Englishdef2;

I _did_ manage to write something that will put the semicolon _into_ the ones with parens in the right place, but if I then use the line above it will give the undesired result of:

Kanjicharacters: Kanacharacters; (n, vs); Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; Englishdef1; Englishdef2;
Kanjicharacters: Kanacharacters; (n); Englishdef1; Englishdef2;

I need essentially to add a "only add a semicolon before the second space if the line doesn't have anything with parenthesis right after it" (because sometimes the definitions also have parentheses).

I greatly appreciate your help!
 
Old 01-29-2007, 12:14 PM   #5
goemon
LQ Newbie
 
Registered: Nov 2005
Posts: 17

Original Poster
Rep: Reputation: 0
Almost there...!!

Hmmm, the hardest part seems to be developing the eye for creating simpler searches. That helped a lot; however the final additional wrinkle was that I needed it to only remove a semicolon if it was before an open parens with a v, n, or a right after it. (For example, (adv), (n), (v), and so on)

I think I might have gotten a really clunky way...inserting semicolons, then tagging ones that are incorrect with an '@' symbol, and then removing anything that has a '@;'. Here's my current complete script (I'm sure there's easier ways to do this with Perl and so on, but it took me so long to grok SED that I'm trying to stick with something I know):

for file
do
echo $file
mv $file $$.tempfile
sed 's/ /; /2
s/ /; /3
s/; ([vna]/@&/g
s/@;//g
s/;;/;/g
s/; ;/; /g
s/:/;/g' $$.tempfile > $file
done
rm $$.tempfile

There's probably a cleaner, simpler way to do this, but for now I guess I'm in good shape. If you have time to clean it up or simplify it, great...but otherwise, I think I'm all set for now.

With SED, what would the syntax be for "the first block of text before a space"? It'd be great to identify and repeat the first block of text if, and only if, there was only one Japanese-encoded block. Some lines do not have a Kanjicharacter field, just a lone Kanacharacter field, and thus I need to manually repeat it so that I have a properly fleshed out line.

'^+ ' (without the ' marks) ???

Is there any way to, for the following example:

Kanji; Kana; Englishmeaning1; Englishmeaning2;
Kana; Englishmeaning1; Englishmeaning2;

Turn it into this:

Kanji; Kana; Englishmeaning1; Englishmeaning2;
Kana; Kana; Englishmeaning1; Englishmeaning2;

(the second line's "Kana;" string is doubled.

As always, many thanks for any tips, help, or suggestions...
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Non greedy Regex in C crazyjimbo Programming 5 11-15-2006 12:19 PM
Regex nightmare Isotonik Linux - Newbie 2 05-25-2006 03:10 AM
regex help siyisoy Programming 4 04-07-2006 06:32 AM
Regex Help cmfarley19 Programming 5 03-31-2005 11:13 PM
Help with Sed and regex cmfarley19 Programming 6 11-18-2004 02:09 PM


All times are GMT -5. The time now is 10:51 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration