LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices



Reply
 
Search this Thread
Old 06-20-2008, 09:15 AM   #1
MheAd
Member
 
Registered: Jun 2007
Distribution: Ubuntu 14.04
Posts: 186

Rep: Reputation: 36
How do I extract characters from several words on a line?


Hello.
I'm trying to do this thing that made me pull my hair out for the past two days. I'm not getting anywhere. Basically, I'm writing a script for automatic addition of user accounts. That's not a problem really, I pretty much know what to do once I've solved this dilemma.

I need to extract letters from personal names in a file and use them in my script. I was hoping 'sed' was the right tool, possibly combined with 'awk', 'tr' or 'cut'.

For instance, let's say I have a line in a file:

John Williams WORD

(all words separated with a single 'space')

I need to extract the first two letters from both words, and put them together, then place the xxxx word before them and finally convert that new "word" to lowcase.

So a line: "John Williams FTP" should output

ftpjowi

(please note that the third WORD can be anything and should be copied in its entirety)

All guides I've found so far are not helping me much. Finally, I did make it work, kind of. It did produce the result I need, but I feel that most of this was actually more luck then understanding of what I was doing (especially the sed-part). I managed to build a quite long piped command and I'm quite sure it can be done in more rational way:

Code:
 sed 's/\(..\) \(..\)/\2 \1/' <listfile> | awk '{ print $3 $1 }' | tr [[:upper:]] [[:lower:]]
So, you guys probably laugh now as I do myself. But this thing does work actually (you can try it. just create a file with content formated described above - and run the command above against it and it will work) but I know it's not "proper" way to do. There must be a better, cleaner, more professional way. I've just started with reg. exp. and it's not the easiest thing.

Thanks very much for your time and thanks in advance for any suggestion you might have.

M.

Last edited by MheAd; 06-20-2008 at 09:26 AM.
 
Old 06-20-2008, 09:57 AM   #2
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655
You could alter the sed command so that you don't need pipe the output through awk:
Code:
echo 'John Williams FTP' | sed 's/\([a-zA-Z][a-z]\)[^ ]* \([a-zA-Z][a-z]\)[^ ]* \([a-zA-Z][a-zA-Z]*\)/\L\3\1\2/'
The "\L" part changes the replacement to lower case, but is a GNU extension. There is a way to convert to lower case in a sed program but you really have to jump to a lot of hoops, and it might as well be written in sanscript. The "tr" command is handy and easy to understand.
 
Old 06-20-2008, 10:34 AM   #3
Uxinn
Member
 
Registered: May 2008
Location: Iceland
Distribution: Ubuntu Hardy
Posts: 47

Rep: Reputation: 16
I'm sure this is possible with a SED one-liner but here is how I did it..


Code:
[~]: cat text.txt 
John Williams FTP
Steinar Marino SS
Silvester Stallone DIR
John Smith NOBODY

[~]: cat text.txt |tr '[A-Z]' '[a-z]'|awk '{print $3 substr($1,1,2) substr($2,1,2)}'
ftpjowi
ssstma
dirsist
nobodyjosm
cat text.txt |tr '[A-Z]' '[a-z]'|awk '{print $3 substr($1,1,2) substr($2,1,2)}'
 
Old 06-20-2008, 10:41 AM   #4
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655
Quote:
Originally Posted by Uxinn View Post

[~]: cat text.txt |tr '[A-Z]' '[a-z]'|awk '{print $3 substr($1,1,2) substr($2,1,2)}'
ftpjowi
ssstma
dirsist
nobodyjosm

[/code]cat text.txt |tr '[A-Z]' '[a-z]'|awk '{print $3 substr($1,1,2) substr($2,1,2)}'

Using cat unnecesarily is one of my pet peeves.

tr '[A-Z]' '[a-z]' <text.txt

Although one handy exception is using "cat -n" to add line numbers.
 
Old 06-20-2008, 11:49 AM   #5
MheAd
Member
 
Registered: Jun 2007
Distribution: Ubuntu 14.04
Posts: 186

Original Poster
Rep: Reputation: 36
Guys, thanks a bunch!
All suggested solutions worked as a charm!
Keep 'em comming - the more I learn about these different approaches - the better!

I do have a few questions on the currently stated solutions.
I'll get back to them soon.
 
Old 06-20-2008, 11:56 AM   #6
MheAd
Member
 
Registered: Jun 2007
Distribution: Ubuntu 14.04
Posts: 186

Original Poster
Rep: Reputation: 36
Quote:
Originally Posted by jschiwal View Post
You could alter the sed command so that you don't need pipe the output through awk:
Code:
echo 'John Williams FTP' | sed 's/\([a-zA-Z][a-z]\)[^ ]* \([a-zA-Z][a-z]\)[^ ]* \([a-zA-Z][a-zA-Z]*\)/\L\3\1\2/'
The "\L" part changes the replacement to lower case, but is a GNU extension. There is a way to convert to lower case in a sed program but you really have to jump to a lot of hoops, and it might as well be written in sanscript. The "tr" command is handy and easy to understand.


Okay, here is what I'm wondering about.
Wasn't there any way you could use dots (.) instead of those a-z lines, in the way you tell sed to pick "any" two first characters of the first two words not only necessarily up/low-case letters? Of course, the letters will be used exclusively, but I was wondering this due simplification of the string-writing.

Also, what does exactly [^ ]* mean in this, regarding the first two words?

As for the third word, was it really necessary to state, from what I'm reading here, "at least one up-lowcase letter or more" or was it any other way to put it? I mean, what would happen if the numbers were used as first two characters of the third word?

With other words, isn't there any way we could rewrite this string so sed picks *any* type of characterd, only the strict rules are 2 first characters from the 2 first words + entire 3rd word no matter how many / what characters (and of course - leave the \3\1\2 substitution) ?

Thanks in advance!

Last edited by MheAd; 06-20-2008 at 01:55 PM.
 
Old 06-20-2008, 03:16 PM   #7
MheAd
Member
 
Registered: Jun 2007
Distribution: Ubuntu 14.04
Posts: 186

Original Poster
Rep: Reputation: 36
Ok guys, I think I've figured out how to write just what I need.

This should do it:

Code:
echo "John Williams FTP" | sed 's/\(..\)[^ ]* \(..\)[^ ]* \(.*\)/\3\1\2/' | tr [[:upper:]] [[:lower:]]
However, I started thinking on something more advanced I could add to this string - converting non-english letters to english ones. But I wouldn't go further than just being able transfering Swedish characters to more suitable universal latin characters, for instance , and to A, A and O, making it more suitable for /etc/passwd file. Is there any way to add an additional pipe to the string above just to filter the output for any non-English character?
 
Old 06-20-2008, 05:12 PM   #8
Mr. C.
Senior Member
 
Registered: Jun 2008
Posts: 2,529

Rep: Reputation: 59
There is no reason to add additional commands; tr can do what you as asking. Change tr's first argument to include the appropriate octal codes and its second argument to include the latin replacements. Eg:
Code:
tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO'

$ echo 'Testing the \304, \305, and \326 Chars' | tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO' 
testing the A, a, and O chars
 
Old 06-20-2008, 05:43 PM   #9
MheAd
Member
 
Registered: Jun 2007
Distribution: Ubuntu 14.04
Posts: 186

Original Poster
Rep: Reputation: 36
Quote:
Originally Posted by Mr. C. View Post
There is no reason to add additional commands; tr can do what you as asking. Change tr's first argument to include the appropriate octal codes and its second argument to include the latin replacements. Eg:
Code:
tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO'

$ echo 'Testing the \304, \305, and \326 Chars' | tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO' 
testing the A, a, and O chars
I'm not sure what that string would accomplish.

By writing:

Code:
$ echo 'Testing the \304, \305, and \326 Chars' | tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO'
it returns (in my case):

Code:
testing the \304, \305, and \326 chars
This seems to be a bit sophisticated. Could it be having something to do with local settings as well? I mean, my settings are Swedish, I don't know about your country. Maybe the codes vary depending on that. Also, when I include those codes in the original string I wrote, nothing changes when it's filtering Swedish name. , and still remain, instead of desired Aa Aa and Oo.

Perhaps it would be easier to write an additional tr - command as pipe to the string above, cause, the other preceding commands would have accomplished the extraction/substitutions - so this new tr command would only need to translate any of lowcase to lowcase aao. And from what I understand, lowcase and uppercase letters do not have same octal value. Perhaps that's causing problems?

Last edited by MheAd; 06-20-2008 at 05:52 PM.
 
Old 06-20-2008, 05:51 PM   #10
Mr. C.
Senior Member
 
Registered: Jun 2008
Posts: 2,529

Rep: Reputation: 59
You are entering the characters \, 3, 0, followed by 4. That is not the same things as entering the single character which is represented in the shell as \304.

Try copying the character here. Then in your shell, type Ctrl-V and then immediately paste the copied character. Now backspace over it and see how it deletes the single \304 character.

Last edited by Mr. C.; 06-20-2008 at 05:57 PM.
 
Old 06-20-2008, 05:56 PM   #11
MheAd
Member
 
Registered: Jun 2007
Distribution: Ubuntu 14.04
Posts: 186

Original Poster
Rep: Reputation: 36
I'm sorry, but I don't understand what you mean.
 
Old 06-20-2008, 06:08 PM   #12
Mr. C.
Senior Member
 
Registered: Jun 2008
Posts: 2,529

Rep: Reputation: 59
Past your paragraph(s) above that include the Swedish characters into a file. Be sure that your characters appear correctly in the file. Call the file foo.

Then, copy and paste this command into the shell and run it:

Code:
tr '[[:upper:]]\304\305\326\345\344\366' '[[:lower:]]AAOaao' < foo
You should see your characters are transliterated into their latin lookalikes.
 
Old 06-20-2008, 06:21 PM   #13
MheAd
Member
 
Registered: Jun 2007
Distribution: Ubuntu 14.04
Posts: 186

Original Poster
Rep: Reputation: 36
Nope. Nothing happens. Swedish characters remain Swedish.
 
Old 06-20-2008, 06:43 PM   #14
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655
Look in the manpage for "regex". If I remember correctly, there are character classes.

You can also use [[:alpha:]] instead of [a-z]. The "[A-Z][a-z]*" pattern seems to match a formal name better, but I guess it could be tripped up by a name like "McDonald". There is also a non-standard GNU extension to match the beginning of a word. Using "." will match anything, including spaces.

The carat as the first charactor in a set, e.g. [^abcd] means any character that isn't a or b or c or d. [^ ] means any non-space character. So "[A-Z][^ ]* " will match any string that starts with a capital letter and doesn't contain a space. So "[^ ]* " is a way to match words or arguments separated by spaces.

You an use an expression like [[=o=]] to match equivalent "o" or accented versions in the same equivalence class. This may depend on the locale you use.

Look at the "tr" command or the "y" command in sed on translating characters.
sed y/[[=a=][=o=]/ao/
may translate accented a's and o's. I haven't tried this and attempting to type these characters on my keyboard might lead to physical injury!
 
Old 06-20-2008, 06:45 PM   #15
Mr. C.
Senior Member
 
Registered: Jun 2008
Posts: 2,529

Rep: Reputation: 59
Hmmm. What happens with these two commands? :
Code:
printf "\304\n"
printf "\304\n" | tr '\304' A
You can copy/paste directly.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
shell script to find an word or words from a line rakesh.tandur Linux - General 5 05-13-2008 02:57 PM
Need to strip words from front of line. sed/awk/grep? joadoor Linux - Software 6 08-28-2006 05:39 AM
BASH: First words in a line JordanH Programming 7 10-24-2004 11:00 AM
Command line extract Daunted Linux - Software 2 09-30-2004 06:37 PM
51 characters only in the 1st Line of command line eggCover Linux - General 2 07-29-2004 02:28 PM


All times are GMT -5. The time now is 02:54 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration