LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   How do I extract characters from several words on a line? (https://www.linuxquestions.org/questions/linux-newbie-8/how-do-i-extract-characters-from-several-words-on-a-line-650512/)

MheAd 06-20-2008 08:15 AM

How do I extract characters from several words on a line?
 
Hello.
I'm trying to do this thing that made me pull my hair out for the past two days. I'm not getting anywhere. Basically, I'm writing a script for automatic addition of user accounts. That's not a problem really, I pretty much know what to do once I've solved this dilemma.

I need to extract letters from personal names in a file and use them in my script. I was hoping 'sed' was the right tool, possibly combined with 'awk', 'tr' or 'cut'.

For instance, let's say I have a line in a file:

John Williams WORD

(all words separated with a single 'space')

I need to extract the first two letters from both words, and put them together, then place the xxxx word before them and finally convert that new "word" to lowcase.

So a line: "John Williams FTP" should output

ftpjowi

(please note that the third WORD can be anything and should be copied in its entirety)

All guides I've found so far are not helping me much. Finally, I did make it work, kind of. It did produce the result I need, but I feel that most of this was actually more luck then understanding of what I was doing (especially the sed-part). I managed to build a quite long piped command and I'm quite sure it can be done in more rational way:

Code:

sed 's/\(..\) \(..\)/\2 \1/' <listfile> | awk '{ print $3 $1 }' | tr [[:upper:]] [[:lower:]]
So, you guys probably laugh now as I do myself. But this thing does work actually (you can try it. just create a file with content formated described above - and run the command above against it and it will work) but I know it's not "proper" way to do. There must be a better, cleaner, more professional way. I've just started with reg. exp. and it's not the easiest thing.

Thanks very much for your time and thanks in advance for any suggestion you might have.

M.

jschiwal 06-20-2008 08:57 AM

You could alter the sed command so that you don't need pipe the output through awk:
Code:

echo 'John Williams FTP' | sed 's/\([a-zA-Z][a-z]\)[^ ]* \([a-zA-Z][a-z]\)[^ ]* \([a-zA-Z][a-zA-Z]*\)/\L\3\1\2/'
The "\L" part changes the replacement to lower case, but is a GNU extension. There is a way to convert to lower case in a sed program but you really have to jump to a lot of hoops, and it might as well be written in sanscript. The "tr" command is handy and easy to understand.

Uxinn 06-20-2008 09:34 AM

I'm sure this is possible with a SED one-liner but here is how I did it..


Code:

[~]: cat text.txt
John Williams FTP
Steinar Marino SS
Silvester Stallone DIR
John Smith NOBODY

[~]: cat text.txt |tr '[A-Z]' '[a-z]'|awk '{print $3 substr($1,1,2) substr($2,1,2)}'
ftpjowi
ssstma
dirsist
nobodyjosm

cat text.txt |tr '[A-Z]' '[a-z]'|awk '{print $3 substr($1,1,2) substr($2,1,2)}'

jschiwal 06-20-2008 09:41 AM

Quote:

Originally Posted by Uxinn (Post 3190255)

[~]: cat text.txt |tr '[A-Z]' '[a-z]'|awk '{print $3 substr($1,1,2) substr($2,1,2)}'
ftpjowi
ssstma
dirsist
nobodyjosm

[/code]cat text.txt |tr '[A-Z]' '[a-z]'|awk '{print $3 substr($1,1,2) substr($2,1,2)}'


Using cat unnecesarily is one of my pet peeves.

tr '[A-Z]' '[a-z]' <text.txt

Although one handy exception is using "cat -n" to add line numbers.

MheAd 06-20-2008 10:49 AM

Guys, thanks a bunch!
All suggested solutions worked as a charm!
Keep 'em comming - the more I learn about these different approaches - the better!

I do have a few questions on the currently stated solutions.
I'll get back to them soon.

MheAd 06-20-2008 10:56 AM

Quote:

Originally Posted by jschiwal (Post 3190226)
You could alter the sed command so that you don't need pipe the output through awk:
Code:

echo 'John Williams FTP' | sed 's/\([a-zA-Z][a-z]\)[^ ]* \([a-zA-Z][a-z]\)[^ ]* \([a-zA-Z][a-zA-Z]*\)/\L\3\1\2/'
The "\L" part changes the replacement to lower case, but is a GNU extension. There is a way to convert to lower case in a sed program but you really have to jump to a lot of hoops, and it might as well be written in sanscript. The "tr" command is handy and easy to understand.



Okay, here is what I'm wondering about.
Wasn't there any way you could use dots (.) instead of those a-z lines, in the way you tell sed to pick "any" two first characters of the first two words not only necessarily up/low-case letters? Of course, the letters will be used exclusively, but I was wondering this due simplification of the string-writing.

Also, what does exactly [^ ]* mean in this, regarding the first two words?

As for the third word, was it really necessary to state, from what I'm reading here, "at least one up-lowcase letter or more" or was it any other way to put it? I mean, what would happen if the numbers were used as first two characters of the third word?

With other words, isn't there any way we could rewrite this string so sed picks *any* type of characterd, only the strict rules are 2 first characters from the 2 first words + entire 3rd word no matter how many / what characters (and of course - leave the \3\1\2 substitution) ?

Thanks in advance!

MheAd 06-20-2008 02:16 PM

Ok guys, I think I've figured out how to write just what I need.

This should do it:

Code:

echo "John Williams FTP" | sed 's/\(..\)[^ ]* \(..\)[^ ]* \(.*\)/\3\1\2/' | tr [[:upper:]] [[:lower:]]
However, I started thinking on something more advanced I could add to this string - converting non-english letters to english ones. But I wouldn't go further than just being able transfering Swedish characters to more suitable universal latin characters, for instance Ä, Å and Ö to A, A and O, making it more suitable for /etc/passwd file. Is there any way to add an additional pipe to the string above just to filter the output for any non-English character?

Mr. C. 06-20-2008 04:12 PM

There is no reason to add additional commands; tr can do what you as asking. Change tr's first argument to include the appropriate octal codes and its second argument to include the latin replacements. Eg:
Code:

tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO'

$ echo 'Testing the \304, \305, and \326 Chars' | tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO'
testing the A, a, and O chars


MheAd 06-20-2008 04:43 PM

Quote:

Originally Posted by Mr. C. (Post 3190600)
There is no reason to add additional commands; tr can do what you as asking. Change tr's first argument to include the appropriate octal codes and its second argument to include the latin replacements. Eg:
Code:

tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO'

$ echo 'Testing the \304, \305, and \326 Chars' | tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO'
testing the A, a, and O chars


I'm not sure what that string would accomplish.

By writing:

Code:

$ echo 'Testing the \304, \305, and \326 Chars' | tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO'
it returns (in my case):

Code:

testing the \304, \305, and \326 chars
This seems to be a bit sophisticated. Could it be having something to do with local settings as well? I mean, my settings are Swedish, I don't know about your country. Maybe the codes vary depending on that. Also, when I include those codes in the original string I wrote, nothing changes when it's filtering Swedish name. Åå, Ää and Öö still remain, instead of desired Aa Aa and Oo.

Perhaps it would be easier to write an additional tr - command as pipe to the string above, cause, the other preceding commands would have accomplished the extraction/substitutions - so this new tr command would only need to translate any of lowcase äåö to lowcase aao. And from what I understand, lowcase and uppercase letters do not have same octal value. Perhaps that's causing problems?

Mr. C. 06-20-2008 04:51 PM

You are entering the characters \, 3, 0, followed by 4. That is not the same things as entering the single character which is represented in the shell as \304.

Try copying the Å character here. Then in your shell, type Ctrl-V and then immediately paste the copied character. Now backspace over it and see how it deletes the single \304 character.

MheAd 06-20-2008 04:56 PM

I'm sorry, but I don't understand what you mean.

Mr. C. 06-20-2008 05:08 PM

Past your paragraph(s) above that include the Swedish characters into a file. Be sure that your characters appear correctly in the file. Call the file foo.

Then, copy and paste this command into the shell and run it:

Code:

tr '[[:upper:]]\304\305\326\345\344\366' '[[:lower:]]AAOaao' < foo
You should see your characters are transliterated into their latin lookalikes.

MheAd 06-20-2008 05:21 PM

Nope. Nothing happens. Swedish characters remain Swedish.
:(

jschiwal 06-20-2008 05:43 PM

Look in the manpage for "regex". If I remember correctly, there are character classes.

You can also use [[:alpha:]] instead of [a-z]. The "[A-Z][a-z]*" pattern seems to match a formal name better, but I guess it could be tripped up by a name like "McDonald". There is also a non-standard GNU extension to match the beginning of a word. Using "." will match anything, including spaces.

The carat as the first charactor in a set, e.g. [^abcd] means any character that isn't a or b or c or d. [^ ] means any non-space character. So "[A-Z][^ ]* " will match any string that starts with a capital letter and doesn't contain a space. So "[^ ]* " is a way to match words or arguments separated by spaces.

You an use an expression like [[=o=]] to match equivalent "o" or accented versions in the same equivalence class. This may depend on the locale you use.

Look at the "tr" command or the "y" command in sed on translating characters.
sed y/[[=a=][=o=]/ao/
may translate accented a's and o's. I haven't tried this and attempting to type these characters on my keyboard might lead to physical injury!

Mr. C. 06-20-2008 05:45 PM

Hmmm. What happens with these two commands? :
Code:

printf "\304\n"
printf "\304\n" | tr '\304' A

You can copy/paste directly.


All times are GMT -5. The time now is 04:03 AM.