How do I extract characters from several words on a line?
Hello.
I'm trying to do this thing that made me pull my hair out for the past two days. I'm not getting anywhere. Basically, I'm writing a script for automatic addition of user accounts. That's not a problem really, I pretty much know what to do once I've solved this dilemma. I need to extract letters from personal names in a file and use them in my script. I was hoping 'sed' was the right tool, possibly combined with 'awk', 'tr' or 'cut'. For instance, let's say I have a line in a file: John Williams WORD (all words separated with a single 'space') I need to extract the first two letters from both words, and put them together, then place the xxxx word before them and finally convert that new "word" to lowcase. So a line: "John Williams FTP" should output ftpjowi (please note that the third WORD can be anything and should be copied in its entirety) All guides I've found so far are not helping me much. Finally, I did make it work, kind of. It did produce the result I need, but I feel that most of this was actually more luck then understanding of what I was doing (especially the sed-part). I managed to build a quite long piped command and I'm quite sure it can be done in more rational way: Code:
sed 's/\(..\) \(..\)/\2 \1/' <listfile> | awk '{ print $3 $1 }' | tr [[:upper:]] [[:lower:]] Thanks very much for your time and thanks in advance for any suggestion you might have. M. |
You could alter the sed command so that you don't need pipe the output through awk:
Code:
echo 'John Williams FTP' | sed 's/\([a-zA-Z][a-z]\)[^ ]* \([a-zA-Z][a-z]\)[^ ]* \([a-zA-Z][a-zA-Z]*\)/\L\3\1\2/' |
I'm sure this is possible with a SED one-liner but here is how I did it..
Code:
[~]: cat text.txt |
Quote:
Using cat unnecesarily is one of my pet peeves. tr '[A-Z]' '[a-z]' <text.txt Although one handy exception is using "cat -n" to add line numbers. |
Guys, thanks a bunch!
All suggested solutions worked as a charm! Keep 'em comming - the more I learn about these different approaches - the better! I do have a few questions on the currently stated solutions. I'll get back to them soon. |
Quote:
Okay, here is what I'm wondering about. Wasn't there any way you could use dots (.) instead of those a-z lines, in the way you tell sed to pick "any" two first characters of the first two words not only necessarily up/low-case letters? Of course, the letters will be used exclusively, but I was wondering this due simplification of the string-writing. Also, what does exactly [^ ]* mean in this, regarding the first two words? As for the third word, was it really necessary to state, from what I'm reading here, "at least one up-lowcase letter or more" or was it any other way to put it? I mean, what would happen if the numbers were used as first two characters of the third word? With other words, isn't there any way we could rewrite this string so sed picks *any* type of characterd, only the strict rules are 2 first characters from the 2 first words + entire 3rd word no matter how many / what characters (and of course - leave the \3\1\2 substitution) ? Thanks in advance! |
Ok guys, I think I've figured out how to write just what I need.
This should do it: Code:
echo "John Williams FTP" | sed 's/\(..\)[^ ]* \(..\)[^ ]* \(.*\)/\3\1\2/' | tr [[:upper:]] [[:lower:]] |
There is no reason to add additional commands; tr can do what you as asking. Change tr's first argument to include the appropriate octal codes and its second argument to include the latin replacements. Eg:
Code:
tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO' |
Quote:
By writing: Code:
$ echo 'Testing the \304, \305, and \326 Chars' | tr '[[:upper:]]\304\305\326' '[[:lower:]]AaO' Code:
testing the \304, \305, and \326 chars Perhaps it would be easier to write an additional tr - command as pipe to the string above, cause, the other preceding commands would have accomplished the extraction/substitutions - so this new tr command would only need to translate any of lowcase äåö to lowcase aao. And from what I understand, lowcase and uppercase letters do not have same octal value. Perhaps that's causing problems? |
You are entering the characters \, 3, 0, followed by 4. That is not the same things as entering the single character which is represented in the shell as \304.
Try copying the Å character here. Then in your shell, type Ctrl-V and then immediately paste the copied character. Now backspace over it and see how it deletes the single \304 character. |
I'm sorry, but I don't understand what you mean.
|
Past your paragraph(s) above that include the Swedish characters into a file. Be sure that your characters appear correctly in the file. Call the file foo.
Then, copy and paste this command into the shell and run it: Code:
tr '[[:upper:]]\304\305\326\345\344\366' '[[:lower:]]AAOaao' < foo |
Nope. Nothing happens. Swedish characters remain Swedish.
:( |
Look in the manpage for "regex". If I remember correctly, there are character classes.
You can also use [[:alpha:]] instead of [a-z]. The "[A-Z][a-z]*" pattern seems to match a formal name better, but I guess it could be tripped up by a name like "McDonald". There is also a non-standard GNU extension to match the beginning of a word. Using "." will match anything, including spaces. The carat as the first charactor in a set, e.g. [^abcd] means any character that isn't a or b or c or d. [^ ] means any non-space character. So "[A-Z][^ ]* " will match any string that starts with a capital letter and doesn't contain a space. So "[^ ]* " is a way to match words or arguments separated by spaces. You an use an expression like [[=o=]] to match equivalent "o" or accented versions in the same equivalence class. This may depend on the locale you use. Look at the "tr" command or the "y" command in sed on translating characters. sed y/[[=a=][=o=]/ao/ may translate accented a's and o's. I haven't tried this and attempting to type these characters on my keyboard might lead to physical injury! |
Hmmm. What happens with these two commands? :
Code:
printf "\304\n" |
All times are GMT -5. The time now is 04:03 AM. |