extract text from files
Hi,
I have many files in a folder from which I need to extract some contents, these are basically text files wich have individual lines with (i.e) name: john address: whatever phone: 123456 Some caveats 1. Sometimes a line might be missing. name: johnn phone: 123456 2. Lines are not in the same line-numbers across the files I did try some things with awk based on google searches but I couldn't extract the data of each file into a single line (this is the ultimate goal): john,whatever,123456 I don't have knowledge other than having put some bash scripts together for backup jobs, so I am open to install anything that could help to pull this off. Any help will be greatly appreciated. Regards. |
Greetingz!
Sounds like you might want to use "egrep". I would suggest first reading the man page for the command, however the following command may help; egrep -i "name:|phone:|address:" /path/to/files/* Each pattern you want to find is separated by the pipe ("|") symbol, and the entire set of patterns must be wrapped in the double-quote character. |
wow, that was a quick reply !! :D
It does extract the data, but every item is still dumped to a single line, and I need to combine/chain them into a one line per original file (actually csv file). That is the part where I got stuck :S TIA |
Ah!
Okay, well then I would pipe that output from the "egrep" command I mentioned earlier to awk (and maybe sort, too) egrep -i "name:|phone:|address:" /path/to/files/* |\ awk 'NR == 1 { line = sq $0 sq } { line = line "," sq $0 sq } END { print line }' Now, for some darn reason, this will repeat the first field twice. I have *no* idea why, so if anyone else chimes I'd really appreciate it. |
Hi,
Try this sed -n ':mark /phone:/ ! {N;b mark}; /phone:/ {s/name:[ ]*//;s/phone:[ ]*/,/;s/address:[ ]*/,/;s/\n//g;p}' infile > outfile This assumes that there is no 'junk' in between the records. If there is only one record per file then the command can can be simplified. Hope this helps |
@crts
thnks for the suggestion, however, that code is giving me individual lines too and, more importantly, my source files do have garbage before the contents I really need and that is getting dumped too. Quote:
Code:
nombre_apellido: John Code:
,ciudad: TheCitytreet 123 TIA |
... and that is exactly why you should always give a representative example of your data. This was not obvious from your initial post. In fact, your description implied that there is nothing in between.
Code:
sed -n '/nombre_apellido:/ {:mark /ciudad:/ ! {N;b mark}; /ciudad:/ {s/nombre_apellido:[ ]*//;s/ciudad:[ ]*/,/;s/direccion:[ ]*/,/;s/\n//g;p};}' infile > outfile |
My apologies, it didn't seem it could cause any harm at the time...
Lesson learned. The results of your new version are much closer: Code:
John It probably doesn't matter now as you resolved that, but the data above what I need to extract is userid: 123456 userstatus: 1 usergroup: somegroup And there are more items at the end which I don't need. Whether the fields are present or not varies from one file to another. Thank you very much for the help. |
I just realized the empty lines are caused by the files having windows line breaks...
so I am using this to convert them to a single unix file Code:
awk '{ sub("\r$", ""); print }' form_* > unix/merged.txt Thanks a Ton! Best regards. |
what's it gonna be?
Quote:
Quote:
Quote:
Quote:
|
The original data is in spanish, I try to translate that to english so foreign language is out of the way when asking for help in an english speaking forum.
Quote:
Quote:
I intended to illustrate what can be found in the lines prior to the ones I need. Maybe I chose the wrong words... Quote:
If I made a mistake about the contents at the begining of the documents (and I already apologized) it was due to the fact that I don't know about this, which is way I need help in the first place. Thanks again for the help, as I said, it is working now. |
Hi,
I did not mean to be harsh. I just wanted to point out how I perceived the development of the initial problem. Quote:
Quote:
Quote:
So when you said that the command did not work I assumed that is due to the arrangement of your data. At this point I had already double-checked the command. Since I did not see a windows logo on the left side of your posts the possibility of a dos-formatted file (good work on catching that, by the way) did not cross my mind. Anyway, glad I could help. P.S.: A slightly shorter way to convert DOS to UNIX files Code:
sed 's/.$//' dos.file > unix.file |
Sir, I have to say you are correct :D . I didn't use 'before' as I thought and my wording wasn't all that good.
Thanks for the help and the extra tip, I really appreciate it. Cheers :) |
Well assuming the start and endpoints will always be there:
Code:
awk -F: '/start/,/end/{printf $2","}/end/' file |
Hi,
I tried Code:
awk -F: '/name/,/phone/{printf $2","}/phone/' file # gawk 3.1.6 Code:
something Code:
JohnA, TheStreet 123A, TheCityA,phone: TheCityA |
All times are GMT -5. The time now is 01:24 PM. |