[SOLVED] extract text from files

Frakk · 08-28-2010, 06:06 PM

Hi,

I have many files in a folder from which I need to extract some contents, these are basically text files wich have individual lines with (i.e)
name: john
address: whatever
phone: 123456

Some caveats

1. Sometimes a line might be missing.
name: johnn
phone: 123456

2. Lines are not in the same line-numbers across the files

I did try some things with awk based on google searches but I couldn't extract the data of each file into a single line (this is the ultimate goal):
john,whatever,123456

I don't have knowledge other than having put some bash scripts together for backup jobs, so I am open to install anything that could help to pull this off.

Any help will be greatly appreciated.
Regards.

xeleema · 08-28-2010, 06:24 PM

Greetingz!

Sounds like you might want to use "egrep". I would suggest first reading the man page for the command, however the following command may help;

egrep -i "name:|phone:|address:" /path/to/files/*

Each pattern you want to find is separated by the pipe ("|") symbol, and the entire set of patterns must be wrapped in the double-quote character.

Frakk · 08-28-2010, 06:35 PM

wow, that was a quick reply !!

It does extract the data, but every item is still dumped to a single line, and I need to combine/chain them into a one line per original file (actually csv file).

That is the part where I got stuck :S

TIA

xeleema · 08-28-2010, 06:49 PM

Ah!

Okay, well then I would pipe that output from the "egrep" command I mentioned earlier to awk (and maybe sort, too)

egrep -i "name:|phone:|address:" /path/to/files/* |\
awk 'NR == 1 { line = sq $0 sq } { line = line "," sq $0 sq } END { print line }'

Now, for some darn reason, this will repeat the first field twice. I have *no* idea why, so if anyone else chimes I'd really appreciate it.

crts · 08-28-2010, 06:49 PM

Hi,

Try this
sed -n ':mark /phone:/ ! {N;b mark}; /phone:/ {s/name:[ ]*//;s/phone:[ ]*/,/;s/address:[ ]*/,/;s/\n//g;p}' infile > outfile

This assumes that there is no 'junk' in between the records. If there is only one record per file then the command can can be simplified.

Hope this helps

Frakk · 08-28-2010, 07:48 PM

@crts

thnks for the suggestion, however, that code is giving me individual lines too and, more importantly, my source files do have garbage before the contents I really need and that is getting dumped too.

Quote:

Originally Posted by xeleema

egrep -i "name:|phone:|address:" /path/to/files/* |\
awk 'NR == 1 { line = sq $0 sq } { line = line "," sq $0 sq } END { print line }'

Not sure if I am doing something wrong, but for this data

Code:

nombre_apellido: John
direccion: TheStreet 123
ciudad: TheCity

I am getting these results (I matched the "fields" as my original example had dummy fields sorry)

Code:

,ciudad: TheCitytreet 123

Seems like everything is getting piled up. That is processing a single file, it gets worst when it goes through all of them.

TIA

crts · 08-28-2010, 08:05 PM

... and that is exactly why you should always give a representative example of your data. This was not obvious from your initial post. In fact, your description implied that there is nothing in between.

Code:

sed -n '/nombre_apellido:/ {:mark /ciudad:/ ! {N;b mark}; /ciudad:/ {s/nombre_apellido:[ ]*//;s/ciudad:[ ]*/,/;s/direccion:[ ]*/,/;s/\n//g;p};}' infile > outfile

If this still does not match then provide some representative sample data. I am not going to *guess* what your file might look like.

Frakk · 08-28-2010, 08:34 PM

My apologies, it didn't seem it could cause any harm at the time...
Lesson learned.

The results of your new version are much closer:

Code:

John

,TheStreet 123

,TheCity

The empty line are part of the results.

It probably doesn't matter now as you resolved that, but the data above what I need to extract is
userid: 123456
userstatus: 1
usergroup: somegroup

And there are more items at the end which I don't need.
Whether the fields are present or not varies from one file to another.

Thank you very much for the help.

Frakk · 08-28-2010, 08:47 PM

I just realized the empty lines are caused by the files having windows line breaks...

so I am using this to convert them to a single unix file

Code:

awk '{ sub("\r$", ""); print }' form_* > unix/merged.txt

And it does work perfectly with that file

Thanks a Ton!
Best regards.

crts · 08-28-2010, 08:48 PM

Quote:

Originally Posted by Frakk

... these are basically text files wich have individual lines with (i.e)
name: john
address: whatever
phone: 123456

Some caveats

1. Sometimes a line might be missing.
name: johnn
phone: 123456

Quote:

Originally Posted by Frakk

Not sure if I am doing something wrong, but for this data

Code:

nombre_apellido: John
direccion: TheStreet 123
ciudad: TheCity

Quote:

Originally Posted by Frakk

what I need to extract is
userid: 123456
userstatus: 1
usergroup: somegroup

Please make up your mind first and

Quote:

If this still does not match then provide some representative sample data. I am not going to *guess* what your file might look like.

So far you have provided three different scenarios. I provided two solution that I both tested and they did work based on your sample data. Your last post suggests that your data is arranged as in your initial post. That is not representative data. We are going in circles right now.

Frakk · 08-28-2010, 09:24 PM

The original data is in spanish, I try to translate that to english so foreign language is out of the way when asking for help in an english speaking forum.

Quote:

1. Sometimes a line might be missing.
name: johnn
phone: 123456

I meant sometimes a line might not be present, in that example address is missing and phone is next to name, just in case someone might think of using line numbers as a reference to identify the data.

Quote:

what I need to extract is
userid: 123456
userstatus: 1
usergroup: somegroup

I didn't say "what I need to extract is" I said "the data before what I need to extract is".
I intended to illustrate what can be found in the lines prior to the ones I need. Maybe I chose the wrong words...

Quote:

So far you have provided three different scenarios. I provided two solution that I both tested and they did work based on your sample data. Your last post suggests that your data is arranged as in your initial post. That is not representative data. We are going in circles right now.

Maybe that's unnecessarily harsh? Whether the item is called "name" or "nombre_apellido" doesn't really change anything.
If I made a mistake about the contents at the begining of the documents (and I already apologized) it was due to the fact that I don't know about this, which is way I need help in the first place.

Thanks again for the help, as I said, it is working now.

crts · 08-28-2010, 10:39 PM

Hi,

I did not mean to be harsh. I just wanted to point out how I perceived the development of the initial problem.

Quote:

The original data is in spanish ...

Yes, but you also stated in that post that there are lines that are to be excluded from the output. And that does qualify as altered scenario. The translation alone, of course, does not.

Quote:

I said "the data before what I need to extract is"

Now that I do understand. But your exact words were:

Quote:

the data above...

I must admit I couldn't make hands and tails of it. I thought that by 'above' you were referring to the data you presented in a post 'above'.
So when you said that the command did not work I assumed that is due to the arrangement of your data. At this point I had already double-checked the command. Since I did not see a windows logo on the left side of your posts the possibility of a dos-formatted file (good work on catching that, by the way) did not cross my mind.
Anyway, glad I could help.

P.S.: A slightly shorter way to convert DOS to UNIX files

Code:

sed 's/.$//' dos.file > unix.file

Frakk · 08-29-2010, 12:00 AM

Sir, I have to say you are correct

. I didn't use 'before' as I thought and my wording wasn't all that good.

Thanks for the help and the extra tip, I really appreciate it.
Cheers

grail · 08-29-2010, 12:14 AM

Well assuming the start and endpoints will always be there:

Code:

awk -F: '/start/,/end/{printf $2","}/end/' file

crts · 08-29-2010, 12:42 AM

Hi,

I tried

Code:

awk -F: '/name/,/phone/{printf $2","}/phone/' file   # gawk 3.1.6

with this data

Code:

something
at the 
start

name: JohnA
address: TheStreet 123A
phone: TheCityA

...
in the middle
of something
...
name: JohnB
address: TheStreet 123B
phone: TheCityB

name: JohnC
phone: TheCityC

... and to the
end

The output was:

Code:

JohnA, TheStreet 123A, TheCityA,phone: TheCityA
JohnB, TheStreet 123B, TheCityB,phone: TheCityB
JohnC, TheCityC,phone: TheCityC