LinuxQuestions.org - [SOLVED] extract text from files

Page 1 of 2

Show 50 post(s) from this thread on one page

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - extract text from files (https://www.linuxquestions.org/questions/programming-9/extract-text-from-files-829061/)

Frakk

08-28-2010 06:06 PM

extract text from files

Hi,

I have many files in a folder from which I need to extract some contents, these are basically text files wich have individual lines with (i.e)
name: john
address: whatever
phone: 123456

Some caveats

1. Sometimes a line might be missing.
name: johnn
phone: 123456

2. Lines are not in the same line-numbers across the files

I did try some things with awk based on google searches but I couldn't extract the data of each file into a single line (this is the ultimate goal):
john,whatever,123456

I don't have knowledge other than having put some bash scripts together for backup jobs, so I am open to install anything that could help to pull this off.

Any help will be greatly appreciated.
Regards.

xeleema

08-28-2010 06:24 PM

Greetingz!

Sounds like you might want to use "egrep". I would suggest first reading the man page for the command, however the following command may help;

egrep -i "name:|phone:|address:" /path/to/files/*

Each pattern you want to find is separated by the pipe ("|") symbol, and the entire set of patterns must be wrapped in the double-quote character.

Frakk

08-28-2010 06:35 PM

wow, that was a quick reply !! :D

It does extract the data, but every item is still dumped to a single line, and I need to combine/chain them into a one line per original file (actually csv file).

That is the part where I got stuck :S

TIA

xeleema

08-28-2010 06:49 PM

Ah!

Okay, well then I would pipe that output from the "egrep" command I mentioned earlier to awk (and maybe sort, too)

egrep -i "name:|phone:|address:" /path/to/files/* |\
awk 'NR == 1 { line = sq $0 sq } { line = line "," sq $0 sq } END { print line }'

Now, for some darn reason, this will repeat the first field twice. I have *no* idea why, so if anyone else chimes I'd really appreciate it.

crts	08-28-2010 06:49 PM

Hi,

Try this
sed -n ':mark /phone:/ ! {N;b mark}; /phone:/ {s/name:[ ]*//;s/phone:[ ]*/,/;s/address:[ ]*/,/;s/\n//g;p}' infile > outfile

This assumes that there is no 'junk' in between the records. If there is only one record per file then the command can can be simplified.

Hope this helps

Frakk

08-28-2010 07:48 PM

@crts

thnks for the suggestion, however, that code is giving me individual lines too and, more importantly, my source files do have garbage before the contents I really need and that is getting dumped too.

Quote:

Originally Posted by xeleema (Post 4081038)

egrep -i "name:|phone:|address:" /path/to/files/* |\
awk 'NR == 1 { line = sq $0 sq } { line = line "," sq $0 sq } END { print line }'

Not sure if I am doing something wrong, but for this data

Code:

nombre_apellido: John

direccion: TheStreet 123

ciudad: TheCity

I am getting these results (I matched the "fields" as my original example had dummy fields sorry)

Code:

,ciudad: TheCitytreet 123

Seems like everything is getting piled up. That is processing a single file, it gets worst when it goes through all of them.

TIA

crts	08-28-2010 08:05 PM

... and that is exactly why you should always give a representative example of your data. This was not obvious from your initial post. In fact, your description implied that there is nothing in between.

Code:

sed -n '/nombre_apellido:/ {:mark /ciudad:/ ! {N;b mark}; /ciudad:/ {s/nombre_apellido:[ ]*//;s/ciudad:[ ]*/,/;s/direccion:[ ]*/,/;s/\n//g;p};}' infile > outfile

If this still does not match then provide some representative sample data. I am not going to *guess* what your file might look like.

Frakk

08-28-2010 08:34 PM

My apologies, it didn't seem it could cause any harm at the time...
Lesson learned.

The results of your new version are much closer:

Code:

John



,TheStreet 123



,TheCity

The empty line are part of the results.

It probably doesn't matter now as you resolved that, but the data above what I need to extract is
userid: 123456
userstatus: 1
usergroup: somegroup

And there are more items at the end which I don't need.
Whether the fields are present or not varies from one file to another.

Thank you very much for the help.

Frakk

08-28-2010 08:47 PM

I just realized the empty lines are caused by the files having windows line breaks...

so I am using this to convert them to a single unix file

Code:

awk '{ sub("\r$", ""); print }' form_* > unix/merged.txt

And it does work perfectly with that file

Thanks a Ton!
Best regards.

crts	08-28-2010 08:48 PM

what's it gonna be?

Quote:

Originally Posted by Frakk (Post 4081013)

... these are basically text files wich have individual lines with (i.e)
name: john
address: whatever
phone: 123456

Some caveats

1. Sometimes a line might be missing.
name: johnn
phone: 123456

Quote:

Originally Posted by Frakk (Post 4081065)

Not sure if I am doing something wrong, but for this data

Code:

nombre_apellido: John

direccion: TheStreet 123

ciudad: TheCity

Quote:

Originally Posted by Frakk (Post 4081089)

what I need to extract is
userid: 123456
userstatus: 1
usergroup: somegroup

Please make up your mind first and

Quote:

If this still does not match then provide some representative sample data. I am not going to *guess* what your file might look like.

So far you have provided three different scenarios. I provided two solution that I both tested and they did work based on your sample data. Your last post suggests that your data is arranged as in your initial post. That is not representative data. We are going in circles right now.

Frakk

08-28-2010 09:24 PM

The original data is in spanish, I try to translate that to english so foreign language is out of the way when asking for help in an english speaking forum.

Quote:

1. Sometimes a line might be missing.
name: johnn
phone: 123456

I meant sometimes a line might not be present, in that example address is missing and phone is next to name, just in case someone might think of using line numbers as a reference to identify the data.

Quote:

what I need to extract is
userid: 123456
userstatus: 1
usergroup: somegroup

I didn't say "what I need to extract is" I said "the data before what I need to extract is".
I intended to illustrate what can be found in the lines prior to the ones I need. Maybe I chose the wrong words...

Quote:

Maybe that's unnecessarily harsh? Whether the item is called "name" or "nombre_apellido" doesn't really change anything.
If I made a mistake about the contents at the begining of the documents (and I already apologized) it was due to the fact that I don't know about this, which is way I need help in the first place.

Thanks again for the help, as I said, it is working now.

crts	08-28-2010 10:39 PM

Hi,

I did not mean to be harsh. I just wanted to point out how I perceived the development of the initial problem.

Quote:

The original data is in spanish ...

Yes, but you also stated in that post that there are lines that are to be excluded from the output. And that does qualify as altered scenario. The translation alone, of course, does not.

Quote:

I said "the data before what I need to extract is"

Now that I do understand. But your exact words were:

Quote:

the data above...

I must admit I couldn't make hands and tails of it. I thought that by 'above' you were referring to the data you presented in a post 'above'.
So when you said that the command did not work I assumed that is due to the arrangement of your data. At this point I had already double-checked the command. Since I did not see a windows logo on the left side of your posts the possibility of a dos-formatted file (good work on catching that, by the way) did not cross my mind.
Anyway, glad I could help.

P.S.: A slightly shorter way to convert DOS to UNIX files

Code:

sed 's/.$//' dos.file > unix.file

Frakk

08-29-2010 12:00 AM

Sir, I have to say you are correct :D . I didn't use 'before' as I thought and my wording wasn't all that good.

Thanks for the help and the extra tip, I really appreciate it.
Cheers :)

grail

08-29-2010 12:14 AM

Well assuming the start and endpoints will always be there:

Code:

awk -F: '/start/,/end/{printf $2","}/end/' file

crts	08-29-2010 12:42 AM

Hi,

I tried

Code:

awk -F: '/name/,/phone/{printf $2","}/phone/' file # gawk 3.1.6

with this data

Code:

something

at the 

start



name: JohnA

address: TheStreet 123A

phone: TheCityA



...

in the middle

of something

...

name: JohnB

address: TheStreet 123B

phone: TheCityB



name: JohnC

phone: TheCityC



... and to the

end

The output was:

Code:

JohnA, TheStreet 123A, TheCityA,phone: TheCityA

JohnB, TheStreet 123B, TheCityB,phone: TheCityB

JohnC, TheCityC,phone: TheCityC

All times are GMT -5. The time now is 01:24 PM.

Page 1 of 2

Show 50 post(s) from this thread on one page