LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-28-2010, 06:06 PM   #1
Frakk
Member
 
Registered: Oct 2007
Posts: 33

Rep: Reputation: 2
Post extract text from files


Hi,

I have many files in a folder from which I need to extract some contents, these are basically text files wich have individual lines with (i.e)
name: john
address: whatever
phone: 123456

Some caveats

1. Sometimes a line might be missing.
name: johnn
phone: 123456

2. Lines are not in the same line-numbers across the files


I did try some things with awk based on google searches but I couldn't extract the data of each file into a single line (this is the ultimate goal):
john,whatever,123456

I don't have knowledge other than having put some bash scripts together for backup jobs, so I am open to install anything that could help to pull this off.

Any help will be greatly appreciated.
Regards.
 
Old 08-28-2010, 06:24 PM   #2
xeleema
Member
 
Registered: Aug 2005
Location: D.i.t.h.o, Texas
Distribution: Slackware 13.x, rhel3/5, Solaris 8-10(sparc), HP-UX 11.x (pa-risc)
Posts: 988
Blog Entries: 4

Rep: Reputation: 254Reputation: 254Reputation: 254
Greetingz!

Sounds like you might want to use "egrep". I would suggest first reading the man page for the command, however the following command may help;

egrep -i "name:|phone:|address:" /path/to/files/*

Each pattern you want to find is separated by the pipe ("|") symbol, and the entire set of patterns must be wrapped in the double-quote character.
 
Old 08-28-2010, 06:35 PM   #3
Frakk
Member
 
Registered: Oct 2007
Posts: 33

Original Poster
Rep: Reputation: 2
wow, that was a quick reply !!

It does extract the data, but every item is still dumped to a single line, and I need to combine/chain them into a one line per original file (actually csv file).

That is the part where I got stuck :S

TIA

Last edited by Frakk; 08-28-2010 at 06:38 PM. Reason: typos
 
Old 08-28-2010, 06:49 PM   #4
xeleema
Member
 
Registered: Aug 2005
Location: D.i.t.h.o, Texas
Distribution: Slackware 13.x, rhel3/5, Solaris 8-10(sparc), HP-UX 11.x (pa-risc)
Posts: 988
Blog Entries: 4

Rep: Reputation: 254Reputation: 254Reputation: 254
Question

Ah!

Okay, well then I would pipe that output from the "egrep" command I mentioned earlier to awk (and maybe sort, too)

egrep -i "name:|phone:|address:" /path/to/files/* |\
awk 'NR == 1 { line = sq $0 sq } { line = line "," sq $0 sq } END { print line }'


Now, for some darn reason, this will repeat the first field twice. I have *no* idea why, so if anyone else chimes I'd really appreciate it.
 
Old 08-28-2010, 06:49 PM   #5
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Hi,

Try this
sed -n ':mark /phone:/ ! {N;b mark}; /phone:/ {s/name:[ ]*//;s/phone:[ ]*/,/;s/address:[ ]*/,/;s/\n//g;p}' infile > outfile

This assumes that there is no 'junk' in between the records. If there is only one record per file then the command can can be simplified.

Hope this helps
 
Old 08-28-2010, 07:48 PM   #6
Frakk
Member
 
Registered: Oct 2007
Posts: 33

Original Poster
Rep: Reputation: 2
@crts

thnks for the suggestion, however, that code is giving me individual lines too and, more importantly, my source files do have garbage before the contents I really need and that is getting dumped too.

Quote:
Originally Posted by xeleema View Post

egrep -i "name:|phone:|address:" /path/to/files/* |\
awk 'NR == 1 { line = sq $0 sq } { line = line "," sq $0 sq } END { print line }'
Not sure if I am doing something wrong, but for this data
Code:
nombre_apellido: John
direccion: TheStreet 123
ciudad: TheCity
I am getting these results (I matched the "fields" as my original example had dummy fields sorry)
Code:
,ciudad: TheCitytreet 123
Seems like everything is getting piled up. That is processing a single file, it gets worst when it goes through all of them.

TIA
 
Old 08-28-2010, 08:05 PM   #7
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
... and that is exactly why you should always give a representative example of your data. This was not obvious from your initial post. In fact, your description implied that there is nothing in between.
Code:
sed -n '/nombre_apellido:/ {:mark /ciudad:/ ! {N;b mark}; /ciudad:/ {s/nombre_apellido:[ ]*//;s/ciudad:[ ]*/,/;s/direccion:[ ]*/,/;s/\n//g;p};}' infile > outfile
If this still does not match then provide some representative sample data. I am not going to *guess* what your file might look like.
 
Old 08-28-2010, 08:34 PM   #8
Frakk
Member
 
Registered: Oct 2007
Posts: 33

Original Poster
Rep: Reputation: 2
My apologies, it didn't seem it could cause any harm at the time...
Lesson learned.

The results of your new version are much closer:
Code:
John

,TheStreet 123

,TheCity
The empty line are part of the results.

It probably doesn't matter now as you resolved that, but the data above what I need to extract is
userid: 123456
userstatus: 1
usergroup: somegroup

And there are more items at the end which I don't need.
Whether the fields are present or not varies from one file to another.

Thank you very much for the help.
 
Old 08-28-2010, 08:47 PM   #9
Frakk
Member
 
Registered: Oct 2007
Posts: 33

Original Poster
Rep: Reputation: 2
I just realized the empty lines are caused by the files having windows line breaks...

so I am using this to convert them to a single unix file

Code:
awk '{ sub("\r$", ""); print }' form_* > unix/merged.txt
And it does work perfectly with that file

Thanks a Ton!
Best regards.
 
Old 08-28-2010, 08:48 PM   #10
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
what's it gonna be?

Quote:
Originally Posted by Frakk View Post
... these are basically text files wich have individual lines with (i.e)
name: john
address: whatever
phone: 123456

Some caveats

1. Sometimes a line might be missing.
name: johnn
phone: 123456
Quote:
Originally Posted by Frakk View Post
Not sure if I am doing something wrong, but for this data
Code:
nombre_apellido: John
direccion: TheStreet 123
ciudad: TheCity
Quote:
Originally Posted by Frakk View Post
what I need to extract is
userid: 123456
userstatus: 1
usergroup: somegroup
Please make up your mind first and
Quote:
If this still does not match then provide some representative sample data. I am not going to *guess* what your file might look like.
So far you have provided three different scenarios. I provided two solution that I both tested and they did work based on your sample data. Your last post suggests that your data is arranged as in your initial post. That is not representative data. We are going in circles right now.
 
Old 08-28-2010, 09:24 PM   #11
Frakk
Member
 
Registered: Oct 2007
Posts: 33

Original Poster
Rep: Reputation: 2
The original data is in spanish, I try to translate that to english so foreign language is out of the way when asking for help in an english speaking forum.

Quote:
1. Sometimes a line might be missing.
name: johnn
phone: 123456
I meant sometimes a line might not be present, in that example address is missing and phone is next to name, just in case someone might think of using line numbers as a reference to identify the data.

Quote:
what I need to extract is
userid: 123456
userstatus: 1
usergroup: somegroup
I didn't say "what I need to extract is" I said "the data before what I need to extract is".
I intended to illustrate what can be found in the lines prior to the ones I need. Maybe I chose the wrong words...

Quote:
So far you have provided three different scenarios. I provided two solution that I both tested and they did work based on your sample data. Your last post suggests that your data is arranged as in your initial post. That is not representative data. We are going in circles right now.
Maybe that's unnecessarily harsh? Whether the item is called "name" or "nombre_apellido" doesn't really change anything.
If I made a mistake about the contents at the begining of the documents (and I already apologized) it was due to the fact that I don't know about this, which is way I need help in the first place.

Thanks again for the help, as I said, it is working now.
 
Old 08-28-2010, 10:39 PM   #12
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Hi,

I did not mean to be harsh. I just wanted to point out how I perceived the development of the initial problem.
Quote:
The original data is in spanish ...
Yes, but you also stated in that post that there are lines that are to be excluded from the output. And that does qualify as altered scenario. The translation alone, of course, does not.
Quote:
I said "the data before what I need to extract is"
Now that I do understand. But your exact words were:
Quote:
the data above...
I must admit I couldn't make hands and tails of it. I thought that by 'above' you were referring to the data you presented in a post 'above'.
So when you said that the command did not work I assumed that is due to the arrangement of your data. At this point I had already double-checked the command. Since I did not see a windows logo on the left side of your posts the possibility of a dos-formatted file (good work on catching that, by the way) did not cross my mind.
Anyway, glad I could help.

P.S.: A slightly shorter way to convert DOS to UNIX files
Code:
sed 's/.$//' dos.file > unix.file
 
Old 08-29-2010, 12:00 AM   #13
Frakk
Member
 
Registered: Oct 2007
Posts: 33

Original Poster
Rep: Reputation: 2
Sir, I have to say you are correct . I didn't use 'before' as I thought and my wording wasn't all that good.

Thanks for the help and the extra tip, I really appreciate it.
Cheers
 
Old 08-29-2010, 12:14 AM   #14
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Well assuming the start and endpoints will always be there:
Code:
awk -F: '/start/,/end/{printf $2","}/end/' file
 
Old 08-29-2010, 12:42 AM   #15
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Hi,

I tried
Code:
awk -F: '/name/,/phone/{printf $2","}/phone/' file   # gawk 3.1.6
with this data
Code:
something
at the 
start

name: JohnA
address: TheStreet 123A
phone: TheCityA

...
in the middle
of something
...
name: JohnB
address: TheStreet 123B
phone: TheCityB

name: JohnC
phone: TheCityC

... and to the
end
The output was:
Code:
JohnA, TheStreet 123A, TheCityA,phone: TheCityA
JohnB, TheStreet 123B, TheCityB,phone: TheCityB
JohnC, TheCityC,phone: TheCityC
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Batch Text Extract Multiple Files lixbie Programming 10 07-02-2008 09:56 AM
script to extract blocks of text from many files. gruessle Programming 4 10-19-2007 02:31 AM
tool to extract text from various files with sql-queries?? xomic Linux - Software 1 04-17-2007 09:44 PM
How to extract Text from RTF files (or even DOC) SkipHuffman Linux - Software 5 03-02-2007 12:57 PM
extract text portions from html files linuxfond Programming 3 04-28-2004 11:00 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:08 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration