LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 07-22-2008, 01:37 AM   #1
maverick_cat
LQ Newbie
 
Registered: Jul 2008
Posts: 1

Rep: Reputation: 0
How to extract particular text in a text file


Hello all,
I am a newbie to Shell scripting . I have a file of which i require only a certain section to be extracted .
The following is my file ,

------------------------------------------------------------------
# args=-index BRACHYPODIUM-FASTA_9.fsa -seed 30 -minlenltr 100 -maxlenltr 2000 -mindistltr 2000 -maxdistltr 31000 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3 -similar 70.0 -overlaps all -mintsd 5 -maxtsd 10 -motif tgca -motifmis 0 -vic 60 -longoutput -v -out BRACHYPODIUM-FASTA_9-LTR.fsa
# user defined options and values:
# verbosemode: On
# indexname: BRACHYPODIUM-FASTA_9.fsa
# outputfile: BRACHYPODIUM-FASTA_9-LTR.fsa
# xdropbelowscore: 5
# similaritythreshold: 70.00
# minseedlength: 30
# matchscore: 2
# mismatchscore: -2
# insertionscore: -3
# s(ret) e(ret) l(ret) s(lLTR) e(lLTR) l(lLTR) TSD l(TSD) m(lLTR) s(rLTR) e(rLTR) l(rLTR) TSD l(TSD) m(rLTR) sim(LTRs) seq-nr
# where:
# s = starting position
# e = ending position
# l = length
# m = motif
# ret = LTR-retrotransposon
# lLTR = left LTR
# rLTR = right LTR
# TSD = target site duplication
# sim = similarity
# seq-nr = sequence number
6698717 6726836 28120 6698717 6700002 1286 gtattt 6 tg..ca 6725526 6726836 1311 gtattt 6 tg..ca 92.91 0
11976957 11982399 5443 11976957 11977421 465 tcgat 5 tg..ca 11981926 11982399 474 tcgat 5 tg..ca 95.57 0
10519667 10531374 11708 10519667 10519811 145 tgtaat 6 tg..ca 10531262 10531374 113 tgtaat 6 tg..ca 73.10 0
6217747 6240561 22815 6217747 6218145 399 tgtca 5 tg..ca 6240095 6240561 467 tgtca 5 tg..ca 80.51 0
4433924 4439787 5864 4433924 4434036 113 ctaaaa 6 tg..ca 4439674 4439787 114 ctaaaa 6 tg..ca 81.58 0
14572614 14600318 27705 14572614 14573462 849 ctagga 6 tg..ca 14599416 14600318 903 ctagga 6 tg..ca 86.38 0

-------------------------------------------------------------------

I want to extract all the lines of text after the text
# seq-nr = sequence number

I tried scanning each line and writing regular expression to eaxtract only lines which start with a number , to a file ...
I was wondering if someone could give me some direction as to the xtraction.

Regards,
Maverick
 
Old 07-22-2008, 03:03 AM   #2
ZAMO
Member
 
Registered: Mar 2007
Distribution: Redhat &CentOS
Posts: 579

Rep: Reputation: 30
Maverick,


All you need to do is to grep and awk

Code:
grep eq-nr yourfile  |awk -F\= '{print $NF}' > newfile
or if you has really the word "sequence number" and if you want the numbers only , after the word "sequence number" , use this

Code:
grep eq-nr yourfile  |awk -F"sequence number" '{print $NF}' > newfile
 
Old 07-22-2008, 03:33 AM   #3
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Why not simply...
Code:
awk '/^[0-9]/' infile > oufile
 
Old 07-22-2008, 03:44 AM   #4
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655
It would be easier for us to read your message if you include the data in [ code ] blocks. Besides being easier to read, you will preserve extra spaces, etc. which may be important in finding matches.
Also, you should have included information after the block of data you are interested in. That can be important in matching the end of the range.

Code:
# rLTR = right LTR
# TSD = target site duplication
# sim = similarity
# seq-nr = sequence number
6698717 6726836 28120 6698717 6700002 1286 gtattt 6 tg..ca 6725526 6726836 1311 gtattt 6 tg..ca 92.91 0
11976957 11982399 5443 11976957 11977421 465 tcgat 5 tg..ca 11981926 11982399 474 tcgat 5 tg..ca 95.57 0
10519667 10531374 11708 10519667 10519811 145 tgtaat 6 tg..ca 10531262 10531374 113 tgtaat 6 tg..ca 73.10 0
6217747 6240561 22815 6217747 6218145 399 tgtca 5 tg..ca 6240095 6240561 467 tgtca 5 tg..ca 80.51 0
4433924 4439787 5864 4433924 4434036 113 ctaaaa 6 tg..ca 4439674 4439787 114 ctaaaa 6 tg..ca 81.58 0
14572614 14600318 27705 14572614 14573462 849 ctagga 6 tg..ca 14599416 14600318 903 ctagga 6 tg..ca 86.38 0
# dummy line after
Code:
sed -n '/seq-nr/,/^#/{
                  /^#/!p }' temp
6698717 6726836 28120 6698717 6700002 1286 gtattt 6 tg..ca 6725526 6726836 1311 gtattt 6 tg..ca 92.91 0
11976957 11982399 5443 11976957 11977421 465 tcgat 5 tg..ca 11981926 11982399 474 tcgat 5 tg..ca 95.57 0
10519667 10531374 11708 10519667 10519811 145 tgtaat 6 tg..ca 10531262 10531374 113 tgtaat 6 tg..ca 73.10 0
6217747 6240561 22815 6217747 6218145 399 tgtca 5 tg..ca 6240095 6240561 467 tgtca 5 tg..ca 80.51 0
4433924 4439787 5864 4433924 4434036 113 ctaaaa 6 tg..ca 4439674 4439787 114 ctaaaa 6 tg..ca 81.58 0
14572614 14600318 27705 14572614 14573462 849 ctagga 6 tg..ca 14599416 14600318 903 ctagga 6 tg..ca 86.38 0
I assumed that the line after the data would start with a #. So I selected the range "/seq-nr/,/^#/" and then printed a subrange of that for lines that don't start with a #.

Compare this to " /sbin/lspci -v | sed -n '/Ethernet/,/^$/p'" which will extract the information on your ethernet devices. This takes advantage of the fact that the records are separated by blank lines. To bad your program doesn't do that.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to parse text file to a set text column width and output to new text file? jsstevenson Programming 12 04-23-2008 03:36 PM
Extract certain text info from text file xmrkite Linux - Software 30 02-26-2008 12:06 PM
Steps needed to convert multiple text files into one master text file jamtech Programming 5 10-08-2007 12:24 AM
Extract spesific text from an HTML file mister_0101 Programming 6 07-24-2005 05:50 PM
Extract text from a html file gsphanikumar6 Linux - Newbie 2 08-20-2004 02:11 PM


All times are GMT -5. The time now is 10:21 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration