LinuxQuestions.org - How to extract particular text in a text file

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - How to extract particular text in a text file (https://www.linuxquestions.org/questions/linux-newbie-8/how-to-extract-particular-text-in-a-text-file-657339/)

How to extract particular text in a text file

Hello all,
I am a newbie to Shell scripting . I have a file of which i require only a certain section to be extracted .
The following is my file ,

------------------------------------------------------------------
# args=-index BRACHYPODIUM-FASTA_9.fsa -seed 30 -minlenltr 100 -maxlenltr 2000 -mindistltr 2000 -maxdistltr 31000 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3 -similar 70.0 -overlaps all -mintsd 5 -maxtsd 10 -motif tgca -motifmis 0 -vic 60 -longoutput -v -out BRACHYPODIUM-FASTA_9-LTR.fsa
# user defined options and values:
# verbosemode: On
# indexname: BRACHYPODIUM-FASTA_9.fsa
# outputfile: BRACHYPODIUM-FASTA_9-LTR.fsa
# xdropbelowscore: 5
# similaritythreshold: 70.00
# minseedlength: 30
# matchscore: 2
# mismatchscore: -2
# insertionscore: -3
# s(ret) e(ret) l(ret) s(lLTR) e(lLTR) l(lLTR) TSD l(TSD) m(lLTR) s(rLTR) e(rLTR) l(rLTR) TSD l(TSD) m(rLTR) sim(LTRs) seq-nr
# where:
# s = starting position
# e = ending position
# l = length
# m = motif
# ret = LTR-retrotransposon
# lLTR = left LTR
# rLTR = right LTR
# TSD = target site duplication
# sim = similarity
# seq-nr = sequence number
6698717 6726836 28120 6698717 6700002 1286 gtattt 6 tg..ca 6725526 6726836 1311 gtattt 6 tg..ca 92.91 0
11976957 11982399 5443 11976957 11977421 465 tcgat 5 tg..ca 11981926 11982399 474 tcgat 5 tg..ca 95.57 0
10519667 10531374 11708 10519667 10519811 145 tgtaat 6 tg..ca 10531262 10531374 113 tgtaat 6 tg..ca 73.10 0
6217747 6240561 22815 6217747 6218145 399 tgtca 5 tg..ca 6240095 6240561 467 tgtca 5 tg..ca 80.51 0
4433924 4439787 5864 4433924 4434036 113 ctaaaa 6 tg..ca 4439674 4439787 114 ctaaaa 6 tg..ca 81.58 0
14572614 14600318 27705 14572614 14573462 849 ctagga 6 tg..ca 14599416 14600318 903 ctagga 6 tg..ca 86.38 0

-------------------------------------------------------------------

I want to extract all the lines of text after the text
# seq-nr = sequence number

I tried scanning each line and writing regular expression to eaxtract only lines which start with a number , to a file ...
I was wondering if someone could give me some direction as to the xtraction.

Regards,
Maverick

Maverick,

All you need to do is to grep and awk

Code:

grep eq-nr yourfile |awk -F\= '{print $NF}' > newfile

or if you has really the word "sequence number" and if you want the numbers only , after the word "sequence number" , use this

Code:

grep eq-nr yourfile |awk -F"sequence number" '{print $NF}' > newfile

Why not simply...

Code:

awk '/^[0-9]/' infile > oufile

It would be easier for us to read your message if you include the data in [ code ] blocks. Besides being easier to read, you will preserve extra spaces, etc. which may be important in finding matches.
Also, you should have included information after the block of data you are interested in. That can be important in matching the end of the range.

Code:

# rLTR = right LTR

# TSD = target site duplication

# sim = similarity

# seq-nr = sequence number

6698717 6726836 28120 6698717 6700002 1286 gtattt 6 tg..ca 6725526 6726836 1311 gtattt 6 tg..ca 92.91 0

11976957 11982399 5443 11976957 11977421 465 tcgat 5 tg..ca 11981926 11982399 474 tcgat 5 tg..ca 95.57 0

10519667 10531374 11708 10519667 10519811 145 tgtaat 6 tg..ca 10531262 10531374 113 tgtaat 6 tg..ca 73.10 0

6217747 6240561 22815 6217747 6218145 399 tgtca 5 tg..ca 6240095 6240561 467 tgtca 5 tg..ca 80.51 0

4433924 4439787 5864 4433924 4434036 113 ctaaaa 6 tg..ca 4439674 4439787 114 ctaaaa 6 tg..ca 81.58 0

14572614 14600318 27705 14572614 14573462 849 ctagga 6 tg..ca 14599416 14600318 903 ctagga 6 tg..ca 86.38 0

# dummy line after

Code:

sed -n '/seq-nr/,/^#/{

                  /^#/!p }' temp

6698717 6726836 28120 6698717 6700002 1286 gtattt 6 tg..ca 6725526 6726836 1311 gtattt 6 tg..ca 92.91 0

11976957 11982399 5443 11976957 11977421 465 tcgat 5 tg..ca 11981926 11982399 474 tcgat 5 tg..ca 95.57 0

10519667 10531374 11708 10519667 10519811 145 tgtaat 6 tg..ca 10531262 10531374 113 tgtaat 6 tg..ca 73.10 0

6217747 6240561 22815 6217747 6218145 399 tgtca 5 tg..ca 6240095 6240561 467 tgtca 5 tg..ca 80.51 0

4433924 4439787 5864 4433924 4434036 113 ctaaaa 6 tg..ca 4439674 4439787 114 ctaaaa 6 tg..ca 81.58 0

14572614 14600318 27705 14572614 14573462 849 ctagga 6 tg..ca 14599416 14600318 903 ctagga 6 tg..ca 86.38 0

I assumed that the line after the data would start with a #. So I selected the range "/seq-nr/,/^#/" and then printed a subrange of that for lines that don't start with a #.

Compare this to " /sbin/lspci -v | sed -n '/Ethernet/,/^$/p'" which will extract the information on your ethernet devices. This takes advantage of the fact that the records are separated by blank lines. To bad your program doesn't do that.