[SOLVED] commands to select range of information

kumar23kan · 10-11-2015, 09:40 PM

Hai
I have a txt file with several numbers and characters separated by both the space and tabs. I have to select a line starting with one character in that line i have to look for numbers below say 3000, beyond which the entire line has to be deleted till it reaches the next line. can somebody help please.

berndbausch · 10-11-2015, 10:07 PM

Quote:

Originally Posted by kumar23kan

Hai
I have a txt file with several numbers and characters separated by both the space and tabs. I have to select a line starting with one character in that line i have to look for numbers below say 3000, beyond which the entire line has to be deleted till it reaches the next line. can somebody help please.

awk, grep, cut, sed are typically the commands you would use. Your description doesn't help me understand your line structure. Can you provide a few sample lines?

syg00 · 10-11-2015, 10:14 PM

Sounds like homework.

kumar23kan · 10-11-2015, 10:41 PM

dear syg00, I am a biologist and it is not my homework. I am trying to organize information.

dear berndbausch
The content of the txt is as follows

rd15256 1.1361 0.3236 0.0692 81.3499 28.2168 5.3160 1006 TG 1006 RG2 rim55501 (3,-1) Ringer sample
rd15256 DOS= 2.255E+00 litl(DOS)= -2.855E+01 liter123= 1.407E+01 -2.105E+01 -2.157E+01 square.= 0.02
rd15256 gen_vecs_uvz(segmental): -0.0047 -0.0009 0.0112 0.0112 -0.0064 0.0050 0.0071 0.0095 0.0045
! Specimen rd: sample hills ABR ABR_orth DOS litl(DOS)
ring+ 1.1359 0.3236 0.0696 81.3296 28.2136 5.3503 1 2.2673 -3.04321E+01
hills+ 1.1358 0.3235 0.0698 81.3211 28.2122 5.3645 2 2.2801 -3.20847E+01
hills+ 1.1358 0.3235 0.0700 81.3127 28.2109 5.3787 3 2.2981 -3.43627E+01
hills+ 1.1357 0.3235 0.0702 81.3043 28.2096 5.3928 4 2.3214 -3.73644E+01
hills+ 1.1356 0.3235 0.0704 81.2959 28.2083 5.4069 5 2.3503 -4.12124E+01
hills+ 1.1355 0.3235 0.0706 81.2875 28.2070 5.4209 6 2.3850 -4.60633E+01
hills+ 1.1355 0.3235 0.0707 81.2792 28.2056 5.4349 7 2.4256 -5.21206E+01
hills+ 1.1354 0.3234 0.0709 81.2709 28.2043 5.4487 8 2.4725 -5.96533E+01
hills+ 1.1353 0.3234 0.0711 81.2627 28.2030 5.4625 9 2.5256 -6.90238E+01
hills+ 1.1352 0.3234 0.0713 81.2545 28.2017 5.4762 10 2.5851 -8.07300E+01
hills+ 1.1352 0.3234 0.0715 81.2464 28.2005 5.4897 11 2.6510 -9.54735E+01
hills+ 1.1351 0.3234 0.0716 81.2384 28.1992 5.5031 12 2.7235 -1.14272E+02
hills+ 1.1350 0.3234 0.0718 81.2304 28.1979 5.5164 13 2.8027 -1.38659E+02
hills+ 1.1349 0.3234 0.0720 81.2225 28.1967 5.5296 14 2.8884 -1.71049E+02
hills+ 1.1349 0.3233 0.0721 81.2147 28.1954 5.5426 15 2.9808 -2.15472E+02
hills+ 1.1348 0.3233 0.0723 81.2070 28.1942 5.5554 16 3.0799 -2.79198E+02

rd15257 1.1398 0.3159 0.0582 81.7857 27.5442 4.4724 1006 TG 1006 SD rim55501 (3,-1) Ringer sample
rd15257 DOS= 1.273E+00 litl(DOS)= -4.041E+00 liter123= 9.115E+00 -6.256E+00 -6.900E+00 square.= 0.10
rd15257 gen_vecs_uvz(segmental): -0.0009 0.0104 0.0052 -0.0074 -0.0044 0.0084 0.0119 -0.0020 0.0085
! Specimen rd: sample hills ABR ABR_orth DOS litl(DOS)
ring+ 1.1398 0.3163 0.0584 81.7800 27.5805 4.4883 1 1.2806 -4.10694E+00
hills+ 1.1398 0.3165 0.0585 81.7772 27.5985 4.4962 2 1.2899 -4.32799E+00
hills+ 1.1398 0.3167 0.0586 81.7744 27.6164 4.5041 3 1.3028 -4.68175E+00
hills+ 1.1398 0.3169 0.0587 81.7715 27.6343 4.5120 4 1.3195 -5.17392E+00
hills+ 1.1397 0.3171 0.0588 81.7687 27.6522 4.5198 5 1.3396 -5.80978E+00
hills+ 1.1397 0.3173 0.0589 81.7659 27.6700 4.5276 6 1.3632 -6.59338E+00
hills+ 1.1397 0.3175 0.0590 81.7631 27.6876 4.5354 7 1.3900 -7.52640E+00
hills+ 1.1397 0.3177 0.0591 81.7602 27.7052 4.5431 8 1.4197 -8.60637E+00
hills+ 1.1397 0.3179 0.0592 81.7574 27.7227 4.5508 9 1.4520 -9.82422E+00
hills+ 1.1396 0.3181 0.0593 81.7546 27.7400 4.5584 10 1.4867 -1.11608E+01
hills+ 1.1396 0.3183 0.0594 81.7518 27.7572 4.5660 11 1.5233 -1.25820E+01
hills+ 1.1396 0.3185 0.0595 81.7490 27.7743 4.5735 12 1.5615 -1.40319E+01
hills+ 1.1396 0.3187 0.0596 81.7461 27.7913 4.5810 13 1.6008 -1.54239E+01
hills+ 1.1396 0.3189 0.0597 81.7433 27.8081 4.5884 14 1.6409 -1.66272E+01
hills+ 1.1396 0.3191 0.0598 81.7405 27.8248 4.5957 15 1.6815 -1.74496E+01
hills+ 1.1395 0.3193 0.0599 81.7376 27.8414 4.6030 16 1.7225 -1.76129E+01
hills+ 1.1395 0.3195 0.0600 81.7348 27.8578 4.6102 17 1.7639 -1.67202E+01

This goes one to for several thousand lines I can manually curate but i think using linux commands might save lots of time

syg00 · 10-11-2015, 10:56 PM

Given that input what is the expected output. As stated, your initial post is so generic as to be meaningless.

kumar23kan · 10-11-2015, 11:10 PM

the first line contains 1006 as the eighth character, I want remove all data which exceed 3000

Beryllos · 10-11-2015, 11:58 PM

Quote:

Originally Posted by kumar23kan

the first line contains 1006 as the eighth character, I want remove all data which exceed 3000

You still haven't clearly described the problem. It looks like there are blocks of 20 or so lines, and your remark suggests that you want to filter blocks based on the eighth item of the first line of each block.

This could be done by a bash script with commands like read (to read lines), cut (to select one item from the line), a conditional statement like if [ $item -gt 3000 ], and echo (to write the lines you need to save). Put the appropriate code in a while loop to process blocks until you reach the end of the file, and inside that loop you could use another while loop to copy the desired lines to the output file until the next block is detected.

Sardog · 10-12-2015, 12:43 AM

You might have better luck parsing this file if you wrote a python script. It is worth the effort.

syg00 · 10-12-2015, 01:31 AM

Quote:

Originally Posted by kumar23kan

the first line contains 1006 as the eighth character, I want remove all data which exceed 3000

No, the eighth field contains 1006.
If it is greater than 3000, do you want to delete that entire line, and all lines down to the next blank line ?. Your terminology is just not logical.

kumar23kan · 10-12-2015, 01:42 AM

yes the eighth field if it exceeds i would like to delete that line and following line till it reaches the second group. I am not expertised in writing scripts in python scripts, i am just a beginner

pan64 · 10-12-2015, 01:47 AM

if not python, you can use awk, perl or other language. What do you prefer (I mean which one did you try already, what can you handle easier?)

berndbausch · 10-12-2015, 01:52 AM

To remove all lines whose eighth field is greater than 3000, then following awk program would be a solution:

Code:

awk '$8 <= 3000 { print }' nameofyourdatafile

awk programs are a series of condition-action pairs. Here, the condition is "field 8 is up to 3000", and the action is obviously to print such a line.

If you only want to delete the lines that start with "rd", for example, and whose 8th field is greater than 3000:

Code:

awk '/^rd/ && $8 > 3000 { next  }
                        { print }' nameofyourdatafile

The first condition is "line starts with rd and has an 8th field greater than 3000". The corresponding action is to skip to the next line, in other words, do nothing for the current line.
The second condition is empty and matches all lines.

I have to say though, I still don't understand what you want to achieve.

syg00 · 10-12-2015, 02:46 AM

Slight modification to delete the block to the next null line, which is maybe what the OP wants.

Code:

awk '$8 <= 3000 { print $0"\n" }' RS='' nameofyourdatafile > reduceddatafile

kumar23kan · 10-12-2015, 04:39 AM

syg00 and berndbausch thanks for the help...

grail · 10-12-2015, 06:02 AM

Or if you like:

Code:

awk '$8 <= 3000' RS='' ORS='\n\n' nameofyourdatafile > reduceddatafile