Please help me with these basic commands

pizzarist · 02-10-2017, 04:21 PM

Hello,

I'm struggling with this homework. I tried everything and my final output is always empty. Here are the questions and my commands :

1- Download file http://x.vcf.gz
- $ wget http://x.vcf.gz

2- Uncompress the downloaded file
- $ gunzip x.vcf.gz

3- Extract the lines 40-60 from the uncompressed file and generate an output file
-$ sed -n 40,61p x.vcf > output-file-1

4-Based on the output from step 3, extract the first 4 columns and generate an output file
- $ cut -c 1-5 output-file-1 > output-file-2

5-Based on the output from step 4, remove the lines that have ID ( in the 2nd column) starts with the string (DEL) and generate an output file
- $ sed -i '/DEL/2d' output-file-2 > output-file-3

6- show the content of the final output
- $ cat output-file-3

and the result is empty. In fact, when I try to search for the string DEL in output-file-3 or 2 I don't see anything.

Your help would very much appreciated

hydrurga · 02-11-2017, 04:58 AM

Step 4. cut -c extracts based on character count, not field (column) count. Look at cut -f instead, with -d if necessary. Alternatively, you could use awk.

Turbocapitalist · 02-11-2017, 05:29 AM

Also, in step #5, you have accidentally added the -i option to sed and it gets in the way. It will cause sed to do in-place editing of the one file. As a result it will produce no output, thus the redirection of output (where there is no output to redirect) to the second file will result in the second file becoming empty.

Code:

man sed

Myself, I see the -i option on sed as more of a misfeature and do it the way you have it, with a redirect of sed's output to a new file.

Jjanel · 02-11-2017, 05:37 AM

Hi again! #1 doesn't look like a valid URL! Look for the file with: ls -l
Maybe: https://github.com/vgteam/vg/raw/mas...order/x.vcf.gz

Code:

user@trisquel:~$ ls -l
total 0
user@trisquel:~$ wget https://github.com/vgteam/vg/raw/master/test/order/x.vcf.gz
--2017-[...ton of msgs...] saved [2906/2906]
user@trisquel:~$ ls -l
total 4
-rw-rw-r-- 1 user user 2906 Feb 10 14:40 x.vcf.gz
user@trisquel:~$ gunzip x.vcf.gz
user@trisquel:~$ ls -l
total 16
-rw-rw-r-- 1 user user 16167 Feb 10 14:40 x.vcf
user@trisquel:~$ sed -n 40,61p x.vcf > output-file-1
user@trisquel:~$ ls -l
total 20
-rw-rw-r-- 1 user user  1137 Feb 10 14:43 output-file-1
-rw-rw-r-- 1 user user 16167 Feb 10 14:40 x.vcf
user@trisquel:~$ wc output-file-1
  22   22 1137 output-file-1
user@trisquel:~$ cut -c 1-5 output-file-1 > output-file-2
user@trisquel:~$ ls -l
total 24
-rw-rw-r-- 1 user user  1137 Feb 10 14:43 output-file-1
-rw-rw-r-- 1 user user   132 Feb 10 14:43 output-file-2
-rw-rw-r-- 1 user user 16167 Feb 10 14:40 x.vcf
user@trisquel:~$ head -1 output-file-2; tail -1 output-file-2 #sed -n '1p;$p' #awk 'NR==1;END{print}'
##con
##con
user@trisquel:~$ sed -i '/DEL/2d' output-file-2 > output-file-3
sed: -e expression #1, char 6: unknown command: `2'

cat -n x.vcf
will show line numbers; also (in the vcf I guessed at):
grep -n DEL x.vcf
93:##ALT=<ID=DEL,Description="Deletion">
So, my step #3 would have -different- line numbers.

[thoughts added later:]
A key concept here is debugging=troubleshooting what's happening.
Another way is: make a tiny/simple test case, and dig-deeply thru it,
to get each step working and understood.
(a bit like a new car, with buttons/switches that [may] do something 'good')
-I- actually didn't know what the sed -i switch does, so I
do a 1-minute scan of the [100page] manual, section sed, switch -i
IF that doesn't clarify it, I try a web-search (including 'goal'), like:
sed examples remove lines that have|match a string
Yea! 1st 'hit': http://stackoverflow.com/questions/5...pecific-string
(I thought this would be a 10+minute 'project' but it turned out <2!)
Deeper search: sed "delete 2 lines": seq 5|sed /2/,+2d COOL!

Anyway, best wishes! You'll be a Linux Master in no time at all!

hydrurga · 02-11-2017, 05:45 AM

Quote:

Originally Posted by Jjanel

Hi again! #1 doesn't look like a valid URL! Look for the file with: ls -l

Yeah, that was the first thing that struck me too, Jjanel, but I reckoned that the OP would probably have mentioned that there was an error at that stage. I assumed therefore that the OP had judiciously edited this portion of the original post to delete the actual URL used.

pizzarist · 02-11-2017, 08:45 AM

I'm terribly sorry, I just made that link up as an example. This is the actual link :
http://ftp.1000genomes.ebi.ac.uk/vol...notypes.vcf.gz

pizzarist · 02-11-2017, 09:16 AM

Quote:

Originally Posted by hydrurga

Step 4. cut -c extracts based on character count, not field (column) count. Look at cut -f instead, with -d if necessary. Alternatively, you could use awk.

Do you mean like this :

$ cut -d ' ' -f 1-5 output-file-1 > output-file-2

hydrurga · 02-11-2017, 09:23 AM

Quote:

Originally Posted by pizzarist

Do you mean like this :

$ cut -d ' ' -f 1-5 output-file-1 > output-file-2

You need to look at each step, and the output from each step, separately and check that the output matches the format and content that you want it to be in. If it doesn't match your requirements then try changing the command and/or options, read the man files for the command in question, search the internet for answers. Then, if you still can't get the output to match what you want, come back on here and post the command and output in question, along with an example of how you want the output to be.

So, starting from the first step, and working downwards step by step, where do you hit your first problem?

pizzarist · 02-11-2017, 10:45 AM

Quote:

Originally Posted by hydrurga

You need to look at each step, and the output from each step, separately and check that the output matches the format and content that you want it to be in. If it doesn't match your requirements then try changing the command and/or options, read the man files for the command in question, search the internet for answers. Then, if you still can't get the output to match what you want, come back on here and post the command and output in question, along with an example of how you want the output to be.

So, starting from the first step, and working downwards step by step, where do you hit your first problem?

I suspect that my problem starts with step 4 as I'm not sure what I'm extracting, columns or characters. The final output should be a gene sequencing file, something like : CCGTCGAACCA.

hydrurga · 02-11-2017, 10:55 AM

Quote:

Originally Posted by pizzarist

I suspect that my problem starts with step 4 as I'm not sure what I'm extracting, columns or characters. The final output should be a gene sequencing file, something like : CCGTCGAACCA.

An internet search for "vcf format" produced the following as the first result:

http://www.internationalgenome.org/w...alysis/vcf4.0/

Have a read of the section entitled "Data lines".

Turbocapitalist · 02-11-2017, 10:56 AM

Quote:

Originally Posted by pizzarist

I suspect that my problem starts with step 4 as I'm not sure what I'm extracting, columns or characters. The final output should be a gene sequencing file, something like : CCGTCGAACCA.

You can't get there from here. You'll have to back up and replace step #3 and onwards. I'd recommend awk to extract the 4th column, if the 4th column consists of a larger number of A, C, G, and T.

What are you really supposed to extract from the file?

Jjanel · 02-11-2017, 06:14 PM

I added a bit to my #4post. Have a peek at this web-search: Genomes bio-linux
Also (| is OR here [not 'pipe'!]): "internationalgenome" linux|awk|perl
Oh, and (I'm addicted to web-searching!): book|.pdf linux for bioinformatics
(I'm curious as to what computer and Linux 'distro' you use ... just curious)
You'll 'get there', tho maybe in a different awk-mobile

Linux is INFINITE!

pizzarist · 02-11-2017, 06:33 PM

Quote:

Originally Posted by Jjanel

I added a bit to my #4post. Have a peek at this web-search: Genomes bio-linux
Also (| is OR): "internationalgenome" linux|awk|perl
(I'm curious as to what computer and Linux 'distro' you use ... just curious)
You'll 'get there', tho maybe in a different awk-mobile

Linux is INFINITE!

I use a super computer at our campus ( Karst), remotely from my macbook. I have no idea about the linux. I'm new to all of this.

r3sistance · 02-11-2017, 06:54 PM

Quote:

Originally Posted by pizzarist

I use a super computer at our campus ( Karst), remotely from my macbook. I have no idea about the linux. I'm new to all of this.

A pipe is a fairly simple way to concatenate commands, taking the stdout of the command of the left and putting it to the stdin of the command to the right. That might sound a bit strange so here is an example

Code:

echo "hello" | sed 's/hello/hello world/' | cat
hello world

So the output of the first echo is hello, this is passed to sed. Sed does the replacement on hello to hello world.
The output of sed is then passed to the cat on the right which outputs "hello world"

You seem to be using a lot of files unnecessarily where pipes probably would have been a better choice.

ntubski · 02-11-2017, 07:09 PM

Quote:

Originally Posted by r3sistance

The output of sed is then passed to the echo on the right which outputs "hello world"

echo ignores its input, doesn't it? Maybe you meant to put cat there instead?