help with extracting entries from a multiple entry columns in a file
Hi All,
I have a tab separated file where the entries from the 2nd column onwards are multiple and separated by a ;. I would like to get only the 1st entry of each column. I have 8000+ columns and not giving all here. Here is a sample: rs1 AG;0.79780;0.132;0.204;487;923 GG;0.79780;0.115;0.161;213;457 AG;0.79780;0.095;0.152;375;835 I would like to have rs1 AG GG AG Appreciate any help. Thank you, Davina |
Trivial in awk - nominate ";" as field separator.
|
That should be pretty easy with awk or perl, and there are several ways to approach the problem in either. Which one are you trying and can you show how far you have gotten?
|
Quote:
So I'd keep tabs as the separator, but use gsub() to zap everything starting with the first semicolon in each field. But is there a way to do that or otherwise get the same result without needing a loop to go through the fields in each row? |
Quote:
tr -s '; ' '\t' < "file name". However that splits each column into multiple columns at the points where ; occurs. I just want the first entries of each column (remember I have thousands of columns). |
Oops - didn't read that too well did I. Sorry about that. I'll be back.
|
Ok. Try escalating to awk then. Be sure to see the manual page.
Code:
man awk http://www.grymoire.com/Unix/Awk.html It is a very thorough introduction. |
Quote:
Personally I'd use sed - same/similar regex. |
How's your regex fu Davina ?.
Your data (for this discussion) can be defined as "a bunch of non-semicolon characters (that you want to keep), followed by a bunch of non-whitespace characters (that you want to remove)". Define that in regex, and make the substitution global. |
Quote:
DavinaP, the substitution command in sed is what to look at: Code:
sed -e 's/old/new/g;' < oldfile.txt > newfile.txt |
I do not use awk but try this,
Code:
userx%slackwhere ⚡ testDIR ⚡> awk -F\; '{print $1}' fileDirLit or to keep it handy Code:
userx%slackwhere ⚡ testDIR ⚡> awk -F\; '{print $1}' fileDirLit > results Code:
userx%slackwhere ⚡ testDIR ⚡> awk -F\; 'NR > 1 {print $1}' fileDirLit |
You can use:
Code:
$ tr ';' \t < file > file2 |
My feeling is that the example given has been confusing.
If the data format is tab separated columns with semicolon delimiters within columns, such as Code:
AG;0.79780;0.132;0.204;487;923 AG;0.79780;0.132;0.204;487;923 AG;0.79780;0.132;0.204;487;923 Code:
awk -F ";[^\t]+" '{for (i=1;i<NF;i++){printf"%s", $i}; printf"\n"}' <inputfile> |
in OP post he says all he wants is the very first column which is all of the AG GG AG etc..
which this actually gives him Code:
awk -F\; '{print $1}' fileToLooKAt > results |
But the OP doesn't show any tabs, so $1 is the whole row.
Sorry, didn't see the "-F". |
All times are GMT -5. The time now is 11:53 PM. |