LinuxQuestions.org - help with extracting entries from a multiple entry columns in a file

Page 1 of 2

Show 50 post(s) from this thread on one page

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - help with extracting entries from a multiple entry columns in a file (https://www.linuxquestions.org/questions/linux-newbie-8/help-with-extracting-entries-from-a-multiple-entry-columns-in-a-file-4175608892/)

DavinaP

06-30-2017 04:29 AM

help with extracting entries from a multiple entry columns in a file

Hi All,
I have a tab separated file where the entries from the 2nd column onwards are multiple and separated by a ;.
I would like to get only the 1st entry of each column. I have 8000+ columns and not giving all here.
Here is a sample:

rs1
AG;0.79780;0.132;0.204;487;923
GG;0.79780;0.115;0.161;213;457
AG;0.79780;0.095;0.152;375;835

I would like to have

rs1
AG
GG
AG

Appreciate any help.
Thank you,
Davina

syg00

06-30-2017 04:39 AM

Trivial in awk - nominate ";" as field separator.

Turbocapitalist

06-30-2017 04:40 AM

That should be pretty easy with awk or perl, and there are several ways to approach the problem in either. Which one are you trying and can you show how far you have gotten?

Turbocapitalist

06-30-2017 04:43 AM

Quote:

Originally Posted by syg00 (Post 5729067)

Trivial in awk - nominate ";" as field separator.

The sample above has only one column. The other columns are apparently separated by tabs.

So I'd keep tabs as the separator, but use gsub() to zap everything starting with the first semicolon in each field. But is there a way to do that or otherwise get the same result without needing a loop to go through the fields in each row?

DavinaP

06-30-2017 04:45 AM

Quote:

Originally Posted by Turbocapitalist (Post 5729068)

That should be pretty easy with awk or perl, and there are several ways to approach the problem in either. Which one are you trying and can you show how far you have gotten?

Thanks, I have not gotten anywhere much except trying this command:
tr -s '; ' '\t' < "file name".
However that splits each column into multiple columns at the points where ; occurs.
I just want the first entries of each column (remember I have thousands of columns).

syg00

06-30-2017 04:51 AM

Oops - didn't read that too well did I. Sorry about that. I'll be back.

Turbocapitalist

06-30-2017 04:52 AM

Ok. Try escalating to awk then. Be sure to see the manual page.

Code:

man awk

But that is a reference (actually the reference) only and though you should use it a lot, it might not be the best place to start with awk. So also see this site:

http://www.grymoire.com/Unix/Awk.html

It is a very thorough introduction.

syg00

06-30-2017 05:45 AM

Quote:

Originally Posted by Turbocapitalist (Post 5729069)

So I'd keep tabs as the separator, but use gsub() to zap everything starting with the first semicolon in each field. But is there a way to do that or otherwise get the same result without needing a loop to go through the fields in each row?

gensub maybe - that way you can use back-references.
Personally I'd use sed - same/similar regex.

syg00

06-30-2017 06:31 AM

How's your regex fu Davina ?.
Your data (for this discussion) can be defined as "a bunch of non-semicolon characters (that you want to keep), followed by a bunch of non-whitespace characters (that you want to remove)". Define that in regex, and make the substitution global.

Turbocapitalist

07-01-2017 12:41 AM

Quote:

Originally Posted by syg00 (Post 5729089)

Personally I'd use sed - same/similar regex.

Yes. If one thinks about the lines as a single unit, then sed is a good idea. I had been thinking about the line as a record with fields and thus gravitated to awk. Either will work. The language sed is a little terse while awk is a little more complicated, though.

DavinaP, the substitution command in sed is what to look at:

Code:

sed -e 's/old/new/g;' < oldfile.txt > newfile.txt

The greater than > and less than < signs are IO redirects in the shell.

BW-userx

07-01-2017 07:40 AM

I do not use awk but try this,

Code:

userx%slackwhere ⚡ testDIR ⚡> awk -F\;  '{print $1}' fileDirLit

rs1

AG

GG

AG

http://cs.canisius.edu/ONLINESTUFF/P...K/awk.examples

or to keep it handy

Code:

userx%slackwhere ⚡ testDIR ⚡> awk -F\;  '{print $1}' fileDirLit > results      

userx%slackwhere ⚡ testDIR ⚡> cat results

rs1

AG

GG

AG

to skip that first line

Code:

userx%slackwhere ⚡ testDIR ⚡> awk -F\; 'NR > 1 {print $1}' fileDirLit 

AG

GG

AG

GG

AG

AwesomeMachine

07-01-2017 01:02 PM

You can use:

Code:

$ tr ';' \t < file > file2

cat file2 | awk '{print $1}' > file3

That is untested, but I think it will work. What you're doing is changing the semicolons to tabs, which are white space, and then selecting the column before the first white space.

allend

07-03-2017 08:05 AM

My feeling is that the example given has been confusing.
If the data format is tab separated columns with semicolon delimiters within columns, such as

Code:

AG;0.79780;0.132;0.204;487;923        AG;0.79780;0.132;0.204;487;923        AG;0.79780;0.132;0.204;487;923

GG;0.79780;0.115;0.161;213;457        GG;0.79780;0.115;0.161;213;457        GG;0.79780;0.115;0.161;213;457

AG;0.79780;0.095;0.152;375;835        AG;0.79780;0.095;0.152;375;835        AG;0.79780;0.095;0.152;375;835

then I suggest using awk

Code:

awk -F ";[^\t]+" '{for (i=1;i<NF;i++){printf"%s", $i}; printf"\n"}' <inputfile>

BW-userx

07-03-2017 08:11 AM

in OP post he says all he wants is the very first column which is all of the AG GG AG etc..
which this actually gives him

Code:

awk -F\; '{print $1}' fileToLooKAt > results

AwesomeMachine

07-03-2017 10:32 AM

But the OP doesn't show any tabs, so $1 is the whole row.

Sorry, didn't see the "-F".

All times are GMT -5. The time now is 11:53 PM.

Page 1 of 2

Show 50 post(s) from this thread on one page