how to find unique characters within each column in a txt.file in linux ?

Rozak · 09-08-2015, 11:47 AM

Quote:

Originally Posted by grail

and this one:

Code:

001 100 020
002 111 025
001 100 001
003 100 111

and if you had of left the first column in on previous example i would expect quite a different result.

Code:

 
001 100 020
002 111 025
003     001
        111

and for the previous example for the first column I only would have had one row with 080 .

NevemTeve · 09-08-2015, 12:25 PM

> please ignore the first column
Not possible. Let's assume this:

Code:

080 001 100 020
    002 111 025
    003

Well, pick up a programming language (script or otherwise, but 64-bit capable), and do the following:

1. read everything into memory; arrange data column-wize
2. for every column: sort and drop duplicates
3. create the output

Rozak · 09-08-2015, 12:31 PM

Quote:

Originally Posted by NevemTeve

> please ignore the first column
Not possible. Let's assume this:

Code:

080 001 100 020
    002 111 025
    003

Well, pick up a programming language (script or otherwise, but 64-bit capable), and do the following:

1. read everything into memory; arrange data column-wize
2. for every column: sort and drop duplicates
3. create the output

THank you, But how? i am just beginner. I do not know how. I used this command in linux :
awk < input.txt '{print $1}' | sort | uniq > ouput.txt
but it gave me the answer only for the first column. I am wondering how to change this command to have the answers for all columns at the same time.
can you guid me please?

NevemTeve · 09-08-2015, 12:47 PM

Sorry, but this is a task for a programmer. Nobody can turn you into a programmer via forum-posts.

grail · 09-08-2015, 01:03 PM

Quote:

Originally Posted by Rozak

Code:

 
001 100 020
002 111 025
003     001
        111

and for the previous example for the first column I only would have had one row with 080 .

Would you be able to explain how you got this output?

You have moved 003 from the 4th row up to the 3rd
Left a gap in column 2 / row 3 which you have not done previously
And left 2 gaps in the 4th row

You have previously mentioned that not all rows may have the same number of columns, but your now saying you also do not wish to lose any columns ... is this correct?
Or in another way, if a row starts with 4 columns it will always have 4 columns but some may now be empty?

You say that you would have only one row with 080 ... so are you now saying that the first column will never be repeated?
Are there other rows that this may also occur with?

as you can see, this is not an easy problem and is only made harder with the more information you omit.

Rozak · 09-08-2015, 01:14 PM

Quote:

Originally Posted by grail

Would you be able to explain how you got this output?

You have moved 003 from the 4th row up to the 3rd
Left a gap in column 2 / row 3 which you have not done previously
And left 2 gaps in the 4th row

You have previously mentioned that not all rows may have the same number of columns, but your now saying you also do not wish to lose any columns ... is this correct?
Or in another way, if a row starts with 4 columns it will always have 4 columns but some may now be empty?

You say that you would have only one row with 080 ... so are you now saying that the first column will never be repeated?
Are there other rows that this may also occur with?

as you can see, this is not an easy problem and is only made harder with the more information you omit.

I just want to extract uniqe values within each column which means if a valuse repeat 3 more that one time in a columns I want to see it only one time in my new data.

danielbmartin · 09-08-2015, 02:46 PM

With this InFile ...

Code:

123 000 111
232 123 123
123 123 123
123 000 123

... this awk ...

Code:

awk '{for (j=1;j<=NF;j++) !a[j","$j]++?b[j]=b[j]"\n"$j:0;}
  END{for (j=1;j<=NF;j++) print "\nIn column",j,
 "The unique values are:"b[j]}' $InFile >$OutFile

... produced this OutFile ...

Code:

In column 1 The unique values are:
123
232

In column 2 The unique values are:
000
123

In column 3 The unique values are:
111
123

Daniel B. Martin

Rozak · 09-08-2015, 03:23 PM

Quote:

Originally Posted by danielbmartin

With this InFile ...

Code:

123 000 111
232 123 123
123 123 123
123 000 123

... this awk ...

Code:

awk '{for (j=1;j<=NF;j++) !a[j","$j]++?b[j]=b[j]"\n"$j:0;}
  END{for (j=1;j<=NF;j++) print "\nIn column",j,
 "The unique values are:"b[j]}' $InFile >$OutFile

... produced this OutFile ...

Code:

In column 1 The unique values are:
123
232

In column 2 The unique values are:
000
123

In column 3 The unique values are:
111
123

Daniel B. Martin

Thank you! but i want my new data file have the same number of columns like the original one (exactly same structure but without duplication). my mean is I want it to be like this:
new file:

Code:

123 000 111
232 123 123

I can also send you a part of my original data if you would like to see.

danielbmartin · 09-08-2015, 04:49 PM

With this InFile ...

Code:

123 000 111
232 123 123
123 123 123
123 000 123

... this awk code ...

Code:

 awk '{for (j=1;j<=NF;j++) !a[j","$j]++?b[j]=b[j]" "$j:0;}
   END{for (j=1;j<=NF;j++) print b[j]}' $InFile   \
|awk '{for (j=1;j<=NF;j++) a[j]=a[j]" "$j} 
  END {j=1; while (j in a) {print a[j];j++}}' >$OutFile

... produced this OutFile ...

Code:

 123 000 111
 232 123 123

Daniel B. Martin

Rozak · 09-08-2015, 05:07 PM

Quote:

Originally Posted by danielbmartin

With this InFile ...

Code:

123 000 111
232 123 123
123 123 123
123 000 123

... this awk code ...

Code:

 awk '{for (j=1;j<=NF;j++) !a[j","$j]++?b[j]=b[j]" "$j:0;}
   END{for (j=1;j<=NF;j++) print b[j]}' $InFile   \
|awk '{for (j=1;j<=NF;j++) a[j]=a[j]" "$j} 
  END {j=1; while (j in a) {print a[j];j++}}' >$OutFile

... produced this OutFile ...

Code:

 123 000 111
 232 123 123

Daniel B. Martin

for this code i got this error when I tried to run it in linux:
awk: fatal: cannot open file ` ' for reading (No such file or directory)
while I am in correct directory and I changed my input file name into hap.txt
could you help me by solving the problem?

danielbmartin · 09-08-2015, 05:23 PM

Quote:

Originally Posted by Rozak

could you help me by solving the problem?

$InFile is the symbolic name for the input file.
$OutFile is the symbolic name for the output file.

This is the way my code reads ...

Code:

# File identification
   Path=${0%.*}
 InFile=$Path"inp.txt"
OutFile=$Path"out.txt"

... but that won't work on your computer because I don't know the names of your input and output files.

My preference is to have the program and data files in the same directory. Many people follow a different convention. Regardless of this distinction the awk code should work if you correctly identify the files.

On my machine the program is named dbm1484.bin; the InFile is dbm1484inp.txt; the OutFile is dbm1484out.txt.

Suggestion: get help from someone at your location.

Daniel B. Martin

danielbmartin · 09-08-2015, 05:29 PM

For what it's worth, this is my program in its entirety.

Code:

#!/bin/bash   Daniel B. Martin   Sep15
#
# To execute this program, launch a terminal session and enter:
#  bash /home/daniel/Desktop/LQfiles/dbm1484.bin

# This program inspired by ...
#  http://www.linuxquestions.org/questions/programming-9/
#    how-to-find-unique-characters-within-each-column-in-a-txt-file-in-linux-4175552929/

# Keywords: unique within column; ternary operator; transpose matrix

# File identification
   Path=${0%%.*}
 InFile=$Path"inp.txt"
OutFile=$Path"out.txt"

echo; echo "Method #1 of LQ Member danielbmartin."
awk '{for (j=1;j<=NF;j++) !a[j","$j]++?b[j]=b[j]"\n"$j:0;}
  END{for (j=1;j<=NF;j++) print "\nIn column",j,
 "the unique values are:"b[j]}' $InFile >$OutFile
echo "InFile ...";  cat $InFile;  echo "End Of File ("$(wc -l <$InFile)"  lines)"
echo "OutFile ..."; cat $OutFile; echo "End Of File ("$(wc -l <$OutFile)" lines)"

echo; echo "Method #2 of LQ Member danielbmartin."
 awk '{for (j=1;j<=NF;j++) !a[j","$j]++?b[j]=b[j]" "$j:0;}
   END{for (j=1;j<=NF;j++) print b[j]}' $InFile   \
|awk '{for (j=1;j<=NF;j++) a[j]=a[j]" "$j} 
  END {j=1; while (j in a) {print a[j];j++}}' >$OutFile
echo "InFile ...";  cat $InFile;  echo "End Of File ("$(wc -l <$InFile)"  lines)"
echo "OutFile ..."; cat $OutFile; echo "End Of File ("$(wc -l <$OutFile)" lines)"

echo; echo "Normal end of job."; echo; exit

Daniel B. Martin

Rozak · 09-08-2015, 06:29 PM

Quote:

Originally Posted by danielbmartin

$InFile is the symbolic name for the input file.
$OutFile is the symbolic name for the output file.

This is the way my code reads ...

Code:

# File identification
   Path=${0%.*}
 InFile=$Path"inp.txt"
OutFile=$Path"out.txt"

... but that won't work on your computer because I don't know the names of your input and output files.

My preference is to have the program and data files in the same directory. Many people follow a different convention. Regardless of this distinction the awk code should work if you correctly identify the files.

On my machine the program is named dbm1484.bin; the InFile is dbm1484inp.txt; the OutFile is dbm1484out.txt.

Suggestion: get help from someone at your location.

Daniel B. Martin

My input file is hap.txt and my output file is uniqhap.txt I changed them to these but I get this error:
skarimi@signal[19:20][~]$ cd mkhap
skarimi@signal[19:26][~/mkhap]$ awk '{for (j=1;j<=NF;j++) !a[j","$j]++?b[j]=b[j]" "$j:0;} END{for (j=1;j<=NF;j++) print b[j]}' hap.txt \ |awk '{for (j=1;j<=NF;j++) a[j]=a[j]" "$j} END {j=1; while (j in a) {print a[j];j++}}' > uniqhap.txt
awk: cmd. line:1: fatal: cannot open file ` ' for reading (No such file or directory)
skarimi@signal[19:27][~/mkhap]$
should I add a program to my file? is this the problem? can you please guide me?

danielbmartin · 09-08-2015, 06:35 PM

Quote:

Originally Posted by Rozak

awk: cmd. line:1: fatal: cannot open file ` ' for reading (No such file or directory)

This means you have not correctly identified the InFile, so the awk says "there is no InFile so I can't execute."

Again, you need help from someone at your location.

Daniel B. Martin

Rozak · 09-08-2015, 06:49 PM

Quote:

Originally Posted by danielbmartin

This means you have not correctly identified the InFile, so the awk says "there is no InFile so I can't execute."

Again, you need help from someone at your location.

Daniel B. Martin

but it works based on the previous command that you wrote. and it does not give any error. look:
skarimi@signal[19:41][~/mkhap]$ awk '{for (j=1;j<=NF;j++) !b[j","$j]++?b[j]=b[j]"\n"$j:0;} END{for (j=1;j<=NF;j++) print "\nIn column",j,
> "The unique values are:"b[j]}' hap.txt > uniq.txt
skarimi@signal[19:46][~/mkhap]$
it runs without any error and I can see the result. but it does not work base on the last command you sent to me. are you sure nothin is wrong with the command? I really appreciat your help!