LinuxQuestions.org - How to order data based on missing values?

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - How to order data based on missing values? (https://www.linuxquestions.org/questions/linux-newbie-8/how-to-order-data-based-on-missing-values-4175446647/)

How to order data based on missing values?

I have a file which looks like this.

Code:

con1        BC1a:25        BC1b:25        BC2a:2        BC2b:20        BC3a:5 BC3b:56

con2        BC1a:25        BC2a:2                                

con3        BC2a:2        BC3a:5        BC3b:6                        

con4        BC1b:20        BC2a:12        BC2b:20        BC3a:50        BC3b:5

All the rows doesn't have the equal number of columns. I would like to order the data in the columns based on BCtag and insert missing values for columns that are empty.

My result should look something like this

Code:

con1        BC1a:25        BC1b:25        BC2a:2        BC2b:20        BC3a:5 BC3b:56

con2        BC1a:25        -        BC2a:2        -        -        -

con3        -        -        BC2a:2        -        BC3a:5        BC3b:6

con4        -        BC1b:20        BC2a:12        BC2b:20        BC3a:50 BC3b:5

My actual file has many thousands of rows and 150 columns. Any ideas of how to do this?

Thanks in advance.

not easy... but doable.

I'd do it in perl - the associative hash tables would do it (the BC1a labels used as subscript).

You start with an array - containing all the labels you will be using.

Then create a reference hash table that has values for each of the labels (the "-").

in a loop for each line,
initialize a new hash table with the reference table. for each token on the line (you do have to split the token at the ":") replace the new hash table entry with the string based on the BCxx label that is in the token list.

Last, output the new hash table based on the array (in the proper order) containing the labels.

You could do it with python, but I'm not that familiar with python. Can't give you a sample code right now - have to deal with some cats...
Maybe tomorrow though.

My attempt at a bash solution.

Code:

!/bin/bash



# Read the data file to get a sorted list of prefixes

while read -a aline ; do

  for (( i=1; i<"${#aline[@]}"; i+=1 )); do

    echo "${aline[i]%:*}" >> prefixes.txt;

  done

  sort -u prefixes.txt > temp.txt;

  mv temp.txt prefixes.txt;

done < data.txt



# Use the sorted list to produce the output file

readarray -t prefixes < prefixes.txt

while read -a aline ; do

  j=1;

  echo -n "${aline[0]}" >> output.txt;

  for (( i=0; i<"${#prefixes[@]}"; i+=1 )); do

    if [[ "${aline[j]%:*}" == "${prefixes[i]}" ]] ; then 

      echo -n " ${aline[j]}" >> output.txt;

      (( j+=1 ));

    else  

      echo -n " -" >> output.txt;

    fi

  done

  echo "" >> output.txt;

done < data.txt



# Cleanup  

rm prefixes.txt

Using the input you gave in a file named data.txt, this script produces a file output.txt containing

Code:

con1 BC1a:25 BC1b:25 BC2a:2 BC2b:20 BC3a:5 BC3b:56

con2 BC1a:25 - BC2a:2 - - -

con3 - - BC2a:2 - BC3a:5 BC3b:6

con4 - BC1b:20 BC2a:12 BC2b:20 BC3a:50 BC3b:5

I should point out that the script assumes the fields in a line are in lexicographic order.