LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   How to order data based on missing values? (https://www.linuxquestions.org/questions/linux-newbie-8/how-to-order-data-based-on-missing-values-4175446647/)

jv61 01-21-2013 04:33 PM

How to order data based on missing values?
 
I have a file which looks like this.

Code:

con1        BC1a:25        BC1b:25        BC2a:2        BC2b:20        BC3a:5 BC3b:56
con2        BC1a:25        BC2a:2                               
con3        BC2a:2        BC3a:5        BC3b:6                       
con4        BC1b:20        BC2a:12        BC2b:20        BC3a:50        BC3b:5

All the rows doesn't have the equal number of columns. I would like to order the data in the columns based on BCtag and insert missing values for columns that are empty.

My result should look something like this

Code:

con1        BC1a:25        BC1b:25        BC2a:2        BC2b:20        BC3a:5 BC3b:56
con2        BC1a:25        -        BC2a:2        -        -        -
con3        -        -        BC2a:2        -        BC3a:5        BC3b:6
con4        -        BC1b:20        BC2a:12        BC2b:20        BC3a:50 BC3b:5


My actual file has many thousands of rows and 150 columns. Any ideas of how to do this?

Thanks in advance.

jpollard 01-21-2013 08:55 PM

not easy... but doable.

I'd do it in perl - the associative hash tables would do it (the BC1a labels used as subscript).

You start with an array - containing all the labels you will be using.

Then create a reference hash table that has values for each of the labels (the "-").

in a loop for each line,
initialize a new hash table with the reference table. for each token on the line (you do have to split the token at the ":") replace the new hash table entry with the string based on the BCxx label that is in the token list.

Last, output the new hash table based on the array (in the proper order) containing the labels.

You could do it with python, but I'm not that familiar with python. Can't give you a sample code right now - have to deal with some cats...
Maybe tomorrow though.

allend 01-22-2013 09:00 AM

My attempt at a bash solution.
Code:

!/bin/bash

# Read the data file to get a sorted list of prefixes
while read -a aline ; do
  for (( i=1; i<"${#aline[@]}"; i+=1 )); do
    echo "${aline[i]%:*}" >> prefixes.txt;
  done
  sort -u prefixes.txt > temp.txt;
  mv temp.txt prefixes.txt;
done < data.txt

# Use the sorted list to produce the output file
readarray -t prefixes < prefixes.txt
while read -a aline ; do
  j=1;
  echo -n "${aline[0]}" >> output.txt;
  for (( i=0; i<"${#prefixes[@]}"; i+=1 )); do
    if [[ "${aline[j]%:*}" == "${prefixes[i]}" ]] ; then
      echo -n " ${aline[j]}" >> output.txt;
      (( j+=1 ));
    else 
      echo -n " -" >> output.txt;
    fi
  done
  echo "" >> output.txt;
done < data.txt

# Cleanup 
rm prefixes.txt

Using the input you gave in a file named data.txt, this script produces a file output.txt containing
Code:

con1 BC1a:25 BC1b:25 BC2a:2 BC2b:20 BC3a:5 BC3b:56
con2 BC1a:25 - BC2a:2 - - -
con3 - - BC2a:2 - BC3a:5 BC3b:6
con4 - BC1b:20 BC2a:12 BC2b:20 BC3a:50 BC3b:5

I should point out that the script assumes the fields in a line are in lexicographic order.


All times are GMT -5. The time now is 12:23 PM.