[SOLVED] Build an array/matrix from a concatenated list in bash

eamesj · 05-21-2013, 09:26 AM

Hi all

I have a concatenated list in a file that i would like to split up and build an array/matrix from.

i.e. for the list

Code:

901  0.0001618 #sub-list 1
901 -0.0083606
901 -0.0060424
902 -0.0006518 #sub-list 2
902 -0.0006474
902 -0.0006474
907  0.0001615 #sub-list 3
907 -0.0093895
907 -0.0090656

I would like to read line-by line and build a new array/matrix for every change in column 1 (901, 902 and 903 in this case but not necessarily these values) so that i can read each sub-list independently.

how can i go about doing this? and how to call on each sub-list?

Thnaks

grail · 05-21-2013, 10:28 AM

What is to be stored in each element of the array?

What language are you reading the file with?

What have you done so far to help solve this issue?

PTrenholme · 05-21-2013, 01:24 PM

When you say "#sub-list," to what are they subordinate?

If they are just independent lists, then the problem is quite simple. For example, in gawk all you'd need is something like this:

Code:

{
  list[$1]=(list[$1])?$0:list[$1] SUBSEP $0
}

In bash, this might work:

Code:

#!/bin/bash
declare -A Value
declare -a Label
declare Count=0
function add_element()
{
  local element
  element="${1}"
  shift
  if [ -z "${Value[${element}]}" ]
  then
    ((++Count))
    Label[${Count}]="${element}"
    Value[${element}]="${@}"
  else
    Value[${element}]="${Value[${element}]}"$'\n'"${@}"
  fi
}
# Test code
while read -a line
do
  add_element "${line[@]}"
done < eamesi.data
for ((i=1;i<=Count;++i))
do
  echo
  echo "${Label[${i}]}:"
  echo "${Value[${Label[${i}]}]}"
done

Running that last code for your sample yields:

Code:

$ bash eamesi

901:
0.0001618 #sub-list 1
-0.0083606
-0.0060424

902:
-0.0006518 #sub-list 2
-0.0006474
-0.0006474

907:
0.0001615 #sub-list 3
-0.0093895
-0.0090656

eamesj · 05-22-2013, 05:25 AM

ok, so, this is an evolving project ...

what i have, a list of 200 lines, each containing a file and a value :

Code:

901  0.0001618
901 -0.0083606
901 -0.0065674
...
901 -0.0006485 # 200th row
902 -0.0060424
902 -0.0006518
..
902 -0.0006474 # 400th row
903 -0.0006518
903  0.0006518
903 -0.0006518
etc..

what i am generating

Code:

901  0.0001618 902 -0.0006485 903 -0.0006518
901 -0.0083606 901 -0.0060424 903 -0.0006518
901 -0.0065674 902 -0.0006518 903 -0.0006518
...
901 -0.0006485 902 -0.0006474 903 -0.0065674 # 200th row

For this im using a series of sed and awk to format the rows to columns

Code:

listcount=`wc tempfile.txt | awk '{print $1}'`
while [ $listcount -ge 200 ] ; do
	sed -n '1,200 p' tempfile.txt > 1.txt
	sed -i '1,200 d' tempfile.txt
	listcount=`wc tempfile.txt | awk '{print $1}'`
	echo $listcount
	paste -d" " 1.txt tempfile.txt > tempfile2.txt
	sed 's/  / /g' tempfile2.txt > tempfile.txt
done

and then assigning columns with an array with the following:

Code:

for i in $(eval echo {3..$maxfile}); do
	if [[ $((i % 2)) != 0 ]]; then
	    array+=($i)
	fi
done

So that I can perform functions on the columns, unfortunately the sed/awk/paste formatting takes a very long time on larger files, any speedier options?

chrism01 · 05-22-2013, 05:40 AM

If you're going to do a lot of data processing, I'd suggest moving up to a language that can what you want, without recourse to the shell eg Perl.
(You could use C, but its not that much quicker than Perl and its a lot more fiddly to program).

konsolebox · 05-22-2013, 06:13 AM

Your matrix is confusing or perhaps I'm not really able to think well due to colds. Could you kindly add colors to those numbers so that we could know how they were relocted please? Thanks.

eamesj · 05-22-2013, 07:22 AM

Sorry, think a typo has made it confusing...

Code:


901  0.1111111
901 -0.1111111
901 -0.1111111
...
901 -0.1111111 # 200th row
902 -0.2222222 # 201st row
902 -0.2222222
..
902 -0.2222222  # 400th row
903 -0.3333333 
903  0.3333333
..
903 -0.3333333 # 600th row
etc..

becomes
901 0.1111111 902 -0.2222222 903 -0.3333333
901 -0.1111111 902 -0.2222222 903 0.3333333
901 -0.1111111
etc ...
901 -0.1111111 902 -0.2222222 903 -0.3333333# 200th row

konsolebox · 05-22-2013, 08:13 AM

There are many ways but the simplest I think is by continuous concatenating of strings:

Code:

#!/bin/bash

INPUT=/path/to/input_file.ext
OUTPUT=/path/to/output_file.ext
SEP=' '
LINES=()

{
    for (( I = 0; I < 200; ++I )); do
        read LINE || break
        LINES[I]=$LINE
    done

    I=0

    while read LINE; do
        LINES[I]=${LINES[I]}${SEP}${LINE}
        (( I = (I + 1) % 200 ))
    done
} < "$INPUT"

{
    for I in "${!LINES[@]}"; do  ## or IFS=$'\n' eval "echo \"\${LINES[*]}\""
        echo "${LINES[I]}"
    done
} > "$OUTPUT"

grail · 05-22-2013, 09:30 AM

Ok ... so far it all looks like a nightmare to me (and I do not have a cold (yet)).

If I understand correctly:

1. You have a single file 1000's of lines long and every 200 lines (or maybe some arbitrary value) the value in the first column changes

2. Take said file and reformat so every line his a concatenation of lines at the same position within each grouping (here every 200)

If correct the above is what provides the output shown in post #4 (correct?)

The part I do not understand, assuming above is correct, is:

Quote:

and then assigning columns with an array

Would you please elaborate on what is meant by this line? What exactly is being placed in an array? (ie what data)

danielbmartin · 05-22-2013, 09:50 AM

Quote:

Originally Posted by eamesj

Sorry, think a typo has made it confusing...

It's still confusing. I simplified the input file to this shorter version ...

Code:

901  0.1111111
901 -0.1111112
901 -0.1111113
902 -0.2222221
902 -0.2222222
902 -0.2222223
903 -0.3333331
903  0.3333332
903 -0.3333333
904 -0.4444441
904  0.4444442
904 -0.4444443

... this code ...

Code:

# File identification
 Path=$(cut -d'.' -f1 <<< ${0})
 InFile=$Path"inp.txt"
OutFile=$Path"out.txt"
   Work=$Path"w.txt"

n=4       # n = number of output files
l=12      # l = total number of lines to write
let r=l/n # r = ratio
split -d -l $r $InFile $Work
paste -d" " $Work* >$OutFile

... produced this OutFile ...

Code:

901  0.1111111 902 -0.2222221 903 -0.3333331 904 -0.4444441
901 -0.1111112 902 -0.2222222 903  0.3333332 904  0.4444442
901 -0.1111113 902 -0.2222223 903 -0.3333333 904 -0.4444443

Daniel B. Martin

PTrenholme · 05-22-2013, 11:58 AM

Here's another gawk program:

Code:

#!/bin/gawk -f
function Max(a,b)
{
  if (a > b) return a
  return b
}
BEGIN {
  if (out=="") out="/dev/stdout"
  label=0
  len=0
}
{
  data[$1][++count[$1]]=$2
  label=Max(label,length($1))
  len=Max(len,length($2))
}
END {
  
  max=0
  columns=asorti(count,ordered)
  for (i=1;i<=columns;++i) {
    max=Max(max,count[ordered[i]])
  }
  error=0
  for (i=1;i<=columns;++i) {
    if (count[ordered[i]] < max) {
      print "Warning: " ordered[i] " contained only " count[ordered[i]] " entries." > "/dev/stderr"
      error=1
    }
  }
  if (error) print "Short data set values will be reported as \"NaN\"" > "/dev/stderr"
  fmt=" %-" label+1 " s%" len+1 "s"
  for (j=1; j<= max; ++j) {
    for (i=1; i<=columns; ++i) {
      v = (j <= count[ordered[i]]) ? data[ordered[i]][j] : "NaN" 
      printf(fmt, ordered[i], v) > out
    }
    print ""
  }
}

Using Mr. Martin's data, that produces:

Code:

$ ./eameri eameri.data 
 901   0.1111111 902  -0.2222221 903  -0.3333331 904  -0.4444441
 901  -0.1111112 902  -0.2222222 903   0.3333332 904   0.4444442
 901  -0.1111113 902  -0.2222223 903  -0.3333333 904  -0.4444443

Note that the code does not require that the data be in any fixed order, and that it reports a non-fatal error if each set is not of the same length.

grail · 05-22-2013, 01:02 PM

Well assuming the order is present, the initial part is relatively easy:

Code:

awk '$1 != x{i=1;x=$1}{line[i] = line[i] (line[i]?FS:"") $0;i++}END{for(j in line)print line[j]}' file

But I would still need more details on the array part??

eamesj · 05-23-2013, 03:23 AM

Thanks guys,

Went with grail's awk - much quicker than the sed/paste/awk looping

for the array im doing a count on the file to get the maximum number of columns ($maxfile) and building an array of the odd numbered columns from 3 to $maxfile (to be the y values in a graph, the x value is column 1). the only way ive seen in bash is to use eval with {A..Z}

Code:

# build array for alternate columns
for i in $(eval echo {3..$maxfile}); do
	if [[ $((i % 2)) != 0 ]]; then
	    oddcol=($i)
		printf "%4d: %s\n" $i ${oddcol[$i]}
	fi
done

problem with this is that the output array is unpopulated

Code:

How can i populate this array with the index?

Code:

   3:3
   5:5
   7:7
   9:9
  11:11
etc..

grail · 05-23-2013, 04:15 AM

IF I am following (big IF as really not sure), then I have 2 suggestions:

1. Don't use eval as it really is not needed, simply create a standard for loop (see below)

2. Each time you add to the array the index starts at zero and is incremented by one, however, you are calling the array at the same index as the column position, ie at 3 we add to the array the first
value but this will be at index 0 ... hence we would need to call ${oddcol[0]} and NOT ${oddcol[3]} (which is the value of 'i' at this point

So if above is correct:

Code:

j=0
for (( i = 3; i <= maxfile; i+=2 ))
do
    oddcol+=($i)
    printf "%4d: %s\n" $i ${oddcol[j++]}
done

konsolebox · 05-23-2013, 04:35 AM

Quote:

Originally Posted by grail

Well assuming the order is present, the initial part is relatively easy:

Code:

awk '$1 != x{i=1;x=$1}{line[i] = line[i] (line[i]?FS:"") $0;i++}END{for(j in line)print line[j]}' file

But I would still need more details on the array part??

grail excuse me but i don't see a part where it cycles back in the array after 200 lines?

Quote:

Originally Posted by eamesj

the only way ive seen in bash is to use eval with {A..Z}

I would have a big guess that you haven't seen or didn't bother to check my post at all?