bash 3.0+ script: array_unique(): can this function still be further optimized?

konsolebox · 07-20-2010, 07:59 AM

I'm currently making a set of functions that can be used for common manipulation of arrays. All of them are packaged as array_*().. One of them, array_unique(), was what I tried to optimize yesterday. Here's the code:

Code:

# array/unique.sh
#
# provides array_unique()
#
# author: konsolebox
# copyright free, 2010
#


# bool array_unique (avar <input>, [avar <result>])
#
# Removes duplicate entries from array.  If a result variable is
# specified, changes are saved to that variable, or else the input
# variable gets modified instead.  Indices are preserved.
#
# Please don't specify variables that have same names as the local
# variables below.  All of the local variables are prefixed with __ so
# it wont' really be a namespace problem.
#
# Note: At first I thought that a method like the one presented by the
#       second version will yield faster result since it doesn't
#       reallocate the indices everytime a repeating line is removed.
#       That could have been the case but as tests shows, the first
#       version yielded faster performance in most cases.  The second
#       version is only faster when there are -very- many similarities
#       in many lines.
#
#       It appears that the overhead of having manier instructions is
#       greater than the overhead of recollecting indices.  The second
#       version could be just faster if its lines can be reduced.  It
#       can perhaps also be faster if implemented in languages other
#       than bash.
#
if true; then
	function array_unique {
		case "$#" in
		1)
			eval "
				local -a __T=(\"\${!$1[@]}\") || return
				local -i __I=0 __J __C=\${#__T[@]} __D=0
				for (( ; __I < __C; ++__I )); do
					for (( __J = __I + 1; __J < __C; ++__J )); do
						[[ \${$1[__T[__I]]} = \"\${$1[__T[__J]]}\" ]] && {
							unset $1\\[__T\\[__J\\]\\] __T\\[__J\\]
							(( ++__D ))
						}
					done
					[[ __D -gt 0 ]] && {
						__T=(\"\${__T[@]:__I + 1}\")
						(( __C -= __D + __I + 1, __I = -1, __D = 0 ))
					}
				done
				return 0
			"
			;;
		2)
			eval "
				local -a __T=(\"\${!$1[@]}\") || return
				local -i __I=0 __J __C=\${#__T[@]} __D=0
				$2=() || return
				for (( ; __I < __C; ++__I )); do
					for (( __J = __I + 1; __J < __C; ++__J )); do
						[[ \${$1[__T[__I]]} = \"\${$1[__T[__J]]}\" ]] && {
							unset __T\\[__J\\]
							(( ++__D ))
						}
					done
					__J=__T[__I]
					$2[__J]=\${$1[__J]} || return
					[[ __D -gt 0 ]] && {
						__T=(\"\${__T[@]:__I + 1}\")
						(( __C -= __D + __I + 1, __I = -1, __D = 0 ))
					}
				done
				return 0
			"
			;;
#		*)
#			echo "invalid number of arguments."
#			exit 1
#			;;
		esac
		return 1
	}
else
	function array_unique {
		case "$#" in
		1)
			eval "
				local -i -a __T=(\"\${!$1[@]}\") __U=() || return
				local -i __I=0 __J __C=\${#__T[@]} __P
				for (( ; __I < __C; __I += __U[__I] + 1 )); do
					for (( __P = __I, __J = __I + __U[__I] + 1; __J < __C; __J += __U[__J] + 1 )); do
						[[ \${$1[__T[__I]]} = \"\${$1[__T[__J]]}\" ]] && {
							(( __U[__P] += __U[__J] + 1 ))
							unset $1\\[__T\\[__J\\]\\]
							continue
						}
						(( __P = __J ))
					done
				done
				return 0
			"
			;;
		2)
			eval "
				local -i -a __T=(\"\${!$1[@]}\") __U=() || return
				local -i __I=0 __J __C=\${#__T[@]} __P
				$2=() || return
				for (( ; __I < __C; __I += __U[__I] + 1 )); do
					__J=__T[__I]
					$2[__J]=\${$1[__J]} || return
					for (( __P = __I, __J = __I + __U[__I] + 1; __J < __C; __J += __U[__J] + 1 )); do
						[[ \${$1[\${__T[__I]}]} = \"\${$1[__T[__J]]}\" ]] && {
							(( __U[__P] += __U[__J] + 1 ))
							continue
						}
						(( __P = __J ))
					done
				done
				return 0
			"
			;;
#		*)
#			echo "invalid number of arguments."
#			exit 1
#			;;
		esac
		return 1
	}
fi

Sample test script:

Code:

source unique.sh

N=10  # The heaviest value I tried was probably N=1000 Z=1000
Z=10

declare -a -i A=()
for (( I = 0; I <= N; ++I )); do
	(( A[I] = RANDOM % Z ))
done
unset "A[$(( RANDOM % N ))]" "A[$(( RANDOM % N ))]" "A[$(( RANDOM % N ))]" "A[$(( RANDOM % N ))]"
echo "array_unique A"
echo "before: $(set | grep ^A=)"
time array_unique A
echo "after:  $(set | grep ^A=)"
echo

declare -a -i A=()
for (( I = 0; I <= N; ++I )); do
	(( A[I] = RANDOM % Z ))
done
unset "A[$(( RANDOM % N ))]" "A[$(( RANDOM % N ))]" "A[$(( RANDOM % N ))]" "A[$(( RANDOM % N ))]"
echo "array_unique A B"
echo "before: $(set | grep ^A=)"
time array_unique A B
echo "after:  $(set | grep ^B=)"

Can anyone further optimize this function? Please don't tell me that I'm just wasting my time trying to optimize this again

. Thanks.

grail · 07-20-2010, 08:55 AM

Well I am not sure about optimization because I stopped after the first statement:

Quote:

if true; then

Would not the first part of optimization be to throw away the useless 'else' clause??

ntubski · 07-20-2010, 11:16 AM

Find the best algorithm first, you're using quadratic time solutions.

Code:

    function array_unique {
        local -A __U
        eval "for n in \"\${$1[@]}\" ; do
            __U[\$n]=1
        done"
        case "$#" in
            1) eval "$1=( \${!__U[@]} )";;
            2) eval "$2=( \${!__U[@]} )";;
            *) return 1;;
        esac
    }

EDIT:
If you allow external programs, this one is faster

Code:

    function array_unique {
        local __OIFS="$IFS"
        local __s='eval "echo \"\${$1[*]}\"" | sort -u'
        case "$#" in
            1) IFS=$'\n'; eval "$1=( \$($__s) )";;
            2) IFS=$'\n'; eval "$2=( \$($__s) )";;
            *) return 1;;
        esac
        IFS="$__OIFS"
    }

konsolebox · 07-20-2010, 09:57 PM

Quote:

Originally Posted by grail

Well I am not sure about optimization because I stopped after the first statement:

Would not the first part of optimization be to throw away the useless 'else' clause??

Thanks for the rep.. Actually that's only for load time; after-all functions are only loaded and parsed once. They're converted to a more optimized form afterwards. What's important is really in the loop; it's what I'm trying to change.

Quote:

Originally Posted by ntubski

Find the best algorithm first, you're using quadratic time solutions.

Code:

    function array_unique {
        local -A __U
        eval "for n in \"\${$1[@]}\" ; do
            __U[\$n]=1
        done"
        case "$#" in
            1) eval "$1=( \${!__U[@]} )";;
            2) eval "$2=( \${!__U[@]} )";;
            *) return 1;;
        esac
    }

Looks like it's only for associative arrays? But thanks. Maybe using that idea, a faster version can be implement in bash 4.0+. I can create another version that will be automatically used if bash is 4.0+.

.... The only possible overhead I can see with that is the reallocation of values to keys in flag associative arrays; but still it can be better.

Quote:

EDIT:
If you allow external programs, this one is faster

Code:

    function array_unique {
        local __OIFS="$IFS"
        local __s='eval "echo \"\${$1[*]}\"" | sort -u'
        case "$#" in
            1) IFS=$'\n'; eval "$1=( \$($__s) )";;
            2) IFS=$'\n'; eval "$2=( \$($__s) )";;
            *) return 1;;
        esac
        IFS="$__OIFS"
    }

Could be but there's the dependency.. and should it really be faster since it makes many subprocess calls? .. parent->(subshell echo, parent->(exec(sort)))
in both steps (echo and $() like call), a large memory is allocated for the storage of string/output.

Thanks again. I'll try to apply the first.

konsolebox · 07-21-2010, 03:08 AM

As expected the version that uses associative arrays is the fastest. Thanks again ntubski. You did it again

.

I also found something that makes the functions really slow: unset. It appears that builtin calls are really slow compared to function-embedded methods. It also appears that unsetting an element in an array also takes much time. So to make me not use unset, I guess I'll also have to try a variation whereby a temporary array variable will be used for storing unique elements before copying them to the input array. After running tests I hope on my next post it will already be the final form of the script. I'll wait some days just to make sure that the script gets 100.00001% stable. Of course new ideas are always welcome until then. Thanks.

Code:

function array_unique_a {
	case "$#" in
	1)
		[[ $1 =~ ^[[:alpha:]_]+[[:alnum:]_]*$ ]] || return
		local -A __F=(); local -i __I; local __A
		eval "
			for __I in \"\${!$1[@]}\"; do
				__A=\${$1[__I]}
				[[ -n \${__F[\$__A]} ]] && {
					unset $1\\[__I\\]
					continue
				}
				__F[\$__A]=.
			done
		"
		;;
	2)
		[[ $1 =~ ^[[:alpha:]_]+[[:alnum:]_]*$ && $2 =~ ^[[:alpha:]_]+[[:alnum:]_]*$ ]] || return
		local -A __F=(); local -i __I; local __A
		eval "
			$2=() || return
			for __I in \"\${!$1[@]}\"; do
				__A=\${$1[__I]}
				[[ -n \${__F[\$__A]} ]] && continue
				$2[__I]=\$__A
				__F[\$__A]=.
			done
		"
		;;
#	*)
#		echo "invalid number of arguments."
#		exit 1
#		;;
	esac
}
function array_unique_b {
	case "$#" in
	1)
		eval "
			local -a __T=(\"\${!$1[@]}\") || return
			local -i __I=0 __J __C=\${#__T[@]} __D=0
			for (( ; __I < __C; ++__I )); do
				for (( __J = __I + 1; __J < __C; ++__J )); do
					[[ \${$1[__T[__I]]} = \"\${$1[__T[__J]]}\" ]] && {
						unset $1\\[__T\\[__J\\]\\] __T\\[__J\\]
						(( ++__D ))
					}
				done
				[[ __D -gt 0 ]] && {
					__T=(\"\${__T[@]:__I + 1}\")
					(( __C -= __D + __I + 1, __I = -1, __D = 0 ))
				}
			done
			return 0
		"
		;;
	2)
		eval "
			local -a __T=(\"\${!$1[@]}\") || return
			local -i __I=0 __J __C=\${#__T[@]} __D=0
			$2=() || return
			for (( ; __I < __C; ++__I )); do
				for (( __J = __I + 1; __J < __C; ++__J )); do
					[[ \${$1[__T[__I]]} = \"\${$1[__T[__J]]}\" ]] && {
						unset __T\\[__J\\]
						(( ++__D ))
					}
				done
				__J=__T[__I]
				$2[__J]=\${$1[__J]} || return
				[[ __D -gt 0 ]] && {
					__T=(\"\${__T[@]:__I + 1}\")
					(( __C -= __D + __I + 1, __I = -1, __D = 0 ))
				}
			done
			return 0
		"
		;;
#	*)
#		echo "invalid number of arguments."
#		exit 1
#		;;
	esac
	return 1
}
function array_unique_c {
	case "$#" in
	1)
		eval "
			local -i -a __T=(\"\${!$1[@]}\") __U=() || return
			local -i __I=0 __J __C=\${#__T[@]} __P
			for (( ; __I < __C; __I += __U[__I] + 1 )); do
				for (( __P = __I, __J = __I + __U[__I] + 1; __J < __C; __J += __U[__J] + 1 )); do
					[[ \${$1[__T[__I]]} = \"\${$1[__T[__J]]}\" ]] && {
						(( __U[__P] += __U[__J] + 1 ))
						unset $1\\[__T\\[__J\\]\\]
						continue
					}
					(( __P = __J ))
				done
			done
			return 0
		"
		;;
	2)
		eval "
			local -i -a __T=(\"\${!$1[@]}\") __U=() || return
			local -i __I=0 __J __C=\${#__T[@]} __P
			$2=() || return
			for (( ; __I < __C; __I += __U[__I] + 1 )); do
				__J=__T[__I]
				$2[__J]=\${$1[__J]} || return
				for (( __P = __I, __J = __I + __U[__I] + 1; __J < __C; __J += __U[__J] + 1 )); do
					[[ \${$1[\${__T[__I]}]} = \"\${$1[__T[__J]]}\" ]] && {
						(( __U[__P] += __U[__J] + 1 ))
						continue
					}
					(( __P = __J ))
				done
			done
			return 0
		"
		;;
#	*)
#		echo "invalid number of arguments."
#		exit 1
#		;;
	esac
	return 1
}

test script

Code:

source unique.sh

function array_copy {
	local -i __I
	eval "$2=() && for __I in \${!$1[@]}; do $2[__I]=\${$1[__I]}; done"
}

[[ -z $N ]] && N=10000
[[ -z $Z ]] && Z=100
[[ -z $U ]] && U=500

declare -a -i A=() B=()

for (( I = 0; I <= N; ++I )); do
	(( A[I] = RANDOM % Z ))
done

for (( I = U; I; --I )); do
	unset "A[$(( RANDOM % N ))]"
done

array_copy A C

echo "N = $N, Z =$Z, U = $U"
echo
echo "orignal data count: ${#A[@]}"
echo

echo '----- array_unique_a -----'
echo
echo "array_unique_a A"
echo
time array_unique_a A
echo
echo "after:  $(set | grep ^A=)"
echo
echo "array_unique_a A B"
echo
time array_unique_a A B
echo
echo "after:  $(set | grep ^B=)"
echo

array_copy C A
B=()
echo '----- array_unique_b -----'
echo
echo "array_unique_b A"
echo
time array_unique_b A
echo
echo "after:  $(set | grep ^A=)"
echo
echo "array_unique_b A B"
echo
time array_unique_b A B
echo
echo "after:  $(set | grep ^B=)"
echo

array_copy C A
B=()
echo '----- array_unique_c -----'
echo
echo "array_unique_c A"
echo
time array_unique_c A
echo
echo "after:  $(set | grep ^A=)"
echo
echo "array_unique_c A B"
echo
time array_unique_c A B
echo
echo "after:  $(set | grep ^B=)"
echo

result for first test

Code:

N = 10000, Z =100, U = 500

orignal data count: 9517

----- array_unique_a -----

array_unique_a A


real	0m0.597s
user	0m0.587s
sys	0m0.010s

after:  A=([0]="87" [1]="73" [2]="47" ...) (same)

array_unique_a A B


real	0m0.006s
user	0m0.004s
sys	0m0.002s

after:  B=([0]="87" [1]="73" [2]="47" ...) (same)

----- array_unique_b -----

array_unique_b A


real	2m27.084s
user	0m2.221s
sys	2m24.828s

after:  A=([0]="87" [1]="73" [2]="47" ...) (same)

array_unique_b A B


real	0m0.295s
user	0m0.572s
sys	0m0.000s

after:  B=([0]="87" [1]="73" [2]="47" ...) (same)

----- array_unique_c -----

array_unique_c A


real	3m22.593s
user	0m3.387s
sys	3m18.185s

after:  A=([0]="87" [1]="73" [2]="47" ...) (same)

array_unique_c A B


real	0m0.458s
user	0m0.890s
sys	0m0.000s

after:  B=([0]="87" [1]="73" [2]="47" ...) (same)

2nd test

Code:

N = 10000, Z =5, U = 500

orignal data count: 9515

----- array_unique_a -----

array_unique_a A


real	0m0.585s
user	0m0.582s
sys	0m0.004s

after:  A=([0]="0" [2]="2" [3]="1" [4]="3" [5]="4")

array_unique_a A B


real	0m0.001s
user	0m0.001s
sys	0m0.000s

after:  B=([0]="0" [2]="2" [3]="1" [4]="3" [5]="4")

----- array_unique_b -----

array_unique_b A


real	0m9.102s
user	0m9.011s
sys	0m0.086s

after:  A=([0]="0" [2]="2" [3]="1" [4]="3" [5]="4")

array_unique_b A B


real	0m0.005s
user	0m0.004s
sys	0m0.000s

after:  B=([0]="0" [2]="2" [3]="1" [4]="3" [5]="4")

----- array_unique_c -----

array_unique_c A


real	0m14.186s
user	0m14.102s
sys	0m0.081s

after:  A=([0]="0" [2]="2" [3]="1" [4]="3" [5]="4")

array_unique_c A B


real	0m0.006s
user	0m0.006s
sys	0m0.000s

after:  B=([0]="0" [2]="2" [3]="1" [4]="3" [5]="4")

ntubski · 07-21-2010, 08:48 AM

Quote:

Originally Posted by konsolebox

Could be but there's the dependency.. and should it really be faster since it makes many subprocess calls? .. parent->(subshell echo, parent->(exec(sort)))
in both steps (echo and $() like call), a large memory is allocated for the storage of string/output.

It tested fastest for me. If you want pure bash, it's probably worth writing sort in bash: that still gives an O(n log n) algorithm vs. O(n^2), so it should be faster for large N.

EDIT:

Code:

real	3m22.593s
user	0m3.387s
sys	3m18.185s

It's strange that there is so much sys time used, maybe you started swapping?

konsolebox · 07-21-2010, 10:06 AM

Quote:

Originally Posted by ntubski

It tested fastest for me. If you want pure bash, it's probably worth writing sort in bash: that still gives an O(n log n) algorithm vs. O(n^2), so it should be faster for large N.

I already have array_sort() but it's implemented differently with array_unique(). How can that make things faster? Also does it save the indices?

Quote:

EDIT:

Code:

real	3m22.593s
user	0m3.387s
sys	3m18.185s

It's strange that there is so much sys time used, maybe you started swapping?

Not sure but why not (array_unique A B)?. Could it be that unsetting an element costs too much memory?.. I'll check it later.

ntubski · 07-21-2010, 10:59 AM

Quote:

Originally Posted by konsolebox

I already have array_sort() but it's implemented differently with array_unique(). How can that make things faster? Also does it save the indices?

Your solutions with the nested loops go through each element N times, so commands within the nested loop are run N*N=N^2 times. There are sorting algorithms that take time proportional to NlogN which is much less than N^2 for large N.

Do you mean array_sort() uses array_unique()? I don't understand what you mean by "save the indices".

Quote:

Not sure but why not (array_unique A B)?. Could it be that unsetting an element costs too much memory?.. I'll check it later.

Only the user time represents CPU cycles, so it seems strange that the majority of the time it takes is sys time. Even if unsetting costs a lot of memory, I don't think it should cost that much.

konsolebox · 07-21-2010, 11:57 PM

Quote:

Do you mean array_sort() uses array_unique()? I don't understand what you mean by "save the indices".

I'm sorry that I put it the wrong way. I use a different algorithm for array_sort() and it doesn't run with array_unique().

Here's the previous code that's not yet array_sort(). I didn't bring the new one so I just took it from playshell in sf.

Code:

include log.sh
include patterns.sh
include utils/copyarray.sh


#
# utils/sortarray.sh
#
# portable utility that provides a function that sorts an array
#
# author: konsolebox
# copyright free
# created and updated 2008-2010
#


# Credits have got to be given to the authors of the book "A Book on C 4th Ed."
# for this great algorithm.  The algorithm was originally described by someone
# and was completely explained in the book with an implimentation that's written
# in C.
#
# I knew C from many sources but most of what I learned came from this book and
# I therefore recommend it to starters for a good start and also to experienced
# programmers for a good reference and new methods that they may discover from
# it.


# void utils_sortarray
# (
# 	["from=<__ARRAY>"],
# 	["type=<string|integer>"],
# 	["to=<__ARRAY>"],
# 	["as=<values|indices>"],
# 	["--" [ SOURCEVALUES[@] ]]
# )
#
function utils_sortarray {
	log_fcall "$@"

	[[ $# -eq 0 ]] && return

	local FROM TYPE TO AS
	local -a __ARRAY
	local -a -i POINTERS

	while [[ $# -gt 0 ]]; do
		case "$1" in
		from=*)
			FROM=${1#from=}
			;;
		type=*)
			TYPE=${1#type=}
			;;
		to=*)
			TO=${1#to=}
			;;
		as=*)
			AS=${1#as=}
			;;
		--)
			shift
			break
			;;
		#beginsyntaxcheckblock
		*)
			log_criticalerror "unknown parameter: $1"
			;;
		#endsyntaxcheckblock
		esac
		shift
	done

	#beginsyntaxcheckblock
	[[ -n $FROM && $FROM != $PATTERNS_ARRAYVARNAME ]] && \
		log_criticalerror "variable name not valid for the source array: $FROM"
	[[ -n $TYPE && $TYPE != @(string|integer) ]] && \
		log_criticalerror "argument is not valid for type: $TYPE"
	[[ -n $TO && $TO != $PATTERNS_ARRAYVARNAME ]] && \
		log_criticalerror "variable name not valid for the target array: $TO"
	[[ -n $AS && $AS != @(values|indices) ]] && \
		log_criticalerror "argument is not valid for as: $AS"
	[[ -z $FROM && $# -eq 0 ]] && \
		log_criticalerror "a source should be specified either by 'from=<__ARRAY>' or '-- CONTENTS[@]'"
	#endsyntaxcheckblock

	if [[ $# -gt 0 ]]; then
		__ARRAY=("$@")
	elif [[ -n $FROM ]]; then
		utils_copyarray "$FROM" __ARRAY
	fi

	[[ -z $TYPE ]] && TYPE='string'
	[[ -z $TO ]] && TO='__'
	[[ -z $AS ]] && AS='values'

	POINTERS=(${!__ARRAY[@]})

	if [[ ${#POINTERS[@]} -gt 1 ]]; then
		case "$TYPE" in
		string)
			utils_sortarray_quicksortforstrings 0 $(( ${#POINTERS[@]} - 1 ))
			;;
		integer)
			utils_sortarray_quicksortforintegers 0 $(( ${#POINTERS[@]} - 1 ))
			;;
		esac
	fi

	case "$AS" in
	values)
		local -i I J=0
		unset "${TO}[*]"
		eval "for I in \"\${POINTERS[@]}\"; do ${TO}[J++]=\${__ARRAY[I]}; done"
		;;
	indices)
		eval "$TO=(\"\${POINTERS[@]}\")"
		;;
	esac
}


# void utils_sortarray_quicksortforstrings (uint LEFT, uint RIGHT)
#
function utils_sortarray_quicksortforstrings {
	[[ $1 -lt $2 ]] || return

	local -i LEFT=$1 RIGHT=$2 PIVOT PARTITION

	if utils_sortarray_quicksortforstrings_findpivot; then
		utils_sortarray_quicksortforstrings_partition
		utils_sortarray_quicksortforstrings $LEFT $(( PARTITION - 1 ))
		utils_sortarray_quicksortforstrings $PARTITION $RIGHT
	fi
}


# bool utils_sortarray_quicksortforstrings_findpivot (void)
#
function utils_sortarray_quicksortforstrings_findpivot {
	local -i A B C P MIDDLE

	(( MIDDLE = LEFT + (RIGHT - LEFT) / 2 ))

	(( A = POINTERS[LEFT], B = POINTERS[MIDDLE], C = POINTERS[RIGHT] ))

	[[ ${__ARRAY[A]} > ${__ARRAY[B]} ]] && (( A = $B, B = $A ))
	[[ ${__ARRAY[A]} > ${__ARRAY[C]} ]] && (( A = $C, C = $A ))
	[[ ${__ARRAY[B]} > ${__ARRAY[C]} ]] && (( B = $C, C = $B ))

	if [[ ${__ARRAY[A]} < ${__ARRAY[B]} ]]; then
		PIVOT=$B
		return 0
	fi

	if [[ ${__ARRAY[B]} < ${__ARRAY[C]} ]]; then
		PIVOT=$C
		return 0
	fi

	for (( P = LEFT + 1; P < MIDDLE; ++P )); do
		if [[ ${__ARRAY[P]} > ${__ARRAY[A]} ]]; then
			PIVOT=$P
			return 0
		fi

		if [[ ${__ARRAY[P]} < ${__ARRAY[A]} ]]; then
			PIVOT=$A
			return 0
		fi
	done

	for (( P = MIDDLE + 1; P < RIGHT; ++P )); do
		if [[ ${__ARRAY[P]} > ${__ARRAY[A]} ]]; then
			PIVOT=$P
			return 0
		fi

		if [[ ${__ARRAY[P]} < ${__ARRAY[A]} ]]; then
			PIVOT=$A
			return 0
		fi
	done

	return 1
}


# void utils_sortarray_quicksortforstrings_partition (void)
#
function utils_sortarray_quicksortforstrings_partition {
	local -i L R T
	local P=${__ARRAY[PIVOT]}

	for (( L = LEFT, R = RIGHT; L <= R; )); do
		while [[ ${__ARRAY[POINTERS[L]]} < $P ]]; do
			(( ++L ))
		done

		until [[ ${__ARRAY[POINTERS[R]]} < $P ]]; do
			(( --R ))
		done

		[[ L -lt R ]] && (( T = POINTERS[L], POINTERS[L] = POINTERS[R], POINTERS[R] = T, ++L, --R ))
	done

	(( PARTITION = L ))
}


# void utils_sortarray_quicksortforintegers (uint LEFT, uint RIGHT)
#
function utils_sortarray_quicksortforintegers {
	[[ $1 -lt $2 ]] || return

	local -i LEFT=$1 RIGHT=$2 PIVOT PARTITION

	if utils_sortarray_quicksortforintegers_findpivot; then
		utils_sortarray_quicksortforintegers_partition
		utils_sortarray_quicksortforintegers $LEFT $(( PARTITION - 1 ))
		utils_sortarray_quicksortforintegers $PARTITION $RIGHT
	fi
}


# bool utils_sortarray_quicksortforintegers_findpivot (void)
#
function utils_sortarray_quicksortforintegers_findpivot {
	local -i A B C P MIDDLE

	(( MIDDLE = LEFT + (RIGHT - LEFT) / 2 ))

	(( A = POINTERS[LEFT], B = POINTERS[MIDDLE], C = POINTERS[RIGHT] ))

	[[ __ARRAY[A] -gt __ARRAY[B] ]] && (( A = $B, B = $A ))
	[[ __ARRAY[A] -gt __ARRAY[C] ]] && (( A = $C, C = $A ))
	[[ __ARRAY[B] -gt __ARRAY[C] ]] && (( B = $C, C = $B ))

	if [[ __ARRAY[A] -lt __ARRAY[B] ]]; then
		PIVOT=$B
		return 0
	fi

	if [[ __ARRAY[B] -lt __ARRAY[C] ]]; then
		PIVOT=$C
		return 0
	fi

	for (( P = LEFT + 1; P < MIDDLE; ++P )); do
		if [[ __ARRAY[P] -gt __ARRAY[A] ]]; then
			PIVOT=$P
			return 0
		fi

		if [[ __ARRAY[P] -lt __ARRAY[A] ]]; then
			PIVOT=$A
			return 0
		fi
	done

	for (( P = MIDDLE + 1; P < RIGHT; ++P )); do
		if [[ __ARRAY[P] -gt __ARRAY[A] ]]; then
			PIVOT=$P
			return 0
		fi

		if [[ __ARRAY[P] -lt __ARRAY[A] ]]; then
			PIVOT=$A
			return 0
		fi
	done

	return 1
}


# void utils_sortarray_quicksortforintegers_partition (void)
#
function utils_sortarray_quicksortforintegers_partition {
	local -i L R T P

	for (( L = LEFT, R = RIGHT, P = __ARRAY[PIVOT]; L <= R; )); do
		for (( ; __ARRAY[POINTERS[L]] < P; ++L )); do
			continue
		done

		for (( ; __ARRAY[POINTERS[R]] >= P; --R )); do
			continue
		done

		[[ L -lt R ]] && (( T = POINTERS[L], POINTERS[L] = POINTERS[R], POINTERS[R] = T, ++L, --R ))
	done

	(( PARTITION = L ))
}

Quote:

Originally Posted by ntubski

Your solutions with the nested loops go through each element N times, so commands within the nested loop are run N*N=N^2 times. There are sorting algorithms that take time proportional to NlogN which is much less than N^2 for large N.

Perhaps I can also consider this but I actually planned to make the functions not change the indices or save the element's current positions.

For example if we have an array like

Code:

A=([1]="A" [2]="B" [3]="A" [4]="B")

We can have them processed into two outputs.

One is

Code:

A=([1]="A" [2]="B")

and the other is

Code:

A=([0]="A" [1]="B")

And what I'm after is just the first.
If we do something like:

Code:

B=("${A[@]}"])

The indices of B will always be continuous from 0 no matter what the index/key structure of A is.

Quote:

Only the user time represents CPU cycles, so it seems strange that the majority of the time it takes is sys time. Even if unsetting costs a lot of memory, I don't think it should cost that much.

I just found out that it's naturally slow since there are many elements from the original but why it takes too much time on sys and not on user, that I don't know. Perhaps bash uses a dynamic method for handling arrays like everytime an element is removed, the whole map of the array is updated or changed. This might be for the sake of saving memory but if it's the case, it could be just the reason why removing an element is slow. For the reason why it's system time that's used, maybe because there are calls to malloc() and free(); that's casted not only for the element that's to be removed but on other related objects as well..? If it's already internal then I guess I can no longer do anything much about it.

I already made variations of array_unique_a() where I try to make a temporary copy first but the result was just a slower function. There was also a version were I just save the indices that will be deleted and then cast unset with them in one shot but it's also slower.

Anyhow I guess the last version I posted is already the fastest though I'm still hoping that there's a better way to do it. If only indices are mapped in a different way in bash (just like associative array), but perhaps it's really not that easy to make such an implementation knowing that nothingness is allowed in between elements in indexed arrays. Or maybe associative array like manipulation for indexed arrays will just be expensive.. don't know.