LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   python list.sort() buggy? (https://www.linuxquestions.org/questions/programming-9/python-list-sort-buggy-642935/)

notnor 05-17-2008 06:30 PM

python list.sort() buggy?
 
this is an excerpt from a sorted list in python:

'504620', '507603', '508727', '51334', '514495', '515436', '519684', '520386'


so apparently it is a bit buggy? it is taken from a big dataset(10K+ elements).

i have converted it into a set and then back into a list to make all elements unique.
could that have something to do with it?

Hko 05-17-2008 08:26 PM

Quote:

Originally Posted by notnor (Post 3156524)
i have converted it into a set and then back into a list to make all elements unique.
could that have something to do with it?

Yes. set's use the hash of the elements. Using hashes behind the scenes makes it fast. On the downside hashing implies messing up the order..

Call sort() after converting to a list. Or, if this will slow your program down too much because you would need to sort much more often, you could keep a set alongside to the list which you use only for checking uniqueness.

Note that dict's also use hashes, and also mess up the order of the dict-keys.

I tried to make a class for such a unique string list. I hardly used it myself and it may need improvement, but it works. It uses a dict to keep uniqueness to make ik compatible with earlier python's, but the idea is the same as using a set alongside the list. Here's the code:
Code:

## public domain ##

class StringItem(str):
    def __init__(self, s):
        str.__init__(self, s)
        self.index = None
        self.refcount = 1       

class UniqStringList:
    '''A indexed list of strings in which no duplicates are stored.
   
    When a string value is added that already exists in the list, it will be
    stored at the index of the first string that had this value.
   
    A reference count is maintained for the string values, so if a certain
    string value is added 3 times, it will be deleted from the list when it is
    removed 3 times.
    '''
    def __init__(self, sequence=None):
        '''Create new UniqStringList object
        UniqStringList -> new empty UniqStringList.
        UniqStringList(sequence) -> new UniqStringList initialized from
        sequence's items.
        '''
        self._dict = {}
        self._list = []
        if sequence is not None:
            for item in sequence:
                self.add(item)

    def __delete(self, item):
        item.refcount -= 1
        idx = item.index
        if item.refcount <= 0:
            del self._dict[item]
            del self._list[idx]
            for s in self._list[idx:]:
                s.index -= 1

    def  __str__(self):
        result = ''
        for s in self:
            result = '%s%s\n' % (result, s)
        return result[:-1]    # remove last newline
   
    def refcount(self, s):
        try:
            if isinstance(s, str): return self._dict[s].refcount
            elif isinstance(s, int): return self._list[s].refcount
            else: raise TypeError
        except  (KeyError, IndexError): return 0
   
    def __getitem__(self, s):
        '''Returns:
        - The index where s is stored if s is a string.
        - The string with index s if s is an integer.
        '''
        if isinstance(s, str): return self._dict[s]
        elif isinstance(s, int): return self._list[s]
        else: raise TypeError
   
    def __delitem__(self, s):
        self.remove(s)       
           
    def __len__(self):
        return len(self._list)
   
    def __iter__(self):
        return self._list.__iter__()
   
    def add(self, s):
        '''Adds a string value to the UniqStringList.       
        Returns: index of the string value where it was stored in the list.
        '''
        item = self._dict.get(s)
        if item == None:  # not in list yet
            item = StringItem(s)
            item.index = len(self._list)
            self._list.append(item)
            self._dict[s] = item
            return item.index
        else:  # value already in list
            item.refcount += 1
            return item.index
           
    def remove(self, s):
        '''Removes a string for the list by decreasing the reference count.
        Only if the reference count reaches zero the string is really deleted.
        s can be:
            a integer -> the string with index s is deleted.
            a string -> the string s is deleted.
        '''
        if isinstance(s, str): self.__delete(self._dict[s])
        elif isinstance(s, int): self.__delete(self._list[s])
        else: raise TypeError
           
    def show(self):
        '''returns string containing the list with index and reference counts'''
        result = ''
        for s in self:
            item = self._dict[s]
            result = '%s%d %s %d\n' % (result, item.index, s, item.refcount)
        return result[:-1]    # remove last newline


notnor 05-17-2008 08:50 PM

the algorithm is:
1. convert to set
2. convert back to list
3. sort

so it is not sorted and then to set and back to list.

so that should sort it perfectly no? is there any other wy to make a list unique?

angrybanana 05-18-2008 12:29 AM

It's not buggy that's sorted in correct ASCII order. If you want a numeric sort you can either convert the strings to integers then sort them, or use "int" as the key when sorting.
Code:

>>> mylist=['504620', '507603', '508727', '51334', '514495', '515436', '519684', '520386']
>>> sorted(mylist)
['504620', '507603', '508727', '51334', '514495', '515436', '519684', '520386']
>>> sorted(mylist, key=int)
['51334', '504620', '507603', '508727', '514495', '515436', '519684', '520386']

There are a few approaches to making a list unique. If you want to preserve the original sort order.

http://aspn.activestate.com/ASPN/Coo...n/Recipe/52560
That link has a few of them.
If you want a simple, easy to understand one.. maybe something like this?
Code:

def deldup(it):
    seen = set()
    out = []
    for x in it:
            if x not in seen:
                    out.append(x)
                    seen.add(x)
    return out

This isn't the most efficient way of doing this but might be the easiest to understand.


All times are GMT -5. The time now is 04:42 PM.