ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I'm trying to implement the Huffman algorithm on python. It's like counting the characters of a file, finding their probability, and after that the characters with higher probability get a shorter binary code, and the characters with lower probability - a bigger one.
I've written the functions that form the code for each character, but I cannot write the code in the file.
Python 2 supports writing binary only as hex or 8-base digits, but I need to write binary with variable length.
I tried to experiment with the third python. Here are the results I got:
As much as I understand, every character of the string is like transformed to binary, and after that is written into the file. But anyway, as you can see, after writing, the file remains empty.
So, how should I write the code into the file?
Thank you very much for your replies, but anyway I didn't get the results I wanted. I've opened the file with a hexeditor and I get that the characters simply were recoded into binary.
Here is the file written with python in hexedit:
Probably more efficient, and better (arguably more the way python want you to write binary data):
Code:
#!/usr/bin/env python
import array
# An python array is like a restricted python list
# for storing binary data.
#
data = array.array('B') # create array of bytes.
data.append(int('1000001', 2)) # binary for ascii 'A'
data.append(int('1000010', 2)) # binary for ascii 'B'
data.append(int('1000011', 2)) # binary for ascii 'C'
data.append(int('11111111', 2)) # 255, 0xFF
print data
# Write the array at once to a file
#
f = file('/tmp/data.bin', 'wb')
data.tofile(f)
f.close()
Distribution: slackware64 13.37 and -current, Dragonfly BSD
Posts: 1,810
Rep:
Quote:
Hko, thank you very much. But what if the number of bits are not enough for to complect an entire byte??
This is why Huffman encoding using this method is nonsense. All symbols will occupy a minimum of one byte and so no gains are had. To program various sybmbol lengths and their prefix codes requires a more complex algorithm than just representing all codes in a fixed number of bits.
Hko,yes
Thank you once again, you've helped me very much
bgeddy, not really. The point is that if I would write this file with binary, I can lose maximum 1 byte and that's not a really big loss. It's not like writing every character as a number. It's concatenating the binary string for all the characters and after that writing it in the file. The most I fear that these spare bits could ruin my decoding algorithm, because I cannot find out how many 0s do I have to "clear" in order to decode correctly the first letter.
True, so you obviously need fill them all, possbily except the last byte.
Then only the last byte may contain 7 (worst case) bits. But there's no way around that anyway.
Creating a string of '0' and '1' characters first, and then converting those to (real binary) bytes is obviously a unneeded detour using more memory an CPU than necessary. But IMO it's OK to start like that, easier to debug and see what is going on. You can alway later optimize the character strings away.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.