ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
i'm helping someone with a program here. it's a pretty simple spellcheck program that just uses a hash table for the dictionary and checks a document. we got all that stuff working already. but the point of the program is to use as little ram as possible. the dictionary is huge at over 500 000 words. currently we're just hashing the strings and it works great but it uses a lot of space. a lot more than the profs demo used. he gave a little hint here:
"There is no way you will get close to my ram usage unless you do not use chars or strings. You must combine chars into integral types. For instance, a 32 bit integer will hold 6 lowercase letters with two bits left over. Sean."
right... then someone replied:
"Just out of curiousity, how come you can only fit 6 characters in a 32 bit integer? You only need 4 bits to represent 26 characters. Couldn't you fit 8 characters in an integer?"
no chars are 8 bits ...
"No, you cannot represent 26 characters in 4 bits. Four bits can only hold 16 states (2^4 = 16), which is not even enough enough for one letter. Storing A thru Z in lowercase needs 26 states and the closest thing is 5 bits (2^5 = 32). A char is 1 byte or 8 bits. Thus that leaves 3 bits which are basically unused. Sean's hint refers to getting rid of those 3 wasted bits and compacting them. By using only 5 bits per character, you are able to store 6 characters in a 32 bit integer, wasting only 2 bits per 6 characters instead of wasting 3 bits per each character."
so that all makes sense right?
i understand the concept, but i don't know how to actually do that in c++. i need to take a character, drop the three unused bits, put the 5 into an integer, do this four more times and ignore the lower (or upper?) two bits. then hash this integer into the table. a problem i can see coming up later is words that are not 5 characters long and manging the hashing / looking up. but we'll worry about that after we figure out how to squish them together.
I can tell you how to put 4 chars in an int a relatively simple way:
Code:
void
CharsIntoInt4(char *cChars, int *iInt)
{
for (int I = 0; I < 4; I++)
reinterpret_cast <char*> (iInt) [I] = cChars[I];
}
void
IntIntoChars4(char *cChars, int *iInt)
{
for (int I = 0; I < 4; I++)
cChars[I] = reinterpret_cast <char*> (iInt) [I];
}
Putting 5 chars into an int is a lot more complex though. You will need 6 bits each:
Code:
unsigned char
Shorten(char cChar)
//Moves letters to 1st 6 bits of character
{
if (cChar >= 'A' && cChar <= 'Z') return cChar - 'A';
if (cChar >= 'a' && cChar <= 'z') return cChar - 'a' + 26;
return 0;
}
char
Lengthen(unsigned char cChar)
//Moves 1st 6 bits of character back to letters
{
return (cChar >= 26)? (cChar + 'a') : (cChar + 'A');
}
void
CharsIntoInt5(char *cChars, int *iInt)
{
*iInt = 0;
for (int I = 0; I < 5; I++)
*iInt |= ( (int) Shorten(cChars[I]) ) << (I * 6);
}
void
IntIntoChars5(char *cChars, int *iInt)
{
for (int I = 0; I < 5; I++)
cChars[I] = Lengthen( (unsigned char) ( (*iInt >> (I * 6)) & 63) );
//Take the next 5 bits and move it to the 1st 5 bits of the character
}
I'm sorry if the second one doesn't work perfectly; I haven't tested it. I wrote it just now and compiled it with Comeau online (http://www.comeaucomputing.com/tryitout/). Hopefully this gives you some idea of what to do.
ta0kira
I have an idea, but it will hog processor space (i'm pretty new in c++, so my programming still need a lot of trimming).
If you ask for the return value of a char x[] then it is returned as a short integer casted into an integer. so in theory you can ask for the integer and convert it ro a binary. Next you only need to manipulate the binaries, represented by strings. When you have assembled all the 5 characters into a binary you could reconvert the binary to an integer. Heres what I mean:
get the character as an integer (say 62 if you have an a) -> subtract 62 to mach it with you binary protocol (since you only want 26 characters) -> turn the number 0 into a binary of "00000" (b would thus be "00001" and so on)
get the next character (say c) and turn it into a binary ("00011") -> add the two binaries ("00000"+"00001"="000000001")
Repeat till you got all 5 charcters (result for "chris" would be "000110100010010010011001", I think) -> convert into an integer
The principal method is easy and elemental, we all started our programming days with programs of the "turn an integer into a binary and print it" - type. However that will take up a lot of time and if you are aiming for speed, you are better off with another method.
invariant: if (left) valid(leaf.right) else valid(leaf.c);
also to investigate: would there be any benefit from huffman-coding blocks of four chars (which would match nicely into the struct huff) instead of just one?
I can tell you how to put 4 chars in an int a relatively simple way:
Simpler way:
Code:
union stuffer {
char c[4];
int i;
}
and write the data to c[0], c[x] ... and read it from i.
All this assumes your ints are 32 bits, of course, which they usually are, but it's not guaranteed by the language; ISTR that the only guarantee C gives (gave?) is that a long will be no shorter than a short. ints are generally supposed to take the "natural" length of the architecture but all the 64 but systems I've worked on had 32 bit ints.
You've also got to consider the endian-ness of the target systems.
Last edited by eddiebaby1023; 05-25-2005 at 07:23 AM.
Yes, something I forgot is that in C++ int is defined by the implementation. We've all been assuming int is a long, but for certain systems, such as a 16 bit system, int will be a short. OP: is it specifically int, and do you know if it will always be 32 bit? That makes a small or big difference. You could always use templates to correctly create a union based on the size of int, which I can show you if you want. Also, is the sign for signed chars always bit 7?
ta0kira
freegianghu: okay, my code was broken. I'm truly, utterly sorry for any inconvenience it may have caused
... and why not just use isalpha(str[i])? tolower(str[i]) - 'a': 31; in that way, it's (if only extremely marginally) more portable (*cough* union-hack *cough*), plus the '#' (decimal 35 > 31) won't flow into the next char
Originally posted by ta0kira Yes, something I forgot is that in C++ int is defined by the implementation. We've all been assuming int is a long, but for certain systems, such as a 16 bit system, int will be a short. OP: is it specifically int, and do you know if it will always be 32 bit? That makes a small or big difference. You could always use templates to correctly create a union based on the size of int, which I can show you if you want. Also, is the sign for signed chars always bit 7?
ta0kira
i think it's a reasonable assumption for this program to assume that ints are 32 and bit 7 is the signed bit. it's just an x86 machine running GNU/linux. so can i go ahead and use the code jonas posted?
Just another thought about this: Why throw away 2 bits every each 6 lowercase letters? After storing 5 times 6 (= 30) of these 5-bit chars, you have thrown away 5 * 2-bits, which could store another 2 of these 5-bit chars.
So to save even more space, leave the idea of storing 6 chars in a 32-bit integer. Instead store 32 chars in a 160-bits block.
also i have no idea how to actually use this union thing. what part goes in the header file and what goes in the cpp file? how do i create one / use it etc etc???
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.