storing 5 chars into an int in C++?

jwn7 · 05-25-2005, 02:08 AM

i'm helping someone with a program here. it's a pretty simple spellcheck program that just uses a hash table for the dictionary and checks a document. we got all that stuff working already. but the point of the program is to use as little ram as possible. the dictionary is huge at over 500 000 words. currently we're just hashing the strings and it works great but it uses a lot of space. a lot more than the profs demo used. he gave a little hint here:

"There is no way you will get close to my ram usage unless you do not use chars or strings. You must combine chars into integral types. For instance, a 32 bit integer will hold 6 lowercase letters with two bits left over. Sean."

right... then someone replied:

"Just out of curiousity, how come you can only fit 6 characters in a 32 bit integer? You only need 4 bits to represent 26 characters. Couldn't you fit 8 characters in an integer?"

no chars are 8 bits ...

"No, you cannot represent 26 characters in 4 bits. Four bits can only hold 16 states (2^4 = 16), which is not even enough enough for one letter. Storing A thru Z in lowercase needs 26 states and the closest thing is 5 bits (2^5 = 32). A char is 1 byte or 8 bits. Thus that leaves 3 bits which are basically unused. Sean's hint refers to getting rid of those 3 wasted bits and compacting them. By using only 5 bits per character, you are able to store 6 characters in a 32 bit integer, wasting only 2 bits per 6 characters instead of wasting 3 bits per each character."

so that all makes sense right?

i understand the concept, but i don't know how to actually do that in c++. i need to take a character, drop the three unused bits, put the 5 into an integer, do this four more times and ignore the lower (or upper?) two bits. then hash this integer into the table. a problem i can see coming up later is words that are not 5 characters long and manging the hashing / looking up. but we'll worry about that after we figure out how to squish them together.

thanks!

ta0kira · 05-25-2005, 03:09 AM

I can tell you how to put 4 chars in an int a relatively simple way:

Code:

	void
CharsIntoInt4(char *cChars, int *iInt)
{
	for (int I = 0; I < 4; I++)
	reinterpret_cast <char*> (iInt) [I] = cChars[I];
}

	void
IntIntoChars4(char *cChars, int *iInt)
{
	for (int I = 0; I < 4; I++)
	cChars[I] = reinterpret_cast <char*> (iInt) [I];
}

Putting 5 chars into an int is a lot more complex though. You will need 6 bits each:

Code:

	unsigned char
Shorten(char cChar)
//Moves letters to 1st 6 bits of character
{
	if (cChar >= 'A' && cChar <= 'Z') return cChar - 'A';
	if (cChar >= 'a' && cChar <= 'z') return cChar - 'a' + 26;
	return 0;
}


	char
Lengthen(unsigned char cChar)
//Moves 1st 6 bits of character back to letters
{
	return (cChar >= 26)? (cChar + 'a') : (cChar + 'A');
}


	void
CharsIntoInt5(char *cChars, int *iInt)
{
	*iInt = 0;

	for (int I = 0; I < 5; I++)
	*iInt |= ( (int) Shorten(cChars[I]) ) << (I * 6);
}

	void
IntIntoChars5(char *cChars, int *iInt)
{
	for (int I = 0; I < 5; I++)
	cChars[I] = Lengthen( (unsigned char) ( (*iInt >> (I * 6)) & 63) );
	//Take the next 5 bits and move it to the 1st 5 bits of the character
}

I'm sorry if the second one doesn't work perfectly; I haven't tested it. I wrote it just now and compiled it with Comeau online (http://www.comeaucomputing.com/tryitout/). Hopefully this gives you some idea of what to do.
ta0kira

jwn7 · 05-25-2005, 03:14 AM

wow

i owe you

thank you so much

even if we can only fit 4 in there it will still be a lot better

ichrispa · 05-25-2005, 03:35 AM

I have an idea, but it will hog processor space (i'm pretty new in c++, so my programming still need a lot of trimming).
If you ask for the return value of a char x[] then it is returned as a short integer casted into an integer. so in theory you can ask for the integer and convert it ro a binary. Next you only need to manipulate the binaries, represented by strings. When you have assembled all the 5 characters into a binary you could reconvert the binary to an integer. Heres what I mean:

get the character as an integer (say 62 if you have an a) -> subtract 62 to mach it with you binary protocol (since you only want 26 characters) -> turn the number 0 into a binary of "00000" (b would thus be "00001" and so on)
get the next character (say c) and turn it into a binary ("00011") -> add the two binaries ("00000"+"00001"="000000001")
Repeat till you got all 5 charcters (result for "chris" would be "000110100010010010011001", I think) -> convert into an integer

The principal method is easy and elemental, we all started our programming days with programs of the "turn an integer into a binary and print it" - type. However that will take up a lot of time and if you are aiming for speed, you are better off with another method.

wi++ · 05-25-2005, 04:17 AM

Have you thought about using Huffman coding for your character data? You have then an extra lookup for a character, but maybe a big spacegain.

jonaskoelker · 05-25-2005, 05:07 AM

good idea, wi++.

however, for stuffing chars into uints, I can't believe no one has suggested this:

Code:

union char32 {
  uint32_t uint;
  struct {
    unsigned c1: 5;
    unsigned c2: 5;
    unsigned c3: 5;
    unsigned c4: 5;
    unsigned c5: 5;
    unsigned unused: 2;
  } chars;
};

union char32 c;
c.chars.c1 = str[0];
c.chars.c2 = str[1];
c.chars.c3 = str[2];
c.chars.c4 = str[3];
c.chars.c5 = str[4];
c.unused = 0; /* or some other consistent scheme. Maybe an xor? */

also, as a compact representation of the huffman tree (though that shouldn't be necessary), consider using a

Code:

struct huff {
  struct huff* left;
  union {
    struct huff* right;
    char c;
  } leaf;
};

invariant: if (left) valid(leaf.right) else valid(leaf.c);

also to investigate: would there be any benefit from huffman-coding blocks of four chars (which would match nicely into the struct huff) instead of just one?

freegianghu · 05-25-2005, 05:56 AM

Quote:

Originally posted by jonaskoelker
good idea, wi++.

however, for stuffing chars into uints, I can't believe no one has suggested this:

Code:

union char32 {
  uint32_t uint;
  struct {
    unsigned c1: 5;
    unsigned c2: 5;
    unsigned c3: 5;
    unsigned c4: 5;
    unsigned c5: 5;
    unsigned unused: 2;
  } chars;
};

Where is c0?

)

Code:

c.chars.c1 = str[0]; 
/*should it be?
(str[0]>='A'&&str[0]<='Z'?str[0]-'A':str[0]>='a'&&str[0]<='z'?str[0]-'a':'#') 
*/

eddiebaby1023 · 05-25-2005, 07:21 AM

Quote:

I can tell you how to put 4 chars in an int a relatively simple way:

Simpler way:

Code:

union stuffer {
    char c[4];
    int i;
}

and write the data to c[0], c[x] ... and read it from i.

All this assumes your ints are 32 bits, of course, which they usually are, but it's not guaranteed by the language; ISTR that the only guarantee C gives (gave?) is that a long will be no shorter than a short. ints are generally supposed to take the "natural" length of the architecture but all the 64 but systems I've worked on had 32 bit ints.

You've also got to consider the endian-ness of the target systems.

ta0kira · 05-25-2005, 07:52 AM

Yes, something I forgot is that in C++ int is defined by the implementation. We've all been assuming int is a long, but for certain systems, such as a 16 bit system, int will be a short. OP: is it specifically int, and do you know if it will always be 32 bit? That makes a small or big difference. You could always use templates to correctly create a union based on the size of int, which I can show you if you want. Also, is the sign for signed chars always bit 7?
ta0kira

jonaskoelker · 05-25-2005, 09:20 AM

freegianghu: okay, my code was broken. I'm truly, utterly sorry for any inconvenience it may have caused

... and why not just use isalpha(str[i])? tolower(str[i]) - 'a': 31; in that way, it's (if only extremely marginally) more portable (*cough* union-hack *cough*), plus the '#' (decimal 35 > 31) won't flow into the next char

anyways, "we kno wad we meen".

--Jonas

jwn7 · 05-25-2005, 12:41 PM

lots of good responses. thanks guys

Quote:

Originally posted by ta0kira
Yes, something I forgot is that in C++ int is defined by the implementation. We've all been assuming int is a long, but for certain systems, such as a 16 bit system, int will be a short. OP: is it specifically int, and do you know if it will always be 32 bit? That makes a small or big difference. You could always use templates to correctly create a union based on the size of int, which I can show you if you want. Also, is the sign for signed chars always bit 7?
ta0kira

i think it's a reasonable assumption for this program to assume that ints are 32 and bit 7 is the signed bit. it's just an x86 machine running GNU/linux. so can i go ahead and use the code jonas posted?

Hko · 05-25-2005, 12:46 PM

Just another thought about this: Why throw away 2 bits every each 6 lowercase letters? After storing 5 times 6 (= 30) of these 5-bit chars, you have thrown away 5 * 2-bits, which could store another 2 of these 5-bit chars.

So to save even more space, leave the idea of storing 6 chars in a 32-bit integer. Instead store 32 chars in a 160-bits block.

Using jonaskoelker's struct-with-bit-fields idea:

Code:

typedef struct {
    unsigned c0: 5;
    unsigned c1: 5;
    unsigned c2: 5;
    unsigned c3: 5;
    unsigned c4: 5;
    unsigned c5: 5;
    unsigned c6: 5;
    unsigned c7: 5;
    unsigned c8: 5;
    unsigned c9: 5;
    unsigned c10: 5;
    unsigned c11: 5;
    unsigned c12: 5;
    unsigned c13: 5;
    unsigned c14: 5;
    unsigned c15: 5;
    unsigned c16: 5;
    unsigned c17: 5;
    unsigned c18: 5;
    unsigned c19: 5;
    unsigned c20: 5;
    unsigned c21: 5;
    unsigned c22: 5;
    unsigned c23: 5;
    unsigned c24: 5;
    unsigned c25: 5;
    unsigned c26: 5;
    unsigned c27: 5;
    unsigned c28: 5;
    unsigned c29: 5;
    unsigned c30: 5;
    unsigned c31: 5;
} charblock;

I wouldn't implement it this way though, but use the bit-wise operators (<< & | ) instead.

jwn7 · 05-25-2005, 12:50 PM

also i have no idea how to actually use this union thing. what part goes in the header file and what goes in the cpp file? how do i create one / use it etc etc???

thanks

Hko · 05-25-2005, 12:55 PM

You don't really need a union to do the compressed storing of chars.
The union makes it possible to refer to the block of 6 chars as an integer.

I'm assuming here the goal was not "storing chars in an integer", but rather "storing as many chars as possible in as little memory as possible".

jwn7 · 05-25-2005, 12:59 PM

yes that's exactly right.

sooo what should i do??