LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-25-2005, 02:08 AM   #1
jwn7
Member
 
Registered: Aug 2004
Location: pittsburgh, pa
Distribution: gentoo
Posts: 81

Rep: Reputation: 15
storing 5 chars into an int in C++?


i'm helping someone with a program here. it's a pretty simple spellcheck program that just uses a hash table for the dictionary and checks a document. we got all that stuff working already. but the point of the program is to use as little ram as possible. the dictionary is huge at over 500 000 words. currently we're just hashing the strings and it works great but it uses a lot of space. a lot more than the profs demo used. he gave a little hint here:

"There is no way you will get close to my ram usage unless you do not use chars or strings. You must combine chars into integral types. For instance, a 32 bit integer will hold 6 lowercase letters with two bits left over. Sean."

right... then someone replied:

"Just out of curiousity, how come you can only fit 6 characters in a 32 bit integer? You only need 4 bits to represent 26 characters. Couldn't you fit 8 characters in an integer?"

no chars are 8 bits ...

"No, you cannot represent 26 characters in 4 bits. Four bits can only hold 16 states (2^4 = 16), which is not even enough enough for one letter. Storing A thru Z in lowercase needs 26 states and the closest thing is 5 bits (2^5 = 32). A char is 1 byte or 8 bits. Thus that leaves 3 bits which are basically unused. Sean's hint refers to getting rid of those 3 wasted bits and compacting them. By using only 5 bits per character, you are able to store 6 characters in a 32 bit integer, wasting only 2 bits per 6 characters instead of wasting 3 bits per each character."

so that all makes sense right?

i understand the concept, but i don't know how to actually do that in c++. i need to take a character, drop the three unused bits, put the 5 into an integer, do this four more times and ignore the lower (or upper?) two bits. then hash this integer into the table. a problem i can see coming up later is words that are not 5 characters long and manging the hashing / looking up. but we'll worry about that after we figure out how to squish them together.

thanks!
 
Old 05-25-2005, 03:09 AM   #2
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
I can tell you how to put 4 chars in an int a relatively simple way:
Code:
	void
CharsIntoInt4(char *cChars, int *iInt)
{
	for (int I = 0; I < 4; I++)
	reinterpret_cast <char*> (iInt) [I] = cChars[I];
}

	void
IntIntoChars4(char *cChars, int *iInt)
{
	for (int I = 0; I < 4; I++)
	cChars[I] = reinterpret_cast <char*> (iInt) [I];
}
Putting 5 chars into an int is a lot more complex though. You will need 6 bits each:
Code:
	unsigned char
Shorten(char cChar)
//Moves letters to 1st 6 bits of character
{
	if (cChar >= 'A' && cChar <= 'Z') return cChar - 'A';
	if (cChar >= 'a' && cChar <= 'z') return cChar - 'a' + 26;
	return 0;
}


	char
Lengthen(unsigned char cChar)
//Moves 1st 6 bits of character back to letters
{
	return (cChar >= 26)? (cChar + 'a') : (cChar + 'A');
}


	void
CharsIntoInt5(char *cChars, int *iInt)
{
	*iInt = 0;

	for (int I = 0; I < 5; I++)
	*iInt |= ( (int) Shorten(cChars[I]) ) << (I * 6);
}

	void
IntIntoChars5(char *cChars, int *iInt)
{
	for (int I = 0; I < 5; I++)
	cChars[I] = Lengthen( (unsigned char) ( (*iInt >> (I * 6)) & 63) );
	//Take the next 5 bits and move it to the 1st 5 bits of the character
}
I'm sorry if the second one doesn't work perfectly; I haven't tested it. I wrote it just now and compiled it with Comeau online (http://www.comeaucomputing.com/tryitout/). Hopefully this gives you some idea of what to do.
ta0kira

Last edited by ta0kira; 05-25-2005 at 07:42 AM.
 
Old 05-25-2005, 03:14 AM   #3
jwn7
Member
 
Registered: Aug 2004
Location: pittsburgh, pa
Distribution: gentoo
Posts: 81

Original Poster
Rep: Reputation: 15
wow

i owe you

thank you so much even if we can only fit 4 in there it will still be a lot better
 
Old 05-25-2005, 03:35 AM   #4
ichrispa
Member
 
Registered: Mar 2005
Location: Dresden, Germany
Distribution: OpenSuse 11.2/3, Debian 5.0 , Debian 1.3.1, OpenBSD
Posts: 277

Rep: Reputation: 32
I have an idea, but it will hog processor space (i'm pretty new in c++, so my programming still need a lot of trimming).
If you ask for the return value of a char x[] then it is returned as a short integer casted into an integer. so in theory you can ask for the integer and convert it ro a binary. Next you only need to manipulate the binaries, represented by strings. When you have assembled all the 5 characters into a binary you could reconvert the binary to an integer. Heres what I mean:

get the character as an integer (say 62 if you have an a) -> subtract 62 to mach it with you binary protocol (since you only want 26 characters) -> turn the number 0 into a binary of "00000" (b would thus be "00001" and so on)
get the next character (say c) and turn it into a binary ("00011") -> add the two binaries ("00000"+"00001"="000000001")
Repeat till you got all 5 charcters (result for "chris" would be "000110100010010010011001", I think) -> convert into an integer

The principal method is easy and elemental, we all started our programming days with programs of the "turn an integer into a binary and print it" - type. However that will take up a lot of time and if you are aiming for speed, you are better off with another method.
 
Old 05-25-2005, 04:17 AM   #5
wi++
LQ Newbie
 
Registered: May 2005
Posts: 3

Rep: Reputation: 0
Have you thought about using Huffman coding for your character data? You have then an extra lookup for a character, but maybe a big spacegain.
 
Old 05-25-2005, 05:07 AM   #6
jonaskoelker
Senior Member
 
Registered: Jul 2004
Location: Denmark
Distribution: Ubuntu, Debian
Posts: 1,524

Rep: Reputation: 47
good idea, wi++.

however, for stuffing chars into uints, I can't believe no one has suggested this:
Code:
union char32 {
  uint32_t uint;
  struct {
    unsigned c1: 5;
    unsigned c2: 5;
    unsigned c3: 5;
    unsigned c4: 5;
    unsigned c5: 5;
    unsigned unused: 2;
  } chars;
};

union char32 c;
c.chars.c1 = str[0];
c.chars.c2 = str[1];
c.chars.c3 = str[2];
c.chars.c4 = str[3];
c.chars.c5 = str[4];
c.unused = 0; /* or some other consistent scheme. Maybe an xor? */
also, as a compact representation of the huffman tree (though that shouldn't be necessary), consider using a
Code:
struct huff {
  struct huff* left;
  union {
    struct huff* right;
    char c;
  } leaf;
};
invariant: if (left) valid(leaf.right) else valid(leaf.c);

also to investigate: would there be any benefit from huffman-coding blocks of four chars (which would match nicely into the struct huff) instead of just one?
 
Old 05-25-2005, 05:56 AM   #7
freegianghu
Member
 
Registered: Oct 2004
Location: somewhere in the street
Distribution: Window$
Posts: 192

Rep: Reputation: 30
Quote:
Originally posted by jonaskoelker
good idea, wi++.

however, for stuffing chars into uints, I can't believe no one has suggested this:
Code:
union char32 {
  uint32_t uint;
  struct {
    unsigned c1: 5;
    unsigned c2: 5;
    unsigned c3: 5;
    unsigned c4: 5;
    unsigned c5: 5;
    unsigned unused: 2;
  } chars;
};
Where is c0? )

Code:
c.chars.c1 = str[0]; 
/*should it be?
(str[0]>='A'&&str[0]<='Z'?str[0]-'A':str[0]>='a'&&str[0]<='z'?str[0]-'a':'#') 
*/
 
Old 05-25-2005, 07:21 AM   #8
eddiebaby1023
Member
 
Registered: May 2005
Posts: 378

Rep: Reputation: 33
Quote:
I can tell you how to put 4 chars in an int a relatively simple way:
Simpler way:
Code:
union stuffer {
    char c[4];
    int i;
}
and write the data to c[0], c[x] ... and read it from i.

All this assumes your ints are 32 bits, of course, which they usually are, but it's not guaranteed by the language; ISTR that the only guarantee C gives (gave?) is that a long will be no shorter than a short. ints are generally supposed to take the "natural" length of the architecture but all the 64 but systems I've worked on had 32 bit ints.

You've also got to consider the endian-ness of the target systems.

Last edited by eddiebaby1023; 05-25-2005 at 07:23 AM.
 
Old 05-25-2005, 07:52 AM   #9
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Yes, something I forgot is that in C++ int is defined by the implementation. We've all been assuming int is a long, but for certain systems, such as a 16 bit system, int will be a short. OP: is it specifically int, and do you know if it will always be 32 bit? That makes a small or big difference. You could always use templates to correctly create a union based on the size of int, which I can show you if you want. Also, is the sign for signed chars always bit 7?
ta0kira
 
Old 05-25-2005, 09:20 AM   #10
jonaskoelker
Senior Member
 
Registered: Jul 2004
Location: Denmark
Distribution: Ubuntu, Debian
Posts: 1,524

Rep: Reputation: 47
freegianghu: okay, my code was broken. I'm truly, utterly sorry for any inconvenience it may have caused

... and why not just use isalpha(str[i])? tolower(str[i]) - 'a': 31; in that way, it's (if only extremely marginally) more portable (*cough* union-hack *cough*), plus the '#' (decimal 35 > 31) won't flow into the next char

anyways, "we kno wad we meen".

--Jonas
 
Old 05-25-2005, 12:41 PM   #11
jwn7
Member
 
Registered: Aug 2004
Location: pittsburgh, pa
Distribution: gentoo
Posts: 81

Original Poster
Rep: Reputation: 15
lots of good responses. thanks guys

Quote:
Originally posted by ta0kira
Yes, something I forgot is that in C++ int is defined by the implementation. We've all been assuming int is a long, but for certain systems, such as a 16 bit system, int will be a short. OP: is it specifically int, and do you know if it will always be 32 bit? That makes a small or big difference. You could always use templates to correctly create a union based on the size of int, which I can show you if you want. Also, is the sign for signed chars always bit 7?
ta0kira
i think it's a reasonable assumption for this program to assume that ints are 32 and bit 7 is the signed bit. it's just an x86 machine running GNU/linux. so can i go ahead and use the code jonas posted?
 
Old 05-25-2005, 12:46 PM   #12
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: Debian
Posts: 2,536

Rep: Reputation: 111Reputation: 111
Just another thought about this: Why throw away 2 bits every each 6 lowercase letters? After storing 5 times 6 (= 30) of these 5-bit chars, you have thrown away 5 * 2-bits, which could store another 2 of these 5-bit chars.

So to save even more space, leave the idea of storing 6 chars in a 32-bit integer. Instead store 32 chars in a 160-bits block.

Using jonaskoelker's struct-with-bit-fields idea:
Code:
typedef struct {
    unsigned c0: 5;
    unsigned c1: 5;
    unsigned c2: 5;
    unsigned c3: 5;
    unsigned c4: 5;
    unsigned c5: 5;
    unsigned c6: 5;
    unsigned c7: 5;
    unsigned c8: 5;
    unsigned c9: 5;
    unsigned c10: 5;
    unsigned c11: 5;
    unsigned c12: 5;
    unsigned c13: 5;
    unsigned c14: 5;
    unsigned c15: 5;
    unsigned c16: 5;
    unsigned c17: 5;
    unsigned c18: 5;
    unsigned c19: 5;
    unsigned c20: 5;
    unsigned c21: 5;
    unsigned c22: 5;
    unsigned c23: 5;
    unsigned c24: 5;
    unsigned c25: 5;
    unsigned c26: 5;
    unsigned c27: 5;
    unsigned c28: 5;
    unsigned c29: 5;
    unsigned c30: 5;
    unsigned c31: 5;
} charblock;
I wouldn't implement it this way though, but use the bit-wise operators (<< & | ) instead.
 
Old 05-25-2005, 12:50 PM   #13
jwn7
Member
 
Registered: Aug 2004
Location: pittsburgh, pa
Distribution: gentoo
Posts: 81

Original Poster
Rep: Reputation: 15
also i have no idea how to actually use this union thing. what part goes in the header file and what goes in the cpp file? how do i create one / use it etc etc???

thanks
 
Old 05-25-2005, 12:55 PM   #14
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: Debian
Posts: 2,536

Rep: Reputation: 111Reputation: 111
You don't really need a union to do the compressed storing of chars.
The union makes it possible to refer to the block of 6 chars as an integer.

I'm assuming here the goal was not "storing chars in an integer", but rather "storing as many chars as possible in as little memory as possible".

Last edited by Hko; 05-25-2005 at 12:59 PM.
 
Old 05-25-2005, 12:59 PM   #15
jwn7
Member
 
Registered: Aug 2004
Location: pittsburgh, pa
Distribution: gentoo
Posts: 81

Original Poster
Rep: Reputation: 15
yes that's exactly right.

sooo what should i do??
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
international chars awl Slackware 8 09-01-2005 01:17 PM
invalid types int[int] for array subscript scuzzman Programming 2 11-16-2004 09:34 PM
Storing Strings ToothlessRebel Programming 4 09-05-2004 03:16 AM
chars } and $ won't work.. among others.. apax Linux - Newbie 8 11-26-2003 09:20 AM
telnet and special chars csDraco_ Slackware 7 05-21-2003 09:57 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:21 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration