triple-channel memory architecture

cuizehan · 05-29-2011, 09:02 PM

Quote:

Originally Posted by Skaperen

I thought I'd toss out possible interleaving schemes just so you'd have an idea what kinds of things I was thinking about. In these descriptions I will assume the access word size is 64-bits. That means the low order 3 bits of the byte address is not used. The address bits above that would be used in the descriptions below.

Any interleave involves some kind of address transform from the address the CPU accesses, to end up with the addresses requested over the memory channels, and which memory channel is used for that request.

A basic 2-way interleave would use the lowest address bit to choose which channel to access. The address bits above that would be passed to the memory devices to select the word to access. This ensures that a sequential access of memory alternates between the channels in a two channel system. If both channels are actually fetched in parallel when either word is accessed (and when there is a cache miss), then both channels are loaded into cache. So the 2nd word of that sequential memory READ will see a cache hit. Writes would just do a write-through.

A 3-way is where it gets complicated.

The equivalent of the basic interleave is to take the address and divide by 3. The quotient is passed to the memory devices as the address, and the remainder chooses which channel. The trouble with this is that there is some delay through all those decision gates before the quotient and remainder are available. Unless someone can somehow come up with a zero-delay divide-by-three gate matrix, this is not a practical solution.

An alternative is to take just a few bits (3 bits for an 8/9 interleave) and apply the divide by 3. Since no power of 2 is divisible exactly by 3, this is an uneven amount of access. That is, for 8 sequential addresses, they would alternate through the three channels in the order 0, 1, 2, 0, 1, 2, 0, 1 ... and stop there (the next 8 addresses starts over at 0 again, not at 2). That means 1/9th of memory is lost (every 3 positions in the 3rd channel address space would be skipped over). This would be a poor solution. Doing this with a larger number of bits might be doable because less memory would be lost. A 32/33 interleave would lose 1/33rd of memory (every 11th position in the 3rd channel would be skipped over), which might be more acceptable. But there is one problem as we keeping going up with this ... we get longer delays in doing that division by 3 with more bits. But maybe there is a good tradeoff point somewhere.

There are a couple other ideas I'm thinking of, but I haven't figured them out completely. I'm not even sure they'd work (I'd have to work out the design all the way, probably, to determine that). One of them involves scattering the accessed addresses around the three channels in a non-linear order (using an XOR matrix). One disadvantage with that is you cannot always parallel load the cache (although it might be possible to selectively do so where the address transforms favor it).

A 4-way interleave is basically as simple as a 2-way, but you have 2 bits selecting the channel instead of just 1 bit. Basically, whenever the interleave is a power of two, you can do the needed modular arithmetic by just routing address bit lines.

Hey Intel and AMD ... just move on up to quadruple channel memory and simplify life.

I don't know whether you have figure out this question.

I can provide some information here.

The interleaving scheme in triple channel mode is exactly the MOD3 operation. I measured this by using our HMTT hardware. Besides, in the intel datasheet(i7-900 datasheet volume2 section 2.9) there are related information: three schemes are like you have described above, one is using 3 bits, one is using 3 bits and XOR, and one is MOD3 which is our situation.

Recently, I'm confusing about how interleave will be performed if the need for triple channel operation is not satisfied, i.e., the capacity of each channel is not equal. If you have related information, please share with me.

Skaperen · 06-01-2011, 08:07 AM

Quote:

Originally Posted by cuizehan

I don't know whether you have figure out this question.

I can provide some information here.

The interleaving scheme in triple channel mode is exactly the MOD3 operation. I measured this by using our HMTT hardware. Besides, in the intel datasheet(i7-900 datasheet volume2 section 2.9) there are related information: three schemes are like you have described above, one is using 3 bits, one is using 3 bits and XOR, and one is MOD3 which is our situation.

Recently, I'm confusing about how interleave will be performed if the need for triple channel operation is not satisfied, i.e., the capacity of each channel is not equal. If you have related information, please share with me.

The mod3 scheme would seem to require extra circuitry to do that calculation, and a lot of gates that would delay the address propagation. Is it uniformly mod3 over the entire address space, or is it mod3 across some subset of the address bits? If the latter, and if done over lower bits, then that would leave the mapping out of balance since every power of 2 mod 3 is never 0.

Searching on developer.intel.com finds nothing about details of memory interleaving. At least one document I found said it did interleave but gave no further details (besides where to plug in DIMMS to gain triple-channel speed on certain Intel boards). There's more than one way to interleave, so I make no assumptions about this from what they say.

cuizehan · 06-01-2011, 08:22 AM

Quote:

Originally Posted by Skaperen

The mod3 scheme would seem to require extra circuitry to do that calculation, and a lot of gates that would delay the address propagation. Is it uniformly mod3 over the entire address space, or is it mod3 across some subset of the address bits? If the latter, and if done over lower bits, then that would leave the mapping out of balance since every power of 2 mod 3 is never 0.

yeah, it is MOD3 over the entire address space. I think the MOD3 may not cost too much, compared to the complicated memory controller and the long memory access latency.

Skaperen · 06-01-2011, 11:05 AM

They could still get some of the speedup of triple-channel memory if they also supported having one of the slots populated with memory twice are large as the other two. In this case it would be mod4 interleaved like 0,1,0,2 (where 0 has a double sized DIMM, or interconnected with 2 DIMMS). Then you could have true powers of 2 like 4G, 8G, 16G, or even 32G, while still being somewhat faster than plain double-channel.

Maybe we'll eventually see quad-channel memory.