Efficient usage of the Dreamcast RAM

From dreamcast.wiki
Jump to navigation Jump to search

Written by bogglez

Introduction

SDRAMs are essentially DRAMs on steroids, with a minor interface on top of them that allows some degree of pipelining.

SDRAMs are designed to transfer data in fixed-size bursts. Initiating a transfer takes some cycles, but after that the burst of data is transferred very quickly (one data item per cycle). Subsequent accesses may also be faster if they are in the vincinity of each other.

Getting good performance out of an SDRAM-based memory subsystem requires that the programmer pays attention to the size and location of data items. Ensuring that data objects are as small as possible, and related objects are stored either close to each other, or really far from each other (see Chapter 3.3), is imperative to high bandwidth operation.

All cycle values referred to in this document are bus cycles, not CPU core cycles.

SDRAM operation

Overview

The Dreamcast has 16MB of main memory, supplied in the form of two 8MB SDRAM chips. These chips are clocked at the SH4's full bus frequency, 100MHz.

Each of the two main memory SDRAM chips in the Dreamcast is 512k x 32 bits x 4 banks large. The two chips are connected in tandem to the CPU's 64bit bus - one chip handles the upper 32 bits of each access, the other the lower 32 bits - so the configuration is (from the CPU's point of view) identical to having a single 512k x 64 bits x 4 banks SDRAM chip.


                     Address/control bus
                             |||
 +---------------------------|||--------------------------------------------+
 |                           |||                                            |
 |    +++--------------------+++---+++--------------------- ... ---+++      |
 |    |||                          |||                             |||      |
 |    |||         Bank 0           |||        Bank 1               ...      |
 |    |||      +-----------+       |||      +-----------+                   |
 |  +------+   |   Row 0   |     +------+   |   Row 0   |                   |
 |  | Bank |---|   Row 1   |     | Bank |---|   Row 1   |   ...   Bank k-1  |
 |  | ctrl |   |    ...    |     | ctrl |   |    ...    |                   |
 |  +------+   |  Row n-1  |     +------+   |  Row n-1  |                   |
 |             +-----------+                +-----------+                   |
 |               |||||||||                    |||||||||                     |
 |             +-----------+                +-----------+                   |
 |             |Sense Amps |                |Sense amps |                   |
 |             +-----------+                +-----------+        .........  |
 |               |||||||||    +--------+      |||||||||          |||||||||  |
 |               +++++++++----|Bus ctrl|------+++++++++---- ... -+++++++++  |
 |                            +--------+                                    |
 +-------------------------------||||---------------------------------------+
                                 ||||
                               Data bus
        Figure 2.1.1: Model of an SDRAM chip: n rows x m bits/row x k banks

An SDRAM chip has one set of address/control pins, and one set of data pins. These make up the SDRAM's interface to the outside world. (See fig. 2.1.1) Inside the chip, there are several (usually two or four) identical submodules called 'banks'. Each bank contains three parts: Bank control circuitry, a memory array, and a row of sense amplifiers. The memory array, in turn, is made up of a set of rows of bits. A row can be subdivided into a set of words, each of which is the size of the data bus. The length of a row is usually equal to (or a bit smaller than) the number of rows in the memory array.

Each bank covers one segment of the addressable space; that is, the highest 1 or 2 bits of the address is used to select bank. Row address is given using the following address bits, and finally the column index is given using the lowest-order address bits.

Before a given memory cell can be accessed, the appropriate row must be activated in the right bank. Activation connects it to the sense amplifiers, which will translate the small (0.5v or less) charges in the memory array to acceptable CMOS levels. Once the row is active, its contents can be read/written over the data bus. One row can be active in each bank, independent of the other banks.

Activation takes a few cycles. If another row already is active in that bank, then that row must first be deactivated. That procedure also takes a few cycles.

When reading/writing to the SDRAM, the bus control module will transfer data between the appropriate portions of the active row in the bank in question and the data bus. The bus control module allows the rows to be many times longer than the width of the data bus.

In addition to this, the memory cells leak power; inactive rows leak slowly, active rows leak quickly. Therefore, the rows periodically need to be rewritten (typically once every 4ms). A row must not be open too long either (typically max 100us). The normal refresh handles both these issues.

Commands

A few different commands can be sent to the SDRAM on the control/address bus, here is a subset:

Command Description
REF Deactivates all active rows, and refreshes one row in each bank (the SDRAM keeps an internal counter which specifies which row is next to be refreshed).
ACTIV(bank, row) Activates the given row in the given bank. No other row must be currently active in that bank.
PRE(bank) 'Precharges', that is, deactivates the currently active row in the given bank. If no row is active in the bank, this is a no-op.
PALL Precharges all banks.
READ(bank, column) Reads a burst of data from the currently active row in the given bank, starting at the specified column. The data will be written to the databus.
WRIT(bank, column) Writes a burst of data to the currently active row in the given bank, starting at the specified column. The data will be read from the databus.
READA(bank, column) Reads a burst of data, then precharges the bank.
WRITA(bank, column) Writes a burst of data, then precharges the bank.

Read/write operation

READ/READA commands initiate a series ('burst') of consecutive reads from the active row. There is no need for more communication after the command has been sent; all reads will be carried out without any handshaking. Each read ('beat') will deliver one full set of data to the data bus (8/16/32/64 bits, depending on the data bus width of the SDRAM), and then update the reading address within the row in anticipation of the next beat.

The number of beats in a burst has to be set by the CPU when the SDRAM is initialized. Normal values are 1, 2, 4, 8 beats, or enough beats to deliver the whole row.

The reading address must be aligned to an integer word boundary. As the read progresses, the next address is computed in one out of a few different ways. The most commonly used scheme is the sequential burst sequence: The access address is increased to the next word address modulo (burst length) -- that is, if the current burst length is 4, and (access address) modulo 4 == 3, then the next access address will be 3 words back, rather than 1 word ahead. This mode allows CPUs to begin fetching a cacheline at an arbitrary position within the line - if the word at offset #2 into a 4-word cacheline is immediately needed by the CPU, it can fetch the words in order 2, 3, 0, 1 by starting a fetch at word 2.

Write operation is identical to read operation, except that the SDRAM will read from the address pins, rather than write to them. Also, as there is no write operation that writes only a subset of a word, there is one disable/enable signal available per byte; the bus controller can assert only some of those signals, thereby telling the SDRAM to only update some of the bytes of the next word to be written.


Aborting and pipelining bus transactions

A burst read/write sequence can be prematurely aborted by issuing another read/write command to the SDRAM which will result in new bus access before the first burst sequence has finished.

When issuing a read/write, the bank number and column address given on the address bus will be latched into the SDRAM. Once the first data word is read/written via the data bus, the contents on the address and control buses is no longer needed. Another command can then be given while one of the banks is still reading/writing via the data bus, as long as that command will not directly interfere with the ongoing bus transaction.

Suitable commands can be row activation/deactivation in another bank, or a read/write command for one of the banks. (The read/write command must not be sent too soon however, or the ongoing bus transaction will be aborted before completion.) By overlapping data bus transfers and command issuing, the SDRAM can reach throughput rates closer to its theoretical maximum: 1 data word per cycle.

SDRAM in the Dreamcast

As the Dreamcast has two 32bit SDRAMs in parallel, and the bus to the SDRAMs is 64bit wide, this chapter will assume the simplified view that there is a single 64bit SDRAM chip in the system. The fictionary '64bit SDRAM' has 2kB/row, 512 rows, and 4 banks. This means that each bank spans 4MB of memory.

An address can be split into bank-, row- and column-bits in the following manner:

 bbrr rrrr rrrr rccc cccc c000             <-- 24bit memory address
  |         |        |      | 
 bank      row    column   sub-8byte position

The SH4 has a bus state controller, which operates in parallel with the rest of the SH4 core. The SH4 bus runs at 100MHz, which is half the SH4 core frequency.


Access philosophy

The SH4 will always access the SDRAM in 32-byte chunks (4-beat bursts) - even when doing non-cached reads/writes. [When doing non-cached writes, the SH4 will tell the SDRAM what data to ignore using some bus control signals.] Because all DMA uses the SH4's on-board DMA controller to generate the addresses, DMA also accesses the SDRAM in 32-byte chunks. Thus there are only two operations which are of interest to the programmer: the 32-byte read, and the 32-byte write.

The bus state controller keeps track of which banks have open rows, and omits unnecessary PRE/ACTIV commands whenever possible.

The bus state controller can be in two modes.

In the first mode ('RASDown', which is setup by the BootROM code), read/write commands are issued as READs/WRITs, and thus leaves the accessed row active after the operation. The bus state controller keeps track of which rows are open in which banks, and omits unnecessary PRE/ACTIV commands whenever possible (when there is a 'row hit'). The only way that rows become deactivated is via 'row misses' and the periodically issued REF commands. This mode gives better performance, unless the row hit rate is exceptionally low.

In the second mode, read/write commands are issued as READAs/WRITAs, and thus deactivate the row after the operation. This means that one ACTIV command must be issued before each READA/WRITA. This mode has lower maximum throughput, but may be faster when executing algorithms that have bad row hit/miss ratio.

The bus state controller will not pipeline CPU accesses, only DMA accesses. This means that when the CPU requests a memory access, the bus state controller will wait until the data bus is idle before issuing any commands to the SDRAM.

Normally, the bus state controller will run in RASDown mode. The BootROM code sets up the bus state controller to operate in this mode.


Access timing in RASDown mode

ACTIV(bank, row) takes 3 cycles. PRE(bank) takes 2 cycles. READ(column) takes 3 cycles for setup, and then 4 cycles during which the data arrives (one beat per bus cycle). WRIT(column) begins writing data immediately: 4 cycles to receive the data. However, the WRITE command must come at least 2 cycles into the operation (some control bus signals may be delayed that much since the previous bus access).

The setup time of one CPU access can not be overlapped with the data access time of a previous CPU access. However, DMA accesses will pipeline in this fashion, and since they are massively sequential, there will be a lot of row hits which results in a transfer speed close to 8 bytes/cycle.

The values in Table 3.2.1 indicate that the maximum read speed for CPU would be 450MB/s, maximum write speed at 600MB/s, and maximum DMA speed (both read and write) at 800MB/s.

When CPU or DMA requests a memory access to a given address, the SH4 will issue different commands to the SDRAM:

  • If the correct row is active in the bank in question ('row hit'), the READ/WRIT is sent directly to the SDRAM.
  • If no row is active in the bank in question ('no row active'), an ACTIV command is sent, followed by the READ/WRIT. This case occurs very rarely in the Dreamcast.
  • If another row is active in the bank in question ('row miss'), a PRE command is first issued to close that row. This is followed by an ACTIV to activate the appropriate row, and finally the READ/WRIT is given.

See table 3.2.1 for cycle timings of the different cases.

Figure 3.2.5 deserves a comment: According to the SDRAM specification, one of the control signals is delayed by two cycles, so the bus state controller must assert the signal two cycles before the bus transaction begins. Since the CPU is unable to predict what that signal should be set to in advance of the bus transaction, the bus state controller has to idle for two cycles while the control signal in question is propagating through the SDRAM. (DMA accesses, on the other hand, are long sequences of increasing addresses. The control signal can then be predicted ahead of time, and the gap between bus transactions eliminated.)

Table 3.2.1: Common access timings

Operation Cycles Figure
CPU burst read, no row active 10 cycles Fig. 3.2.1
CPU burst read, row hit 7 cycles Fig. 3.2.2
CPU burst read, row miss 12 cycles Fig. 3.2.3
CPU burst write, no row activ 7 cycles Fig. 3.2.4
CPU burst write, row hit 6 cycles Fig. 3.2.5
CPU burst write, row miss 9 cycles Fig. 3.2.6
DMA burst reads, row hit 4 cycles each Fig. 3.2.7
DMA burst writes, row hit 4 cycles each Fig. 3.2.8
 Note: If a row miss happens during the first cycle after a write, the bus state controller will idle for a cycle before sending the PRE command.
   Chart coding:
   **** marks the time when a row/column address, or PRE command
        is being sent
   ---- and .... mark when burst beats 1,2,3,4 transfer the data
   ++++ marks when address/data of other memory accesses are
        being performed (only used in the DMA figures)


   Cycle             0   1   2   3   4   5   6   7   8   9  10  11  12
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Row address     ************  |   |   |   |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address    |   |   | ****************************  |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Data arrives      |   |   |   |   |   |  ----....----.... |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |

Figure 3.2.1: CPU burst read, no row active


                     0   1   2   3   4   5   6   7   8   9  10  11  12
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address  ****************************  |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Data arrives      |   |   |  ----....----.... |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |

Figure 3.2.2: CPU burst read, row hit



   Cycle             0   1   2   3   4   5   6   7   8   9  10  11  12
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Precharge       ********  |   |   |   |   |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Row address       |   | ************  |   |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address    |   |   |   |   | ****************************  |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Data arrives      |   |   |   |   |   |   |   |  ----....----.... |
                     |   |   |   |   |   |   |   |   |   |   |   |   |

Figure 3.2.3: CPU burst read, row miss


                     0   1   2   3   4   5   6   7   8   9  10  11  12
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Row address     ************  |   |   |   |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address    |   |   | ****************  |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Data departs      |   |   |  --- ... --- ...  |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |

Figure 3.2.4: CPU burst write, no row active


                     0   1   2   3   4   5   6   7   8   9  10  11  12
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Idle cycles     ********  |   |   |   |   |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address    |   | ****************  |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Data departs      |   |  --- ... --- ...  |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |

Figure 3.2.5: CPU burst write, row hit


                     0   1   2   3   4   5   6   7   8   9  10  11  12
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Precharge       ********  |   |   |   |   |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Row address       |   | ************  |   |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address    |   |   |   |   | ****************  |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Data departs      |   |   |   |   |  --- ... --- ...  |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |

Figure 3.2.6: CPU burst write, row miss


                    -1   0   1   2   3   4   5   6   7   8   9  10  11
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address 1  | ****************  |   |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address 2  |   |   |   |   | ++++++++++++++++  |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address 3  |   |   |   |   |   |   |   |   | ++++++++++++++++
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Read data 0      ++++++++++++++++ |   |   |   |   |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Read data 1       |   |   |   |  ----....----.... |   |   |   |   |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
   Read data 2       |   |   |   |   |   |   |   |  ++++++++++++++++ |
                     |   |   |   |   |   |   |   |   |   |   |   |   |
       Figure 3.2.7: DMA burst reads, row hit


                     -4  -3  -2  -1   0   1   2   3   4   5   6   7   8
                      |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address 0 ++++  |   |   |   |   |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address 1   |   |   |   | ****  |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
   Column address 2   |   |   |   |   |   |   |   | ++++  |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
   Data departs 0    ++++++++++++++++ |   |   |   |   |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
   Data departs 1     |   |   |   |  ----....----.... |   |   |   |   |
                      |   |   |   |   |   |   |   |   |   |   |   |   |
   Data departs 2     |   |   |   |   |   |   |   |  ++++++++++++++++ |
                      |   |   |   |   |   |   |   |   |   |   |   |   |

Figure 3.2.8: DMA burst writes, row hit


Optimizing memory access patterns

  • Align your data structures such that they span as few cache lines as possible (16-byte align vectors, 32-byte align matrices).
  • If you have a group of related data items, which are going to be accessed at roughly the same time, and there will not be many accesses elsewhere in that bank, then put them into the same row by 2kB-aligning the group of items; this avoids some row activations/deactivations.
  • If performing some kind of streaming operation (wading through lots of data and performing some simple operation on it), put each data stream into a separate memory bank; this avoids lots of unnecessary row activations/deactivations.
  • Remember that the cache is direct-mapped. If you are interleaving accesses to two arrays that both are aligned to at least 16kB, there is a risk of cache thrashing. Offset one array some bytes to the side to solve that problem.
  • Use PREF to fetch data some cycles before you'll be accessing it. If you don't prefetch appropriately, a cachemiss later on will stall the entire SH4 pipeline until the data is available (8+ CPU-cycles from SDRAM).
  • If you are creating a data stream from scratch -- not modifying existing data at those locations -- then use MOVCA to allocate cache lines without causing memory reads. This way avoids reading in dummy data (due to cache misses & line fills) which is soon going to be overwritten anyway. If you're not going to access the data shortly either, use the Store Queues to write out the data.
  • When you are writing out a data stream, and you will not overwrite it in the near future, use the OCBWB instruction to force the data to be written back to memory. The two main reasons for triggering cache write-backs manually are that you can avoid memory contention to some degree, and spurious cache writebacks later on may cause lots of SDRAM active row switching. Again, Store Queues is an alternative.
  • Keep in mind that Store Queues bypass the cache: If previously have read from a memory area, and subsequently write to it using Store Queues, use the OCBI instruction to invalidate the corresponding cache lines. Otherwise the cache might contain stale data.


Accessing other memory areas

When switching between writing to the SDRAM, and writing to other memory areas (either to other memory, or to memory-mapped devices), idle cycles may be inserted.

A 32-byte burst to the Tile Accelerator seems to usually take roughly 9 cycles, give or take max 2. The Tile Accelerator might be busy (this happens mainly when starting/ending a new list type, or when submitting degenerate/invalid primitives); then the access will stall for a while (delays up to 500 cycles have been observed).


References

  • The bus state controller and SDRAM interface of the SH4 are well described in the SH7750 Hardware Manual.
  • A detailed description on how to setup an SH4-SDRAM system is found in Application Note #92, named "SH-4 Interface to SDRAM".
  • The KM432S2030CT-G8 SDRAM (which is used for main memory in some Euro DCs, at least) documentation is available from Samsung.