Altera Flex10K CPUs |
|||
Home Flex10KE CPUs >> << Using block RAM
Usenet Postings |
Subject: Re: FPGA based CPU ideas, and novel extensions => distributed RAM and Altera CPUs Date: 14 Oct 1997 00:00:00 GMT Newsgroups: comp.arch.fpga David Atkins wrote in message ... >Any of these kicking around for Altera, if not for a good reason, ? >Somehting of an interest but not in aposition to find the time for the >money to get into, we use 10k10's at present and the techniques would be >intersting, any pointer greatfully recieved. (Disclaimer: I have studied but never used Altera devices.) FPGA RISC CPUs, e.g. CPUs with adequate register files, can certainly be implemented in the Altera FLEX 10K family, which has many nice features. However, in my opinion, the Xilinx XC4000 architecture seems a better platform (higher performance) for this application because of its distributed RAM feature. In particular, a simple RISC datapath benefits from a 2-read, 1-write port register file. In an XC4000, these can (in theory) be built and run at up to about 10 ns/cycle using two banks of dual port mode distributed RAM. [tWCTS=9.0, 8.4, 7.7 ns in XC4000XL-3, -2, -1]. Of course to take advantage of this 66-100 MHz operation you need the deeply pipelined even/odd ALUs I described in another recent posting. In contrast, in a FLEX 10K device, you would use EABs (the 256x8 embedded RAM blocks). A 32x32 2-read 1-write register file would then require 3 cycles using 4 EABs, or 2 cycles using 8 EABs (two copies of the register file), at (in theory) 10+ ns/cycle. [tEAWRCREG and tEARCREG=11.6, 9.5 ns in EPF10K50V-4, -3]. (Perhaps an Altera expert will provide more correct and up-to-date information.) Of course, an accumulator or stack oriented instruction set architecture (with TOS in a register) could reduce the average number of EAB accesses per cycle. EABs could certainly excel at building LARGE register files (e.g. for vector registers or multiple thread contexts or register windows), on-chip RAM, ROM, caches, TLBs, cache tag RAMs for off-chip caches, etc. Indeed an AMD 29000 style variable sized register window implementation might avoid enough memory traffic to outperform a simpler 32-register RISC with half the cycle time. Might not. Alas, compared to distributed RAM, EABs are often too narrow (256x8 instead of 128x16) and coarse. Take a simple I-cache design. A (256 byte) 16-entry by 4-word line by 32-bit I-cache in an XC4000 is one column of 16 CLBs for a 16x24 cache tag RAM, one column for a tag comparator and other control logic, and four columns for a 4x16x32 cache data RAM. Total approximately 6x16 CLBs, 10% of a 4025E, 3% of a XC4085XL. A (512 byte) 2-way set assoc, 32-entry cache would be about 200 CLBs, still a small percentage of a large device. Whereas the smallest such 32-bit cache you can build from EABs is 4 EABs (both tags and data in same EABs) with two cycle cache access . 4 EABs is 33% of the EAB resources in a 10K100. Another feature XC4000 has but which FLEX10K lacks is TBUFs (3-state drivers). These are very handy for sharing one wide bus across chip. In the old J32 design, the processor half of the XC4010 uses almost every available TBUF to drive many different results onto the "result bus", destined for write-back into the register file: * adder/subtractor * logic unit * operand A << 1, << 2, << 4, >> 1, >> 2, >> 4 * data-in (byte, halfword, word) * sign extension of word/byte data-in for lbu/lbs/lhu/lhs * next-PC (for jal (jump-and-link)) to save the next-PC into a register * data-out during the first cycle of store instructions (not written back) and the 32-bit on-chip data bus half of the XC4010 uses TBUFs for: * various peripherals and boot ROM to return read data * driving off-chip data-in onto the on-chip bus * bus byte-lane shifting -- for instance for "lbu r1,3(r0)" (load byte unsigned from address 3), we move data on mem.d[31:24] down to mem.d[7:0] On the other hand, even the 10K10 provides an astonishing 3x144 FastTrack row channels, so it seems straightforward to deliver even eight or ten 32-bit possible results to multiplexors implemented in LABs. Assuming each EAB/row is responsible for 8 bits of the processor, a 10K10 might implement a splendid 16- or 24-bit RISC. Furthermore you can always implement a 32-bit processor with an 8- or 16-bit datapath, if you perform several execute cycles per instruction. Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |