Superscalar FPGA CPU Design |
|||
Home Java processors >> << 32-bit RISC CPU
Usenet Postings |
Subject: Re: small superscalar design ? Date: 01 Nov 1995 00:00:00 GMT newsgroups: comp.arch.fpga In <HAMMAMI.95Oct29205005@pross113.u-aizu.ac.jp> hammami-@u-aizu.ac.jp (Omar Hammami) writes: >Dear Netters: > >Does anyone have ever tried to implement a small superscalar >processor design using FPGAs ? I would like to have an idea of >the gate complexity level for say a 16 bits superscalar with >roughly 3 FP units and 2 Int units and a limited number of >instructions. Great question! It is surely possible, but it might not be the best fit of architecture to implementation technology... First, let's talk functional units. Integer ALUs are easy to implement, you simply need an adder, logic unit, and multiplexor of some sort. Ripple carry adders will do, given vendors' dedicated carry chain hardware. Decent performance FPUs would be more difficult. An FP add/sub will require renormalization which requires a barrel shifter (bad: lots of wires!) or several iterative cycles of fewer bit shifts. A multiplier of managable size will also take several cycles, although at 16-bit FP (1 bit sign + 5 bit exp + 10 bit mantissa?) you might only need at 10x10->20 bit multiplier. (This is an invitation for you FP in FPGA veterans to chime in with your experiences.) Moving on, the register file could well be your critical path. If you hope to sustain an average of even one and a half integer instructions per clock you will have to fetch three or four operands per clock and retire two. A 3-read-2-write register file in which the two write back values are retired before you read the three new operands will take at least 2 SRAM write cycles on embedded dual ported SRAMs and up to 4 cycles on embedded non-dual ported synchronous SRAMs, depending upon how many copies of the register file you keep. (See Xilinx 4000E, Altera Max10K, Actel 3200DX?.) For instance, on the new Xilinx 40xxE-3 parts, you are talking at least 2-, 3-, or 4- ~15 ns sync-SRAM cycles best case. This design would hardly be competitive with a single-issue one which could sustain one instruction per clock at twice the clock rate. Once again the "speed demons" whip the "brainiacs"! And if you hope to retire 3 or 4 results (peak) per clock, your basic cycle time is even worse. Your only hope might be to lobby the FPGA architects for multi-multi-ported SRAMs (2-write, 2-read quad ported synchronous SRAMs, anyone? :-) A VLIW like architecture could be a better architectural choice because much of the register file, and its multiple write back liability, could be spread about the functional units with limited degrees of communication between units, achieving only one register file write back operation each clock per unit. (I have some "paper" VLIW datapath floorplans that satisfactorily fit in large existing FPGAs.) For instance, you could easily distribute 64 16-bit registers and four ALUs about as 4 units each of 16x16-bit register file+ALU, assuming instruction operand constraints that a given instruction segment for one unit could only read operands from that unit or (in a limited way) from adjacent units, and could only write back results to that unit. Then you could keep the register file result write back cycle time down to one or two sync-SRAM or dual port SRAM write and readback cycles no matter how wide your machine grows. But, I'd hate to have to write your compiler's code generator. Another comment. Superscalar microarchitectures typically demand many operand busses to route lots of operands and results to lots of functional units. Unfortunately, wires are relatively much more expensive in FPGAs than in custom designs (where they are already plenty expensive). The datapath of my (now working!) 32-bit pipelined RISC (half an XC4010) has just barely enough wiring resources to implement a single issue microarchitecture. Unless you are a wizard at time multiplexing different operands and results on the same wires, say at 10 ns intervals, without incurring killer delays, I think you would find today's FPGAs unacceptably wire limiting. But by all means give it a try! Sounds like a great push-the-envelope project. >Any pointers on books, TRs and projects descriptions for >undergraduates and graduates will be appreciated. Very highly recommended reference: Mike Johnson's book, Superscalar Microprocessor Design, Prentice Hall, 1991, ISBN 0-13-875634-1. "Tired: superscalar; Wired: VLIW and multithread," Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |