Life in an FPGA |
|||
Home Maximum element >> << Regexps in FPGAs
Usenet Postings |
Subject: Re: [++] Fast Life code (Was:Re: FPGA-based CPUs (was Re: Minimal ALU instruction set)) Date: 25 May 1998 00:00:00 GMT Newsgroups: comp.arch,comp.arch.fpga,comp.arch.embedded Tim Olson wrote >This algorithm looks like the one described in the Smalltalk "blue book", >where a version of Life was implemented using BitBlt operations to >implement the cell counting in parallel. Another reference to the bitwise parallel approach is "Life Algorithms", Mark Niemiec, Byte, Jan. 1979, pp 70-79. If I recall correctly, Mark, David Buckingham, and friends, used Waterloo's Honeywell 66/60's EIS "move mountain" instructions to animate 64K 36-bit words per iteration. Inspired by Buckingham and the Blue Book, I wrote a bitblt version that did 800,000 cells in 34? bitblts on a Perq in 1983? and one that did 400,000 cells/s on an 8 MHz (1 M 2-operand 32-bit op/s) 68000 in 1985. As Messrs. Mathisen and Mogensen describe, Life should run very fast on modern processors (superscalar and multimedia enhanced and large caches). 64-bits, in 40 insns, in perhaps 15-20 clocks, at 3 ns/clock, e.g. 1 bit/ns. FPGA Implementation: It is straightforward to run at full memory bandwidth. For example, given an XCS20 and a 32Kx32 PBSRAM (32-bits in or out per 15 ns clock) we can approach 32 bits/(2*15) ns, e.g. 1 bit/ns. Since a given line is read three times (as "below", "current", and "above"), we buffer 2 lines of cells in RAM in the FPGA. A 1024 x n playfield requires 2 x 1024 bits = 64 CLBs of single port RAM, and preferably 3 x 1024 bits for 3 banks since each clock you must read from up to two lines and write to a third. Detailed design/floor plan. One bit requires approx. 9 CLBs. Assuming the cell neighbours are (a,b,c,d,e,f,g,h), we need :- 3 CLBs RAM -- 3 32x1 RAMs (3 banks of line buffer) 6 CLBs logic -- 1 s0a=a^b^c^d; s0e=e^f^g^h 2 s1a="a+b+c+d == 2 or 3"; s1e="e+f+g+h == 2 or 3" 3 s2a=a&b&c&d; s2e=e&f&g&h 4,5 (s3,s2,s1,s0)=(s2a,s1a,s0a) + (s2e,s1e,s0e) (uses dedicated carry logic) 6 new = ~s3&~s2&s1&(s0|old) and so in a 20x20 CLB XCS20, we explicitly place 16 rows of 1x9 CLB tiles in the left half, another 16 in the right half, leaving plenty of room to spare for control and address generation. At the 1997 FPGAs for Custom Computing Machines conference., the paper "The RAW Benchmark Suite" by Babb et al proposed a set of benchmarks for comparing reconfigurable computing systems. One of the 12 benchmarks was Life, for which they reported speedups of several hundred times over a SparcStation20+software approach, but in fairness, they write "we are not currently implementing known improvements to the software to take advantage of the bit-level parallelism available in a microprocessor". Summary. Hypotheticially... Fast microprocessor + cache: ~1 bit/ns Single FPGA + SRAM custom machine: ~1 bit/ns Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |