Finding Max Element in an FPGA |
|||
Home Floorplanning >> << Life in an FPGA
Usenet Postings |
Subject: Re: Efficient max-function architecture? Date: 22 Sep 1998 00:00:00 GMT Newsgroups: comp.arch.fpga Some comments. Algorithm analysis 101 tells us that to determine the largest item of n items requires at least n-1 2-input "greater than" comparisons. You might build other circuit structures, e.g. using 3-input or 4-input max-functions, of course, but I doubt they are as area efficient -- although I can't prove that off the top of my head. In the worst case, each comparison must consider each bit of its two arguments, although there is a possibility (a probability) of "early out" if some MSBs are different. If you wish to take advantage of early-out, you will have to design for the worst case (no early-outs occurred) or for failure in the improbable case. As you suggest, you can do the n-1 comparisons sequentially, in parallel, pipelined, etc., depending upon your performance requirements. Consider using dedicated carry chains (if available) to make your comparators small and fast. For example, an n-bit comparator requires approximately n/2 XC4000 CLBs + approx 1/2 CLB to establish carry-in[0] as 0. If the data arrives a byte at a time you could perform 1 load and n-1 comparison cycles using a single comparator and load-enabled register (e.g. 4.5 XC4000 CLBs, 1+15 cycles at ~10 ns/cycle). If the 16 bytes of data arrive all at once, the pipelined tree of comparators and muxes should do nicely (15*(4.5+4) = ~130 CLBs, 10 ns/cycle). Hybrids: e.g. if you have 50 ns you can reduce the problem to a first step where you iteratively find 4 largest of 4 subsets of 4, followed by 3 comparators+muxes, in area 4*4.5 +3*(4.5+4) = ~45 CLBs). Jan Gray Subject: Re: Efficient max-function architecture? -- "parallel bitwise max" Date: 22 Sep 1998 00:00:00 GMT Newsgroups: comp.arch.fpga I wrote in message <6u8s8g$l5s$1@news-1.news.gte.net>... >Algorithm analysis 101 tells us that to determine the largest item of n >items requires at least n-1 2-input "greater than" comparisons. ... SEGUE Algorithm analysis 101 also teaches us to carefully choose a model (e.g. elements and a 2-element comparison function) and a goal (e.g. minimize no. of comparisons to determine the largest element) to represent the problem. Only then can we compare algorithms or study optimal lower bounds on algorithms. For example, using a comparison based model, it can be shown that any sorting algorithm on n elements must perform at least ceil lg n! comparisons. But there are other sorting algorithms which do not perform any elementwise comparisons. They assume a different model with different operations on the elements. Consider radix sort, in which we distribute the n elements into r buckets, according to increasingly significant digits in their base r representation, and then regather them and repeat. (Remember the old IBM punch card sorters?) So, as radix sort is to a comparison-based sort, is there an analogous "radix max" to our comparison-based max? "PARALLEL BITWISE MAX" Yes. Here's a 'digit'al method for finding the max of n k-bit words. We scan the n inputs as n serial bit streams, all clocked together, msbs first. One bit stream is the maximum if it is never observed to be less than some other bit stream. For any two bit streams A and B, "A < B" iff ~msb(A)&msb(B) | (msb(A)==msb(B) & "lsbs(A) < lsbs(B)"). Design: we keep n candidate-for-max state bits, one for each bit stream. All are initialized to 1, as each stream is initially a candidate to be the maximum. A candidate stream is still a candidate if it has a current 1 input bit. A stream with a current 0 input bit loses its candidacy if there is some other remaining candidate stream that has a current 1 bit. After all input bits have been clocked past, the remaining candidate is the maximum bit stream. If duplicate input bit streams are possible, there can be more than one remaining candidate, each of which corresponds to the same maximum value. // pidgin netlist generator source: input reset; // 1 during the reset cycle input stream[n]; // current bit of each of the n input streams reg cand[n]; // 1 if the i'th stream is still a candidate for maximum. net some1; // 1 if some remaining candidate stream input bit is currently 1 output answer; // result, the index of max input stream some1 = cand[0]&stream[0] | cand[1]&stream[1] | ... | cand[n-1]&stream[n-1]; for (i = 0; i < n; i++) cand[i] := reset | cand[i]&(stream[i] | ~stream[i]&~some1); If we know the input words are never two alike, then bitcount(cand)=1 and we can use answer = encode(cand); otherwise, answer = priority_encode(cand); Implemented in an XC4000 FPGA, for n=16, requires approximately 4.5 CLBs for some1 8 CLBs for cand[16] 12.5 CLBs for a 16-to-4 priority encoder ---- ~25 CLBs, result every 9 clocks for 8-bit input data, at perhaps 10 ns/clock. Radix 4 (2 bits/clock) is also possible but probably unwieldly and slow. The nice thing about this approach is it is readily scalable to many inputs; the slowest part of the design would be the 'some1' or-reduction circuit, and even with 36 input streams this is only three CLB delays and mostly local interconnect. Jan Gray Subject: Re: Efficient max-function architecture? -- "parallel bitwise max" Date: 30 Sep 1998 00:00:00 GMT Newsgroups: comp.arch.fpga Brad Taylor wrote in message <3611D95A.527FC010@emf.net>... > /=====REG==\ > | [MAX] | > [MAX] \==>|a o|==/ > [hi_byte ]===>|a o|==REG=>|b__| > [lo_byte ]===>|b__| I like it! Certainly nicer than the 4-way hybrid in my first posting. To recap the thread and add yet another approach, if you have n inputs each m bits long, some choices are :- 1. simple m/2+1 CLB max accumulator in ~n clocks 2. ~nm/2 CLB max-mux tree in 1 clock 3. hybrid which takes ~nm/2k CLBs in k clocks 4. "parallel bit serial max" in ~n CLBs in ~m clocks 5. "serial bit serial max" in ~lg m + m/16 CLBs in m*n clocks About 5: use a bit serial max state machine together with an m-bit FIFO implemented using dual port RAM, to store the "current max". Operate by streaming in all the input words, serially, one bit at a time, into the max machine. Each m clocks it bit serially emits the max of all inputs so far. Jan Gray Subject: Re: Efficient max-function architecture? -- "parallel bitwise max" Date: 30 Sep 1998 00:00:00 GMT Newsgroups: comp.arch.fpga Jan Gray wrote in message <6usptm$d1$1@news-2.news.gte.net>... >5. "serial bit serial max" in ~lg m + m/16 CLBs in m*n clocks > >About 5: use a bit serial max state machine together with an m-bit FIFO >implemented using dual port RAM, to store the "current max"... Oops, we don't need an m-bit FIFO, but rather a simple m-bit shift register. Looking to http://www.xilinx.com/xapp/xapp052.pdf for inspiration, the m-bit SR can be implemented in just 2 + m/32 CLBs so I should rather have written: 5. "serial bit serial max" in ~4 + m/32 CLBs in m*n clocks Jan Gray
Copyright © 2000, Gray Research LLC. All rights reserved. |