Theoretical performance?
The number of operations in the matrix multiply is:
- 4 Loads (read [1x4] input matrix)
- 4 Stores (write [1x4] output matrix)
- 16 Mults (matrix products)
- 12 Adds (column sums)
Total ops is 36. Theoretical number of required instruction packets is 36/EUMAX.
- If EUMAX = 4, then should only take 9 instruction packets!
You will be graded on how close you get to this limit.