MicroUnity's First Generation Technology
MicroUnity began design and implementation in the late 1980's of a general-purpose microprocessor that could process digital media—audio, video, graphics, RF data streams—in real time using software rather than special-purpose hardware.
One key advance was dynamic partitioning of a microprocessor's datapaths, registers, and execution units to perform operations in parallel on several sizes and types of digital media data. For example, a microprocessor dynamically partitions 128 bits of internal datapaths and registers into 16 concatenated (8-bit) bytes of data, or 8 (16-bit) doublets, or 4 (32-bit) quadlets. The concatenated data is of various types—e.g. integer (fixed point) and floating point, as shown below:
The microprocessor also dynamically partitions its execution units to operate on these "groups" of concatenated data. For example, below is a Group Floating-point Multiply and Add instruction, which operates in parallel on partitioned groups of four (32-bit) floating-point numbers. Similar instructions operate in parallel on fixed point data of the sizes above, and perform other operations - subtract, compare, square root, etc. The combination of compare-and-set operations with group mux enables conditional processing in parallel while eliminating the overhead of branches and condition code registers.
Group Floating Point Multiply-Add
A processor with a 128-bit data path can perform 16 operations per cycle on 8-bit integer video data in response to one instruction (or block of instructions), followed by 8 operations on 16-bit fixed-point audio data in response to a second, followed by 4 operations on 32-bit floating-point graphics data in response to a third, etc. In other words, the partitioning of the data and execution units varies dynamically, adapting to the varying sizes and types of data as necessary in a given set of multimedia applications.
Media processing also requires parallel rearrangement and rescaling of data, for example as shown in the illustrations below:
Other examples of MicroUnity's first-generation innovations in media processing included: configureable cache- and buffer-type access methods; multiple-priority memory systems for real-time multiprocessing; interfaces and control for media peripherals; handling of exceptions (such as overflow and underflow) in parallel group operations and a variety of methods for hyperthreading and multi-core media processing. These inventions and others are described in MicroUnity's patents. The net effect of these first-generation innovations was to improve cost and power dissipation by more than an order of magnitude in the processing of digital video, graphics, and communications data using general purpose microprocessors.
BroadMX: the Next-Generation of Software-Upgradeable Broadband
BroadMX is MicroUnity's answer to the question: How will the next order of magnitude of cost/power improvement come about?
BroadMX is a programming model and synthesizable hardware library that makes the most advanced broadband algorithms—including 4G wireless, HD video, and photorealistic 3D graphics—practical to implement and upgrade in software.
BroadMX reduces the instruction count of real-time applications by an order of magnitude with a simple, classical programming model based on new compound operations called BroadOps that extend group operations to 2-D arrays and other compound data structures. By reducing instruction and intermediate result overhead, BroadMX results in a smaller code footprint and lower power consumption over a wide range of performance levels. A typical BroadOp may replace 20 or more first-generation multimedia instructions, while using the bandwidth resources of a single instruction.
For example, the BroadMX convolve operation shown below uses the same number of 128-bit operands used in a single instruction (e.g. the group multiply-add shown above) of a state-of-the-art multimedia CPU. Convolve computes eight 16-bit elements, each extracted from the full 35-bit sum of eight 16-bit products. A multimedia CPU requires 16 multiply-adds, 12 additions and 30 data rearrangements, for a total of 58 instructions to perform the same function with 32 bits of internal precision. Full 35-bit precision would require even more instructions.
BroadMX includes four classes of BroadOps:
- ENSEMBLE OPS multiply, add, and convolve groups of 8-, 16-, 32-, 64- and 128-bit data types, including integer, polynomial, Galois field, and (optional) floating point. These accelerate filters, correlators, resamplers, error-correction, etc.
- CROSSBAR OPS switch up to 256 bits from two concatenated 128-bit input registers into 128 output bits. These accelerate bit manipulation, scrambling, interleaving, encryption, etc.
- GROUP OPS add, subtract and perform logic on 128-bit "groups" of 8-, 16-, 32-, 64- or 128-bit data, including compound ops such as compare-set, shift-add and 3-input add/subtract and boolean ops. These accelerate basic integer arithmetic and logic decisions.
- WIDE OPS access compound data structures such as matrices, switch arrays and translation tables from memory and retain them in "WideCaches" embedded in the datapath. These accelerate bandwidth-intensive compound ops such as matrix multiplies, bit-granular switching and vector translates.
WideOps and WideCache enable software access to the bandwidth required for the most general BroadMX functions while preserving operand interfaces of existing datapaths. For example, the BroadMX crossbar function has an array of 128 muxes, each needing 8 bits to address 256 bits from two register inputs. As shown below, a "wide switch" instruction uses a third register input as a pointer to memory containing the128-byte switch control array needed for complete bit-level control of the muxes.
Upon first use, wide switch is effectively a CISC instruction that fetches the 128-byte control array using the existing memory bus in multiple cycles. However, a BroadMX WideCache retains this data in a 1024-bit memory accessible to the crossbar matrix. Subsequently, wide switch can reuse the switch address array from the local WideCache and execute as a RISC instruction in a single cycle.
Without WideCache, a 256x16 crossbar op SELECT16BITS as shown above (*) would be the most powerful operation to enhance a CPU for bit switching within the 3x128-bit input operand limit. Eight of these instructions, along with eight LOADs and seven ROTMERGE, would assemble the 128-bit result of BroadMX wide switch 16 bits at a time in 23 instructions. In fact, current processors lack SELECT16BITS and would have to assemble the output one bit at a time, resulting in a much higher instruction count.
Basic crossbar ops perform rearrangements with high regularity—such as byte permute, shuffle, swizzle, extract, shift, rotate, compress, expand, deposit and withdraw. In all these cases a strip of "expansion logic" generates the 128 8-bit addresses for the crossbar matrix from a single 128-bit input register or immediate data without using a WideCache entry.
A wide matrix instruction extends ensemble ops much as wide switch extends the crossbar. It produces a 128-bit vector result from an input vector multiplied by a 1024-bit (8x8x16 in the 16-bit case) matrix array accessed indirectly from memory and retained in WideCache accessible to the multipliers within the ensemble unit. The convolve operation shown earlier essentially constructs a matrix in expansion logic that slides an 8-element window one step per row across the 16-element time series formed by concatenating two 128-bit inputs—without using a WideCache entry.
A wide translate instruction translates each of sixteen bytes of input into the corresponding byte of output through sixteen independent tables, 8-bits wide by 256 words deep, as shown below. The bank of tables is specified by a pointer and loaded from memory on first use with the same WideCache mechanisms discussed above in the ensemble and crossbar units. An input address mux can pair indices together for 16, 32 and 64-bit wide tables and mask indices to pack multiple smaller tables (with depths any power of two between four and 256 words) into one WideCache line. A conventional multimedia CPU would require up to 47 instructions to perform the same function.
BroadMX hardware blocks implemented as synthesizable Verilog are easier to port and integrate with existing SOC design flows than hard or firm core designs. These blocks can be assembled into BroadMX cores configured to meet a wide range of performance, die area and power requirements. Configuration variables include memory sizes, interface widths, issue rates and thread count.
For high performance application, instructions per cycle can be scaled from single to four-way issue, and the number of parallel threads can be scaled from one to four threads. For low power and low cost applications, the individual units can be configured for lower issue rate.
For example, the Ensemble Unit configuration options include 'full' 128x128-bit or 'quarter-size' 64x64-bit. The quarter-size E unit can execute all vector ops (up to 32-bit integer and 16-bit complex data sizes) in a single cycle, and basic convolve and matrix ops in four cycles. The quarter-size E unit matches applications that are dominated by vector ops, such as wireless baseband processing.
In a typical four-threaded configuration, the light-weight ALU units can be replicated with the thread state. Each G-ALU is integrated with its private register file, as shown below, so that operand wires for the most frequent operations are as short as possible. The heavy-weight units (multiplier ensemble, switch, table) share the operand bypass network. This structure saves area, reduces contention, and simplifies the software effort required to keep the heavy-weight units well utilized.
BroadMX software is cleanly implemented on a base of C/C++ intrinsic functions covering the mathematical range summarized in the table below. This direct mapping results in extremely efficient compiled code. BroadMX does not require mastering an unconventional programming paradigm.
BroadOPS intrinsic functions are powerful blocks of computation. The C code below shows the four block transformations that comprise the innermost loop of the Advanced Encryption Standard (AES). SubBytes is a single wide translate table lookup instruction, ShiftRows and MixColumns merge into a single wide multiply matrix Galois instruction and AddRoundKey is a single group xor instruction. This loop executes Nrnd times (typically ten) to decrypt 128 bits of data. The entire AES inner loop can be unrolled into just three BroadOps per round; one Load per round will also be needed if the keys are fetched from memory. The same function on a multimedia CPU requires about 70 instructions per round.
BroadMX continues to evolve—adding BroadOps for streamlining seemingly more intractable portions of communications algorithms—for example, Reed-Solomon error correction. Some of these advances are explained in MicroUnity's patent portfolio.