Programmable Logic In MCUs Can Provide Big Performance Boost
Mark Hastings, Cypress Semiconductor
To make the best decision as to what will work best for your project, you need to understand how different MCU vendors implement internal programmable logic
However, even with these feature rich, inexpensive microcontrollers, there will always be project specific hardware interfaces or new third-party interfaces that the microcontroller venders cannot quickly support. Often, this requires designers to use external hardware or to implement the interface in firmware by "bit-banging." The bit-banging approach uses firmware to toggle an input/output (I/O) port, usually to implement a serial interface. I also refer to it when I have to monitor a port to decode serial data as well. Using either external hardware or bit-banging to implement an interface adds cost to a design. The cost of additional external hardware is obvious, but using software to implement a serial interface may also require a faster, and therefore more expensive, CPU.
Most common MCUs today support SPI, UART, and I2C interfaces, but there are still many times where some internal user programmable logic can come in very useful. Several companies – including Atmel, Cypress, Microchip, and NXP– have added some user-definable logic to their parts to address some of these problems. These devices are mainly MCUs with added logic. The CPU is still the main workhorse and the added logic is used to make the CPU more efficient. These device types are usually found in cost-sensitive products, but are also used as small co-processors to assist in low-level tasks to offload the main processor for greater efficiency.
FPGAs, on the other hand, have been moving toward a similar target but from the other direction. Xilinx and Altera have been adding hard and soft core processors for many years to create System-on-Chip (SoC) devices. The FPGA approach is usually more expensive, but when a project requires significant custom logic, they can be a cost-effective approach. These devices are invaluable for prototyping ASICs, implementing low volume products where time to market is key, and creating larger products that require ongoing hardware flexibility.
Both of these device types – MCUs with logic and FPGAs with CPUs – provide hardware flexibility in the field. Once Flash-based devices became common, field upgrades became the norm. At first the designer had the ability to upgrade just firmware, but now, both hardware (logic) and firmware can easily upgraded in the field. Devices from computer mice to high-speed network routers can be field-upgradable with the ability to reprogram both the firmware and hardware in a single device.
In order to make the best decision as to what will work best for your project, it is important to understand how each of these venders has implemented the internal programmable logic. Just a simple internal AND or OR gate may be sufficient to eliminate an external component or to improve CPU performance. All four methods allow input and output signals to be gated by the custom logic. This can be used to gate an input with a clock so that a counter can be used to measure an external clock frequency. One simple example supported by each of the four types of logic blocks is a way to modulate the output of a UART to be used for IR communication. It's not just the fact there is an internal AND gate as show in Figure 1, but the ability to route the signals from a clock or counter and the UART TX output to the AND gate.
Atmel XMEGA Custom Logic (XCL)
The Atmel XCL block contains two LUT (lookup table) blocks along with two 8-bit Timer/Counter blocks (Figure 2). The LUTs do not have to be connected to the timer/counter blocks, but can instead be connected to the UART, EVENT blocks, or I/O pins. The two LUTs can be used as two independent 2-input units or as a single 3-input unit. The logic blocks can be configured as combinational logic, such as AND, NAND, OR, NOR, XOR, XNOR, NOT, and MUX functions. Alternately, they can be configured for sequential logic functions such a D-type flip-flop, D-type latch, or RS latch. Being able to connect to other logic blocks and I/O pins, these simple LUT blocks can easily save the designer from using some external logic or save CPU cycles. For example, the Atmel XCL Module Application Note (AT01084) explains how the XCL module can be configured to provide Manchester coding to the output of a UART without extra CPU overhead or external logic. The AVR XMEGA E devices contain one XCL block.
Microchip Configurable Logic Cell (CLC)
The Microchip CLC allows the user to select up to four signals from eight inputs. These inputs can be a combination of two I/O pins, internal clocks, peripherals, or register bits. These four signals are then routed into a programmable logic block. The logic block can be programmed to be one of eight combinations of AND-OR, OR-XOR, 4-input AND, S-R latch, or one of four other flip-flop combinations. Microchip provides some examples in their Configurable Logic Cell Tips 'n Tricks application note. One of these examples shows how these blocks can be configured to decode quadrature signals. Without these extra logic blocks, the CPU would need to sample the signals at a much higher rate to eliminate false rotations when one of the signals is between transitions; but using these logic blocks, the CPU is only interrupted on a valid transition. The Microchip PIC10(L)F320/322 devices each contain one CLC block, which contains four of the blocks shown in Figure 3.
NXP Pattern Match Engine
NXP has user configurable logic called a Pattern Match Engine (Figure 4). Up to eight GPIOs (general-purpose input/outputs) can be selected as inputs to this logic block. These eight inputs can be used to generate product terms for a complex Boolean expression. The outputs can be used to trigger an interrupt, drive a special I/O pin, or be routed to the next logic block or slice. Other features allow edge detection for rising or falling signals and inverting signals. This extra hardware can be configured to interrupt the CPU only when a complex (or simple) combination of events occurs. The NXP LPC81x devices contain one Pattern Match Engine.
Cypress PSoC Universal Digital Block (UDB)
Cypress has taken a more advanced approach using Universal Digital Blocks or UDBs (Figure 5). This approach puts Cypress somewhere between the three vender approaches discussed above and FPGAs. A UDB consists of two 12C4 PLDs and a DataPath. The PLDs can be used to control the operation and the data flow of the DataPath, or to implement general-purpose logic and state machines. The DataPath is a custom block that provides 8-bit functions (AND, OR, ADD, SUB, INC, DEC, XOR), shift functions (Left and Right), and two 4-byte FIFOs to queue data. These blocks can also be combined to provide up to 16, 24, or 32-bit functions.
The inputs and outputs of the UDBs can be routed to any of the GPIOs, peripheral/component I/Os, interrupts, etc. The interconnection between the UDBs, fixed function analog and digital blocks, and GPIOs is handled with the Digital Signal Interface (DSI). The UDBs can be programmed to implement anything from a simple logic gate, to counters and PWMs, to common serial interfaces like UARTs and I2C peripherals.
User are provided with several options with which to implement a design using UDBs. Primitive logic gates can be simply placed in a schematic, or the design may be implemented in Verilog. A custom graphical interface is also provided to implement a design without the use of Verilog. The PSoC 3, PSoC 4, and PSoC 5 families of microcontrollers may contain between 4 and 24 of these UDB logic blocks.
Example custom interface
Recently, I noticed some interesting RGB LEDs drivers made by Worldsemi. The part numbers are WS2811, WS2812, and WS2812B but they all work the same way. Using a single GPIO pin, you can drive 1000 or more of these parts. The idea that I could control so many RGB LEDs with a single GPIO pin as intriguing. The interface to these parts was not SPI, UART, or I2C, but instead a custom interface. The communication signal is asynchronous, with each bit starting with a rising edge. A data 1 or 0 is determined by the length of the high pulse as shown in the Figure 6.
Each RGB LED requires 24 bits of data; 8-bits for each color (red, green, blue) in the format shown in Figure 7.
The WS2811/12 parts have DIN (Data In) and DOUT (Data Out) signals so they can be simply connected in serial ("daisy-chained"). Each part keeps the first 24 bits that it sees on its DIN pin and retransmits any additional data out on its DOUT pin, as illustrated in Figure 8.
Once the data stream starts, each part expects to see a continuous stream of data. If the data signal does not have a rising edge for more than 50 µs, the part latches the data, and starts to look for the next 24 bits of data as its own, and retransmits the rest (Figure 9).
As you can see, this is not very complicated, but it doesn't adapt easily to standard microcontroller hardware. I did find instances where people used a SPI interface to maintain the timing, but it required at least three SPI data bits per actual data bit and was not a big advantage over bit-banging for CPU performance. Since the timing is so tight, if you were to implement the interface in firmware (bit-banging), the CPU would be dedicated doing nothing but toggling the drive signal until all the LEDs are updated. This means that you would have to disable all interrupts and not respond to any other inputs for the duration.
For example, if you have 1,000 LEDs in the chain, the update time would be (0.40 µs + 0.85 µs) * 24bits * 1,000 LEDs = 30,000 µs or 30 ms. This may not be a problem, but if you wanted to update the LEDs at a 30 Hz rate, you would be using almost 100% of the CPU!
When writing code, I like to stay away from blocking code or conditions where I have to disable interrupts for an extended length of time. This is especially true if I have a user interface or communication with other processors in my system. Thus, my goal was to offload the CPU as much as possible and allow the hardware to do most of the work, just as common internal UART or SPI blocks do today.
The Cypress PSoC UDBs have a couple of nice features that make this rather easy. Each UDB has two 4-byte FIFOs and a shifter in the DataPath. When implementing a UART, one can be used for a Tx buffer and the other for an Rx buffer. In the case of my WS2811/12 driver, I would only need one output FIFO and the shifter. I decided to configure the hardware to generate an interrupt once for each device. The interrupt handler would load the 24-bits (3 bytes) and return until the FIFO is empty. This way I would get an interrupt every 30 µs instead of having to disable all interrupts and bit-banging with a resolution of 150 ns or better. The microcontroller I decided to use was the Cypress PSoC CY8C4245AXI. This is only about $1 in quantity, has four UDBs, and runs at 48 MHz, which means it has more than enough speed and hardware to do the trick.
My design used two of the four UDBs in the PSoC. One UDB was used to buffer the data with the FIFO and shift the data so that all 24 bits (3 bytes) per device could be written at once. The second UDB was used to create a PWM (pulse width modulator) with two compare outputs. One created a logic zero and the other created a logic one. The serial data was used to control the digital multiplexer to select the one or zero waveform. The block diagram is shown in Figure 10.
Most of the product terms of the four PLDs (two per UDB) in the UDBs were used to control the DataPaths, generate the interrupts, and to provide status and control functions, but this still only used half of the UDB resources in this inexpensive microcontroller.
The next thing to do is to find out just how much CPU overhead this extra hardware has saved the design. For an example, consider an array of 1,000 LEDs that needs to be refreshed at 30 Hz. This would consume almost 100% of the CPU's resources if the implementation employed firmware to bit-bang the interface. Using the programmable hardware in the PSoC device still causes an interrupt to occur every 30 µs, which is significant, but not bad for an ARM Cortex-M0 running at 48 MHz. In order to test the overhead, I created a simple loop where the display was refreshed at about 30 Hz. In the main loop, I toggled a pin and used a scope to count toggles over a 40 ms period. Then I disabled interrupts and ran the project again and compared the results. The continuous display refresh only used about 12% of the CPU, as compared to almost 100% using bit-banging in firmware. This leaves 88% of the CPU cycles for external communication and user interface. If DMA (direct memory access) were added to the design, the overhead would probably be reduced from 12% down to 2% or less. I used the least expensive PSoC (~$1) that contained UDBs but did not have DMA; however, several of the larger parts do include DMA.
A real RGB LED "signboard" panel with a grid of 60×16 = 960 LEDs was actually created and used to test out the component (Figure 11). It worked as expected and can be used as a graphics interface with rudimentary line, rectangle, circle, and text support.
Whether you have a large LED panel or a simple custom interface, some internal programmable hardware can make a huge impact on your design performance. Not every custom interface will require a multi-byte FIFO or a full hardware state machine, but having this flexibility can give you more design options, the ability to improve performance, and the ability to quickly adapt an existing design to address the requirements of a new product.
Materials on the topic
News on theme: