I will forever be grateful to Bunnie, he pointed me in the direction of murmurhash when I needed something to help with the integrity of a section of memory in a microcontroller. Legend.
Emulating the RPI PIOs instead of the TI PRUs is really a miss.
The PRUs really get a bunch right. Very specifically, the ability to broadside dump the ENTIRE register file in a single cycle from one PRU to the other is gigantic. It's the single thing that allows you to transition the data from a hard real-time domain to a soft real-time domain and enables things like the industrial Ethernet protocols or the BeagleLogic, for example.
Tooling for the RPI PIO design is probably a bit more accessible than the TI PRU situation. I'd say its not really a miss - more of a necessity given bennies' proclivity towards open/available tools. Getting access to architecture details of the TI PRU would necessitate an NDA, would it not?
> Getting access to architecture details of the TI PRU would necessitate an NDA, would it not?
Nope. All the information is right in the publicly available architecture manuals. However, you don't need to copy the PRUs, per se. All this can be done with RISC-V.
The important parts are deterministic execution, the register file sideload between paired processors, and, possibly, single cycle instruction execution. None of these are precluded by using RISC-V.
And, given how large his PIO stuff is, I'd argue it would be better to do this with RISC-V.
What are your thoughts on efficiency? BIO vs PIO implementing, say, 68k 16-bit-wide bus slave. I know i can support 66MHz 68K bus clock with PIO at 300MHz. How much clock speed would BIO need?
It depends a lot upon where the processing is happening. For example, you could do something where all the data is pre-processed and you're just blasting bits into a GPIO register with a pair of move instructions. In which case you could get north of 60MHz, but I think that's sort of cheating - you'll run out of pre-processed data pretty quickly, and then you have to take a delay to generate more data.
The 25MHz number I cite as the performance expectation is "relaxed": I don't want to set unrealistic expectations on the core's performance, because I want everyone to have fun and be happy coding for it - even relatively new programmers.
However, with a combination of overclocking and optimization, higher speeds are definitely on the horizon. Someone on the Baochip Discord thought up a clever trick I hadn't considered that could potentially get toggle rates into the hundreds of MHz's. So, there's likely a lot to be discovered about the core that I don't even know about, once it gets into the hands of more people.
I specified slave specifically because slave is a LOT harder. Master is always easy. Waiting for someone else’s clock and then capturing and replying asap is the hard part. Especially if as a slave you need to simulate a read.
On rp2350 it is pio (wait for clock) -> pio (read address bus) -> dma (addr into lower bits of dma source for next channel) -> dma (Data from SRAM to PIO) -> pio (write data to data bus) chain and it barely keeps up.
If there's a single rising edge on the bus that you can use as quantum trigger, then, the reads turn into as series of moves into a FIFO, and the response can be quite fast. The quantum-trigger-on-GPIO was provided to solve exactly the problem you described.
Hey, glad to see you here. I'm a huge fan of your projects, and the Baochip was one I didn't see coming. Very nice surprise!
I ordered a few, thinking it would make a good logic analyzer (before the details of the BIO were published). Obviously, it's going to be a stretch with multiple cycles per instructions, and a reduced instruction set. I'll see how far I can push it if I rely on multiple BIOs, perhaps with some tricks such as relying on an external clock signal.
At first glance, they seemed to be perfect for doing some basic RLE or Huffman compression on-the-fly, but I am less sure now, I will have to play with it. Bit-packing may be somewhat expensive to perform, too.
One thing stood out to me in this design: that liberal use of the 16 extra registers. It's a very clever trick, but wouldn't some of these be better exposed as memory addresses? Or do you foresee applications where they are in the hot path (where the inability to write immediate values may matter). Stuff like core ID, debug, or even GPIO direction could be hard-wired to memory addresses, leaving space for some extra features (not sure which? General purpose registers? More queues? More GPIOs? A special purpose HW block?).
I really like the "snap to quantum" mechanism: as you wrote, it is good for portability, though there should be a way to query frequency, if portability is really a goal.
Anyway, it's plenty for a v1, plenty of exciting things to play with, including the MMU of the main core!
The core ID definitely didn't need to be in a register, but the elapsed clocks since reset is actually really handy. Having this in the hot path allows me to build a captouch sensor using the BIO, because the clock increment is 1.42ns and even though the rise time of the pad is microseconds you get plenty of resolution at that counting rate.
I think it will be interesting to see what people end up doing with it and what are the pain points. As you say, it's a v1 - with any luck there will be a v2, so we could consider the time starting now as a deliberation period for what goes into v2.
The good news is that it also all compiles into an FPGA, so proposed patches can be tested & vetted in hardware, albeit at a much slower clock rate.
Ah, thank you for the example, I understand how a linearly-increasing counter can be useful, if you use it that way. It would obviously be more versatile with write access & configurable clock dividers, pre-setters, counting direction, etc. The current design probably allows re-using the counter across cores & minimize space, so makes sense to me. I should dig into the RTL when I have a bit of time… Maybe I'll make it my bedside reading?
You could also say it's up to the user to implement a fully-fledged timer/counter in a BIO coprocessor if they need one, though ideally there would be a shared register (or a way to configure the FIFOs depth + make them non-blocking) to communicate the result.
Small cores like these are really fun to play with: the constraints easily fit in your head, and finding some clever way to use the existing HW is very rewarding. Who needs Zachtronics games when you have a BIO or PIO?
I'm currently elbow deep in making a PIO+DMA sprite and tile display renderer.
Losing the high maximum data rate is quite a cost, but in my use case BIO would be the clear winner, indexed pixel format conversion on PIO is shifting out the high bits of palette address, then the index, then some zeros. Which goes to a FIFO which is read by a DMA simply to write it to the readaddr+trigger of another DMA which feeds into another FIFO (which is the program doing the transparency)
That I suspect becomes a much simpler task with BIO
It is an interesting case, where just knowing that the higher potential rate of the PIO is there is a kind of comfort even when you don't currently need it.
Although for those higher rates it is very rarely reactive and most often just wiggling wires in a predetermined fashion.
I wonder if having a register that can be DMA'd to could perform the equivalent function of side-set to play a fixed sequence to some pins at full clock speed. Like playing macros.
I guess another approach a 32 bit register could shift out 4 bits of side set per clock cycle. Then you could pre program for the next 8 cycles in a single 32 bit write. It would give you breathing space to drive the main data while the side set does fixed pattern signaling.
I suspect there are tricks to get higher rates, for sure. And hopefully once we see a library of applications forming, we can make informed decisions about what extensions and features would be necessary to enable the next level of I/O performance.
I loved this article and had wanted to play with PIO for a long time (or at least, learn from it through playing!).
One thing jumped out here - I assumed CISC inside PIO had a mental model of "one instruction by cycle" and thus it was pretty easy to reason about the underlying machine (including any delay slots etc...).
For this RISC model using C, we are now reasoning about compiled code which has a somewhat variable instruction timing (1-3 cycles) and that introduces an uncertainty - the compiler and understanding its implementation.
I think this means that the PIO is timing-first, as timing == waveform where BIO is clarity-first with C as the expression and then explicit hardware synchronization.
I like both models! I am wondering about the quantum delays however that are being used to set the deadlines - here, human derived wait delays are utilized knowledge of the compiled instructions to set the timing.
Might there not be a model of 'preparing the next hardware transaction' and then 'waiting for an external synchronization' such as an external signal or internal clock, so we don't need to count the instruction cycles so precisely. On the external signal side, I guess the instruction is 'wait for GPIO change' or something, so the value is immediately ready (int i = GPIO_read_wait_high(23) or something) and the external one is doing the same, but synchronizing (GPIO_write_wait_clock( 24, CLOCK_DEF)) as an alternative to the explicit quantum delays.
This might be a shadow register / latch model in more generic terms - prep the work in shadow, latch/commit on trigger.
The idea of the wait-to-quantum register is that it gets you out of cycle-counting hell at the expense of sacrificing a few cycles as rounding errors. But yes, for maximum performance you would be back to cycle counting.
That being said - one nice thing about the BIO being open source is you can run the verilog design in Verilator. The simulation shows exactly how many cycles are being used, and for what. So for very tight situations, the open source RTL nature of the design opens up a new set of tools that were previously unavailable to coders. You can see an example of what it looks like here: https://baochip.github.io/baochip-1x/ch00-00-rtl-overview.ht...
Of course, there's a learning curve to all new tools, and Verilator has a pretty steep curve in particular. But, I hope people give the Verilator simulations a try. It's kind of neat just to be able to poke around inside a CPU and see what it's thinking!
Correct, actually most programs I've written for the BIO are in assembly.
The C compiler support is a relatively recent addition, mostly to showcase the possibilities of doing high-level protocol offloading into the BIO, and the tooling benefits of sticking with a "standard" instruction set.
Very much looking forward to play with the BIO functionality on the Baochips that I have ordered. Thanks for the nice write up!
It is fascinating to see how widely applicable the "just throw a RISC-V core or 4 in there" design pattern is. The wide range of CPU designs that are standardized, the number oc mature open source implementations, and the lack of royalty fees, and the ready-to-run programming toolchains really drives this to a new level. And CPUs are small in die area anyway compared to SRAM! Was cool to see on the RPI2350 how they just threw in another two RISC-V cores next to the ARMs.
For these reasons specified above, I think that this trend will continue. For example, in my specialization of edge machine learning, we are seeing MEMS sensors that integrate user programmable DSP+ML+CPU right there on the sensor chip.
This is actually super cool, you can use those as both math accelerators and as io, and them being in lockstep you can kind of use them as int only shader units. I don't know how this is useful yet.
Btw I am curious what about edge cases. Maybe I have missed that from the article but what is the size of the FIFO?
Or the more dangerous part that is you have complex to determine timing now for complex cases like each reqd from FIFO is and ISR and you have until the next read from the FIFO amount of instructions otherwise you would stall the system and that looks to me too hard to debug.
FIFO is 8-deep. I did fail to mention that explicitly in the article, I think. The depth is so automatic to me that I forget other people don't know it.
The deadlock possibilities with the FIFO are real. It is possible to check the "fullness" of a FIFO using the built-in event subsystem, which allows some amount of non-blocking backpressure to be had, but it does incur more instruction overhead.
I appreciate the intro, motivation and comparison to the PIO of the RP2040/2350. How would this compare to the (considerably older, slower, but more flexible) Parallax P8X32A ("Propeller")?
IIRC the Propeller is an eight thread barrel CPU with the same number of pipeline stages. So it "retires" just one instruction per cycle. All PIO state machines can run every cycle so they should be considered very small CPU cores. You can think of them as channel I/O co-processors for a microcontroller instead of a mainframe.
> Above is the logic path isolated as one of the longest combination paths in the design, and below is a detailed report of what the cells are.
which is an argument that "fpga_pio" is badly implemented or that PIO is unsuitable for FPGA impls. Real silicon does not need to use a shitton of LUT4s to implement this logic and it can be done much more efficiently and closes timing at higher clocks (as we know since PIO will run near a GHz)
The large area usage was a surprise. But is the real PIO also this huge?
My point is, maybe this is one of those designs that blow up in FPGA. Or maybe the open source version of the PIO is simply not as area efficient as the rpi version?
> The build script compiles C code down to a clang intermediate assembly, which is then handed off to a Python script that translates it into a Rust macro which is checked into Xous as a buildable artifact using its pure-Rust toolchain.
Ah yes, the good ol “we solved the C problem by turning it into four other problems” pipeline
47 comments
The PRUs really get a bunch right. Very specifically, the ability to broadside dump the ENTIRE register file in a single cycle from one PRU to the other is gigantic. It's the single thing that allows you to transition the data from a hard real-time domain to a soft real-time domain and enables things like the industrial Ethernet protocols or the BeagleLogic, for example.
> Getting access to architecture details of the TI PRU would necessitate an NDA, would it not?
Nope. All the information is right in the publicly available architecture manuals. However, you don't need to copy the PRUs, per se. All this can be done with RISC-V.
The important parts are deterministic execution, the register file sideload between paired processors, and, possibly, single cycle instruction execution. None of these are precluded by using RISC-V.
And, given how large his PIO stuff is, I'd argue it would be better to do this with RISC-V.
The 25MHz number I cite as the performance expectation is "relaxed": I don't want to set unrealistic expectations on the core's performance, because I want everyone to have fun and be happy coding for it - even relatively new programmers.
However, with a combination of overclocking and optimization, higher speeds are definitely on the horizon. Someone on the Baochip Discord thought up a clever trick I hadn't considered that could potentially get toggle rates into the hundreds of MHz's. So, there's likely a lot to be discovered about the core that I don't even know about, once it gets into the hands of more people.
On rp2350 it is pio (wait for clock) -> pio (read address bus) -> dma (addr into lower bits of dma source for next channel) -> dma (Data from SRAM to PIO) -> pio (write data to data bus) chain and it barely keeps up.
I ordered a few, thinking it would make a good logic analyzer (before the details of the BIO were published). Obviously, it's going to be a stretch with multiple cycles per instructions, and a reduced instruction set. I'll see how far I can push it if I rely on multiple BIOs, perhaps with some tricks such as relying on an external clock signal. At first glance, they seemed to be perfect for doing some basic RLE or Huffman compression on-the-fly, but I am less sure now, I will have to play with it. Bit-packing may be somewhat expensive to perform, too.
One thing stood out to me in this design: that liberal use of the 16 extra registers. It's a very clever trick, but wouldn't some of these be better exposed as memory addresses? Or do you foresee applications where they are in the hot path (where the inability to write immediate values may matter). Stuff like core ID, debug, or even GPIO direction could be hard-wired to memory addresses, leaving space for some extra features (not sure which? General purpose registers? More queues? More GPIOs? A special purpose HW block?).
I really like the "snap to quantum" mechanism: as you wrote, it is good for portability, though there should be a way to query frequency, if portability is really a goal.
Anyway, it's plenty for a v1, plenty of exciting things to play with, including the MMU of the main core!
I think it will be interesting to see what people end up doing with it and what are the pain points. As you say, it's a v1 - with any luck there will be a v2, so we could consider the time starting now as a deliberation period for what goes into v2.
The good news is that it also all compiles into an FPGA, so proposed patches can be tested & vetted in hardware, albeit at a much slower clock rate.
You could also say it's up to the user to implement a fully-fledged timer/counter in a BIO coprocessor if they need one, though ideally there would be a shared register (or a way to configure the FIFOs depth + make them non-blocking) to communicate the result.
Small cores like these are really fun to play with: the constraints easily fit in your head, and finding some clever way to use the existing HW is very rewarding. Who needs Zachtronics games when you have a BIO or PIO?
Losing the high maximum data rate is quite a cost, but in my use case BIO would be the clear winner, indexed pixel format conversion on PIO is shifting out the high bits of palette address, then the index, then some zeros. Which goes to a FIFO which is read by a DMA simply to write it to the readaddr+trigger of another DMA which feeds into another FIFO (which is the program doing the transparency)
That I suspect becomes a much simpler task with BIO
It is an interesting case, where just knowing that the higher potential rate of the PIO is there is a kind of comfort even when you don't currently need it.
Although for those higher rates it is very rarely reactive and most often just wiggling wires in a predetermined fashion.
I wonder if having a register that can be DMA'd to could perform the equivalent function of side-set to play a fixed sequence to some pins at full clock speed. Like playing macros.
I guess another approach a 32 bit register could shift out 4 bits of side set per clock cycle. Then you could pre program for the next 8 cycles in a single 32 bit write. It would give you breathing space to drive the main data while the side set does fixed pattern signaling.
One thing jumped out here - I assumed CISC inside PIO had a mental model of "one instruction by cycle" and thus it was pretty easy to reason about the underlying machine (including any delay slots etc...).
For this RISC model using C, we are now reasoning about compiled code which has a somewhat variable instruction timing (1-3 cycles) and that introduces an uncertainty - the compiler and understanding its implementation.
I think this means that the PIO is timing-first, as timing == waveform where BIO is clarity-first with C as the expression and then explicit hardware synchronization.
I like both models! I am wondering about the quantum delays however that are being used to set the deadlines - here, human derived wait delays are utilized knowledge of the compiled instructions to set the timing.
Might there not be a model of 'preparing the next hardware transaction' and then 'waiting for an external synchronization' such as an external signal or internal clock, so we don't need to count the instruction cycles so precisely. On the external signal side, I guess the instruction is 'wait for GPIO change' or something, so the value is immediately ready (int i = GPIO_read_wait_high(23) or something) and the external one is doing the same, but synchronizing (GPIO_write_wait_clock( 24, CLOCK_DEF)) as an alternative to the explicit quantum delays.
This might be a shadow register / latch model in more generic terms - prep the work in shadow, latch/commit on trigger.
Anyway, great work Bunnie!
That being said - one nice thing about the BIO being open source is you can run the verilog design in Verilator. The simulation shows exactly how many cycles are being used, and for what. So for very tight situations, the open source RTL nature of the design opens up a new set of tools that were previously unavailable to coders. You can see an example of what it looks like here: https://baochip.github.io/baochip-1x/ch00-00-rtl-overview.ht...
Of course, there's a learning curve to all new tools, and Verilator has a pretty steep curve in particular. But, I hope people give the Verilator simulations a try. It's kind of neat just to be able to poke around inside a CPU and see what it's thinking!
The C compiler support is a relatively recent addition, mostly to showcase the possibilities of doing high-level protocol offloading into the BIO, and the tooling benefits of sticking with a "standard" instruction set.
For these reasons specified above, I think that this trend will continue. For example, in my specialization of edge machine learning, we are seeing MEMS sensors that integrate user programmable DSP+ML+CPU right there on the sensor chip.
Btw I am curious what about edge cases. Maybe I have missed that from the article but what is the size of the FIFO?
Or the more dangerous part that is you have complex to determine timing now for complex cases like each reqd from FIFO is and ISR and you have until the next read from the FIFO amount of instructions otherwise you would stall the system and that looks to me too hard to debug.
The deadlock possibilities with the FIFO are real. It is possible to check the "fullness" of a FIFO using the built-in event subsystem, which allows some amount of non-blocking backpressure to be had, but it does incur more instruction overhead.
> Above is the logic path isolated as one of the longest combination paths in the design, and below is a detailed report of what the cells are.
which is an argument that "fpga_pio" is badly implemented or that PIO is unsuitable for FPGA impls. Real silicon does not need to use a shitton of LUT4s to implement this logic and it can be done much more efficiently and closes timing at higher clocks (as we know since PIO will run near a GHz)
My point is, maybe this is one of those designs that blow up in FPGA. Or maybe the open source version of the PIO is simply not as area efficient as the rpi version?
Have some on the way! Can't wait!
> The build script compiles C code down to a clang intermediate assembly, which is then handed off to a Python script that translates it into a Rust macro which is checked into Xous as a buildable artifact using its pure-Rust toolchain.
Ah yes, the good ol “we solved the C problem by turning it into four other problems” pipeline