Digital Audio Synthesis for Dummies: Part 3
Efficiently streaming audio to speakers on embedded systems (with examples in STM32).
Slap a timer, DMA, and DAC together, and BAM—non-blocking audio output!
— TrebledJ, 2022
Ah… embedded systems—the intersection of robust hardware and versatile software.
This is the third (and culminating) post in a series on digital audio synthesis; but the first (and only?) post touching embedded hardware. In the first post, we introduced basic concepts on audio processing. In the second post, we dived into audio synthesis and generation. In this post, we’ll discover how to effectively implement an audio synthesiser and player on an embedded device by marrying hardware (timers, DACs, DMA) and software (double buffering plus other optimisations).1
To understand these concepts even better, we’ll look at examples on an STM32. These examples are inspired from a MIDI keyboard project I previously worked on. I'll be using an STM32F405RGT board in the examples. If you plan to follow along with your own board, make sure it's capable of timer-triggered DMA and DAC. An oscilloscope would also be handy for checking DAC output.
This post is much longer than I expected. My suggested approach of reading is to first gain a high-level understanding (possibly skipping the nitty gritty examples), then dig into the examples for details.
Timers ⏰
It turns out kitchens and embedded systems aren’t that different after all! Both perform input and output, and both have timers! Who knew?
Tick Tock
Timers in embedded systems are similar to those in the kitchen: they tick for a period of time and signal an event when finished. However, embedded timers are much fancier than kitchen timers, making them immensely useful in various applications. They can trigger repeatedly (via auto-reload), count up/down, and be used to generate rectangular (PWM) waves.
So what makes timers tick?
The MCU clock!
The MCU clock is the backbone of a controller. It controls the processing speed and pretty much everything!—timers, ADC, DAC, communication protocols, and whatnot.
The clock runs at a fixed frequency (168MHz on our board). By dividing against it, we can achieve lower frequencies.
The following diagram illustrates how the clock signal is divided on an STM. There are two divisors: the prescaler and auto-reload (aka counter period).
Here, the clock signal is first divided by a prescaler of 2, then further "divided" by an auto-reload of 6. On every overflow (arrow shooting up), the timer triggers an interrupt. In this case, the timer runs at $\frac{1}{12}$ the speed of the clock.
These interrupts can trigger functionality such as DMA transfers (explored later) and ADC conversions.4 Heck, they can even be used to trigger other timers!
Further Reading:
- How do microcontroller timers work?
- Getting Started with STM32: Timers and Timer Interrupts
- There are more prescalers behind the scenes! (APB2)
Example: Initialising the Timer
Suppose we want to send a stream of audio output. We can use a timer with a frequency set to our desired sample rate.
We can derive the prescaler (PSC) and auto-reload (ARR) by finding integer factors that satisfy the following relationship.
$$ \text{freq}_\text{timer} = \frac{\text{freq}_\text{clock}}{(\text{PSC} + 1) \times (\text{ARR} + 1)} $$
where $\text{freq}_\text{timer}$ is the timer frequency (or specifically in our case, the sample rate), $\text{freq}_\text{clock}$ is the clock frequency.
On our STM32F405, we configured $\text{freq}_\text{clock}$ to the maximum possible speed: 168MHz. If we’re aiming for an output sample rate of 42,000Hz, we’d need to divide our clock signal by 4,000, so that we correctly get $\frac{168,000,000}{4,000} = 42,\!000$. For now, we’ll choose register values of PSC = 0
and ARR = 3999
.
Why do we add $+1$ to the PSC and ARR in the relationship above?
On the STM32F4, PSC and ARR are 16-bit registers, meaning they range from 0 to 65,535.5 To save space and enhance program correctness, we assign meaningful behaviour to the value 0.
So in this page, when we say PSC = 0
, we actually mean a prescaler divisor of 1.
Why 0
and 3999
specifically?
Other pairs of PSC and ARR can also work. We can choose any PSC and ARR which get us to our desired timer frequency. Play around and try different pairs of PSC and ARR!
Exercises for the reader:
- What is the difference between different pairs, such as
PSC = 0
,ARR = 3999
vs.PSC = 1
,ARR = 1999
? (Hint: counter.) - Is there a PSC/ARR pair that is "better"?6
We can use STM32 CubeMX to initialise timer parameters. CubeMX allows us to generate code from these options, handling the conundrum of modifying the appropriate registers.
Remember to generate code once done.9 CubeMX should generate the following code in main.c
:
After initialisation, it's possible to change the timer frequency by setting the prescaler and auto-reload registers like so:
This is useful for applications where the frequency is dynamic (e.g. playing music with a piezoelectric buzzer), but it's also useful when we're too lazy to modify the .ioc file.
Example: Playing with Timers
STM’s HAL library provides ready-made functions to interface with hardware.
These functions are used to start/stop timers for basic timing and counting applications. Functions for more specialised modes (e.g. PWM) are available in stm32f4xx_hal_tim.h
.
Digital-to-Analogue Converters (DACs) 🌉
Let's delve into our second topic today: digital-to-analogue converters (DACs).
Audio comes in several forms: sound waves, electrical voltages, and binary data.
Since representations vastly differ, hence the need for interfaces to bridge the worlds. Between the digital and analogue realms, we have DACs and ADCs as mediators. Generally, DACs are used for output while ADCs are for input.
A Closer Look at DACs
Remember sampling? We took a continuous analogue signal and selected discrete points at regular intervals. An ADC is like a glorified sampler.
While ADCs take us from continuous to discrete, DACs (try to) take us from discrete to continuous. The shape of the resulting analogue waveform depends on the DAC implementation. Simple DACs will stagger the output at discrete levels. More complex DACs may interpolate between two discrete samples to “guess” the intermediate values. Some of these guesses will be off, but at least the signal is smoother.
Example: Initialising the DAC
Let’s return to CubeMX to set up our DAC.
Again, remember to generate code when finished.9 The MX_DAC_Init()
function should contain the generated DAC setup code and should already be called in main()
.
Example: Using the DAC
On our STM32, DAC accepts samples quantised to 8 bits or 12 bits.10 We’ll go with superior resolution: 12 bits!
For simplicity, let’s start with sending 1 DAC sample. This can be done like so:
This should output a voltage level of $\frac{1024}{2^{12}} = 25\%$ of the reference voltage $V_{\text{REF}}$. Once it starts, the DAC will continue sending that voltage out until we change the DAC value or call HAL_DAC_Stop()
.
We use DAC_CHANNEL_1
to select the first channel, and use DAC_ALIGN_12B_R
to accept 12-bit right-aligned samples.
To fire a continuous stream of samples, we could use a loop and call HAL_DAC_SetValue()
repeatedly. Let’s use this method to generate a simple square wave.
An aside. The default HAL_Delay()
provided by STM will add 1ms to the delay time—well, at least in my version. I overrode it using a separate definition so that it sleeps the given number of ms.
This generates a square wave with a period of 10ms, for a frequency of 100Hz.
But there are two issues with this looping method:
- Using a while loop blocks the thread, meaning we block the processor from doing other things while outputting the sine wave. We may wish to poll for input or send out other forms of output (TFT/LCD, Bluetooth, etc.).
- Since
HAL_Delay()
delays in milliseconds, it becomes impossible to generate complex waveforms at high frequencies, since that requires us to send samples at microsecond intervals.
In the next section, we’ll address these issues by combining DAC with timers and DMA.
Further Reading:
Direct Memory Access (DMA) 💉🧠
The final item on our agenda today! Direct Memory Access (DMA) may seem like three random words strung together, but it’s quite a powerful tool in the embedded programmer’s arsenal. How, you ask?
DMA enables data transfer without consuming processor resources. (Well, it consumes some resources, but mainly for setup.) This frees up the processor to do other things while DMA takes care of moving data. We could use this saved time to prepare the next set of buffers, render the GUI, etc.
DMA can be used to transfer data from memory-to-peripheral (e.g. DAC, UART TX, SPI TX), from peripheral-to-memory (e.g. ADC, UART RX), across peripherals, or across memory. In this post, we're concerned with one particular memory-to-peripheral transfer: DAC.
Further Reading:
Example: DMA with Single Buffering
We'll now try using DMA with a single buffer, see why this is problematic, and motivate the need for double buffering. If you’ve read this far, I presume you’ve followed the previous section by initialising DMA and generating code with CubeMX.
DMA introduces syncing issues. After preparing a second round of buffers, how do we know if the first round has already finished?
As with all processes which depend on a separate event, there are two approaches: polling and interrupts. In this context:
- Polling: Block and wait until the first round is finished, then send.
- Interrupts: Trigger an interrupt signal when transfer finishes, and start the next round inside the interrupt handler.
Which approach to choose depends on your application.
In our examples, we’ll poll to check if DMA is finished:
With DMA, we’ll first need to buffer an array of samples. Our loop will run like this:
- Buffer samples.
- Wait for DMA to be ready.
- Start the DMA.
Do you notice a flaw in this approach? After starting DMA, we start buffering samples on the next iteration. We risk overwriting the buffer while it’s being sent.
Let’s try to implement it anyway and play a simple 440Hz sine wave.
The results? As expected, artefacts (nefarious little glitches) invade our signal, since our buffer is updated during DMA transfer. This may result in unpleasant clicks from our speaker.
But what if we prep, then start, then wait? This way, the buffer won't be overwritten; but this causes the signal to stall while prepping.11
To resolve these issues, we'll unleash the final weapon in our arsenal.
Example: DMA with Double Buffering
We saw previously how a single buffer spectacularly fails to deal with "concurrent" buffering. With double buffering, we introduce an additional buffer. While one buffer is being displayed/streamed, the other buffer is updated. This ensures our audio can be delivered in one continuous stream.
In code, we’ll add another buffer by declaring uint16_t[2][BUFFER_SIZE]
instead of uint16_t[BUFFER_SIZE]
. We’ll also declare a variable curr
(0 or 1) to index which buffer is currently available.
Now our 440Hz sine wave is unblemished!
Double buffering is also used for video and displays, where each buffer stores a 2D frame instead of a 1D signal.
Example: Playing Multiple Notes with DMA and Double Buffering 🎶
With some minor changes, we can make our device generate audio for multiple notes. Let’s go ahead and play an A major chord!
If you flash the above code and feed the output to an oscilloscope, you may find it doesn’t really work. Our signal stalls, for similar reasons as before.
Even with DMA, stalls may occur. This is usually a sign that buffering (and other processes) consume too much time. In this case, breaks in the data occur—the stream is no longer continuous, because the buffer doesn't finish prepping on time.
Optimisations 🏎
So our code is slow. How do we speed it up?
Here are a few common tricks:
Precompute constants.
Instead of computing
2 * M_PI * FREQUENCY / SAMPLE_RATE
every iteration, we can precompute it before the loop, saving many arithmetic instructions.Math functions such as
sin
can be computationally expensive, especially when used a lot. By caching the waveform in a lookup table, we can speed up the process of computing samples.Increase the buffer size.
By increasing the buffer size, we spend less overhead switching between tasks.
Decrease the sample rate.
If all else fails, we can decrease the load by compromising the sample rate, say from 42000Hz to 21000Hz. With a buffer size of 1024, that means we’ve gone from a constraint of $\frac{1,024}{42,000} = 24.4$ms to $\frac{1,024}{21,000} = 48.8$ms per buffer.
To avoid complicating things, I lowered the sample rate to 21000Hz. This means changing the auto-reload register to 7999, so that our timer frequency is $$\frac{168,000,000}{(0 + 1) \times (7,999 + 1)} = 21,\!000\text{Hz.}$$
After all this hassle, we get a beautiful chord.
Recap 🔁
By utilising both hardware and software, we reap the benefits of parallel processing while implementing an efficient, robust audio application. On the hardware side, we explored:
- Timers, which are an useful and inexpensive way to trigger actions at regular intervals.
- DACs, which enable us to communicate with a speaker by translating digital samples into analogue signals.
- DMA, which enables data transfer with minimal processor resources. This way, we can process other things while streaming audio.
In software, we explored:
- Double buffering, a software technique for buffering data to achieve continuous or faster output.
- Various optimisations, which enable us to squeeze more processing into our tiny board.
When combined, we save processing resources, which can possibly be spent on additional features.
In case you want to go further, here are some other things to explore:
- Generating stereo audio. We’ve generated audio for Channel 1. What about stereo audio for Channel 2? If you’re using reverb effects and wish for a fuller stereo sound, you’ll need an extra pair of buffers (and more processing!).
- Streaming via UART (+ DMA).
- Using SIMD instructions to buffer two (or more?) samples at a time.
- Other assembly-level bit-hacking tricks.
- RTOS for multitasking.
- Other boards or hardware with specialised audio features.
Hope you enjoyed this series of posts! Leave a comment if you like to see more or have any feedback!
Full Code
The complete code for DMA with double buffering has been uploaded as a GitHub Gist. It hasn't been fully optimised yet. I'll leave that as an exercise for the reader.
Footnotes
Each of these components (especially hardware) deserve their own post to be properly introduced; but for the sake of keeping this post short, I’ll only introduce them briefly and link to other resources for further perusal. ↩︎
How do microcontroller timers work? – A decent article on timers. Diagrams are in French though. ↩︎
The extent of timer events depends on hardware support. Timers can do a lot on ST boards. For other brands, you may need to check the datasheet or reference manual. ↩︎
Some other timers have 32-bit ARR registers. But eh, we can achieve a lot with just 16-bit ones. ↩︎
What is the difference between pairs of prescaler/auto-reload, such as
PSC = 0
,ARR = 3999
vs.PSC = 1
,ARR = 1999
?
Indeed, given a fixed clock frequency, the same timer frequency will be generated (since the divisor is the same: 2000). However, the difference lies in the counter. Recall each step of auto-reload equals a step of the counter.
The counter is used in calculating the on-time (or duty cycle). By using a higherARR
, we gain a higher resolution in the counter, which allows us to say, control servos with finer granularity. Thus, a lower prescaler is often preferred.
Of course, different vendors may implement timers differently or have different features attached to timer peripherals. Other considerations may come into play, depending on the vendor and your application. ↩︎We chose Timer 8 (with Channel 4) because it's an advanced control timer (a beefy boi!), capable of a lot, though probably overkill for our simple examples. The timer and channel you use depends on your STM board and model. If you’re following along with this post, make sure to choose a timer which has DMA generation. When in doubt, refer to the reference manual.8 ↩︎
STM's Official Reference Manual for F405/F415, F407/F417, F427/F437, F429/F439 boards. Definitely something to refer to if you’re working on one of those boards. ↩︎ ↩︎
In CubeMX, you can generate code by choosing the Project > Generate Code menu option. Keep in mind that only code between
USER CODE BEGIN
andUSER CODE END
comments will be preserved by ST's code generator. ↩︎ ↩︎There are pros to using 8-bit or 12-bit DAC. 8-bit conversion is faster, whereas 12-bit offers higher resolution. To slightly complicate things, the 12-bit DAC option on our STM32 can be aligned either left or right. That is, we can choose whether our data takes up the first 12 bits or last 12 bits on a 16-bit (2-byte) space. Alignment exists to save you a shift operation, which depends on your application. ↩︎
Not sure if stall is the right word. Let me know if there's a better one. ↩︎
Comments are back! Privacy-focused, without ads, bloatware, and trackers. Be one of the first to contribute to the discussion — I'd love to hear your thoughts.