name: inverse class: center, middle, inverse layout: true .header[.floatleft[.teal[Kit Biggs] — Open Voice].floatright[.teal[@unixbigot@aus.social] .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2025]] --- layout: true name: callout class: center, middle, italic, bulletul .header[.floatleft[.teal[Kit Biggs] — Open Voice].floatright[.teal[@unixbigot@aus.social] .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2025]] --- layout: true name: toply class: center, toply, italic, bulletul .header[.floatleft[.teal[Christopher Biggs] — Open Voice].floatright[.teal[@unixbigot@aus.social] .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2025]] --- layout: true template: callout .header[.floatleft[.teal[Kit Biggs] — Open Voice].floatright[.teal[@unixbigot@aus.social .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2025]] --- template: inverse # Open Source Voice in 2025 ## Is it the future, yet? ### Everything Open 2024 - Tarntanya (Adelaide) .bottom.right[ Kit Biggs, .logo[Accelerando Lab]
@unixbigot@aus.social .logo[accelerando.com.au] ] ??? --- layout: true template: toply .crumb[ # Intro ] --- # About Me ## Kit Biggs — .teal[@unixbigot@aus.social] — .logo[accelerando.com.au] * Meanjin (Brisbane), Australia * Founder, .logo[Accelerando - Innovation Space and IoT consultants] * Convenor Brisbane Internet of Things interest group * 30 years in IT as developer, architect, manager * HIRE ME: Availabile from Feb 2025 ??? Welcome friends sanguine and synthetic. My Name is Kit, my pronouns are They/Them, I come from Queensland and I'm here to help. I make a wildly erratic living by helping other organisations get to grips with the potential and pitfall of the Internet of Things. --- # Prelude .fig100[ ![](notarobot.gif)] ??? There will be no "I am not a robot" checkbox to access this presentation. When I pitched this talk it was something of an afterthought to some other proposals. But I figured why not, because it would be simple to update the presentation I gave in gladstone last year to cover whats new since then in the interim. If you missed my presentation from last year, you're in luck because most of what I said then is now hilariously out of date. I often repeat that artificial intelligence is doomed to failure. If it doesn't work, we have nothing. If it does work, then we have less than nothing because now we have a whole argument about where to draw the line between software and slavery. --- # This is not about AI ## ".st[Humans > Computers]" ## "Humans ≥ Computers" ## "Humans ≠ Computers"? ## "Humans ⩭ Computers"? ??? When I first gave this, talk it was with a disclaimer that I wasn't going to spout off about AI. I lied even then. A year ago a number of people were claiming that we soon will, or even already do, share the planet with Artificial General Intelligences smarter than humans. I found that amusing because the only person I personally knew who bandied the term AGI around was a raving charlatan, and how could I tell the difference between that person and these other luminaries. I've figured it out now. AI blew past the peak of inflated expectations and here we are in the trough of disillusionment. The charlatans have cashed out already, if you're still at the table, consider that you're the designated sucker. I'm pleased that I sat the whole overexuberent shemozzle out. --- # Alexa, add privacy to my shopping list .fig80c[ ![](voiceassistants.jpeg)] ??? The last year has been a bit of a downward spiral on the concept of technology as magic. Household voice assistants have run out of steam, supermarkets without checkouts turned out to rely on offshore wage-slaves, the metaverse and apple vision sank without a trace, and large language models turned out to need a continent worth of underpaid engineers to restrain their tendancy to spout utter rubbish. So maybe the whole concept of conversational interfaces is over, it'll never work. But I do wanna check the bathwater for any trace of baby before we throw it all out. --- # The plan .fig50c[ ![](blueprint.jpeg)] ??? Here's my evil plan. We've just reached a point where last years phones are be doing speech recognition on the device, rather than having to transmit audio to the cloud. And right now we can buy small form factor computers and even microcontrollers that have the ai acceleration hardware that enables voice recognition on the device , without needing any cloud service. So finally we have the potential to roll our own conversational interfaces without the cost, latency and creep factor of cloud services. What I want to do next is go through all the components of a conversational user interface and see where the state of the art stands. --- layout: true template: toply .crumb[ # Intro # Concepts ] --- class: center, middle template: inverse # Principles of voice interfaces ## WAAAH ITS SO COMPLICATED ??? BREATHE All right so in order to understand how to do ALL THIS STUFF we're going to have to break it up into steps. And this makes total sense in the open source world because many of these steps resolve to single software projects, and in most cases there are a number of projects that you can choose from to perform that step. Sometimes all this software runs on one system, sometimes it might be spread across systems. And thats what gives us flexibility because we can arrange the software to fit the kinds of tasks we want to do. --- # Acquiring audio * Analog microphone * I2S digital microphones * Multi-microphone arrays * Bluetooth headsets ** Hands Free Audio Gateway Profile (HF-AG) ** Adv Audio Distribution Profile (A2DP) ??? The first thing we want to do is get some sound. This used to be hard, but then the future happened. A five dollar microcontroller is today fast enough just to run analog to digital conversion on the output of a microphone and process the audio in real time. But really, don't do that. Digital microphones are cheaper than chips, and it's especially worthwhile getting a multiple microphone array or other hardware that can do some of the work of isolating background noise. Bluetooth earbuds are now dirt cheap, and the software to talk to these is now built into your embedded operating system. --- # Did you hear something? * Voice activity detection * Done in some microphone hardware ** https://wiki.seeedstudio.com/xiao_respeaker/ * For everything else, there's Fourier Transforms ** https://github.com/TheZeroHz/VADCoreESP32 ??? The next thing to care about in speech recognition is whether someone is actually speaking. Sometimes the hardware can do this for you, and if not there are off the shelf open source libraries that analyse your raw audio to determine whether it's worth thinking about the bits. Here's a really simple library that just has one key function, it tells you whether there is silence, or speaking. You can twiddle thresholds and priorities and other stuff but this is a lovely example of an open source component that just does one simple thing, in a pluggable way. --- # PAY ATTENTION .fig40c[ ![](wakeup.jpeg)] ??? I contrast this with the big three commercial voice interfaces where the same steps exist but there's no options for the user about how the pieces fit together. Speech recognition is expensive, and when your wearable computer needs a heatsink this can affect your choice of which body parts you want to strap it to, so most of the time we use whats called a WAKE WORD to activate the speech recognition subsystem. This is a privacy and efficiency tradeoff to some extent. You don't want to accidentally trigger effects due to snippets of normal conversation, and in the older systems you don't want them uploading everything you say. So this was the first piece of the puzzle that got implemented on device. Alexa, Okay Google, Hey Siri. Those are the phrases that the big three default to. The dirty secret was that these phrases were chosen to be easy to recognise, three or four regular syllables, mostly ending with a vowel. False triggers are really common because the models are quite loose. This is the part that squicks most people out about the cloud, for example the big story in the news a few years ago about a couple who were having a screaming domestic argument and their voice appliance somehow hallucinated hearing an instruction to transcribe their argument into a text message to a coworker. --- # Say WHAT? .fig80c[ ![](stt.png)] ??? When I started considering this subject, wakeword recognisers were expensive and difficult to produce. Over the last eighteen months this has largely become a solved problem. Software that shipped with a choice of three wakewords a year ago has dozens of options this year. The cost of generating a new wakeword recogniser has fallen from two weeks of supercomputer time and five hundred hours of recordings, to an hour on your own PC. Once the system knows to start listening, it has to convert the waveform produced by humans squirting air through their meat into symbols. This is what we call CONTINUOUS SPEECH RECOGNITION. Pretty much every system out there breaks this out into a speech to text step. That is, it converts audio to ascii. A particular model only works for a particular human language, but but once again I'm seeing the offerings shift from "english and maybe chinese" to a laundry list of supported languages. --- # "Word Recognition" .fig50c[ ![](voicemenu.jpeg)] ??? Now you don't strictly HAVE to do voice recognition. The alternative is called WORD RECOGNITION. Instead of converting the speech to text and then having a separate step to understand the text, simpler systems simply have a limited number of phrases that they are expecting to hear and they perform the much simpler task of determining that the audio sample matches phrase seven, or phrase twelve, or none of the above. This is obviously less flexible but it's also something that can be done on a very small microcontroller, and once again the state of the art has shifted this year. Last year I was working with software that could discriminate about a hundred phrases, and left me reloading my phrasebook with different phrases in different context. Those limits have doubled in the last year, which may be enough to avoid having to juggle phrasebooks. --- # State your intent .fig50c[ ![](intents.png)] ??? Back to voice recognition, we have a phrase converted to text. Now we want to know what to do about it. This is called intent recognition, and this is the big task that the commercial voice assistants consist of. Google apple and amazon all have complex ecosystems that direct voice commands to their various sub products - their music players, their home automation, the apps on your phone, or the marketplace of voice services that they expected vendors to create but which didn't really take off. I'd be really wary of building into these ecosystems because the vendors have quite clearly lost interest in them. In the open source world there are a a handful of projects that do this, which I'll touch on later. --- # Conversations .fig60c[ ![](fries.jpeg)] ??? But lets finish the picture first. Processing an intent may be a fire and forget operation. Something like "Turn on the light" is a complete conversation in itself. More complex operations like composing a message or scheduling an appointment may involve back and forth, so some systems require a conversation component to handle these complex interactions. Apple is particularly good at this. When an intent resolves to an action, then there needs to be some communication with the outside world. A message goes to a light switch, or information passes to an app on your phone, or it might go into a message broker where a number of consumers could act on it. --- # Feedback .fig70c[ ![](tuba.jpeg)] ??? Finally there's feedback, answering "Okay, I've opened the pod bay doors" or "I'm sorry dave, I can't do that". Here we have text to speech conversion systems - this is something that has always been possible on device -- even in back in the 1980s but with considerable variation in quality. Speech synthesis is such a solved problem that I'm sick to death of hearing AI voices. For me we're at the top of an uncanny valley, I'd rather have an obviously imperfect voice generated on device than be hearing the same four AI voices everywhere I go. --- layout: true template: toply .crumb[ # Intro # Concepts # Unix ] --- class: center, middle template: inverse # It's a Unix system! ## I know this .fig50[ ![](unix.png)] ??? BREATHE All right, now we're ready to start looking at some actual software projects that you can run on a small linux device or similar. I'm going to go bottom up, that is we'll talk about the low level operations that deal with audio input or output then we'll consider the intent and conversation handling later. --- # Wakewords * Lots of dead players * Picovoice's Porcupine is mostly open source ??? Let's start with wakewords. I'm not even going to waste your time with anything that hands audio off to the cloud here. All this stuff runs on linux but most of it could be run on bsd or mac too. The field has shaken out a lot over the last year, I'm only going to talk about two systems. Picovoice is a very portable system that runs on a bunch of processors and supports like sixteen languages. Their open source library is called porcupine and sits on github under an apachen license. I think a license fee is required for commercial use. If you want to train your own wakeword they have a website that will walk you through that process, and they ship a few default words. This project has also got integrations to a ton of languages, so if you are keen to work in your pet programming language this is probably worth a look. However the training engine is NOT open source so that may affect your decision. Or you might want to applaud them for having a business model. --- # Wakewords * Lots of dead players * Picovoice's Porcupine is mostly open source * OpenWakeWord - the new hotness ??? The new kid on the block is openwakeword. Even if you don't end up using this, its README has a good explanation of its goals, and where other projects might do better in some areas which I love love love. OpenWakeWord trains its models on synthetic speech so you don't even have to record human speakers. This sounds like black magic and I'd be very suspicious except for that readme which is very open about all the other options in the space and how they compare. This software hasn't changed much in the last year, and I think that reflects that it's gotten good enough for most use cases. --- # Speech to text * Cloud offerings (yawn) from Amazon and Google * Accumulating pile of dead codebases * OpenAI Whisper - Last year's champion * Moonshine - https://github.com/usefulsensors/moonshine * Table of options: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard ??? So thats wakewords, the next step is speech to text. Again, there are so many cloud based speech to text engines that I'm not even going to talk about them. The offline speech recognition engine that was leading the pack last year was openai whisper. It's MIT licensed but the openai company itself is deep into the general ai kool aid so that might sour you on having anything to do with them. Fortunately there's been a cambrian explosion in speech engines, sparked by an arms race between nvidia and openai. There's a URL up there that will list 30 odd opptions with various pros and cons. The other big win is that a number of speech engines can run on a raspberry pi and other single board computers - particularly a number of CPUs from rockchip that have AI accelerators built in. --- # Text to speech * Cloud for uncanny valley quality * Big list https://rhasspy.readthedocs.io/en/latest/text-to-speech/ * Piper is the stand-out, actively developed * Emotional synthesis ??? All right lets skip over intents and conversations and look at text to speech. Another arena where the bleeding edge of quality is cloud based, but the on device options are good enough for me. I'm just going to refer you to a webpage here to run down the pros and cons of various offerings. The tldr is if you're generating english you have lots of good choices and if you want to generate speech in lots of languages, then you have a couple of projects that are capable all rounders. The one that stands out to me is piper, which is what home assistant uses-it supports a lot of languages and has a choice of voices in many cases. I'm very pleased with this one, and its being actively developed. The what's new for this year is emotional synthesis, the ability to make your synthesised speech sound happy or sad or homicidial as you desire. Unfortunately you might now notice we're bombarded with perpetually angry youtube videos thanks to this technology. --- # Hardware ## Let's start with Pi. Most fruitaholics do. .fig30[ ![](coral.jpeg)] * raspberry pi 4 or 5 * nVidia Jetson * Any SFF PC ??? Lets talk briefly about hardware. Pretty much everybody starts out with the raspberry pi single board computer. I have more of them lying about than I know what to do with so that's what I started out with too. A raspberry pi 5 with a google coral accelerator stuck into its pcie slot is certainly an option. --- # Rockchip Arm64 with neural coprocessor .fig50l[ ![](radxa_zero_3w.jpeg)] .fig50r[ ![](rock_5c.webp)] ??? But this year I'm gonna withdraw my recommendation to go with Raspberry Pi. I just got this new third generation Radxa Zero which has a one trillion operation per second neural accelerator. If you spend fifty bucks more and can live with the raspberry pi form factor you can get another model from Radxa model that is six times faster. --- # AI In A Box (Radxa Rock 5A) ## https://github.com/usefulsensors/ai_in_a_box .fig60c[ ![](ai_in_a_box.jpeg)] ??? Now my apologies to your wallet but here's one of those 8 core Radxa boards with neural accelerator embedded in a product called AI In A Box. It can do continuous speech transcription. I can run large language model chatbots if you enjoy lying as a service. And it's about three hundred and fifty us bucks. But don't click buy yet, I've some other options for you later. --- # Microphones .fig50l[ ![](respeaker.jpeg)] .fig50r[ ![](respeaker2.jpeg)] ??? The good thing about the raspberry pi playing card form factor or the smaller pi zero form factor is that you can get multi microphone hats which do a good job of audio input. --- layout: true template: toply .crumb[ # Intro # Concepts # Unix # Embedded ] --- class: center, middle template: inverse # Voice on microcontrollers ## Impossible, right? ??? BREATHE The concept of doing audio processing on an microcontroller sounded absurd five years ago, but thats all changed and im super excited. The arrival of ai model acceleration on microcontrollers was the catalyst that got me re-interested in voice interfaces in 2023, and led to this talk. The architecture that pretty much everybody uses here is the Espressif ESP32. We're talking a couple of CPU cores at 200 megahertz each, and a handful of megabytes each of flash and ram. For about ten bucks. --- # All in one .fig50l[ ![](lyra.png)] .fig50r[ ![](espbox.png)] ??? For a bare bones job all you need is an S3 development board and a microphone module, which could come in under twenty bucks. But there's a couple of development boards that combine cpu and microphone and even a screen which are just divine as long as size isnt an issue. I've done most of my work on Espressif's own devkit, which has a three microphone array, a bunch of multicolour lights, and the capacity to run on battery power. M5 make a lovely little board that many people are raving about that also has an LCD screen. Your hardware choice is going to depend on what your application is. If you're doing home automation or conversational interfaces, then you probably want one of the big boards with lots of microphones. In that case your ESP device is going to do the wake word processing, and then stream the audio to your conversational engine running on a raspberry pi or pc. --- # How small can you go? .fig50l[ ![](nanos3.jpeg)] .fig50r[ ![](watchs3.webp)] ??? The other use case, and this is the one that I'm interested in, is building direct voice into an appliance where the microcontroller is running the whole appliance and just happens to also do voice processing. If you think about the off the shelf internet of things gadgets like light bulbs motorized blinds, and what have you they just take orders from the intent processor, which might be your voice assistant base station or it might be in the cloud. But this means that they only work as part of a system. If you take them somewhere else, they stop working. On the other hand, who takes their lightbulb or their curtains out for lunch? My interest is for tools that I can take with me, either body worn or in a toolbox, and be able to give them instructions without needing a network connection. --- # Even smaller .fig100[ ![](wemos-s3-pro.jpeg)] ??? Here are some boards that I've been playing with in the last couple of weeks, the one on the left is an ESP S3 with a colour display. I've been testing this as a monocle about which more soon. The one on the right is an ESP32 C3 which is the RISC V chip from espressif. Because its an open source instruction set architecture, with no licensing fees, it costs as little as two dollars. Both of these have bluetooth and wifi. --- # Even sleeker .fig60c[ ![](ha_voice_pe.jpeg)] ??? Now this is the gadget that everyone's talking about. This is the official HOME ASSISTANT VOICE module. I got this developer preview so recently that I haven't even powered it on yet, but I can tell you that it's simultaneously exciting and disappointing. It's got microphones and lights and a click dial, its built on the ESP32-S3 microcontroller, and it is fully open source. But it has to to shunt your voice to the cloud, or to your home PC for processing. If you're using home assistant for household automation it should drop right in. Or you can reprogram it to do whatever you like. --- # Phrase recognition - ESP-sr (on github) .fig80c[ ![](skainet.png)] ??? As far as I know, we don't yet have a general speech to text system that runs unassisted on microcontrollers - but I'm hoping to become wrong about that this year. However, there's a number of options for less ambitious phrase recognition on device. There are a couple of systems that work with googles tensorflow lite and will run on embedded arm processors and many others. But for ESP32, Espressif ship their own voice toolkit that does wakeword detection and quite flexible phrase recognition, and this is I what I've been using. If you're working in english or chinese, stick with the factory toolkit. It can recognise up to 300 phrases which you can specify as english text or as phonetic voodoo symbols. If it thinks it has recognised one of its phrases it will give you a confidence level. I've found it to be very accurate. --- # This year's goal ## Integrated magnifying, illumination and readouts .fig50c[ ![](lilygo-t-glass.jpeg)] ??? Now I spoke about a number of projects utilising all these tools last year and I won't repeat myself - you can find the slides for that on my website. My project for this year is an assistive headset. I want to be able to deploy magnifying lenses, illumination and a head up display under voice control. BUT, I've evaluated a number of display monocles and found them all unsuitable for me. The one on screen is the lilygo T-glass. It's about sixty bucks, with an ESP32 S3 processor and open hardware. I would be prepared to love it to bits, but unfortunately I cannot get the screen in focus with my spectacles. So of course I'm building my own combination monocle and voice processor. --- # Micro-LED displays .fig100[ ![](g1-specs.png)] ??? Now these things are the Even Realities G1. They cost about 500 bucks and support prescription lenses. These have a micro LED projector in the frame, and some compute hardware up behind your ears. No camera. I'm loath to spend that much money on something without trying it first, but I do know of someone on the net who has a pair. There's an even cooler product that's just been announced from another vendor but dammit I lost the press release. I think its going to be on kickstarter later this year. --- class: vtight .fig25[ ![](keep-calm.jpg) ] # Peroration* .footnote[Yes, that's a word. Look it up.] .nolm[ * whats wrong with the status quo * the pieces of the puzzle * offline voice software * embedded voice software * voice activated tools ] ??? There is SO much more I wanted to tell you its going to be a super exciting year, presuming the world doesn't end. Today we've looked at a whirlwind tour of the architecture of voice interfaces, and two broad approaches of using a single board ARM computer, or a microcontroller. We've touched on some very shiny products coming down the pipeline for home automation and I've shared my plans for this year's vanity project. So I hope you found some of that useful, I do have some samples of many of these devices with me, I'll be around the rest of the conference if you want to see them. --- # Resources, Questions ## Related talks - [http://christopher.biggs.id.au/#talks](http://christopher.biggs.id.au/#talks) ## HIRE ME - Available from Feb 2025 - Mastodon: .blue[@unixbigot@aus.social] - Email: .blue[christopher@biggs.id.au] - Accelerando - Innovation Space and IoT Consultants - https://accelerando.com.au/ ??? * Thanks for coming * I'm for hire * Over to you