Open Source Voice in 2025

name: inverse
class: center, middle, inverse
layout: true
.header[.floatleft[.teal[Kit Biggs] — Open Voice].floatright[.teal[@unixbigot@aus.social] .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2025]]

---
layout: true
name: callout
class: center, middle, italic, bulletul
.header[.floatleft[.teal[Kit Biggs] — Open Voice].floatright[.teal[@unixbigot@aus.social] .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2025]]

---
layout: true
name: toply
class: center, toply, italic, bulletul
.header[.floatleft[.teal[Christopher Biggs] — Open Voice].floatright[.teal[@unixbigot@aus.social] .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2025]]

---
layout: true
template: callout
.header[.floatleft[.teal[Kit Biggs] — Open Voice].floatright[.teal[@unixbigot@aus.social .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2025]]

---
template: inverse

# Open Source Voice in 2025
## Is it the future, yet?
### Everything Open 2024 - Tarntanya (Adelaide)

.bottom.right[
Kit Biggs, .logo[Accelerando Lab]
<br>@unixbigot@aus.social .logo[accelerando.com.au]
]

???

---
layout: true
template: toply
.crumb[
# Intro
]

---

# About Me

## Kit Biggs — .teal[@unixbigot@aus.social] —  .logo[accelerando.com.au]

* Meanjin (Brisbane), Australia
* Founder, .logo[Accelerando - Innovation Space and IoT consultants]
* Convenor Brisbane Internet of Things interest group
* 30 years in IT as developer, architect, manager
* HIRE ME: Availabile from Feb 2025

???

Welcome friends sanguine and synthetic.  My Name is Kit, my pronouns
are They/Them, I come from Queensland and I'm here to help.

I make a wildly erratic living by helping other organisations get to grips with the potential and pitfall of the Internet of Things.

---

# Prelude

.fig100[
![](notarobot.gif)]

???
There will be no "I am not a robot" checkbox to access this presentation.

When I pitched this talk it was something of an afterthought
to some other proposals.  But I figured why not, because it would be simple to update the
presentation I gave in gladstone last year to cover whats new since then in the interim.

If you missed my presentation from last year, you're in luck because most of what I said then is now hilariously out of date.

I often repeat that artificial intelligence is doomed to failure.  If
it doesn't work, we have nothing.  If it does work, then we have less
than nothing because now we have a whole argument about where to draw
the line between software and slavery.

---

# This is not about AI

## ".st[Humans > Computers]"
## "Humans ≥ Computers"
## "Humans ≠ Computers"?
## "Humans ⩭ Computers"?

???

When I first gave this, talk it was with a disclaimer that I wasn't
going to spout off about AI.  I lied even then.

A year ago a number of people were claiming that we soon will, or even
already do, share the planet with Artificial General Intelligences
smarter than humans.

I found that amusing because the only person I personally knew who
bandied the term AGI around was a raving charlatan, and how could I
tell the difference between that person and these other luminaries.
I've figured it out now.

AI blew past the peak of inflated expectations and here we are in the
trough of disillusionment.  The charlatans have cashed out already, if
you're still at the table, consider that you're the designated sucker.

I'm pleased that I sat the whole overexuberent shemozzle out.

---

# Alexa, add privacy to my shopping list

.fig80c[
![](voiceassistants.jpeg)]

???

The last year has been a bit of a downward spiral on the concept of technology
as magic.  Household voice assistants have run out of steam, supermarkets
without checkouts turned out to rely on offshore wage-slaves, the
metaverse and apple vision sank without a trace, and large language
models turned out to need a continent worth of underpaid engineers to
restrain their tendancy to spout utter rubbish.

So maybe the whole concept of conversational interfaces is over, it'll
never work.  But I do wanna check the bathwater for any trace of baby
before we throw it all out.

---

# The plan

.fig50c[
![](blueprint.jpeg)]

???

Here's my evil plan.  We've just reached a point where last years
phones are be doing speech recognition on the device, rather than
having to transmit audio to the cloud.  And right now we can buy small
form factor computers and even microcontrollers that have the ai
acceleration hardware that enables voice recognition on the device ,
without needing any cloud service.  So finally we have the potential
to roll our own conversational interfaces without the cost, latency
and creep factor of cloud services.

What I want to do next is go through all the components of a
conversational user interface and see where the state of the art
stands.

---
layout: true
template: toply
.crumb[
# Intro
# Concepts
]

---
class: center, middle
template: inverse

# Principles of voice interfaces

## WAAAH ITS SO COMPLICATED

???

BREATHE

All right so in order to understand how to do ALL THIS STUFF we're
going to have to break it up into steps.  And this makes total sense
in the open source world because many of these steps resolve to single
software projects, and in most cases there are a number of projects
that you can choose from to perform that step.  Sometimes all this
software runs on one system, sometimes it might be spread across
systems.  And thats what gives us flexibility because we can arrange
the software to fit the kinds of tasks we want to do.

---

# Acquiring audio

* Analog microphone
* I2S digital microphones
* Multi-microphone arrays
* Bluetooth headsets
** Hands Free Audio Gateway Profile (HF-AG)
** Adv Audio Distribution Profile (A2DP)

???

The first thing we want to do is get some sound.   This used to be hard, but then the future happened.

A five dollar microcontroller is today fast enough just to run analog to digital conversion on the output of a microphone and process the audio in real time.    But really, don't do that.

Digital microphones are cheaper than chips, and it's especially worthwhile getting a multiple microphone array or other hardware that can do some of the work of isolating background noise.

Bluetooth earbuds are now dirt cheap, and the software to talk to these is now built into your embedded operating system.

---

# Did you hear something?

* Voice activity detection
* Done in some microphone hardware
** https://wiki.seeedstudio.com/xiao_respeaker/
* For everything else, there's Fourier Transforms
** https://github.com/TheZeroHz/VADCoreESP32

???

The next thing to care about in speech recognition is whether someone
is actually speaking.  Sometimes the hardware can do this for you, and if not
there are off the shelf open source libraries that analyse your raw
audio to determine whether it's worth thinking about the bits.

Here's a really simple library that just has one key function, it
tells you whether there is silence, or speaking.  You can twiddle
thresholds and priorities and other stuff but this is a lovely example
of an open source component that just does one simple thing, in a
pluggable way.

---

# PAY ATTENTION

.fig40c[
![](wakeup.jpeg)]

???

I contrast this with the big three commercial voice interfaces where the same
steps exist but there's no options for the user about how the pieces fit together.

Speech recognition is expensive, and when your wearable computer needs
a heatsink this can affect your choice of which body parts you want to
strap it to, so most of the time we use whats called a WAKE WORD to
activate the speech recognition subsystem.

This is a privacy and efficiency tradeoff to some extent.  You don't
want to accidentally trigger effects due to snippets of normal
conversation, and in the older systems you don't want them uploading
everything you say.  So this was the first piece of the puzzle that
got implemented on device.  Alexa, Okay Google, Hey Siri.  Those are
the phrases that the big three default to.  The dirty secret was that
these phrases were chosen to be easy to recognise, three or four
regular syllables, mostly ending with a vowel.  False triggers are
really common because the models are quite loose.  This is the part
that squicks most people out about the cloud, for example the big
story in the news a few years ago about a couple who were having a
screaming domestic argument and their voice appliance somehow
hallucinated hearing an instruction to transcribe their argument into
a text message to a coworker.

---

# Say WHAT?

.fig80c[
![](stt.png)]

???

When I started considering this subject, wakeword recognisers were
expensive and difficult to produce.  Over the last eighteen months
this has largely become a solved problem.  Software that shipped with
a choice of three wakewords a year ago has dozens of options this
year.  The cost of generating a new wakeword recogniser has fallen
from two weeks of supercomputer time and five hundred hours of
recordings, to an hour on your own PC.

Once the system knows to start listening, it has to convert the
waveform produced by humans squirting air through their meat into
symbols.  This is what we call CONTINUOUS SPEECH RECOGNITION.

Pretty much every system out there breaks this out into a speech to
text step.  That is, it converts audio to ascii.  A particular model
only works for a particular human language, but but once again I'm
seeing the offerings shift from "english and maybe chinese" to a
laundry list of supported languages.

---

# "Word Recognition"

.fig50c[
![](voicemenu.jpeg)]

???

Now you don't strictly HAVE to do voice recognition.  The alternative
is called WORD RECOGNITION.  Instead of converting the speech to text
and then having a separate step to understand the text, simpler
systems simply have a limited number of phrases that they are
expecting to hear and they perform the much simpler task of
determining that the audio sample matches phrase seven, or phrase
twelve, or none of the above.  This is obviously less flexible but
it's also something that can be done on a very small microcontroller,
and once again the state of the art has shifted this year.

Last year I was working with software that could discriminate about a
hundred phrases, and left me reloading my phrasebook with different
phrases in different context.  Those limits have doubled in the last
year, which may be enough to avoid having to juggle phrasebooks.

---

# State your intent

.fig50c[
![](intents.png)]

???

Back to voice recognition, we have a phrase converted to text.  Now we
want to know what to do about it.  This is called intent recognition,
and this is the big task that the commercial voice assistants consist
of.

Google apple and amazon all have complex ecosystems that direct voice
commands to their various sub products - their music players, their
home automation, the apps on your phone, or the marketplace of voice
services that they expected vendors to create but which didn't really
take off.

I'd be really wary of building into these ecosystems because the vendors
have quite clearly lost interest in them.

In the open source world there are a a handful of projects that do
this, which I'll touch on later.

---

# Conversations

.fig60c[
![](fries.jpeg)]

???

But lets finish the picture first.  Processing an intent may be a fire
and forget operation.  Something like "Turn on the light" is a
complete conversation in itself.

More complex operations like composing a message or scheduling an
appointment may involve back and forth, so some systems require a
conversation component to handle these complex interactions.  Apple is
particularly good at this.

When an intent resolves to an action, then there needs to be some
communication with the outside world.  A message goes to a light
switch, or information passes to an app on your phone, or it might go
into a message broker where a number of consumers could act on it.

---

# Feedback

.fig70c[
![](tuba.jpeg)]

???

Finally there's feedback, answering "Okay, I've opened the pod bay
doors" or "I'm sorry dave, I can't do that".

Here we have text to speech conversion systems - this is something
that has always been possible on device -- even in back in the 1980s
but with considerable variation in quality.

Speech synthesis is such a solved problem that I'm sick to death of
hearing AI voices.  For me we're at the top of an uncanny valley, I'd
rather have an obviously imperfect voice generated on device than be
hearing the same four AI voices everywhere I go.

---
layout: true
template: toply
.crumb[
# Intro
# Concepts
# Unix
]

---
class: center, middle
template: inverse

# It's a Unix system!

## I know this

.fig50[
![](unix.png)]

???

BREATHE

All right, now we're ready to start looking at some actual software projects that you can run on a small linux device or similar.

I'm going to go bottom up, that is we'll talk about the low level
operations that deal with audio input or output then we'll consider
the intent and conversation handling later.

---

# Wakewords

* Lots of dead players
* Picovoice's Porcupine is mostly open source

???

Let's start with wakewords.   I'm not even going to waste your time with anything that hands audio off to the cloud here.    All this stuff runs on linux but most of it could be run on bsd or mac too.

The field has shaken out a lot over the last year, I'm only going to talk about two systems.

Picovoice is a very portable system that runs on a bunch of processors and supports like sixteen languages.   Their open source library is called porcupine and sits on github under an apachen license.   I think a license fee is required for commercial use.  If you want to train your own wakeword they have a website that will walk you through that process, and they ship a few default words.    This project has also got integrations to a ton of languages, so if you are keen to work in your pet programming language this is probably worth a look.    However the training engine is NOT open source so that may affect your decision.

Or you might want to applaud them for having a business model.

---

# Wakewords

* Lots of dead players
* Picovoice's Porcupine is mostly open source
* OpenWakeWord - the new hotness

???

The new kid on the block is openwakeword.  Even if you don't end up
using this, its README has a good explanation of its goals, and where
other projects might do better in some areas which I love love love.

OpenWakeWord trains its models on synthetic speech so you don't even
have to record human speakers.  This sounds like black magic and I'd
be very suspicious except for that readme which is very open about all
the other options in the space and how they compare.

This software hasn't changed much in the last year, and I think that reflects
that it's gotten good enough for most use cases.

---

# Speech to text

* Cloud offerings (yawn) from Amazon and Google
* Accumulating pile of dead codebases
* OpenAI Whisper - Last year's champion
* Moonshine - https://github.com/usefulsensors/moonshine
* Table of options: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

???

So thats wakewords, the next step is speech to text.

Again, there are so many cloud based speech to text engines that I'm not even going to talk about them.

The offline speech recognition engine that was leading the pack last
year was openai whisper.  It's MIT licensed but the openai company
itself is deep into the general ai kool aid so that might sour you on
having anything to do with them.

Fortunately there's been a cambrian explosion in speech engines, sparked by an arms race between nvidia and openai.   There's a URL up there that will list 30 odd opptions with various pros and cons.

The other big win is that a number of speech engines can run on a raspberry pi and other single board computers - particularly a number of CPUs from rockchip that have AI accelerators built in.

---

# Text to speech

* Cloud for uncanny valley quality
* Big list https://rhasspy.readthedocs.io/en/latest/text-to-speech/
* Piper is the stand-out, actively developed
* Emotional synthesis

???

All right lets skip over intents and conversations and look at text to speech.

Another arena where the bleeding edge of quality is cloud based, but
the on device options are good enough for me.

I'm just going to refer you to a webpage here to run down the pros and
cons of various offerings.  The tldr is if you're generating english
you have lots of good choices and if you want to generate speech in
lots of languages, then you have a couple of projects that are capable
all rounders.

The one that stands out to me is piper, which is what home assistant
uses-it supports a lot of languages and has a choice of voices in many
cases.  I'm very pleased with this one, and its being actively
developed.

The what's new for this year is emotional synthesis, the ability to
make your synthesised speech sound happy or sad or homicidial as you
desire.

Unfortunately you might now notice we're bombarded with perpetually
angry youtube videos thanks to this technology.

---

# Hardware

## Let's start with Pi.  Most fruitaholics do.

.fig30[
![](coral.jpeg)]

* raspberry pi 4 or 5
* nVidia Jetson
* Any SFF PC

???

Lets talk briefly about hardware.  Pretty much everybody starts out
with the raspberry pi single board computer.  I have more of them
lying about than I know what to do with so that's what I started out
with too.  A raspberry pi 5 with a google coral accelerator stuck into
its pcie slot is certainly an option.

---

# Rockchip Arm64 with neural coprocessor

.fig50l[
![](radxa_zero_3w.jpeg)]

.fig50r[
![](rock_5c.webp)]

???

But this year I'm gonna withdraw my recommendation to go with
Raspberry Pi.  I just got this new third generation Radxa Zero which
has a one trillion operation per second neural accelerator.  If you
spend fifty bucks more and can live with the raspberry pi form
factor you can get another model from Radxa model that is six times faster.

---

# AI In A Box (Radxa Rock 5A)
## https://github.com/usefulsensors/ai_in_a_box

.fig60c[
![](ai_in_a_box.jpeg)]

???

Now my apologies to your wallet but here's one of those 8 core Radxa boards with neural accelerator embedded in a product called AI In A Box.    It can do continuous speech transcription.   I can run large language model chatbots if you enjoy lying as a service.   And it's about three hundred and fifty us bucks.  But don't click buy yet, I've some other options for you later.

---

# Microphones

.fig50l[
![](respeaker.jpeg)]

.fig50r[
![](respeaker2.jpeg)]

???

The good thing about the raspberry pi playing card form factor or the
smaller pi zero form factor is that you can get multi microphone hats
which do a good job of audio input.

---
layout: true
template: toply
.crumb[
# Intro
# Concepts
# Unix
# Embedded
]

---
class: center, middle
template: inverse

# Voice on microcontrollers

## Impossible, right?

???

BREATHE

The concept of doing audio processing on an microcontroller sounded absurd five years ago, but thats all changed and im super excited.    The arrival of ai model acceleration on microcontrollers was the catalyst that got me re-interested in voice interfaces in 2023, and led to this talk.

The architecture that pretty much everybody uses here is the Espressif ESP32.    We're talking a couple of CPU cores at 200 megahertz each, and a handful of megabytes each of flash and ram.    For about ten bucks.

---

# All in one

.fig50l[
![](lyra.png)]

.fig50r[
![](espbox.png)]

???

For a bare bones job all you need is an S3 development board and a microphone module, which could come in under twenty bucks.   But there's a couple of development boards that combine cpu and microphone and even a screen which are just divine as long as size isnt an issue.

I've done most of my work on Espressif's own devkit, which has a three microphone array, a bunch of multicolour lights, and the capacity to run on battery power.

M5 make a lovely little board that many people are raving about that also has an LCD screen.

Your hardware choice is going to depend on what your application is.  If you're doing home automation or conversational interfaces, then you probably want one of the big boards with lots of microphones.    In that case your ESP device is going to do the wake word processing, and then stream the audio to your conversational engine running on a raspberry pi or pc.

---

# How small can you go?

.fig50l[
![](nanos3.jpeg)]

.fig50r[
![](watchs3.webp)]

???

The other use case, and this is the one that I'm interested in, is
building direct voice into an appliance where the microcontroller is
running the whole appliance and just happens to also do voice
processing.

If you think about the off the shelf internet of things gadgets like
light bulbs motorized blinds, and what have you they just take orders
from the intent processor, which might be your voice assistant base
station or it might be in the cloud.  But this means that they only
work as part of a system.  If you take them somewhere else, they stop
working.  On the other hand, who takes their lightbulb or their
curtains out for lunch?

My interest is for tools that I can take with me, either body worn or
in a toolbox, and be able to give them instructions without needing a
network connection.

---

# Even smaller

.fig100[
![](wemos-s3-pro.jpeg)]

???

Here are some boards that I've been playing with in the last couple of
weeks, the one on the left is an ESP S3 with a colour display.  I've
been testing this as a monocle about which more soon.  The one on the
right is an ESP32 C3 which is the RISC V chip from espressif.  Because
its an open source instruction set architecture, with no licensing
fees, it costs as little as two dollars.  Both of these have bluetooth
and wifi.

---

# Even sleeker

.fig60c[
![](ha_voice_pe.jpeg)]

???

Now this is the gadget that everyone's talking about.  This is the official HOME
ASSISTANT VOICE module.  I got this developer preview so recently that
I haven't even powered it on yet, but I can tell you that it's
simultaneously exciting and disappointing.

It's got microphones and lights and a click dial, its built on the
ESP32-S3 microcontroller, and it is fully open source.

But it has to to shunt your voice to the cloud, or to your home PC for processing.

If you're using home assistant for household automation it should drop
right in.  Or you can reprogram it to do whatever you like.

---

# Phrase recognition - ESP-sr (on github)

.fig80c[
![](skainet.png)]

???

As far as I know, we don't yet have a general speech to text system that
runs unassisted on microcontrollers - but I'm hoping to become wrong about that this
year.

However, there's a number of options for less ambitious phrase
recognition on device.  There are a couple of systems that work with
googles tensorflow lite and will run on embedded arm processors and
many others.

But for ESP32, Espressif ship their own voice toolkit that does
wakeword detection and quite flexible phrase recognition, and this is
I what I've been using.  If you're working in english or chinese,
stick with the factory toolkit.

It can recognise up to 300 phrases which you can specify as english
text or as phonetic voodoo symbols.  If it thinks it has recognised
one of its phrases it will give you a confidence level.  I've found it
to be very accurate.

---

# This year's goal

## Integrated magnifying, illumination and readouts

.fig50c[
![](lilygo-t-glass.jpeg)]

???

Now I spoke about a number of projects utilising all these tools last
year and I won't repeat myself - you can find the slides for that on
my website.  My project for this year is an assistive headset.  I want
to be able to deploy magnifying lenses, illumination and a head up
display under voice control.

BUT, I've evaluated a number of display monocles and found them all unsuitable for me.

The one on screen is the lilygo T-glass.  It's about sixty bucks, with
an ESP32 S3 processor and open hardware.  I would be prepared to love it to bits, but 
unfortunately I cannot get the screen in focus with my spectacles.

So of course I'm building my own combination monocle and voice processor.

---

# Micro-LED displays

.fig100[
![](g1-specs.png)]

???

Now these things are the Even Realities G1.  They cost about 500 bucks
and support prescription lenses.  These have a micro LED
projector in the frame, and some compute hardware up behind your ears.
No camera.

I'm loath to spend that much money on something without trying it
first, but I do know of someone on the net who has a pair.

There's an even cooler product that's just been announced
from another vendor but dammit I lost the press release.  I think its
going to be on kickstarter later this year.

---

class: vtight

.fig25[
![](keep-calm.jpg)
]

# Peroration*
.footnote[Yes, that's a word.  Look it up.]

.nolm[
* whats wrong with the status quo
* the pieces of the puzzle
* offline voice software
* embedded voice software
* voice activated tools
]

???

There is SO much more I wanted to tell you its going to be a super
exciting year, presuming the world doesn't end.

Today we've looked at a whirlwind tour of the architecture of voice
interfaces, and two broad approaches of using a single board ARM
computer, or a microcontroller.

We've touched on some very shiny products coming down the pipeline for
home automation and I've shared my plans for this year's vanity
project.

So I hope you found some of that useful, I do have some samples of
many of these devices with me, I'll be around the rest of the
conference if you want to see them.

---

# Resources, Questions

## Related talks - [http://christopher.biggs.id.au/#talks](http://christopher.biggs.id.au/#talks)

## HIRE ME - Available from Feb 2025

- Mastodon: .blue[@unixbigot@aus.social]
- Email: .blue[christopher@biggs.id.au]
- Accelerando - Innovation Space and IoT Consultants  - https://accelerando.com.au/

???

* Thanks for coming
* I'm for hire
* Over to you