Spellcasting at home

name: inverse
class: center, middle, inverse
layout: true
.header[.floatleft[.teal[Christopher Biggs] — Spellcasting].floatright[.teal[@unixbigot@aus.social] .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2024]]

---
layout: true
name: callout
class: center, middle, italic, bulletul
.header[.floatleft[.teal[Christopher Biggs] — Spellcasting].floatright[.teal[@unixbigot@aus.social] .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2024]]

---
layout: true
name: toply
class: center, toply, italic, bulletul
.header[.floatleft[.teal[Christopher Biggs] — Spellcasting].floatright[.teal[@unixbigot@aus.social] .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2024]]

---
layout: true
template: callout
.header[.floatleft[.teal[Christopher Biggs] — Spellcasting].floatright[.teal[@unixbigot@aus.social] .logo[accelerando.com.au]]] .footer[.floatleft[.hashtag[EverythingOpen] Apr 2024]]

---
template: inverse

# Spellcasting At Home
## Voice interfaces without the cloud
### Everything Open 2024 - Yallarm (Gladstone)

.bottom.right[
Christopher Biggs, .logo[Accelerando Lab]
<br>@unixbigot@aus.social .logo[accelerando.com.au]
]

???

---
layout: true
template: toply
.crumb[
# Intro
]

---

# About Me

## Christopher Biggs — .teal[@unixbigot@aus.social] —  .logo[accelerando.com.au]

* Meanjin (Brisbane), Australia
* Founder, .logo[Accelerando - Innovation Space and IoT consultants]
* Convenor Brisbane Internet of Things interest group
* 30 years in IT as developer, architect, manager
* COVID has been hard, can has job, pls?

---

# Prelude

.fig100[
![](notarobot.gif)]

???

Welcome friends sanguine and synthetic.  There will be no I am not a robot checkbox to access this presentation.

A long time ago in a conference far far away - i've grepped and it was in sydney town during the year of our lady gaga two thousand and seventeen which feels today like a different century in a different galaxy - I first speculated about where our yearning for artificial intelligence which predates information technology itself, would end up.    If it fails, we have to get by with nonintelligent machines, but if it succeeds to any degree, where do we draw the line between software and slavery.   How long until our successors look back at those I am not a robot checkboxes as if they were segregated water fountains.

---

# This is not about AI

.fig70c[
![](aiblue.jpeg)]

???

Looking back to 2017 this was about nine months before the first of the Open AI Generative Pretrained Transformers, GPT1, landed, but a few months after the first papers about the concept.   Those first few falling pebbles entirely escaped my notice at the time.

I remain only mildly informed about the state of so called AI - the only thing I can tell you for sure is that it is blue.   So as a non expert I am going to refrain from broadcasting my opinions about it.   I've heard that's a thing you can do.

What I do have some expertise about is how much computing you can cram into a tuna can, after you remove the tuna.   So what I want to explore today is what kinds of things we can do in the realm of self contained non intelligent appliances.

---

# The magical universe that shall not be named

.fig50c[
![](shewho.jpeg)]

???

There was a series and books and movies almost a generation ago about a secret world of magic, which we all know is just sufficiently advanced technology.   This technology was mediated by wands and pocket maps and magic cloaks.   Wearable computing.   This fantasy world did not appear to ever explain how a two word invocation could result in repairing a broken pair spectacles or washing drying and storing an entire sink-load of dishes.    Surely you'd need to supply the lens prescription, or specify that the good silverware goes in the sideboard, not the kitchen drawer.   It turned out that whole society ran on slavery, and the author was practically a nazi.   This is why we can't have nice things.

---

# All in the wrist

.fig70c[
![](reyhand.jpeg)]

???

That other cryptofascist magical franchise, Star Wars, seems to pay a bit more attention to credible user interfaces.   In order to individually levitate a hundred chunks of rock and rescue your buried friends one has to adopt a determined expression, and arch your hand just so.   Gesture based computing in fact.  I gather that apple is right on top of refining that at present, and I'm sure Aaron Quigley and his gang of co consiprators would love to get some force chips into their prototytpes.   Of course, as soon as you look at interactive technology in star was, you see once again that the whole thing runs on slaves.

There is one thing star wars got right: there are always two, the master and the apprentice.

---

# Always two there are

.fig50c[
![](apprentice.jpeg)]

???

The apprentice model found in fields such as motor mechanics, plumbing, pair programming, brain surgery, and yes galactic intrigue.    The apprentice is there to pass tools and learn how to anticipate which tool to pass, on the way to becoming a master.     As someone who works alone at the moment, having had a series of interns back in the before times, and being the possessor of a factory second brain and a collection of increasingly creaky joints, I would really appreciate having an apprentice again, I'd settle for a mystic energy field that binds the universe together, yes even between me and that screwdriver.   A magic wand would be great but as I've said I reckon they are -- being careful to remain PG here -- a load of horse pucky.

---

# Pass me that wrench

.fig80c[
![](stark.gif)]

???

Switching franchises again, Iron man has his skutters and jarvis to locate and supply the right tool, or put out that fire.   And props to iron man for freeing his slave eventually.

Amazon promised the same magic infrastructure with their supermarkets where you shovel items into your shopping cart and the store computer adds it up.    If you didn't hear, a week or two ago it emerged that that whole chain of stores actually ran on offshore wage-slaves glued to video feeds.

---

# Alexa, add privacy to my shopping list

.fig80c[
![](voiceassistants.jpeg)]

???

The other big attempt at household magic of course was the home voice assistants - amazon alexa, google home, apple siri, and the rest.   I'm so disappointed in these.   They're utterly reliant on fast access to the cloud and their business model is basically privacy violation.   Its pretty clear that google and amazon have worked out that there is no money to be made here.  Apple has had some great ideas but does not appear to be putting in the effort.

So maybe the whole concept of voice interfaces is over, it'll never work.   In one universe this presentation ends here, thanks for coming.

---
layout: true
template: toply
.crumb[
# Intro
# Why
]

---
class: center, middle
template: inverse

# Philosowaffling

## note to self, breathe

???

BREATHE

So what is it that /I/ want anyway?

---

.fig50c[
![](evilempire.jpeg)]

???

Right now every company and their dog is adding generative ai chatbots
to their products.  Microsoft is going all-in at the operating system
level and also in their development platforms like vs code and github.  Google is betting the farm on evil empire, and a big part of that is something that Geoff Huston
mentioned in his keynote.  He called out Doubleclick in his history of
where the internet went wrong, the panopticon advertising company that
wants to know everything about you.  What ever happened to
Doubleclick?  Google purchased them in 2008 and then Doubleclick
hollowed out the soul of google and now wears the corpse around like a
false moustache.

---

# I don't want what you're selling

.fig50c[
![](siri.png)]

???

Amazon's approach with the Alexa ecosystem was skills.   The idea was that developers can write voice control modules along side their product firmware.    Amazon would give you free hosting and incentives to develop skills.   Until last week when they pulled the plug on that.   It seems that amazon is planning to pivot to generative ai instead.

You can ask apple's Siri things like "where is Jane" and "remind me to book the car in for service when I get to work" or "dim the lights in the tv room".    This is exactly right, but its limited.   I think it could be great if the missing pieces get built, but that doesn't seem to be the plan.  Instead the rumours say that the next update of the iphone os is focusing on generative ai.

It'd be great if these things worked, but like the border between software and slave, there's an extremely fuzzy and hard to detect border between a useful expert system and a pack of lies.   In another universe just one more revision to ChatGPT solves the hallucination problem, we all get a pony, and this presentation ends here.    Thanks for coming.

---

# The AI mania again

## sorry

.fig50c[
![](moneybath.jpeg)]

???

Despite the fact that venture capital is showering the field with money, OpenAI is promising to spend a trillion dollars on development, and every major vendor is going berserk stenciling AI onto their product bases, I think its a con and think the vendors and the investors also think its a con.   I know I said I wouldn't spout my uninformed opinions about AI but it turns out theres a real fuzzy boundary between my promises and lies too.

In one universe, maybe this one, the AI mania runs out of money, the servers consuming vast amounts of energy and water all go dark, and conversational interfaces acquire a stink that lasts a generation.    Presentation over, thanks for coming.

---

# Doom Doom

.fig80c[
![](doom.jpeg)]

???

But its not all doom.    No, wait a minute, it is all doom.   Doom the game, from 1993.   Not the first 3D game but the one everyone remembers.    And within a few years if you wanted to enjoy 3D games you bought a graphics accelerator.    A few years after that you needed a graphics processing unit to play new games.  And the feedback loop between games and GPUs led to these things becoming massively powerful vector processing supercomputers that enabled clever applications like face recognition and protein folding and stupid applications like blockchain and generative ai models.   Whoops opinion sorry.

---

# End of the road?

.fig60c[
![](coyote.jpeg)]

???

But now that we proved these novel vector applications were possible if you dream big, people began to dream small, and we start getting vector acceleration hardware in phones and smartwatches and microcontrollers.

And that's what I think the generative ai boosters are betting on, that before they run out of money commodity hardware will catch up and they can run their scams on your computer instead of theirs.    That's what I'm gonna do too.

---

# The plan

.fig50c[
![](blueprint.jpeg)]

???

So here's my evil plan.    We've just reached a point where this years phones are going to be doing speech recognition on the device, rather than having to transmit audio to the cloud.   We might get one more generation of home voice assistants that can do this too, before the money runs out.   And right now we can buy small form factor computers and even microcontrollers that have the so-called ai acceleration hardware that enables some form of voice recognition on the device, without needing any cloud service.    So finally we have the potential to roll our own conversational interfaces without the cost, latency and creep factor of cloud services.

Like all my evil plans, unfortunately I'm running late and whole lot of folks have gotten partway to world domination already, and we'll go through that firstly.   But before then I want to step back and look at some key concepts that you're going to need to recognise.

---
layout: true
template: toply
.crumb[
# Intro
# Why
# What
]

---
class: center, middle
template: inverse

# Principles of voice interfaces

## WAAAH ITS SO COMPLICATED

???

BREATHE

All right so in order to understand how to do ALL THIS STUFF we're going to have to break it up into steps.    And this makes total sense in the open source world because many of these steps resolve to single software projects, and in most cases there are a number of projects that you can choose from to perform that step.    Sometimes all this software runs on one system, sometimes it might be spread across systems.  
And thats what gives us flexibility because we can arrange the software to fit the kinds of tasks we want to do.

---

# PAY ATTENTION

.fig40c[
![](wakeup.jpeg)]

???

I contrast this with the big three voice interfaces where the same steps exist but there's no choice about how the pieces fit together.

The first thing that needs to happen is that we need to get the system's attention.   This could involve pressing a button or moving a device - an apple watch starts listening when you lift it to your mouth.

Most of the time we use a wake word to activate the system.   This is a privacy and efficiency tradeoff to some extent.   You don't want to accidentally trigger effects due to snippets of normal conversation, and in the older systems you don't want them uploading everything you say.    So this was the first piece of the puzzle that got implemented on device.   Alexa, Okay Google, Hey Siri.   Those are the phrases that the big three default to.    The dirty secret was that these phrases were chosen to be easy to recognise, three or four regular syllables, mostly ending with a vowel.   False triggers are really common because the models are quite loose.    This is the part that squicks most people out about the cloud, for example the big story in the news a couple of years ago about a couple who were having a screaming domestic argument and their voice appliance somehow hallucinated hearing an instruction to compose a text message to a coworker.

---

# Say WHAT?

.fig80c[
![](stt.png)]

???

When I started considering this subject, wakeword recognisers were expensive and difficult to produce.    Hundreds of hours of audio were needed and weeks of compute time to train the model.  I actually considered picking some words and asking the everything open attendees to help crowdsource a model training by each contributing a few pronunciations.    In the last few months the state of the art has moved to where an open source wakeword model can be trained in about an hour.

Once the system knows to start listening, it has to convert the waveform produced by humans squirting air through their meat into symbols.   That is called speech recognition.    Pretty much every system out there breaks this out into a speech to text step.   That is, it converts audio to ascii.    A particular model only works for a particular human language, but there are some recent advances about detecting what language the user is speaking.

---

# "Word Recognition"

.fig50c[
![](voicemenu.jpeg)]

???

Now you don't strictly HAVE to do voice recognition.    The alternative is called word recognition.   Instead of converting the speech to text and then having a separate step to understand the text, simpler systems simply have a limited number of phrases that they are expecting to hear and they perform the much simpler task of determining that the audio sample matches phrase seven, or phrase twelve, or none of the above.    This is obviously less flexible but it's also something that can be done on a very small microcontroller, and I'm going to come back to that in large detail near the end.

---

# State your intent

.fig50c[
![](intents.png)]

???

Back to voice recognition, we have a phrase converted to text.   Now we want to know what to do about it.   This is called intent recognition, and this is the big task that the commercial voice assistants consist of.    Google apple and amazon all have big ecosystems that direct voice commands to their various sub products - their music players, their home automation, the apps on your phone, or the marketplace of voice services that they expected vendors to create but which didn't really take off.

In the open source world there are a a handful of projects that do this, and that's what I want to talk about next.

---

# Conversations

.fig60c[
![](fries.jpeg)]

???

But lets finish the picture first.   Processing an intent may be a fire and forget operation.   Something like "Turn on the light" is a complete conversation in itself.   More complex operations like composing a message or scheduling an appointment may involve back and forth, so some systems require a conversation component to handle these complex interactions.    Apple is particularly good at this.

When an intent resolves to an action, then there needs to be some communication with the outside world.    A message goes to a light switch, or information passes to an app on your phone, or it might go into a message broker where a number of consumers could act on it.

---

# Feedback

.fig70c[
![](tuba.jpeg)]

???

Finally there's feedback, answering "Okay, I've opened the pod bay doors" or "I'm sorry dave, I can't do that".   Here we have text to speech conversion systems - this is something that has always been possible on device -- even in back in the 1980s but with considerable variation in quality.    Quite a number of current systems still do the text to speech in the cloud, but on the other hand on-device TTS is finally I think good enough.

---
layout: true
template: toply
.crumb[
# Intro
# Why
# What
# Unix
]

---
class: center, middle
template: inverse

# It's a Unix system!

## I know this

.fig50[
![](unix.png)]

???

BREATHE

All right, now we're ready to start looking at some actual software projects that you can run on a small linux device or similar.   I'm going to go bottom up, that is we'll talk about the low level operations that deal with audio input or output then we'll consider the intent and conversation handling later.

---

# Wakewords

* Lots of dead players
* Picovoice's Porcupine is mostly open source

???

Let's start with wakewords.   I'm not even going to waste your time with anything that hands audio off to the cloud here.    All this stuff runs on linux but most of it could be run on bsd or mac too.

Snips.ai used to be very popular, but it got bought out and shut down.

Picovoice is a very portable system that runs on a bunch of processors and supports like sixteen languages.   Their open source library is called porcupine and sits on github under an apache license.   I think a license fee is required for commercial use.  If you want to train your own wakeword they have a website that will walk you through that process, and they ship a few default words.    This project has also got integrations to a ton of languages, so if you are keen to work in your pet programming language this is probably worth a look.    However the training engine is NOT open source so that may affect your decision.

---

# Wakewords

* Lots of dead players
* Picovoice's Porcupine is mostly open source
* PocketSpinx - very free, rather stale

???

I first got interested in voice interfaces when I saw Kathy Reid a regular at these events talk about them around 2020.    At the time she was working for mycroft dot ai which is no more, but yay open source because the code is still out there, the community still exists and in this case the training engine is open source too.   So its and older codebase, but it checks out.

The other oldtime totally free option is CMU's pocketsphinx.    It's still being developed, but even the authors admit that it's far behind the state of the art.

---

# Wakewords

* Lots of dead players
* Picovoice's Porcupine is mostly open source
* PocketSpinx - very free, rather stale
* OpenWakeWord - the new hotness

???

The new kid on the block is openwakeword.   Even if you don't end up using this its readme has a good explanation of its goals, and where other projects might do better in some areas which I love love love.    openwakeword trains its models on synthetic speech so you don't even have to record human speakers.   This sounds like black magic and I'd be very suspicious except for that readme which is very open about all the other options in the space and how they compare.

---

# Speech to text

* Cloud offerings (yawn) from Amazon and Google
* OpenAI Whisper - open source (but cult-adjacent)

???

So thats wakewords, the next step is speech to text.

There are so many cloud based speech to text engines that I'm not even going to talk about them.

The offline speech recognition engine thats leading the pack appears to be openai whisper.   It's MIT licensed but the openai company itself is deep into the general ai kool aid so that might sour you on having anything to do with them.

---

# Speech to text

* Cloud offerings (yawn) from Amazon and Google
* OpenAI Whisper - open source (but cult-adjacent)
* PocketSphinx still hanging on
* Kaldi - out of academia and with the stylesheet to show for it

???
Pocketsphinx is there in the STT space too.   Very free, lots of human languages, rather rusty.

Kaldi is a project that comes from academia.   Still getting developed, really good doco.   
https://github.com/kaldi-asr/kaldi

---

# Speech to text

* Cloud offerings (yawn) from Amazon and Google
* OpenAI Whisper - open source (but cult-adjacent)
* PocketSphinx still hanging on
* Kaldi - out of academia and with the stylesheet to show for it
* DeepSpeech - will smoke your GPUs if got em

???

Deep speech is from Mozilla.   Mozilla public license.   Well documented.   Mozilla seems to be having focus problems, and at one point went really big on home automation, and then shut it all down again.   I think this is a leftover from that era and hasn't had a release in four years. But it'll use gpu acceleration if present which maybe gives it a performance and accuracy boost.

---

# Text to speech

* Cloud for uncanny valley quality
* Big list https://rhasspy.readthedocs.io/en/latest/text-to-speech/
* Piper is the stand-out, actively developed

???

All right lets skip over intents and conversations and look at text to speech.

Another arena where the bleeding edge of quality is cloud based, but the on device options are good enough for me.

I'm just going to refer you to a webpage here to run down the pros and cons of various offerings.   The tldr is if you're generating english you have lots of good choices and if you want to generate speech in  lots of languages, then you have a couple of projects that sound pretty ordinary.

https://rhasspy.readthedocs.io/en/latest/text-to-speech/

The one that stands out to me is piper, which is what home assistant uses-it supports a lot of languages and has a choice of voices in many cases.   I'm very pleased with this one, and its being actively developed.

---

# Hardware

## Let's start with Pi.  Most fruitaholics do.

.fig30[
![](coral.jpeg)]

* raspberry pi 4 or 5
* google coral
* nVidia Jetson
* Any SFF PC

???

Lets talk briefly about hardware.   Pretty much everybody starts out with the raspberry pi single board computer.    I have more of them lying about than I know what to do with so that's what I'm using too.    If you are starting from scratch, then I would say get a raspberry pi 4 or 5 with as much ram as you can.    The v3 isn't too bad but its starved for ram, so only use one if you've got it already.   One interesting thing about the pi 5 is that it has a pcie slot into which you can put a processor board like the coral - which is only about 30 bucks extra.

On the other hand, if you have a spare laptop or an intel based small form factor PC, you could use that.    I swore a long time ago I would never buy another desktop PC, I've been surviving on dumpsters and hand me downs.

---

# Microphones

.fig50l[
![](respeaker.jpeg)]

.fig50r[
![](respeaker2.jpeg)]

???

Finally theres a number of off brand raspberry pi workalikes -- I've been using a radxa zero which has 4 gig of ram, 128gig of onboard flash and an sd card socket to boot, and I think you can now get up to 16 gig of ram in this form factor.

The good thing about the raspberry pi playing card form factor or the smaller pi zero form factor is that you can get multi microphone hats which do a good job 
of audio input.     I've brought along a number of examples that you can have a poke at later if you like.

---
layout: true
template: toply
.crumb[
# Intro
# Why
# What
# Unix
# Embedded
]

---
class: center, middle
template: inverse

# Voice on microcontrollers

## Impossible, right?

???

BREATHE

The concept of doing audio processing on an microcontroller sounded absurd five years ago, but thats all changed and im super excited.    The arrival of ai model acceleration on microcontrollers was the catalyst that got me re-interested in voice interfaces last year and led to this talk.

The architecture that pretty much everybody uses here is the Espressif ESP32.    We're talking a couple of CPU cores at 200 megahertz each, and a handful of megabytes each of flash and ram.    For about five bucks.
Espressif are in the middle of migrating from their S series tensilica cpu cores to their own C series Risc five cores but I'm gonna stick to the legacy cores here, even though I'm a big fan of the C series chips, because I don't think any of the C-series ones have vector accelerators yet.

The original ESP32, nowadays sometimes called the ESP32S or the S1 is no slouch, but if you don't already have a board lying about, look for the ESP32-S3 which is the latest model in the S series and the one with all the hardware acceleration goodness.

---

# All in one

.fig50l[
![](lyra.png)]

.fig50r[
![](espbox.png)]

???

For a bare bones job all you need is an S3 development board and a microphone module, which could come in under twenty bucks.   But there's a couple of development boards that combine cpu and microphone and even a screen which are just divine as long as size isnt an issue.

I've done most of my work on Espressif's one devkit, which has a three microphone array, a bunch of multicolour lights, and the capacity to run on battery power.

M5 make a lovely little board that many people are raving about that also has an LCD screen.

Your hardware choice is going to depend on what your application is.  If you're doing home automation or conversational interfaces, then you probably want one of the big boards with lots of microphones.    In that case your ESP device is going to do the wake word processing, and then stream the audio to your conversational engine running on a raspberry pi or pc.

---

# How small can you go?

.fig50l[
![](nanos3.jpeg)]

.fig50r[
![](watchs3.webp)]

???

The other use case, and this is the one that I'm interested in, is building direct voice into an appliance where the microcontroller is running the whole appliance and just happens to also do voice processing.

If you think about the off the shelf internet of things gadgets like light bulbs  motorized blinds, and what have you they just take orders from the intent processor, which might be your voice assistant base station or it might be in the cloud.    But this means that they only work as part of a system.   If you take them somewhere else, they stop working.     On the other hand, who takes their lightbulb for a walk.

My application is for tools that I can take with me, either body worn or in a toolbox, and be able to shout at them without needing a support network.

---

# Phrase recognition - ESP-skAInet (on github)

.fig80c[
![](skainet.png)]

???

As far as I know, we don't have general speech to text system that runs on microcontrollers - but I'm willing to be surprised.

There's a number of options for phrase recognition.    There are a couple of systems that work with googles tensorflow lite and will run on embedded arm processors and many others.   But espressif ship their own voice toolkit that does wakeword detection and quite flexible phrase recognition, and this is I what I've been using.    If you're working in english or chinese, stick with the factory toolkit.

It can recognise up to 200 phrases which you can specify as english text or as phonetic voodoo symbols.    If it thinks it has recognised one of its phrases it will give you a confidence level.    I've found it to be very accurate.

---

# Contextual vocabulary

.hugecode[```c
static char *global_commands[] = {
    "start over",
    "all done now",
    "list commands",
    "what was I doing"
};

static char *solder_commands[]={
    "more feed",
    "less feed",
    "feed one",
    "feed two",
    "feed three",
    "feed four",
    "feed five",
    "small wire",
    "mid wire",
    "big wire",
    "hut",
    "retract",
    "hit me",
    NULL
};
```
]

???

Now the cool thing is that you can load up different phrase tables at runtime so you are only trying to match the phrases that make sense in context.   I'm going to give some detailed examples of what I mean later on.

In the case of dedicated devices I don't even bother with wakewords, I just have them listen all the time, or use a motion sensor to initiate listening.

If you do want to have a wakeword you have the choice of espressif's own library or the microwakeword project.     The factory toolkit  gives you some default words, but requires you to pay to generate new wakeword models, whereas microwakeword uses that synthetic speech hack to generate new models.

---
layout: true
template: toply
.crumb[
# Intro
# Why
# What
# Unix
# Embedded
# Homes
]

---
class: center, middle
template: inverse

# Intents and home automation

???

BREATHE

Okay, so before I get to my own favourite application which is voice activated appliances, I want to talk about home automation - or building automation because I use this in the office too.

Right now the big commercial systems amazon, google apple are all sending audio recordings to the cloud.    Maybe this works in silicon valley but here in the antipodes it sucks, uh, nevermind.
BUT I predict this is going to change, with voice recognition on device for apple and google, and look I'll be surprised if amazon bother, siphoning your data is the whole point for them.

---

# Open source homes

.fig50l[
![](sentence-trigger.png)]

.fig50r[
![](sentence-debug.png)]

???

In the open source world there are a number of complete home automation systems.    I built my own which I pretty comprehensively regret, but at the time there wasn't a clear winner across the board.

That's changed.   Home Assistant is the one to beat, and for them 2023 was the year of voice, they focused on this and built an intent system and an ecosystem of interfaces to devices and to other platforms.

There are some other systems that you still might consider if you have special needs.   Rhasspy is one, and it kind of melds into homeassistant around the edges because they share a lot of components.   The thing about rhasspy is it goes all in on MQTT, with all the interactions taking place via a message broker.   This means its particularly powerful where you want to glue it to your own components and systems.

---

# Open source homes

* Home-Assistant
* Rhasspy
* Kalliope
* Use a central PC/Pi and satellite voice terminals

???

We've also got Kalliope which has a very impressive website, and clearly set out to prioritise building an ecosystem of intents triggers and effects, but then the wind went out of the sails and it's been pretty idle for the last two years.

The other thing that makes homeassistant compelling is that you can use low cost esp32 or raspberry pi satellite microphones, say one in each room, and the whole system understands context so if you're in the kitchen and you say turn on the light, it knows you mean the kitchen light.

If you were doing something that is a little off the wall, you still might look at rhasspy because its so flexible, even though development looks to have gone quiet.

---

## Integration

.fig50c[
![](integrations.png)]

???

For a while there everybody was buidling an open source home automation system, mozilla and mycroft and many more but I think its clear that homeassistant is going to be the crocodile to all those dinosaurs.

Oh yeah, home assistant also plays well with your gear from google and apple and amazon, so that's good.

There's a new standard called matter which is supposed to allow devices in the googlapplezon silos to work together, but its a little early to say if its the holy grail or just competing standard number fifteen.

---
layout: true
template: toply
.crumb[
# Intro
# Why
# What
# Unix
# Embedded
# Homes
# Enhance
]

---
class: center, middle
template: inverse

# Zoom and Enhance

???

BREATHE

Okay, now I told you all that stuff to tell you this stuff.

---

# go right.  stop.

.fig100[
![](enhance.gif)]

???

There's a famous scene in the 1982 science fiction classic blade runner, where the detective is improbably enhancing a photograph using a voice interface.     The character deckard says things like "enhance thirty four to forty six"  "go right" "stop" "track 45 left" and ends with "gimme a hardcopy right there"

I've spent years thinking about this scene.   It's an utterly rubbish user interface compared to pinch zooming on a touch screen, or even clicking and dragging with a mouse.

But.

---

# Full hands interfaces

.fig50c[
![](zoom.gif)]

???

There are moments where my hands are full and I need exactly this interface.    When I'm working with a microscope, I might be holding a pair of probes or a soldering iron and solder, and at high magnification trying to move the object I'm looking at is a non starter.   I even built an x-y stage to move around, and just touching those knobs by hand can end up losing the job.    So being able to voice command a microscope stage to move a fraction of a millimeter back and forth is something that I'm in the middle of building right now.

Its not uncommon for a video microscope to lay a grid on the screen to help you locate items.    The particular microscope I'm using has a hdmi output so when I plug that into a monitor I get the grid.   But it also has USB, and I can plug it into a raspberry pi.   This is the way I use it most of the time because I can put the microscope view on half the screen and pull up the documentation or schematic for the object that I'm looking at on the other side.

---

## Microscope stage

.fig50l[
![](stage.jpeg)]

.fig50r[
![](microscope.jpeg)]

???

Now my microscope has variable magnification but it's not a zoom.    It's whats called varifocal.   You don't see those much in consumer items, but the difference is that a varifocal lens can change magnifcation, but then it loses focus and you have to refocus it.    A zoom lens which stays in focus as you zoom is clearly better, but also much more complicated an expensive.   You see varifocals in movie cameras, security cameras and microscopes where quality trumps convenience.   In this case its a right pain because I actually need four motors, two to move the stage, one for the magnifcation and another one for the focus.    And then if I want to zoom in I have to write an autofocus algorithm.

So like any good hacker, I cheated.   This camera is quite high resolution, and I normally scale it down.    So we can start out with a wide view and when we want to zoom in we can just crop the image.     If we overlay a grid, and then number each grid intersection, then that rather confusing scene from blade runner suddenly makes sense.   When deckard says "enhance thirty four to forty six" he's naming the corners of a rectangle that the system is going to reframe the image on.

---

# Two Hundred Word Limit

.fig80c[
![](grid.jpeg)]

???

Lets turn that around and look at it from the perspective of the esp32 microcontroller that's running those four or five motors.    It's limited to 200 phrases.   Oh, and it can't recognise numbers, only the words for the numbers.   So what are our options.

We've got 200 phrases, so we can start out with go left, go right, go up, go down, stop, gimme a hardcopy right there, just kidding who has a printer any more.    Screenshot maybe.   Then we've got maybe 180 phrases for enhancing.   For every pair of numbers we would have to burn a slot - one slot for "enhance thirty four to forty six", another slot for "ehance thirty four to forty seven" and so on.   Except that some pairs don't make sense because they're collinear.   In fact, it probably only makes sense to listen for pairs that are more or less the shape of the screen.    So if we lay down a five by five grid and number the intersections along the first row then the second row and so on we will have 25 intersections and we can listen for 
''enhance 1 to 7" and "enhance 1 to 13" and "enhance one to twenty" and ignore all the other combinations that start from one.

---

# What now boss?

.hugecode[```c
static char *scope_commands[]={
    "stop",
    "wait a minute",
    "go back",
    "track left",
    "track right",
    "tilt up",
    "tilt down",
    "enhance",
    "pull back",
    "give me a hardcopy",
    "thirteen",
    "fourteen",
    "fifteen",
    "sixteen",
    "seventeen",
    "eighteen",
    "nineteen",
    "twenty",
    "twenty one",
    "twenty two",
    "twenty three",
    "twenty four",
    "twenty five",
    "twenty six",
    "twenty seven",
```
]

???

Another way to do it would be to make it conversational.    We just listen for "enhance", and then we load up a phrase table that just contains the numbers one to eighty, and we listen for two matches.   We can ignore the word 'to' in between, oh hell no we can't because the numeral t-w-o sounds just like the preposition t-o.   So maybe we start the numbering from thirteen, because another thing I forgot to tell you is that short phrases are less accurate, you really want three syllables at least which is why alexa is called alexa but siri and google needed a hey up front to pad the syllable count.

So either I'm reading far too much into an old movie, or ridley scott accidentally nailed the design of voice interfaces.

---
layout: true
template: toply
.crumb[
# Intro
# Why
# What
# Unix
# Embedded
# Homes
# Enhance
# Tools
]

---
class: center, middle
template: inverse

# Hands free tools

???

BREATHE

Okay, home stretch.   So what other tools benefit from voice interfaces.

---

# Multimeters

.fig40l[
![](pokit-black.webp)]
.fig40r[
![](multimeter.webp)]

???

Well, multimeters for one.    You have two probes that you hold in each hand, and usually a dial that you turn to change what the meter measures, it can measure voltage and current and resistance and continuity and sometimes half a dozen other quantities.

So being able to say "select resistance", "select voltage", "hold reading", "clear hold" and so forth is really handy.    Now I already own a multimeter that has a bluetooth control api, and our ESP32 has bluetooth also, so we can glom a microcontroller module the size of a postage stamp onto our multimeter and retrofit a voice interface to it.   If your multimeter has a USB port then the ESP32-S3 can do USB too.    Same goes for oscilloscopes, there's already an open source project to control scopes over ethernet, so dropping a voice interface on my scope is gonna be on my todo.

---

# Soldering iron

.fig50l[
![](hakko.gif)]

.fig30r[
![](solderbot.gif)]

???

Theres several other jobs where there just arent enough hands.   Soldering is one, you might have a wire and a soldering iron and some solder and you're already well and truly out of hands.    You can get soldering irons with foot pedals that feed the solder, and the same for solder paste dispsensers and battery terminal spot welders but personally I find foot pedals hard to work with.    The little blighters always go missing, and if they don't you start getting a cramp eventually.   In any case, you often need to tweak the temperature or the feed rate and having to put down all the tools to twiddle knobs is a real pain.

So I stole another idea from another sci fi author.  In this case it's the novel the Diamond Age by Neal Stephenson.   In this story of a world transformed by nanotechnolgoy we see a street thug get a head mounted gun that has a voice interface.

This is exactly what I need to be able to tell my hands free soldering iron.    One hand for the plug, one hand for the wire, tell the soldering iron "dispense five, hut" and it feeds solder and then lowers the soldering iron in one action.     These devices really exist except they all use boring knobs and foot pedals for the controls.

---

.fig70c[
![](aziz.jpeg)]

???

One last example -- the magic wand.    I started out talking about the nazi wizard books where everybody has a wand that lights up on command.     Our own unexpected maker from melbourne sells an esp32 s3 module that's only 11 millimetres wide.    I reckon I can fit that in a wand.   Or an illuminated magnifying visor, or even just a flashlight.   I spend a lot of time looking into dark places, and a flashlight that I can turn on and off hands free, but which goes to sleep when it hasn't moved or been spoken to in a while is actually gonna be really handy.    But I'll have to think up some spells that don't come from the nazi wizards.

What spells would you cast?

---

class: vtight

.fig25[
![](keep-calm.jpg)
]

# Peroration*
.footnote[Yes, that's a word.  Look it up.]

.nolm[
* whats wrong with the status quo
* the pieces of the puzzle
* offline voice software
* embedded voice tools
* conversational microscope
* voice activated tools
]

???

* whats wrong with the status quo
* the pieces of the puzzle
* offline voice software
* embedded voice tools
* conversational microscope
* voice activated tools

---

# Resources, Questions

## Related talks - [http://christopher.biggs.id.au/#talks](http://christopher.biggs.id.au/#talks)

## HIRE ME

- Mastodon: .blue[@unixbigot@aus.social]
- Email: .blue[christopher@biggs.id.au]
- Accelerando - Innovation Space and IoT Consultants  - https://accelerando.com.au/

???

* Thanks for coming
* I'm for hire no job too small fee too large
* Over to you