DevOps for Dishwashers

name: inverse
class: center, middle, inverse
layout: true
.header[.floatleft[.teal[Christopher Biggs] — DevOps for Dishwashers].floatright[.teal[@unixbigot] .logo[@accelerando_au]]] .footer[.floatleft[.hashtag[NDCSydney] Aug 2017]]

---
name: callout
class: center, middle, italic
layout: true
.header[.floatleft[.teal[Christopher Biggs] — DevOps for Dishwashers].floatright[.teal[@unixbigot] .logo[@accelerando_au]]] .footer[.floatleft[.hashtag[NDCSydney] Aug 2017]]

---
layout: true
.header[.floatleft[.teal[Christopher Biggs] — DevOps for Dishwashers].floatright[.teal[@unixbigot] .logo[@accelerando_au]]] .footer[.floatleft[.hashtag[NDCSydney] Jun 2017]]

---
template: inverse
# DevOps for Dishwashers
## Bringing grown-up practices to the Internet of Things

.bottom.right[
Christopher Biggs, .logo[Accelerando Consulting]
 @unixbigot .logo[@accelerando_au]
]

---
class: bulletsh4

# Who am I?

## Christopher Biggs — .teal[@unixbigot] —  .logo[@accelerando_au]

.leftish[
#### Brisbane, Australia
#### Former developer, architect, development manager
#### Founder, .logo[Accelerando Consulting]
#### Full service consultancy - chips to cloud
#### ***IoT, DevOps, Big Data***
]
???

G'day folks, I'm Christopher Biggs.  I've been involved off and on
with embedded systems since my first professional role over 20 years
ago, and nowadays I run a consultancy specialising in the Internet of
Things, which is the new name for embedded systems that don't work
properly.

We work with companies developing IoT devices to help them choose the
right technologies and practices to build test and deploy their
products.

Now I've spoken in the past about the state of security in IoT, and
while I agree that it's pretty awful, I'm not pessimistic.  I made
some predictions about how bad things could get, and what kinds of
action would be needed, and the bad and good news is that these are
coming true.

I'm seeing the space rapidly mature and all the right things are being
done to address the challenges.

---
layout: true
template: callout
.crumb[
# Agenda
## Problems
]

---
# Why Devops?

???
Last month here in sydney I talked about Internet of Scary Things, at
IoT Sydney.  That presentation is about how IoT is like the Wild West
right now, and why DevOps is a must-do for IoT.

Traditionally hardware has been developed carefully and
conservatively, and had long lifecycles.  IoT brings the contradictory
requirement of products that are set in stone, sometimes quite
literally, but which connect to the wild Internet where sometimes a
day can be an age.

So *this* talk is going to be about the *how*.  How *you* can apply
best practices from grown up computers to the challenges of building
and managing internet connected autonomous devices.

---
# Why Dishwashers?

???
I've called this talk DevOps for Dishwashers because recently the tale
of an internet connected dishwasher made the news.  A lot of people
laughed because why would a dishwasher need to be on the internet.

It took about 10 years for us to go from personal computers being
networked only at need, to us considering any computer that isn't
networked all the time to be broken.

I Expect the same for every other electric appliance.  In the long
term, I think the very word computer will become archaic, because
everything will be a computer.

---
# *"Software is eating the world"*
## .right[-- Mark Andreesen]
### Wall St Journal, Six years ago this week.
???
Software, as Mark Andreesen predicted, really truly will have eaten the world.

This will have a lot of effects, but the one I want to focus on today
is that we must rethink our concept of quality in the embedded space.

The price of an amazing future everything can interact with everything
else, is that everything might interact with everything else.

---
layout: true
template: callout
.crumb[
# Agenda
## Problems
## DevOps?
]

---
template: inverse

# Interlude: What do I mean by "DevOps"?
???
Actually, before I go too far, I should define what I mean
by DevOps.  This is one of those technical terms that has been
misused so much that it's dangerous to use it without confirming that
the listener hears what you mean.

---

.fig40[
![](devops.png "DevOps is NOT THIS")]

.spacedown[
## DevOps is not a ***thing you do***,
## it's the ***way you do things***.
]

???

DevOps is not a thing you do.  DevOps engineer is not a job title.
There's no such thing as a DevOps team.   It's an easy mistake to
make, in fact I once held the job title Head of Devops.

What DevOps **is**, is accepting that development and operations are part
of a spectrum, and that distinguishing between them is
counterproductive.

---
.fig40[
![](devops.png "DevOps is NOT THIS")]

.spacedown[
## *Empower* ***everyone***
## *to maximise* ***value.***
]
???
My definition of DevOps is: Evolving a culture and a toolset that
empowers everyone to effectively maximise value through radical
transparency and extreme agility.

---
layout: true
template: callout
.crumb[
# Agenda
## Problems
## DevOps?
## Solutions
]

---
template: inverse

# *"When every Thing is connected, everything is connected"*
## .right[-- Me]

???
So today I'm going to look at how we pay the price of a connected future.

The body of this presentation is about the practical things you can do
to build quality products for the internet of things.

I'll look at the lifecycle of an IoT product from inception to
retirement, and what you can be doing to get the best outcomes at
every stage.

The three areas I'm going to look at are firstly choosing platforms
that support rather than hinder quality outcomes, secondly shaping
your development and quality practices to foster agility while
preserving reliability, and thirdly working with your data in ways that
promote interoperability and reusability.

---
layout: true
template: callout
.crumb[
# Agenda
# Landscape
]

---

# Welcome to the Internet of Things

## pop. 10 Trillion

???
Before we dive into all that, I want to talk about the universe
of discourse.

In just three score years and ten, the span of a single human
lifetime, we saw the ratio of computers to people increase by
around one order of magnitude per decade.

---
# 1936 (Information Pandemic Year Zero)
## 10-7 device/person

???
If you were a fresh-faced undergrad when 
Konrad Zuse was building his Z1 in 1936, you might still
be around to celebrate your 100th birthday next year.

Aside: University washout living in parents apartment
---
# Mainframe era
## 10-4 device/person
???
Fast-forward thirty years
---
# Minicomputer era
## 10-2 device/person
???
A decade later, two orders, inflationary era

Computers have become indispensible, but
this is not widely understood.
---
# Desktop era
## 100 device/person
???
Another decade, another two orders.

Spreadsheets create gordon gekko.
---
# Mobile era
## 100.5 devices/person
???
Move out of inflationary era, diffusion, eating into the fabric of our lives.

Life pre-smartphone is unimaginable
---
# Cloud era .red[[YOU ARE HERE]]
## ~101 devices/person

???
Starting to lose count.

Soon we won't even bother.
---
# Internet of things
## 103 devices/person

???
That trend is is not over.

In the next decade I expect another dectupling, and at some point after
that we're going to stop counting.   Everything will be a computer.
Every. Thing.

Hence, the Internet of Things, a world where humans are a zero point
one percent impurity in a living network of machines.

If we're wise we'll make friends with the machines when
they're babies, because I don't think we can beat them in a fair fight.

---
layout: true
template: callout
.crumb[
# Agenda
# Landscape
# Challenges
]

---
.fig30[
![](switchboard.gif)]

.spacedown[
# Solve the next problem, not the last one]

???
During the early twentieth century, one analyst predicted an oncoming
apocalpyse where civilisation would grind to a halt, because all
available single women would be employed as telephone exchange
operators, and no further rollout of the telephone network would be
possible.
---
# Beware of false analogies and straight line trends
???
Of course, we now have robots running the switchboards, and even making and
answering some of the calls.   Instead of having a sizeable fraction
of the workforce operating the communications network, we have millions of
people in less developed countries trying to sell those people life
insurance over the telephone, because the cost of communicating 
is now practically zero.
---
# Always be Questioning
???
If you wonder how we're going to curate and maintain a thousand devices
per person, you're making the same kind of category error.

---
# Observe, Orient, Decide, Act

???

Strategy is about adapting your behaviour to circumstances.

When a computer cost 3 months wages, you shepherded it carefully.
When computers cost less than a cup of coffee, they become
consumables.  They're sheep, not sheepdogs.

---
class: bulletsh4
.crumb[
# Landscape
# Challenges
## Risks
]

# *"Bad people will break your stuff"*

.bottom.right[
Do you want to know **more**? 
"The Internet of Scary Things" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)]

???
I could go on for hours about the security landscape in the internet
of things, and in fact I've spoken on the subject in the past.

The one-slide precis is that that bad people want to either steal
your stuff, or break your stuff.  It doesn't have to be many bad people,
another side effect of that zero cost of communications is that if you're
a terrible person, the 21st century is now a target-rich environment.

---
class: bulletsh4
.crumb[
# Landscape
# Challenges
## Risks
]

# Everything is awful

???
I don't want to go off on a tangent about security in particular, but
I find that the underlying causes security problems in IoT are an
instructive microcosm of the wider challenges.

Last October, the Mirai worm spread around the globe because of well
known default passwords that never get changed.  This month we learned
that CIA has been able to insert spyware wherever they like because
some software developers found it convenient to put back-doors into
communications hardware.  Choosing expedient but fragile programming
languages brings conseuqences.

The pattern here is a basic lack of
professionalism and forethought.

---
class: bulletsh4
.crumb[
# Landscape
# Challenges
## Risks
]
.fig30[
![](dumpster_fire.jpg)]

.spacedown[
# Everything is awful
## and the awful is on fire]

???
Moving on, in April we got the endlessly entertaining story of the
compromised internet dishwasher, which demonstrates how many
developers aren't trained well enough to avoid common and completely
avoidable pitfalls.

The downside of ubiquitous communications is that there are no safe
spaces.  June's WannaCry malware outbreak showed us that when
everything is software, everything has to be maintainable.  Last
month's second coming of a much more damaging successor hammered the
point home.

If you can't fix it, its no longer safe to own it.  Firewalls and even
Air Gaps are insufficient.

---
# It's not rocket science
## No really, I mean actual rockets.
???

And a final anti pattern, if you sit in the middle of the road, you'll
get run over.  Quality procedures that worked well last year might be
a liability next.

It turns out that the government software quality rules that were
applied to election security in 2016 delivered little benefit and some
potential detriment becaused they largely focused on concepts that
aren't relevant any more.

Guidelines for safety critical software prohibit things like
exceptions and recursion.  This is more about preventing your space
probe from overflowing memory and crashing into a planet than it is
about software quality on modern embedded platforms.

---
.crumb[
# Landscape
# Challenges
## Risks
## Desiderata
]

.fig50[
![](edna_no_capes.jpg)
]

.spacedown[
# Desiderata]

???
So lets summarise what I think a response to these challenges
should look like, then we'll go into detail on each point.

---
# Select appropriate tools and platforms
???
First, choose the right platform and tool.
---
# Comprehensive identity management 
???
Next, make your identity and security part of your framework
so that you only build it once.
---
# Automate for developer and user convenience
???
Automate everything you possibly can.
---
# Testing and testability kept front-of-mind
???
Design for testability.
---
# Train, and Audit, and keep doing both
???
Train your teams.  Share skills, watch youtube videos, get
consultants, go to conferences.   Whatever your budget there's a way
to improve.
---
# Monitor and react (automatically)

???
And lastly maintain awareness, and use that awareness to respond
rapidly.

---
layout: true
template: callout
.crumb[
# Agenda
# Landscape
# Challenges
# Solutions
## Platforms
]

---
template: inverse

# Platforms

???

So lets talk about platforms.  Jez Humble who literally wrote the book
on Continuous Delivery gave a keynote right here in this buidling at
Agile Australia in June where he told a story about HP's printer teams.

The cost of engineering was spiralling, yet quality was awful, and
part of the reason was they were spending over a quarter of their time
just rewriting existing software because pretty much every printer
model used a different processor.

They moved to a common architecture and massively reduced their
reengineering cost.  That meant that some printers had a more
expensive processor than they needed, but the savings far outweighed
this.

---
# People are more expensive than circuits
## .right[(Sorry, robots)]
???
My point here is that every hardware business has a person whose job
is look at every component and ask "can we delete that
to save half a cent per unit".

This is a person who's empowered to remove customer value.  The
correct question when designing your hardware is will this hardware
platform add value, in terms of resilience, longevity and supporting
software quality.

---
# Hardware is DevOps too
## .right[(Robots, I hope this makes it up to you :)]
???
That's right, your platform is part of your devops team.

Your goal should be to select a platform that supports
your devops way of doing things

So while selecting and building the hardware is by far the 
easiest part of the IoT lifecycle, it's the foundation that
facilitates the rest.

---
# Open and well supported

???
This means that openness and friendliness are the first things you
should look for.

Select a processor and platform that has good development tools and
will be around for a long time.

---
# Case study: .blue[**ARM v7**] and .red[**Debian Linux**]
???

Thanks to the drivers of Android phones and TV media boxes, there's a
huge variety of ARM systems out there which are similar enough that
hardware differences aren't really important.  You can develop to the
one architecture and select an appropriate board from five bucks up.

---

.fig40[![Pi](raspberry_pi.jpg)
]

.spacedown[
# Meet the #3 top-selling computer of all time
]
???
In my projects we work with Raspberry Pi as the development platform.

This brings the convenience of a huge selection of platform tools and
a great developer community.

We can use these in the lab, for CI workers, and even for the first hundred
or so product units.  There's a wide selection of compatible lower cost
designs to buy or build for mass market production.

Here's one that starts at seven bucks.

---
# Artisanal free-range small-batch Linux?
## No.
???

The same goes in software, think about what happens if news a nasty
bug breaks on the Internet.  The top tier vendors probably have a
patch out in a day or two.  If you picked some hipster linux
distribution that has fifty users and one maintainer, you may never
see a patch.

You want your tools and systems to be a convenience not a source of
pain.  If you're using a mainstream OS variant, and for IoT that means
Redhat or Debian Linux then there's a critical mass of users that acts
as a quality filter.  There's not likely to be some hidden problem
that you can't solve.

If you're the only company in the world using some niche distribution,
when you run into trouble you're on your own.

---
# Without the Internet, it's just a Thing.

???
I also think it's important for devices to be online as much as possible.

Power and network constraints may mean that devices can't afford to be
online continuously, and this is OK, but you do need to think about how
to support DevOps processes within these constraints.

I'll come back to this later.

---
# *"Is there anybody out there?"*
.right[-- Pink Floyd (also, my lighting controller)]

???
Plan to have two way communication with your devices.   Forget about
one-way communication channels, they may be cheaper but you'll regret
it.

---
# *"Put the robot back in the ocean, kid."*
.right[-- Oceanographers, everwhere]

???

Make sure you have a way to identify and locate every device, even if
you never think you need it.

I'm talking about putting some kind of display on a device, even if
it's only a single LED, and letting devices report their location
either by GPS or more coarsely via cellular or wifi estimation.

I've heard more than one story from the world of oceanography where probes
go missing because people "rescued" them.

---
# Management is not a dirty word

???
Use a remote management platform.   At the least you need a secure way
to upgrade and shutdown a device.   I'm recommending using an agent-oriented
datacentre automation tool like saltstack,  I'll give some examples of what I've done
with this tool later on.

The right tool will kill two birds with one stone, you can use it
construct your master configurations in the lab, and then to manage
updates in the field.

---
layout: true
template: callout
.crumb[
# Landscape
# Challenges
# Solutions
## Platforms
## Dev
]

---
template: inverse

# Development
???

OK, so we chose some hardware and an operating system, and installed
at least one blinky light.

Now we have to make it blink.
---
# Aside: Go Serverless

???

If I could give you one piece of information today it would be use
serverless architecture.

Serverless doesn't mean there aren't servers, it means you don't have
to worry about them.

I'm putting together a global building management system that doesn't
rely on a single fixed server, including the whole CI pipeline.

---
# Nice languages are portable, memory-safe and asynchronous.

???
I want you to take a hard look at what programming platforms you use.

Programming has entered an era where the ability to be rapidly 
up and running thanks to an ecosystem of open source parts is
revolutionising the craft.

One of the most important things to look for is a healthy ecosystem
of actively maintained components.

Javascript and Go are
particularly notable for this in that they have component 
search engines that help you understand who else is using a component.

---
# Case study: .green[Javascript]

???
Javascript rules the world.  Its not the language anyone would have
chosen to be number one but its there, and its getting better every
year.

You can literally run javascript from chips to cloud, on tiny two
dollar microcontrollers, to cloud services.  It's the only language
that has common support across Azure, AWS and Google cloud.

Javascript is also a great way to write user interfaces, I'm using Elm
which is a strongly typed meta language to produce highly reliable and
quite pretty management interfaces.

Please don't write your own user interface components, theres so many great
options out there.  I'm looking at you every network router ever.

---
.fig40[
![](gopher.png)]

.spacedown[
# Case Study - .blue[Go]]
???
I'm using Go for my largest project.   This is a great fit for container
environments because your executables are self-contained.

There's built in cross compilation support, so you can work on
Windows, Mac or Linux, wherever you're most productive.  My CI
pipeline can run on stock cloud systems, and spit out ARM native
containers.

Go is a very pragmatic language, although it looks very much like C
you'll find working in it more like Javascript, you don't need to
worry about memory or pointers hardly ever, and the type system
actively helps you to write more reliable code.

---
# Naughty languages are like a tightrope over a pit of spikes.
???
I coded in C and suchlike for a very long time, and I don't think
there's a good case for widespread use of these tools any more.

Its simply far too easy to write brittle or insecure software.
Modern languages have learned from these drawbacks and provide
a really effective safety net.

---
# *'The wifi password is "abc123';cat /etc/passwd#" '*
## Say no to shell scripts.

???
Shell scripts are so tempting, but when you're dealing with malicious
users the attention to detail needed takes all the convenience out of
it.

---
# Use, and reuse, a framework.
## Yours, mine, Google, Amazon, Microsoft, Apple, whatever.

???
Frameworks are important for IoT because there  are a lot
of foundation behaviours in common between devices.  Given
the time and budget pressure on low cost devices, you
don't want to waste resources on adddressing the same problems
over and over, or worse still not address them at all.

---
# AWS IoT

???

I'm using Amazon's ecosystem in a number of projects.  They've really
thought about what IoT means for device management.

Major challenges like security, intermittent connections and data
 collection are solved for you.  When scale becomes an issue, you have
 amazon greengrass, which lets you move from a centralised system
 where everything talks to the cloud, to a more decentralised
 architecture you have a buildling level or floor level element which
 is under the control of your AWS infrastructure.

---
class: bulletsh4, tight

# Choose your own framework adventure

.leftish[
#### Amazon IoT and Greengrass
#### Google IoT
#### Azure IoT
#### Open Connectivity Foundation IoTivity
#### Resin.io
#### Mongoose-OS
]

???
If you look at the awful usability of many home automation products,
you'll see that vendors have implemented basic behaviours badly when
they could have gotten that work done for them by adopting a framework.

These are some frameworks I like.  The great news for us is there's a
lot of vendors going hard at this problem.

There is a learning curve.  You're initially going to tell yourself "I
don't need all this complexity, can I just skip to the end", but
particularly in the maintenance part of the lifecycle, a good
framework will pay your investment.

---

# Containment is complexity management

???

I've gone on record as saying that you shouldn't just screw a linux
server to the wall, nor ship software that you don't need.  So here I
am advocating using a common mainstream hardware and software
platform, which breaks both those rules.

Containment is how I reconcile those constraints.  Software
containment causes me to limit the interactions between my parts which
helps with security and complexity concerns, and the consequent loose
coupling between parts lets me reuse components more effectively.

---
# You *can* run Docker on a $6.95 linux computer

???
Docker is the containment engine that everyone knows, but there are
three or four others now in the Linux space.  To be honest I haven't
done much more than kick the tyres on some of the others, because
Docker is the one that has all the major cloud vendors on board.

---
# Use your CI to produce docker images as artifacts
???
* Code reviewers and testers can test the exact artifact
* Orchestration tools can push "gold" images to devices

---
layout: true
template: callout
.crumb[
# Landscape
# Challenges
# Solutions
## Platforms
## Dev
## QA
]

---
template: inverse

# Testing

???
Testing is probably the single most critical point for DevOps.  The
goal we are striving for is more frequent releases, and in order to
meet that goal you *have* to reduce the human burden of testing.

---

# Total Infrastructure Awareness
## replicate your whole ecosystem

???
The zeroth law here is to actually do testing.  To structure
your enterprise in a way that provides fast and easy access
to test environments.   Ideally anyone who wants to test can
click a button and have a fresh environment in minutes.

---
# No snowflake servers.
## A dev team should have access to disposable instances of everything

???
If it takes days of work to provision an environment, nobody
will ever do it, which means some testing will never get done,
or your test system will be out of sync with production.

---
# DevOps is a disaster, every day
## and that's good

???
* Fast spin up and replacement of services

Worse still, if you're not deploying new environments every day
you might get that moment where you need to in a hurry and it
turns out nobody knows how.  I've seen it happen.

---
# Painful testing practices beget painfully bad testing
## provide easy test data

???
* Plenty of test-data generators
* Making it painful to test will result in painfully bad testing

---
# Quick fixes are **good**
## and cheap
???

Actual research data shows that that the sooner you find a bug, the less
it costs to fix.

If a developer never makes a mistake, it costs nothing to fix, so
thats the first place a good toolset pays off.

---
# Listen to that annoying hipster tech blogger

???

If you find the problem during coding, that's where pair programming
and code review pays off.

---

# Test-before-merge

## Never* "break the build"

.footnote[well, almost never]

???
If automated testing tools find a problem, then at least you haven't
wasted anyone else's time.

---
# Fail fast
## Unit tests first, followed by slower end-to-end tests

???

Now you can tell developers, run the tests every time before you push, but
in my experience pixies will fly out of my ear before that happens.

So whatever central CI you have needs to fail cheap.  Do the fast
tests first, and the slow tests later.

---
# Every pair of eyeballs costs $$$
## (and no, you can't save $-½ by poking one out)

???
Once a change merges into your code repository the costs multiply.
You've consumed the time of the person who found it, and then the
developer who fixes it, and then the person who confirms its fixed.

If you make it to integration or release testing then you have to
go back and redo work and redo more testing, and possibly reschedule
your release.

---

# Do not poke customers with sticks (either)

.bottom.right[
(Wherein Christopher does math to "prove" a point)]

???
And if a problem makes it into the hands of customers we're talking
about hundreds of times more cost to fix than if you'd found that
problem on day one.

It's true comprehensive automated testing does take time and money.
Studies show it costs around thirty percent more.  But it brings a
ninety percent reduction in defects, and so if your automated tests
catch only one bug a week that takes the developer an hour to fix,
which otherwise would have reached customers, then those tests paid
for themselves more than twice over.

---
# A modern embedded system is faster than a Cray 1
## But that's still **reaaally slow**

???

Now when you're working in IoT you're probably not running an x86
CPU, so how do you compile and test for various embedded CPUs in
your Continuous Integration rig.

---
# Cross-platform CI pipelines
## Option zero: cross-platform languages are win

???
Well, good language choice and containers to the rescue here.

If you're working in Go or NodeJS you will find that you can run your
code on any platform, and build your distribution artifacts on any
cloud system.

---
# Cross-platform CI pipelines
## Option one: Emulate

???

Emulators have gotten really good, too.  You can spin up an ARM emulator on
a PC and boot it up just like a virtual machine.  It's remarkably
easy.

---
# Cross-platform CI pipelines
## Option two: Enrol embedded systems in your CI

???
You do want some real devices of course.

Where you do need to compile or test on your target system, choosing
ARM and Linux means that your systems can integrate right into
whatever testing systems you're using.

I'm using gitlab as it was one of the first systems to build the
automated testing tools right into the code repository, but now
github and bitbucket have all that too.   So go with whatever you know.

---

# Case Study: My Pipelines

.fig100[
![](gitlab_pipeline.png)]

.bottom.right[Minimal requirement: one x86 and one ARM server (eg Raspberry Pi)]

???

* Tag jobs that MUST run on ARM as 'pi', then configure ARM workers to
only run pi jobs

* Only one pi job here

---

# Case Study: My Pipelines

.fig100[
![](gitlab_pipeline.png)]

.bottom.right[Stage 1: common policy checks]

???
There's a pre compilation test phase that runs code quality and static
analysis.

---

# Case Study: My Pipelines

.fig100[
![](gitlab_pipeline.png)]

.bottom.right[Stage 2: compile and package]

???
* Pipeline builds docker images for x86 and ARM

---

# Case Study: My Pipelines

.fig100[
![](gitlab_pipeline.png)]

.bottom.right[Stage 3: Testing (on target arch)]

???
Here's where our ARM CI worker gets used.

All but one of those jobs run on cloud hosted x86 servers, except for
the testing that needs to run on actual ARM hardware.   For that I
have some raspberry pi systems which act as workers in the CI farm.

---

# Case Study: My Pipelines

.fig100[
![](gitlab_pipeline.png)]

.bottom.right[Stage 4: Deploy to container registry]

???

* Side branches (issue and feature branches) push to dev registry
* Master branch pushes to main registry, tagged as staging or live
* Auto push to staging registry
* Manual push to production registry

---

# Case Study: My Pipelines

.fig100[
![](gitlab_runners.png)]

???

You can build all this with as little as two machines, or hundreds if
you need them.

---

# Case Study: My Pipelines

.smallcode[
```yaml
# note the pi build job can be run on x86 because Go is awesome
build for pi:
  stage: build
  script:
    - make installdeps image contents ARCH=pi
  artifacts:
    paths:
      - GPIOpower
      - GPIOpower_docker_pi.tar.gz
      - GPIOpower_contents_pi.tar.gz
#
# Run tests on RasPi
#
test on pi:
  stage: test
  tags:
    - pi
  script:
    - make test

deploy to staging:
  stage: deploy
  environment:
    name: staging
  only:
    - master
  script:
    - make push ARCH=x86 ENVIRONMENT=staging
    - make push ARCH=pi  ENVIRONMENT=staging
```
]

???
This is almost the entirety of the configuation for the pipelines.

I use worker tags to force certain jobs to run on ARM.  I don't like
to put too much configuration in to my CI system because I think its
important to be able to run tests outside CI too.  So most of the
testing logic is in the makefile and test scripts.

---
# Package separate lab and field artifacts

???
Even with the best of intentions, temporary developer backdoors stick around.  So make a sandbox that's safe to put backdoors in.

---
# Defeat laziness by making it easier to do the right thing
???
You can't out-lazy programmers.  Set up your system so that 
Every CI build results in both a locked-down and developer-friendly version.

You don't ever want to here "I'll just hack this into the production
build and then take it out later".

Edith Harbough this morning - multiple instances where lab code left in production cost hundreds of millions of dollars.

Flags named "do not enable this flag"

---
#  Regression tests, longitudinal tests

???
* Nightly or better, run a full coverage regression suite
* "Release-ly" or better, run a performance comparison test

---

.fig50[![](bond_lair.png)]

# Dashboards as "live tests"

## Containerise your BI stack and write dashboards alongside code

???

Okay, so you made it to release.  Now there's ten thousand instances
of your code out there, and they're up a pole, or buried under asphalt
or fired into space.

So don't stop testing.  When we run our tests in production we call
them dashboards.  Bond villains are big on dashboards.

You know the funny thing is, I set out to find a bond movie image for
this slide, and they all looked rubbish compared to reality.  It seems
that a generation of nerds have created a future straight out of fiction.  That image up there is from Hoot Suite's operations centre, and it only shows about a third of the room.

But you don't need all that rubbish.  Who's looking at those screens,
there's ten or more for every person in the room.

We need machines to watch our machines.

---

# Dashboards learn "green" state and alert on red

## Obligatory reference to Machine Learning

.bottom.right[
Do you want to know **more**? "Continuous Dashboarding" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)]

???

Simples approach - red line gauges.

Artificial stupidity - ignore normal, what's left?

Recent ELK stack release (in June) has ML anomaly detection.

---
layout: true
template: callout

.crumb[
# Landscape
# Challenges
# Solutions
## Platforms
## Dev
## QA
## Deployment
]

---
template: inverse

# Deployment

???
All right, so you have some tested code that's been encapsulated in
container images, and now you need a device on which to run it.

---

.fig50[
![](orchestrate.jpg)]

.spacedown[
# Orchestrate: Never do anything by hand.
]

???

Do not fall into the trap of turning yourself into a robot.

Some people say that as a programmer if you ever have to do something
more than twice, you should automate it.

Well let me save you those two times.  If you learn how to use
orchestration systems to configure servers, you'll find that pretty
soon its quicker to use them even for things that you only have to do
once.

Also you were wrong about only ever needing to do it once.

---

# Build a provisioning workflow
## Customise a clean OS (via ethernet or emulation)

???
Here's something that I've seen over and over.

A project starts with developers putting together a quick prototype by
starting from a blank OS installation, installing some tools and
editing the configuration.

Then it's six months later and there's a new version of the OS and
nobody remembers how to go from a clean OS to an app ready platform.

And thats how many embedded devices end up running eight year old kernels 
that are riddled with bugs.

---

# Robo-configure the target system from a provisioning system
## Then save a filesystem image

???
So, what's a better way.

Remote admin of a target machine (saltstack over ssh)

Target machine in a VM - eg vagrant

Target machine in a container (dockerfile)

---

# How do you create a provisioning system?

## Turtles all the way down!

.bottom.right[
Do you want to know **more**? .blue[github.com/unixbigot/kevin]
]

???

You pull yourself up by your bootstraps.

Saltstack - serverless mode to create the first server

---

# Case study - My orchestration scripts

.leftish[
1. Create a read only recovery partition
1. Install SaltStack orchestration minion (now switch protocols)

1. Set timezone, locale, etc.
1. Change default passwords
1. Configure network
1. Provision message bus clients
1. Install language runtimes (nodejs, java etc.) if needed
1. Configure VPN client
1. Fetch initial application containers
]
???

The first part of my process is converting a vendor image to an product base image.

This runs through a bunch of repetitive configuration that nobody wants to do by hand.

At the end of that process we join up to the output of the CI pipeline.

CI take source code and turns it into tested docker images.

The orchestration system fetches the tested docker images and drops them onto a prepared operating system.

So we've closed the circle, we have continuous deployment.

---

# Hey, that sounds a bit like PaaS

## Yeah, it does.

???

At this point you might be thinking this sounds a lot what platform as
a service offerings, like OpenShift or Elastic Beanstalk do.  Those
systems provide a pre-prepared operating system which you don't have
to care about, and you just drop your code into place.

Well, you'd be right.  There are options for IoT where you get something very similar.

---

# Resin.io
## IoT PaaS with Linux and Docker

.bottom.right[
Do you want to know **more**? [https://resin.io/](https://resin.io) 
"The Internet of Scary Things" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)]

???

One of them is resin.io, which provides a base linux image which
connects back to their cloud systems.  You push your code to their git
servers, and their system compiles it and pushes it down to one or
more registered devices.

They support a number of devices, and they take requests for which
ones to add next.  I'm beta testing a new device at the moment.

---

# Amazon AWS Greengrass
## IoT PaaS built on AWS IOT + AWS Lambda

???
Amazon Greengrass is another system - if you're familiar with lambda,
greengrass is basically on-premises lambda, you use the same
mechanisms but instead of deploying to an anonymous container host
that you never see, you're deploying to an embedded system that you
nominate.

---

# Mongoose-OS

## Multiplatform embedded OS with cloud integration and remote upgrade

.bottom.right[
Do you want to know **more**? [http://mongoose-os.com/](http://mongoose-os.com) 
"IoT in two Minutes" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk) 
"Javascript Rules My Life (CampJS 2017)" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)]

???
In the microcontroller space I really like another project called
Mongoose OS.  It offers the same sort of facility without the central
cloud repository, you push a zipfile to a device and it verifes and
installs the contents.

---

# Automate PKI enrolment
## IoT Makes PKI Easy.

.right[-er
]

???

So lets zoom in on that "verifies" part.

How do you permit secure remote access to your devices.

The obvious way is with X.509 certificates as used in the world wide
web.  Only the public key infrastructure for the world wide web is
basically a global protection racket.

The hard part of public key cryptography is authenticating the remote
party.  That's what certification authorities are for in the web.

The alternative is two secret agents exchanging briefcases in the park.

Well, it turns out that's exactly what we do.

---

# Your "secure key distribution channel" is a cardboard box

.leftish[
1. Build and sign a root certificate
1. Upload root cert to the SaltStack master
1. Create minion certificate
1. Install minion certificate on minion
1. Upload a copy to the master
1. All automatically
]

???

We need a tamper-proof channel to get a unique individual secret key from the factory out to our device.

Only our device was made in the factory.  So we put the key inside on
the assembly line, then ship the device.

They way I achieve this in my project is again using SaltStack, during
final assembly test we assign an identity to a device and provide it
with its security certificate.

Now we have trustworthy secure communications with a remote device,
that allows us to send digitally signed software updates that the
device will trust because they're signed with a certificate that it
has carried around since birth.

---

# Containerised version control

## Use your docker registry the way it was intended

???

Now we have a secure way to get updates out to devices, our next task
is to keep track of what versions each device has, and which are available.

Well it turns out our container system does this for us.  Docker
containers have one or more tags, which are intended for version
identification.  The tag you might be familiar with is 'latest' which
is the default tag that's used if you don't supply one.  Well we can
define a tag for staging, and demo and debug and whatever else we
need, and permanent tags for each previous version.

Now we know what version every system has, and we have a way to
request a prticular version.

---

# Use Salt "grains" to define which container to use
.right[(i.e. live, staging, dev or other)]

.left[
```yaml
fleetvalid-rfid-image:
  docker_image.present:
    - name: ((image))
    - force: True

fleetvalid-rfid:
  docker_container.running:
    - image: ((image))
      - links:
      - fleetvalid-aggregator:aggregator
      - fleetvalid-rfidpower:power
    - binds:
      - ((rfid_device)):((rfid_device))
    - environment:
      - RCAGGREGATOR: aggregator:9091
      - POWER_API: power:80
```
]

???
We are *this* close to continuous delivery right now.

The last thing we need is to control which devices get which versions.

Well, our orchestration system is also a configuration management
system, which has a pair of key value stores, one maintained at the
server, and one on the devices.

---
layout: true
template: callout

.crumb[
# Landscape
# Challenges
# Solutions
## Platforms
## Dev
## QA
## Deployment
## Maintenance
]

---
template: inverse

# Maintenance

???
OK we made it.  We're deployed.  But we're  not out the woods yet.

I'm working on devices that are going to be literally set in concrete
and expected to keep working for a decade.

When they break down you'll need a jackhammer and some earmuffs.

---

# Self-care
## FLASH memory longevity tweaks

???
So lets think about flash memory.

Each sector can only sustain about 100k write cycles before it
becomes an ex-sector.

So you don't want to be indiscriminately writing to your disk.

Some things you should configure are turning off filesystem access
timestamps, and logging to a local ram disk prior to shipping any logs
you want to keep to the cloud.

---
# Boot to ramdisk
## Sanity check, then proceed to target environment

???
Linux systems have had the ability to boot with an initial ramdisk for
a long time, for example when you're running a RAID system, you need
somewhere to keep your raid device drivers.

You can turn this feature on even in your teensy embedded system so
that if your filesystem goes bad due to too many writes you can repair
it using the repair utilities kept on the ram disk image.

---
# Recovery mode
## (ab)use DHCP

???

But what your concrete encase device goes into a crash loop.  Well,
I've prepared for this.  My installations don't typically use DHCP,
but the system does send out an address request packet at boot time,
which is usually unanswered.  If it does get an answer commanding it
to go to recovery mode, it starts a login server and waits for input.
But first it uses that conveniently shipped public key certificate to
be certain it can trust the command.

---
# Liveness monitoring
## If a device goes silent, notify the site custodian

???
So that's some things a device can do to look after itself, what else
do we have in our bag of tricks.

Well our orchestration system includes a beacon module that transmits
some basic performance data, so our mangement system can provide a
report of any devices that have gone silent.

---
# Sickness monitoring
## Use an audible or visual attention signal (think smoke-alarms)

???
In the event where a device is still alive but has a component
failure, we can be a little more helpful and have it call attention to
itself.  When every lightbulb in a building is self aware, or every
smoke alarm or every lamp post, a little bit of help working out just
which one is on the fritz comes in very handy.

---
# While you were sleeping
## Intermittent connections

???
In the case where devices are not online all the time, how do we manage updates.   Well amazon's IoT does a great job of this, where right after a device comes on line it receives an update of all the state changes that were initiated while it was off line.

But we can do the same thing with an orchestration system.  In
saltstack there's a component called the reactor, which responds to
defined conditions.  So we can trigger actions to occur when a device
connects, such as transmitting any pending updates.

---
# Kill or Cure
##  Feature/component disable

???
Next I want to consider what if there's a problem we can't fix with software.   Maybe the the fire alarm is stuck on at 3am.  You want to be able to shut down components that require physical repair, rather than have them be a nuisance or a danger.

So if you can build feature switches into your software and hardware,
you should do so.

---
# Kill or Cure
## The Cassini Solution

???
Finally what if a device is out there and you're done supporting it.

There's a space probe orbiting saturn right now that has about 20
kilograms of plutonium on board.   It's running out of fuel, which means it will eventually smack into a moon.

Since we think there could be life on some of Saturn's moons, it
wouldn't be neighbourly to give them all tentacle cancer, so in a few
days it will use the last of it's fuel to kamikaze dive into saturn
itself.

If you're building devices that could outlive their software support,
I encourage you to think about this situation.  The problems recently
with factories and businesses still using abandoned windows XP systems
shows us that if you can't patch it, you should unplug it.

One simple way to do this might be to revoke or stop updating their
security certificates.

---
layout: true
template: callout

.crumb[
# Landscape
# Challenges
# Solutions
## Platforms
## Dev
## QA
## Deployment
## Maintenance
## Monitoring
]

---
template: inverse

# Monitoring (platform data)

???

All right, we're coming to the end.

Last thing I want to talk about is the data being emitted from your things.

---

# Heartbeats
## SaltStack's presence monitor

???
I mentioned that the orchestration system I use sends a heartbeat to
the server.

---

# Flatliners
## Detect missing devices
.right[(i.e. known to saltstack but not connected)]

???
I talked about flatliners earlier, we can also collect longitudinal
information on availability, which might be of interest where service level agreements are in place.
---

# Health stats

.leftish[
* SaltStack beacons - cpu/memory, often
* Full process list, less often
* Network stats
]
???
I also configure my devies to transmit cpu and memory usage stats
every 15 minutes, as part of that post release testing regime.
Hopefully you release a bad update that leaks memory you can notice
and respond.

---

.fig60[
![](lair-large.png)]

# Pour it all into a data lake

## And pretend to be a Bond Villain

.bottom.right[
Do you want to know **more**? "Continuous Dashboarding" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)]

???
Right now I'm just dumping all this data into an elastic search
database, where we eyeball it from time to time, but there's many
other things you can do with this information.

---

# Case study: Log pooling for a building safety startup

.leftish[
* Ram disk on local ARM devices
* Streaming to cloud with Filebeat
* Processing with Logstash
* Set a storage budget and expire to meet the budget
]

???

One of my clients puts mesh networks into aged care facilities.  The
sensors talk to a local gateway which has a cellular uplink.  The logs
from the whole kaboodle funnel back to the cloud, the device keeps the
last hour or so in memory in case technicians need it.

---
layout: true
template: callout

.crumb[
# Landscape
# Challenges
# Solutions
## Platforms
## Dev
## QA
## Deployment
## Maintenance
## Monitoring
## Measurement
]

---
template: inverse

# Measurement (application data)

???

So that's the data from the system software.  We can use all the same
mechanisms for the application software, or we can use something
bespoke.

---

# Use orchestration message bus

## SaltStack message bus is the fast, lightweight ZeroMQ

???

First option is to use the message bus fabric that is maintained by
the orchestration software.

---

# Can you use your orchestration bus for application events?

## Yes, with care

???

Is this a good idea.  I think so, I've done it as a proof of concept
but not deployed it yet.

* CLI tools to inject messages
* Python library

---
# Extend orchestration system with custom modules

???
More fancy:

* Beacon plugin interface
* Engine plugin interface on the server

---
# Record as much as you can, digest later
## Shove all your client data in ElasticSearch

.bottom.right[Purge oldest indexes until CFO stops whinging]

???

My approach to instrumentation is to overshare.  I think its better to
have logs and not need them than the reverse.

---

# Case study: Saltstack plus ELK

.leftish[
* Bridge orchestration bus to application message bus
* Engine module at top level master (or intermediate)
* Gateway messages to elasticsearch, via logstash
* Want MQTT?  You already built a PKI to deploy it in 2 minutes
]
???

The project with the subterranean sensors is fairly low volume, so I
send the data up the orchestration bus.  At the master there's a rule
that sends it across to a logstash cluster.

---

# Case study: MQTT plus ELK
## "Rapids Rivers Ponds"

.leftish[
* MQTT brokers at each site
* Broker in the cloud federates with on-site brokers
* Logstash MQTT plugin subscribes to all events
]

.bottom.right[
Do you want to know **more**? "Implementing Microservice Architectures" [Fred George, YOW 2014](http://yowconference.com.au/slides/yow2014/George-ImplementingMicroserviceArchitectures.pdf)]

???
For another project, the data volume is so high I wanted a separate
channel.  The field devices run an MQTT broker, and the cloud hub
connects to each of these and receives a firehose of data.

Now we have a river of data which we pour into our data lake.

---
layout: true
template: callout

.crumb[
# Landscape
# Challenges
# Solutions
## Platforms
## Dev
## QA
## Deployment
## Maintenance
## Monitoring
## Measurement
## Visualisation
]

---
template: inverse

# Visualisation

???

Okay I'm going to be real brief about visualisation.

---
# Real time status
## Liveness, resources, environment

???
We've already looked at system performance data.  I won't say any more
about that.

System administrators already know all about this stuff.

---
# Measure your KPIs
## Whatever makes you money, count it

???

What people measure less often is the bottom line.  Would you know if
whatever makes you money stopped working?

Remember the telco called OneTel.  Their billing system was not very
good about issuing bills.  How long would you stay in business with no
income?

---
# Measure your KPIs
## Set high and low water marks, alert on them

???
Broken is not always as simple as on or off.  Often you know that
normal is within a certain range.  So if things are not normal, you
probably want to know.

---
# Measure your KPIs
## Pay: Elastic and other vendors have commercial alert engines

???
Data analysis is big business and there's a ton of tools to choose from.

---

# Measure your KPIs

## Free: Node-RED makes a good FOSS alerting engine

.bottom.right[
Do you want to know **more**? "Continuous Dashboarding" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)]

???

If you're interested in rolling your own, have a look at Node Red
which is a data flow processing system that's targeted at IoT.

I've put a link to more info in these slides.

---
# Longitudinal comparisons
## View long-term trends in KPIs

???
Two last things to consider, first the long term trend in your
KPIs, are you going bust or boom.

---
# Longitudinal comparisons
## Pay attention to device longevity, wear, etc.

???
Finally, are there parts in your devices that wear out.  Storage and
battery are the two that spring to mind, so you might want to consider
tracking some metrics about these.

Batteries are a particular area where there's lies, damned lies, and
manufacturer capacity claims.

---
layout: true
template: callout
class: bulletsh4

.crumb[
# Landscape
# Challenges
# Solutions
# Coda
## Summary
]

---

.fig40[
![](keep-calm.jpg)
]

# Summary
.left[
#### Lots of devices, too many to administer by hand
#### Swimming in a soup of malware and bad actors
#### Choose tools that support quality
#### Pipelines for automated build/test/stage
#### (Ab)use traditional cloud management tools for IoT Fleet
#### Message bus all the things
#### Big data now, play later
]

???
All right, let's summarise what I've talked about.

We've looked at the threat landscape.

We've worked through the produt lifecycle from development and testing, to deployment management and retirement.

And we've covered some things you can do with the data that comes out
the end.

If you want to hear more about that last point, come along to YOW Data
here in sydney next month, were myself and also a bunch of really
smart people will talk about data.

---
layout: true
template: callout

.crumb[
# Landscape
# Challenges
# Solutions
# Coda
## Summary
## Resources
]

---

# Resources, Questions

.left[
#### My SaltStack rules for IoT - [github.com/unixbigot/kevin](https://github.com/unixbigot/kevin/)

#### Related talks - [http://christopher.biggs.id.au/#talks](http://christopher.biggs.id.au/#talks)

#### Me - Christopher Biggs
  - Twitter: .blue[@unixbigot]
  - Email: .blue[christopher@biggs.id.au]
  - Slides, and getting my advice: http://christopher.biggs.id.au/
  - Accelerando Consulting - IoT, DevOps, Big Data - https://accelerando.com.au/
]

???

Thanks for your time today, I'm happy to take questions in the few
moments remaining and I'm here all week if you want to have a longer
chat.   Over to you.