name: inverse class: center, middle, inverse layout: true .header[.floatleft[.teal[Christopher Biggs] — DevOps for Dishwashers].floatright[.teal[@unixbigot] .logo[@accelerando_au]]] .footer[.floatleft[.hashtag[NDCSydney] Aug 2017]] --- name: callout class: center, middle, italic layout: true .header[.floatleft[.teal[Christopher Biggs] — DevOps for Dishwashers].floatright[.teal[@unixbigot] .logo[@accelerando_au]]] .footer[.floatleft[.hashtag[NDCSydney] Aug 2017]] --- layout: true .header[.floatleft[.teal[Christopher Biggs] — DevOps for Dishwashers].floatright[.teal[@unixbigot] .logo[@accelerando_au]]] .footer[.floatleft[.hashtag[NDCSydney] Jun 2017]] --- template: inverse # DevOps for Dishwashers ## Bringing grown-up practices to the Internet of Things .bottom.right[ Christopher Biggs, .logo[Accelerando Consulting]
@unixbigot .logo[@accelerando_au] ] --- class: bulletsh4 # Who am I? ## Christopher Biggs — .teal[@unixbigot] — .logo[@accelerando_au] .leftish[ #### Brisbane, Australia #### Former developer, architect, development manager #### Founder, .logo[Accelerando Consulting] #### Full service consultancy - chips to cloud #### ***IoT, DevOps, Big Data*** ] ??? G'day folks, I'm Christopher Biggs. I've been involved off and on with embedded systems since my first professional role over 20 years ago, and nowadays I run a consultancy specialising in the Internet of Things, which is the new name for embedded systems that don't work properly. We work with companies developing IoT devices to help them choose the right technologies and practices to build test and deploy their products. Now I've spoken in the past about the state of security in IoT, and while I agree that it's pretty awful, I'm not pessimistic. I made some predictions about how bad things could get, and what kinds of action would be needed, and the bad and good news is that these are coming true. I'm seeing the space rapidly mature and all the right things are being done to address the challenges. --- layout: true template: callout .crumb[ # Agenda ## Problems ] --- # Why Devops? ??? Last month here in sydney I talked about Internet of Scary Things, at IoT Sydney. That presentation is about how IoT is like the Wild West right now, and why DevOps is a must-do for IoT. Traditionally hardware has been developed carefully and conservatively, and had long lifecycles. IoT brings the contradictory requirement of products that are set in stone, sometimes quite literally, but which connect to the wild Internet where sometimes a day can be an age. So *this* talk is going to be about the *how*. How *you* can apply best practices from grown up computers to the challenges of building and managing internet connected autonomous devices. --- # Why Dishwashers? ??? I've called this talk DevOps for Dishwashers because recently the tale of an internet connected dishwasher made the news. A lot of people laughed because why would a dishwasher need to be on the internet. It took about 10 years for us to go from personal computers being networked only at need, to us considering any computer that isn't networked all the time to be broken. I Expect the same for every other electric appliance. In the long term, I think the very word computer will become archaic, because everything will be a computer. --- # *"Software is eating the world"* ## .right[-- Mark Andreesen] ### Wall St Journal, Six years ago this week. ??? Software, as Mark Andreesen predicted, really truly will have eaten the world. This will have a lot of effects, but the one I want to focus on today is that we must rethink our concept of quality in the embedded space. The price of an amazing future everything can interact with everything else, is that everything might interact with everything else. --- layout: true template: callout .crumb[ # Agenda ## Problems ## DevOps? ] --- template: inverse # Interlude: What do I mean by "DevOps"? ??? Actually, before I go too far, I should define what I mean by DevOps. This is one of those technical terms that has been misused so much that it's dangerous to use it without confirming that the listener hears what you mean. --- .fig40[ ] .spacedown[ ## DevOps is not a ***thing you do***, ## it's the ***way you do things***. ] ??? DevOps is not a thing you do. DevOps engineer is not a job title. There's no such thing as a DevOps team. It's an easy mistake to make, in fact I once held the job title Head of Devops. What DevOps **is**, is accepting that development and operations are part of a spectrum, and that distinguishing between them is counterproductive. --- .fig40[ ] .spacedown[ ## *Empower* ***everyone*** ## *to maximise* ***value.*** ] ??? My definition of DevOps is: Evolving a culture and a toolset that empowers everyone to effectively maximise value through radical transparency and extreme agility. --- layout: true template: callout .crumb[ # Agenda ## Problems ## DevOps? ## Solutions ] --- template: inverse # *"When every Thing is connected, everything is connected"* ## .right[-- Me] ??? So today I'm going to look at how we pay the price of a connected future. The body of this presentation is about the practical things you can do to build quality products for the internet of things. I'll look at the lifecycle of an IoT product from inception to retirement, and what you can be doing to get the best outcomes at every stage. The three areas I'm going to look at are firstly choosing platforms that support rather than hinder quality outcomes, secondly shaping your development and quality practices to foster agility while preserving reliability, and thirdly working with your data in ways that promote interoperability and reusability. --- layout: true template: callout .crumb[ # Agenda # Landscape ] --- # Welcome to the Internet of Things ## pop. 10 Trillion ??? Before we dive into all that, I want to talk about the universe of discourse. In just three score years and ten, the span of a single human lifetime, we saw the ratio of computers to people increase by around one order of magnitude per decade. --- # 1936 (Information Pandemic Year Zero) ## 10
-7
device/person ??? If you were a fresh-faced undergrad when Konrad Zuse was building his Z1 in 1936, you might still be around to celebrate your 100th birthday next year. Aside: University washout living in parents apartment --- # Mainframe era ## 10
-4
device/person ??? Fast-forward thirty years --- # Minicomputer era ## 10
-2
device/person ??? A decade later, two orders, inflationary era Computers have become indispensible, but this is not widely understood. --- # Desktop era ## 10
0
device/person ??? Another decade, another two orders. Spreadsheets create gordon gekko. --- # Mobile era ## 10
0.5
devices/person ??? Move out of inflationary era, diffusion, eating into the fabric of our lives. Life pre-smartphone is unimaginable --- # Cloud era .red[[YOU ARE HERE]] ## ~10
1
devices/person ??? Starting to lose count. Soon we won't even bother. --- # Internet of things ## 10
3
devices/person ??? That trend is is not over. In the next decade I expect another dectupling, and at some point after that we're going to stop counting. Everything will be a computer. Every. Thing. Hence, the Internet of Things, a world where humans are a zero point one percent impurity in a living network of machines. If we're wise we'll make friends with the machines when they're babies, because I don't think we can beat them in a fair fight. --- layout: true template: callout .crumb[ # Agenda # Landscape # Challenges ] --- .fig30[ ] .spacedown[ # Solve the next problem, not the last one] ??? During the early twentieth century, one analyst predicted an oncoming apocalpyse where civilisation would grind to a halt, because all available single women would be employed as telephone exchange operators, and no further rollout of the telephone network would be possible. --- # Beware of false analogies and straight line trends ??? Of course, we now have robots running the switchboards, and even making and answering some of the calls. Instead of having a sizeable fraction of the workforce operating the communications network, we have millions of people in less developed countries trying to sell those people life insurance over the telephone, because the cost of communicating is now practically zero. --- # Always be Questioning ??? If you wonder how we're going to curate and maintain a thousand devices per person, you're making the same kind of category error. --- # Observe, Orient, Decide, Act ??? Strategy is about adapting your behaviour to circumstances. When a computer cost 3 months wages, you shepherded it carefully. When computers cost less than a cup of coffee, they become consumables. They're sheep, not sheepdogs. --- class: bulletsh4 .crumb[ # Landscape # Challenges ## Risks ] # *"Bad people will break your stuff"* .bottom.right[ Do you want to know **more**?
"The Internet of Scary Things" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)] ??? I could go on for hours about the security landscape in the internet of things, and in fact I've spoken on the subject in the past. The one-slide precis is that that bad people want to either steal your stuff, or break your stuff. It doesn't have to be many bad people, another side effect of that zero cost of communications is that if you're a terrible person, the 21st century is now a target-rich environment. --- class: bulletsh4 .crumb[ # Landscape # Challenges ## Risks ] # Everything is awful ??? I don't want to go off on a tangent about security in particular, but I find that the underlying causes security problems in IoT are an instructive microcosm of the wider challenges. Last October, the Mirai worm spread around the globe because of well known default passwords that never get changed. This month we learned that CIA has been able to insert spyware wherever they like because some software developers found it convenient to put back-doors into communications hardware. Choosing expedient but fragile programming languages brings conseuqences. The pattern here is a basic lack of professionalism and forethought. --- class: bulletsh4 .crumb[ # Landscape # Challenges ## Risks ] .fig30[ ] .spacedown[ # Everything is awful ## and the awful is on fire] ??? Moving on, in April we got the endlessly entertaining story of the compromised internet dishwasher, which demonstrates how many developers aren't trained well enough to avoid common and completely avoidable pitfalls. The downside of ubiquitous communications is that there are no safe spaces. June's WannaCry malware outbreak showed us that when everything is software, everything has to be maintainable. Last month's second coming of a much more damaging successor hammered the point home. If you can't fix it, its no longer safe to own it. Firewalls and even Air Gaps are insufficient. --- # It's not rocket science ## No really, I mean actual rockets. ??? And a final anti pattern, if you sit in the middle of the road, you'll get run over. Quality procedures that worked well last year might be a liability next. It turns out that the government software quality rules that were applied to election security in 2016 delivered little benefit and some potential detriment becaused they largely focused on concepts that aren't relevant any more. Guidelines for safety critical software prohibit things like exceptions and recursion. This is more about preventing your space probe from overflowing memory and crashing into a planet than it is about software quality on modern embedded platforms. --- .crumb[ # Landscape # Challenges ## Risks ## Desiderata ] .fig50[  ] .spacedown[ # Desiderata] ??? So lets summarise what I think a response to these challenges should look like, then we'll go into detail on each point. --- # Select appropriate tools and platforms ??? First, choose the right platform and tool. --- # Comprehensive identity management ??? Next, make your identity and security part of your framework so that you only build it once. --- # Automate for developer and user convenience ??? Automate everything you possibly can. --- # Testing and testability kept front-of-mind ??? Design for testability. --- # Train, and Audit, and keep doing both ??? Train your teams. Share skills, watch youtube videos, get consultants, go to conferences. Whatever your budget there's a way to improve. --- # Monitor and react (automatically) ??? And lastly maintain awareness, and use that awareness to respond rapidly. --- layout: true template: callout .crumb[ # Agenda # Landscape # Challenges # Solutions ## Platforms ] --- template: inverse # Platforms ??? So lets talk about platforms. Jez Humble who literally wrote the book on Continuous Delivery gave a keynote right here in this buidling at Agile Australia in June where he told a story about HP's printer teams. The cost of engineering was spiralling, yet quality was awful, and part of the reason was they were spending over a quarter of their time just rewriting existing software because pretty much every printer model used a different processor. They moved to a common architecture and massively reduced their reengineering cost. That meant that some printers had a more expensive processor than they needed, but the savings far outweighed this. --- # People are more expensive than circuits ## .right[(Sorry, robots)] ??? My point here is that every hardware business has a person whose job is look at every component and ask "can we delete that to save half a cent per unit". This is a person who's empowered to remove customer value. The correct question when designing your hardware is will this hardware platform add value, in terms of resilience, longevity and supporting software quality. --- # Hardware is DevOps too ## .right[(Robots, I hope this makes it up to you :)] ??? That's right, your platform is part of your devops team. Your goal should be to select a platform that supports your devops way of doing things So while selecting and building the hardware is by far the easiest part of the IoT lifecycle, it's the foundation that facilitates the rest. --- # Open and well supported ??? This means that openness and friendliness are the first things you should look for. Select a processor and platform that has good development tools and will be around for a long time. --- # Case study: .blue[**ARM v7**] and .red[**Debian Linux**] ??? Thanks to the drivers of Android phones and TV media boxes, there's a huge variety of ARM systems out there which are similar enough that hardware differences aren't really important. You can develop to the one architecture and select an appropriate board from five bucks up. --- .fig40[ ] .spacedown[ # Meet the #3 top-selling computer of all time ] ??? In my projects we work with Raspberry Pi as the development platform. This brings the convenience of a huge selection of platform tools and a great developer community. We can use these in the lab, for CI workers, and even for the first hundred or so product units. There's a wide selection of compatible lower cost designs to buy or build for mass market production. Here's one that starts at seven bucks. --- # Artisanal free-range small-batch Linux? ## No. ??? The same goes in software, think about what happens if news a nasty bug breaks on the Internet. The top tier vendors probably have a patch out in a day or two. If you picked some hipster linux distribution that has fifty users and one maintainer, you may never see a patch. You want your tools and systems to be a convenience not a source of pain. If you're using a mainstream OS variant, and for IoT that means Redhat or Debian Linux then there's a critical mass of users that acts as a quality filter. There's not likely to be some hidden problem that you can't solve. If you're the only company in the world using some niche distribution, when you run into trouble you're on your own. --- # Without the Internet, it's just a Thing. ??? I also think it's important for devices to be online as much as possible. Power and network constraints may mean that devices can't afford to be online continuously, and this is OK, but you do need to think about how to support DevOps processes within these constraints. I'll come back to this later. --- # *"Is there anybody out there?"* .right[-- Pink Floyd
(also, my lighting controller)] ??? Plan to have two way communication with your devices. Forget about one-way communication channels, they may be cheaper but you'll regret it. --- # *"Put the robot back in the ocean, kid."* .right[-- Oceanographers, everwhere] ??? Make sure you have a way to identify and locate every device, even if you never think you need it. I'm talking about putting some kind of display on a device, even if it's only a single LED, and letting devices report their location either by GPS or more coarsely via cellular or wifi estimation. I've heard more than one story from the world of oceanography where probes go missing because people "rescued" them. --- # Management is not a dirty word ??? Use a remote management platform. At the least you need a secure way to upgrade and shutdown a device. I'm recommending using an agent-oriented datacentre automation tool like saltstack, I'll give some examples of what I've done with this tool later on. The right tool will kill two birds with one stone, you can use it construct your master configurations in the lab, and then to manage updates in the field. --- layout: true template: callout .crumb[ # Landscape # Challenges # Solutions ## Platforms ## Dev ] --- template: inverse # Development ??? OK, so we chose some hardware and an operating system, and installed at least one blinky light. Now we have to make it blink. --- # Aside: Go Serverless ??? If I could give you one piece of information today it would be use serverless architecture. Serverless doesn't mean there aren't servers, it means you don't have to worry about them. I'm putting together a global building management system that doesn't rely on a single fixed server, including the whole CI pipeline. --- # Nice languages are portable, memory-safe and asynchronous. ??? I want you to take a hard look at what programming platforms you use. Programming has entered an era where the ability to be rapidly up and running thanks to an ecosystem of open source parts is revolutionising the craft. One of the most important things to look for is a healthy ecosystem of actively maintained components. Javascript and Go are particularly notable for this in that they have component search engines that help you understand who else is using a component. --- # Case study: .green[Javascript] ??? Javascript rules the world. Its not the language anyone would have chosen to be number one but its there, and its getting better every year. You can literally run javascript from chips to cloud, on tiny two dollar microcontrollers, to cloud services. It's the only language that has common support across Azure, AWS and Google cloud. Javascript is also a great way to write user interfaces, I'm using Elm which is a strongly typed meta language to produce highly reliable and quite pretty management interfaces. Please don't write your own user interface components, theres so many great options out there. I'm looking at you every network router ever. --- .fig40[ ] .spacedown[ # Case Study - .blue[Go]] ??? I'm using Go for my largest project. This is a great fit for container environments because your executables are self-contained. There's built in cross compilation support, so you can work on Windows, Mac or Linux, wherever you're most productive. My CI pipeline can run on stock cloud systems, and spit out ARM native containers. Go is a very pragmatic language, although it looks very much like C you'll find working in it more like Javascript, you don't need to worry about memory or pointers hardly ever, and the type system actively helps you to write more reliable code. --- # Naughty languages are like a tightrope over a pit of spikes. ??? I coded in C and suchlike for a very long time, and I don't think there's a good case for widespread use of these tools any more. Its simply far too easy to write brittle or insecure software. Modern languages have learned from these drawbacks and provide a really effective safety net. --- # *'The wifi password is "abc123';cat /etc/passwd#" '* ## Say no to shell scripts. ??? Shell scripts are so tempting, but when you're dealing with malicious users the attention to detail needed takes all the convenience out of it. --- # Use, and reuse, a framework. ## Yours, mine, Google, Amazon, Microsoft, Apple, whatever. ??? Frameworks are important for IoT because there are a lot of foundation behaviours in common between devices. Given the time and budget pressure on low cost devices, you don't want to waste resources on adddressing the same problems over and over, or worse still not address them at all. --- # AWS IoT ??? I'm using Amazon's ecosystem in a number of projects. They've really thought about what IoT means for device management. Major challenges like security, intermittent connections and data collection are solved for you. When scale becomes an issue, you have amazon greengrass, which lets you move from a centralised system where everything talks to the cloud, to a more decentralised architecture you have a buildling level or floor level element which is under the control of your AWS infrastructure. --- class: bulletsh4, tight # Choose your own framework adventure .leftish[ #### Amazon IoT and Greengrass #### Google IoT #### Azure IoT #### Open Connectivity Foundation IoTivity #### Resin.io #### Mongoose-OS ] ??? If you look at the awful usability of many home automation products, you'll see that vendors have implemented basic behaviours badly when they could have gotten that work done for them by adopting a framework. These are some frameworks I like. The great news for us is there's a lot of vendors going hard at this problem. There is a learning curve. You're initially going to tell yourself "I don't need all this complexity, can I just skip to the end", but particularly in the maintenance part of the lifecycle, a good framework will pay your investment. --- # Containment is complexity management ??? I've gone on record as saying that you shouldn't just screw a linux server to the wall, nor ship software that you don't need. So here I am advocating using a common mainstream hardware and software platform, which breaks both those rules. Containment is how I reconcile those constraints. Software containment causes me to limit the interactions between my parts which helps with security and complexity concerns, and the consequent loose coupling between parts lets me reuse components more effectively. --- # You *can* run Docker on a $6.95 linux computer ??? Docker is the containment engine that everyone knows, but there are three or four others now in the Linux space. To be honest I haven't done much more than kick the tyres on some of the others, because Docker is the one that has all the major cloud vendors on board. --- # Use your CI to produce docker images as artifacts ??? * Code reviewers and testers can test the exact artifact * Orchestration tools can push "gold" images to devices --- layout: true template: callout .crumb[ # Landscape # Challenges # Solutions ## Platforms ## Dev ## QA ] --- template: inverse # Testing ??? Testing is probably the single most critical point for DevOps. The goal we are striving for is more frequent releases, and in order to meet that goal you *have* to reduce the human burden of testing. --- # Total Infrastructure Awareness ## replicate your whole ecosystem ??? The zeroth law here is to actually do testing. To structure your enterprise in a way that provides fast and easy access to test environments. Ideally anyone who wants to test can click a button and have a fresh environment in minutes. --- # No snowflake servers. ## A dev team should have access to disposable instances of everything ??? If it takes days of work to provision an environment, nobody will ever do it, which means some testing will never get done, or your test system will be out of sync with production. --- # DevOps is a disaster, every day ## and that's good ??? * Fast spin up and replacement of services Worse still, if you're not deploying new environments every day you might get that moment where you need to in a hurry and it turns out nobody knows how. I've seen it happen. --- # Painful testing practices beget painfully bad testing ## provide easy test data ??? * Plenty of test-data generators * Making it painful to test will result in painfully bad testing --- # Quick fixes are **good** ## and cheap ??? Actual research data shows that that the sooner you find a bug, the less it costs to fix. If a developer never makes a mistake, it costs nothing to fix, so thats the first place a good toolset pays off. --- # Listen to that annoying hipster tech blogger ??? If you find the problem during coding, that's where pair programming and code review pays off. --- # Test-before-merge ## Never* "break the build" .footnote[well, almost never] ??? If automated testing tools find a problem, then at least you haven't wasted anyone else's time. --- # Fail fast ## Unit tests first, followed by slower end-to-end tests ??? Now you can tell developers, run the tests every time before you push, but in my experience pixies will fly out of my ear before that happens. So whatever central CI you have needs to fail cheap. Do the fast tests first, and the slow tests later. --- # Every pair of eyeballs costs $$$ ## (and no, you can't save $-½ by poking one out) ??? Once a change merges into your code repository the costs multiply. You've consumed the time of the person who found it, and then the developer who fixes it, and then the person who confirms its fixed. If you make it to integration or release testing then you have to go back and redo work and redo more testing, and possibly reschedule your release. --- # Do not poke customers with sticks (either) .bottom.right[ (Wherein Christopher does math to "prove" a point)] ??? And if a problem makes it into the hands of customers we're talking about hundreds of times more cost to fix than if you'd found that problem on day one. It's true comprehensive automated testing does take time and money. Studies show it costs around thirty percent more. But it brings a ninety percent reduction in defects, and so if your automated tests catch only one bug a week that takes the developer an hour to fix, which otherwise would have reached customers, then those tests paid for themselves more than twice over. --- # A modern embedded system is faster than a Cray 1 ## But that's still **reaaally slow** ??? Now when you're working in IoT you're probably not running an x86 CPU, so how do you compile and test for various embedded CPUs in your Continuous Integration rig. --- # Cross-platform CI pipelines ## Option zero: cross-platform languages are win ??? Well, good language choice and containers to the rescue here. If you're working in Go or NodeJS you will find that you can run your code on any platform, and build your distribution artifacts on any cloud system. --- # Cross-platform CI pipelines ## Option one: Emulate ??? Emulators have gotten really good, too. You can spin up an ARM emulator on a PC and boot it up just like a virtual machine. It's remarkably easy. --- # Cross-platform CI pipelines ## Option two: Enrol embedded systems in your CI ??? You do want some real devices of course. Where you do need to compile or test on your target system, choosing ARM and Linux means that your systems can integrate right into whatever testing systems you're using. I'm using gitlab as it was one of the first systems to build the automated testing tools right into the code repository, but now github and bitbucket have all that too. So go with whatever you know. --- # Case Study: My Pipelines .fig100[ ] .bottom.right[Minimal requirement: one x86 and one ARM server (eg Raspberry Pi)] ??? * Tag jobs that MUST run on ARM as 'pi', then configure ARM workers to only run pi jobs * Only one pi job here --- # Case Study: My Pipelines .fig100[ ] .bottom.right[Stage 1: common policy checks] ??? There's a pre compilation test phase that runs code quality and static analysis. --- # Case Study: My Pipelines .fig100[ ] .bottom.right[Stage 2: compile and package] ??? * Pipeline builds docker images for x86 and ARM --- # Case Study: My Pipelines .fig100[ ] .bottom.right[Stage 3: Testing (on target arch)] ??? Here's where our ARM CI worker gets used. All but one of those jobs run on cloud hosted x86 servers, except for the testing that needs to run on actual ARM hardware. For that I have some raspberry pi systems which act as workers in the CI farm. --- # Case Study: My Pipelines .fig100[ ] .bottom.right[Stage 4: Deploy to container registry] ??? * Side branches (issue and feature branches) push to dev registry * Master branch pushes to main registry, tagged as staging or live * Auto push to staging registry * Manual push to production registry --- # Case Study: My Pipelines .fig100[ ] ??? You can build all this with as little as two machines, or hundreds if you need them. --- # Case Study: My Pipelines .smallcode[ ```yaml # note the pi build job can be run on x86 because Go is awesome build for pi: stage: build script: - make installdeps image contents ARCH=pi artifacts: paths: - GPIOpower - GPIOpower_docker_pi.tar.gz - GPIOpower_contents_pi.tar.gz # # Run tests on RasPi # test on pi: stage: test tags: - pi script: - make test deploy to staging: stage: deploy environment: name: staging only: - master script: - make push ARCH=x86 ENVIRONMENT=staging - make push ARCH=pi ENVIRONMENT=staging ``` ] ??? This is almost the entirety of the configuation for the pipelines. I use worker tags to force certain jobs to run on ARM. I don't like to put too much configuration in to my CI system because I think its important to be able to run tests outside CI too. So most of the testing logic is in the makefile and test scripts. --- # Package separate lab and field artifacts ??? Even with the best of intentions, temporary developer backdoors stick around. So make a sandbox that's safe to put backdoors in. --- # Defeat laziness by making it easier to do the right thing ??? You can't out-lazy programmers. Set up your system so that Every CI build results in both a locked-down and developer-friendly version. You don't ever want to here "I'll just hack this into the production build and then take it out later". Edith Harbough this morning - multiple instances where lab code left in production cost hundreds of millions of dollars. Flags named "do not enable this flag" --- # Regression tests, longitudinal tests ??? * Nightly or better, run a full coverage regression suite * "Release-ly" or better, run a performance comparison test --- .fig50[] # Dashboards as "live tests" ## Containerise your BI stack and write dashboards alongside code ??? Okay, so you made it to release. Now there's ten thousand instances of your code out there, and they're up a pole, or buried under asphalt or fired into space. So don't stop testing. When we run our tests in production we call them dashboards. Bond villains are big on dashboards. You know the funny thing is, I set out to find a bond movie image for this slide, and they all looked rubbish compared to reality. It seems that a generation of nerds have created a future straight out of fiction. That image up there is from Hoot Suite's operations centre, and it only shows about a third of the room. But you don't need all that rubbish. Who's looking at those screens, there's ten or more for every person in the room. We need machines to watch our machines. --- # Dashboards learn "green" state and alert on red ## Obligatory reference to Machine Learning .bottom.right[ Do you want to know **more**?
"Continuous Dashboarding" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)] ??? Simples approach - red line gauges. Artificial stupidity - ignore normal, what's left? Recent ELK stack release (in June) has ML anomaly detection. --- layout: true template: callout .crumb[ # Landscape # Challenges # Solutions ## Platforms ## Dev ## QA ## Deployment ] --- template: inverse # Deployment ??? All right, so you have some tested code that's been encapsulated in container images, and now you need a device on which to run it. --- .fig50[ ] .spacedown[ # Orchestrate: Never do anything by hand. ] ??? Do not fall into the trap of turning yourself into a robot. Some people say that as a programmer if you ever have to do something more than twice, you should automate it. Well let me save you those two times. If you learn how to use orchestration systems to configure servers, you'll find that pretty soon its quicker to use them even for things that you only have to do once. Also you were wrong about only ever needing to do it once. --- # Build a provisioning workflow ## Customise a clean OS (via ethernet or emulation) ??? Here's something that I've seen over and over. A project starts with developers putting together a quick prototype by starting from a blank OS installation, installing some tools and editing the configuration. Then it's six months later and there's a new version of the OS and nobody remembers how to go from a clean OS to an app ready platform. And thats how many embedded devices end up running eight year old kernels that are riddled with bugs. --- # Robo-configure the target system from a provisioning system ## Then save a filesystem image ??? So, what's a better way. Remote admin of a target machine (saltstack over ssh) Target machine in a VM - eg vagrant Target machine in a container (dockerfile) --- # How do you create a provisioning system? ## Turtles all the way down! .bottom.right[ Do you want to know **more**?
.blue[github.com/unixbigot/kevin] ] ??? You pull yourself up by your bootstraps. Saltstack - serverless mode to create the first server --- # Case study - My orchestration scripts .leftish[ 1. Create a read only recovery partition 1. Install SaltStack orchestration minion
(now switch protocols) 1. Set timezone, locale, etc. 1. Change default passwords 1. Configure network 1. Provision message bus clients 1. Install language runtimes (nodejs, java etc.) if needed 1. Configure VPN client 1. Fetch initial application containers ] ??? The first part of my process is converting a vendor image to an product base image. This runs through a bunch of repetitive configuration that nobody wants to do by hand. At the end of that process we join up to the output of the CI pipeline. CI take source code and turns it into tested docker images. The orchestration system fetches the tested docker images and drops them onto a prepared operating system. So we've closed the circle, we have continuous deployment. --- # Hey, that sounds a bit like PaaS ## Yeah, it does. ??? At this point you might be thinking this sounds a lot what platform as a service offerings, like OpenShift or Elastic Beanstalk do. Those systems provide a pre-prepared operating system which you don't have to care about, and you just drop your code into place. Well, you'd be right. There are options for IoT where you get something very similar. --- # Resin.io ## IoT PaaS with Linux and Docker .bottom.right[ Do you want to know **more**?
[https://resin.io/](https://resin.io)
"The Internet of Scary Things" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)] ??? One of them is resin.io, which provides a base linux image which connects back to their cloud systems. You push your code to their git servers, and their system compiles it and pushes it down to one or more registered devices. They support a number of devices, and they take requests for which ones to add next. I'm beta testing a new device at the moment. --- # Amazon AWS Greengrass ## IoT PaaS built on AWS IOT + AWS Lambda ??? Amazon Greengrass is another system - if you're familiar with lambda, greengrass is basically on-premises lambda, you use the same mechanisms but instead of deploying to an anonymous container host that you never see, you're deploying to an embedded system that you nominate. --- # Mongoose-OS ## Multiplatform embedded OS with cloud integration and remote upgrade .bottom.right[ Do you want to know **more**?
[http://mongoose-os.com/](http://mongoose-os.com)
"IoT in two Minutes" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)
"Javascript Rules My Life (CampJS 2017)" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)] ??? In the microcontroller space I really like another project called Mongoose OS. It offers the same sort of facility without the central cloud repository, you push a zipfile to a device and it verifes and installs the contents. --- # Automate PKI enrolment ## IoT Makes PKI Easy. .right[-er ] ??? So lets zoom in on that "verifies" part. How do you permit secure remote access to your devices. The obvious way is with X.509 certificates as used in the world wide web. Only the public key infrastructure for the world wide web is basically a global protection racket. The hard part of public key cryptography is authenticating the remote party. That's what certification authorities are for in the web. The alternative is two secret agents exchanging briefcases in the park. Well, it turns out that's exactly what we do. --- # Your "secure key distribution channel" is a cardboard box .leftish[ 1. Build and sign a root certificate 1. Upload root cert to the SaltStack master 1. Create minion certificate 1. Install minion certificate on minion 1. Upload a copy to the master 1. All automatically ] ??? We need a tamper-proof channel to get a unique individual secret key from the factory out to our device. Only our device was made in the factory. So we put the key inside on the assembly line, then ship the device. They way I achieve this in my project is again using SaltStack, during final assembly test we assign an identity to a device and provide it with its security certificate. Now we have trustworthy secure communications with a remote device, that allows us to send digitally signed software updates that the device will trust because they're signed with a certificate that it has carried around since birth. --- # Containerised version control ## Use your docker registry the way it was intended ??? Now we have a secure way to get updates out to devices, our next task is to keep track of what versions each device has, and which are available. Well it turns out our container system does this for us. Docker containers have one or more tags, which are intended for version identification. The tag you might be familiar with is 'latest' which is the default tag that's used if you don't supply one. Well we can define a tag for staging, and demo and debug and whatever else we need, and permanent tags for each previous version. Now we know what version every system has, and we have a way to request a prticular version. --- # Use Salt "grains" to define which container to use .right[(i.e. live, staging, dev or other)] .left[ ```yaml fleetvalid-rfid-image: docker_image.present: - name: ((image)) - force: True fleetvalid-rfid: docker_container.running: - image: ((image)) - links: - fleetvalid-aggregator:aggregator - fleetvalid-rfidpower:power - binds: - ((rfid_device)):((rfid_device)) - environment: - RCAGGREGATOR: aggregator:9091 - POWER_API: power:80 ``` ] ??? We are *this* close to continuous delivery right now. The last thing we need is to control which devices get which versions. Well, our orchestration system is also a configuration management system, which has a pair of key value stores, one maintained at the server, and one on the devices. --- layout: true template: callout .crumb[ # Landscape # Challenges # Solutions ## Platforms ## Dev ## QA ## Deployment ## Maintenance ] --- template: inverse # Maintenance ??? OK we made it. We're deployed. But we're not out the woods yet. I'm working on devices that are going to be literally set in concrete and expected to keep working for a decade. When they break down you'll need a jackhammer and some earmuffs. --- # Self-care ## FLASH memory longevity tweaks ??? So lets think about flash memory. Each sector can only sustain about 100k write cycles before it becomes an ex-sector. So you don't want to be indiscriminately writing to your disk. Some things you should configure are turning off filesystem access timestamps, and logging to a local ram disk prior to shipping any logs you want to keep to the cloud. --- # Boot to ramdisk ## Sanity check, then proceed to target environment ??? Linux systems have had the ability to boot with an initial ramdisk for a long time, for example when you're running a RAID system, you need somewhere to keep your raid device drivers. You can turn this feature on even in your teensy embedded system so that if your filesystem goes bad due to too many writes you can repair it using the repair utilities kept on the ram disk image. --- # Recovery mode ## (ab)use DHCP ??? But what your concrete encase device goes into a crash loop. Well, I've prepared for this. My installations don't typically use DHCP, but the system does send out an address request packet at boot time, which is usually unanswered. If it does get an answer commanding it to go to recovery mode, it starts a login server and waits for input. But first it uses that conveniently shipped public key certificate to be certain it can trust the command. --- # Liveness monitoring ## If a device goes silent, notify the site custodian ??? So that's some things a device can do to look after itself, what else do we have in our bag of tricks. Well our orchestration system includes a beacon module that transmits some basic performance data, so our mangement system can provide a report of any devices that have gone silent. --- # Sickness monitoring ## Use an audible or visual attention signal (think smoke-alarms) ??? In the event where a device is still alive but has a component failure, we can be a little more helpful and have it call attention to itself. When every lightbulb in a building is self aware, or every smoke alarm or every lamp post, a little bit of help working out just which one is on the fritz comes in very handy. --- # While you were sleeping ## Intermittent connections ??? In the case where devices are not online all the time, how do we manage updates. Well amazon's IoT does a great job of this, where right after a device comes on line it receives an update of all the state changes that were initiated while it was off line. But we can do the same thing with an orchestration system. In saltstack there's a component called the reactor, which responds to defined conditions. So we can trigger actions to occur when a device connects, such as transmitting any pending updates. --- # Kill or Cure ## Feature/component disable ??? Next I want to consider what if there's a problem we can't fix with software. Maybe the the fire alarm is stuck on at 3am. You want to be able to shut down components that require physical repair, rather than have them be a nuisance or a danger. So if you can build feature switches into your software and hardware, you should do so. --- # Kill or Cure ## The Cassini Solution ??? Finally what if a device is out there and you're done supporting it. There's a space probe orbiting saturn right now that has about 20 kilograms of plutonium on board. It's running out of fuel, which means it will eventually smack into a moon. Since we think there could be life on some of Saturn's moons, it wouldn't be neighbourly to give them all tentacle cancer, so in a few days it will use the last of it's fuel to kamikaze dive into saturn itself. If you're building devices that could outlive their software support, I encourage you to think about this situation. The problems recently with factories and businesses still using abandoned windows XP systems shows us that if you can't patch it, you should unplug it. One simple way to do this might be to revoke or stop updating their security certificates. --- layout: true template: callout .crumb[ # Landscape # Challenges # Solutions ## Platforms ## Dev ## QA ## Deployment ## Maintenance ## Monitoring ] --- template: inverse # Monitoring (platform data) ??? All right, we're coming to the end. Last thing I want to talk about is the data being emitted from your things. --- # Heartbeats ## SaltStack's presence monitor ??? I mentioned that the orchestration system I use sends a heartbeat to the server. --- # Flatliners ## Detect missing devices .right[(i.e. known to saltstack but not connected)] ??? I talked about flatliners earlier, we can also collect longitudinal information on availability, which might be of interest where service level agreements are in place. --- # Health stats .leftish[ * SaltStack beacons - cpu/memory, often * Full process list, less often * Network stats ] ??? I also configure my devies to transmit cpu and memory usage stats every 15 minutes, as part of that post release testing regime. Hopefully you release a bad update that leaks memory you can notice and respond. --- .fig60[ ] # Pour it all into a data lake ## And pretend to be a Bond Villain .bottom.right[ Do you want to know **more**?
"Continuous Dashboarding" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)] ??? Right now I'm just dumping all this data into an elastic search database, where we eyeball it from time to time, but there's many other things you can do with this information. --- # Case study: Log pooling for a building safety startup .leftish[ * Ram disk on local ARM devices * Streaming to cloud with Filebeat * Processing with Logstash * Set a storage budget and expire to meet the budget ] ??? One of my clients puts mesh networks into aged care facilities. The sensors talk to a local gateway which has a cellular uplink. The logs from the whole kaboodle funnel back to the cloud, the device keeps the last hour or so in memory in case technicians need it. --- layout: true template: callout .crumb[ # Landscape # Challenges # Solutions ## Platforms ## Dev ## QA ## Deployment ## Maintenance ## Monitoring ## Measurement ] --- template: inverse # Measurement (application data) ??? So that's the data from the system software. We can use all the same mechanisms for the application software, or we can use something bespoke. --- # Use orchestration message bus ## SaltStack message bus is the fast, lightweight ZeroMQ ??? First option is to use the message bus fabric that is maintained by the orchestration software. --- # Can you use your orchestration bus for application events? ## Yes, with care ??? Is this a good idea. I think so, I've done it as a proof of concept but not deployed it yet. * CLI tools to inject messages * Python library --- # Extend orchestration system with custom modules ??? More fancy: * Beacon plugin interface * Engine plugin interface on the server --- # Record as much as you can, digest later ## Shove all your client data in ElasticSearch .bottom.right[Purge oldest indexes until CFO stops whinging] ??? My approach to instrumentation is to overshare. I think its better to have logs and not need them than the reverse. --- # Case study: Saltstack plus ELK .leftish[ * Bridge orchestration bus to application message bus * Engine module at top level master (or intermediate) * Gateway messages to elasticsearch, via logstash * Want MQTT? You already built a PKI to deploy it in 2 minutes ] ??? The project with the subterranean sensors is fairly low volume, so I send the data up the orchestration bus. At the master there's a rule that sends it across to a logstash cluster. --- # Case study: MQTT plus ELK ## "Rapids Rivers Ponds" .leftish[ * MQTT brokers at each site * Broker in the cloud federates with on-site brokers * Logstash MQTT plugin subscribes to all events ] .bottom.right[ Do you want to know **more**?
"Implementing Microservice Architectures" [Fred George, YOW 2014](http://yowconference.com.au/slides/yow2014/George-ImplementingMicroserviceArchitectures.pdf)] ??? For another project, the data volume is so high I wanted a separate channel. The field devices run an MQTT broker, and the cloud hub connects to each of these and receives a firehose of data. Now we have a river of data which we pour into our data lake. --- layout: true template: callout .crumb[ # Landscape # Challenges # Solutions ## Platforms ## Dev ## QA ## Deployment ## Maintenance ## Monitoring ## Measurement ## Visualisation ] --- template: inverse # Visualisation ??? Okay I'm going to be real brief about visualisation. --- # Real time status ## Liveness, resources, environment ??? We've already looked at system performance data. I won't say any more about that. System administrators already know all about this stuff. --- # Measure your KPIs ## Whatever makes you money, count it ??? What people measure less often is the bottom line. Would you know if whatever makes you money stopped working? Remember the telco called OneTel. Their billing system was not very good about issuing bills. How long would you stay in business with no income? --- # Measure your KPIs ## Set high and low water marks, alert on them ??? Broken is not always as simple as on or off. Often you know that normal is within a certain range. So if things are not normal, you probably want to know. --- # Measure your KPIs ## Pay: Elastic and other vendors have commercial alert engines ??? Data analysis is big business and there's a ton of tools to choose from. --- # Measure your KPIs ## Free: Node-RED makes a good FOSS alerting engine .bottom.right[ Do you want to know **more**?
"Continuous Dashboarding" [christopher.biggs.id.au/talk](http://christopher.biggs.id.au/talk)] ??? If you're interested in rolling your own, have a look at Node Red which is a data flow processing system that's targeted at IoT. I've put a link to more info in these slides. --- # Longitudinal comparisons ## View long-term trends in KPIs ??? Two last things to consider, first the long term trend in your KPIs, are you going bust or boom. --- # Longitudinal comparisons ## Pay attention to device longevity, wear, etc. ??? Finally, are there parts in your devices that wear out. Storage and battery are the two that spring to mind, so you might want to consider tracking some metrics about these. Batteries are a particular area where there's lies, damned lies, and manufacturer capacity claims. --- layout: true template: callout class: bulletsh4 .crumb[ # Landscape # Challenges # Solutions # Coda ## Summary ] --- .fig40[  ] # Summary .left[ #### Lots of devices, too many to administer by hand #### Swimming in a soup of malware and bad actors #### Choose tools that support quality #### Pipelines for automated build/test/stage #### (Ab)use traditional cloud management tools for IoT Fleet #### Message bus all the things #### Big data now, play later ] ??? All right, let's summarise what I've talked about. We've looked at the threat landscape. We've worked through the produt lifecycle from development and testing, to deployment management and retirement. And we've covered some things you can do with the data that comes out the end. If you want to hear more about that last point, come along to YOW Data here in sydney next month, were myself and also a bunch of really smart people will talk about data. --- layout: true template: callout .crumb[ # Landscape # Challenges # Solutions # Coda ## Summary ## Resources ] --- # Resources, Questions .left[ #### My SaltStack rules for IoT - [github.com/unixbigot/kevin](https://github.com/unixbigot/kevin/) #### Related talks - [http://christopher.biggs.id.au/#talks](http://christopher.biggs.id.au/#talks) #### Me - Christopher Biggs - Twitter: .blue[@unixbigot] - Email: .blue[christopher@biggs.id.au] - Slides, and getting my advice: http://christopher.biggs.id.au/ - Accelerando Consulting - IoT, DevOps, Big Data - https://accelerando.com.au/ ] ??? Thanks for your time today, I'm happy to take questions in the few moments remaining and I'm here all week if you want to have a longer chat. Over to you.