The Build is Never Broken

Clint Byrum -- @SpamapS on Freenode IRC and Twitter. https://fewbar.com/

Engineering Manager / Software Dev Eng V, GoDaddy
Words are my own I do not speak for GoDaddy or any project/foundation

My Ansible Story

  • How did you get started with Ansible? -- Reluctantly at first
  • How long have you been using it? -- First tool in toolbox since 2014
  • What's your favorite thing to do when you Ansible? -- Think in Workflow rather than Modeling

#AnsibleFest

Ok but, what is it that you do here?

  • Deploy and operate OpenStack across 4700+ hypervisors(587TB of RAM, 112,000 cores)
  • 4 regions (US west, US east, Singapore, Amsterdam)
  • OpenStack runs production and dev for many GoDaddy.com components
  • OpenStack also at core of some GoDaddy hosting products
GoDaddy, circa. 2 B.C. (Before Clint)

OpenStack deployed with RPMs and puppet

  • Custom patches applied to OpenStack via RPM
  • Hieradata packaged into RPMs
  • CI limited to static analysis tests
  • Masterless puppet triggered by RPM+Ansible
  • Static dev/test/stage environments

What was broken?

Custom patches applied to OpenStack via RPM

  • RPM build process was dark magic
  • Patches made every upgrade a massive merge fight
  • Lack of dynamic test environment meant slow progress

What was broken?

Hieradata packaged into RPMs

  • SECURITY! -- Sensitive things crept into RPMs which anyone behind FW could download!
  • Lack of dynamic test environment meant slow progress

What was broken?

CI limited to static analysis tests

  • Jenkins is a glorious, magnificent beast -- Easy to feed, hard to tame
  • Entirely missed simple things like "does the REST API actually work?"
  • Lack of dynamic test environment meant slow progress

What was broken?

Masterless puppet triggered by RPM+Ansible

  • Almost nothing! Great solution to scale puppet without rewriting
  • Lack of dynamic test environment meant slow progress

What was broken?

Static dev/test/stage environments

  • Lack of dynamic test environment meant slow progress!
  • Master never known to be working
  • Changes in-flight would either
    • block urgent fixes from landing
    • accidentally go live
    • force ad-hoc improvisation in production
A few solutions
  • Write a Patch as best you can.
  • Get it through review
  • Deploy artifacts in static dev environment
  • Test and Repeat as necessary
Mad max blowing gas in to engine

  • Live-edit static dev environment
  • Try to remember everything you editted, put that in a patch
  • Start Previous process with a little more confidence
Goat herders
  • take ball
  • go home
  • look up "how to become a goat herder" on the internet
But what about Vagrant?

Making a special dev harness means now you have two ways to deploy your code.

Jenkins for great justice?

  • Team has a ton of Jenkins automation, most is post-merge
  • No gating means people can ignore it
  • Also nobody on the team really knows Groovy

Zuul - The Gatekeeper

Zuul is, among other things, Cross-repo project gating at a massive scale

OpenStack Needed testing

  • Large distributed system made of smaller pieces
  • Functional testing is not enough -- each service is interdependent on others
  • Zomg 900+ developers 40+ orgs all collaborating!

Zuul was developed at this scale

  • 1500+ active test nodes at once
  • Jenkins was sad -- so we removed it
  • Ansible offers a perfect distributed workflow system for running multi-node tests

Features that you can use

  • Gerrit and Github Code Review triggering/commenting/voting
  • Git based configuration
  • Speculative merging for high-velocity gating
  • Custom image building/reuse (via Nodepool)
  • Multi-Region/Multi-Cloud (via Nodepool)
  • Easy multi-node testing
  • Massively Scalable
  • Cross-repo and Cross-source speculative merge gating
  • In-repo secret encryption

YAML config in-repo or centralized defines config


- project:
    name: ansible/ansible
    third-party-check:
      jobs:
        - shade-ansible-devel-functional-devstack:
            files:
              - ^lib/ansible/modules/cloud/openstack/.*
              - ^contrib/inventory/openstack.py
              - ^lib/ansible/plugins/inventory/openstack.py
              - ^lib/ansible/module_utils/openstack.py
              - ^lib/ansible/utils/module_docs_fragments/openstack.py
                    

Easy Multi-Node testing


- job:
    parent: base
    name: test-kolla-ansible
    run: kolla_ansible/main.yml
    nodeset:
        nodes:
            - name: meta-api
              label: kolla-centos7
            - name: cell-api
              label: kolla-centos7
            - name: db
              label: kolla-centos7
            - name: mq
              label: kolla-centos7
            - name: hypervisor
              label: kolla-centos7

Easy Multi-Node testing


- name: "Setup docker on servers that will need it."
  hosts: cap,map,hv
  roles:
    - docker

- name: "Setup database on db server."
  hosts: db
  roles:
    - db

- name: "Setup rabbit on mq server."
  hosts: mq
  roles:
    - mq

That's right, it's Webscale

Cross-Repo Dependency Controls

  • Make one change depend on another
  • Build speculative cross-repo change-set and test the changes together without merging them
  • Allows engineers to "fire and forget" on a PR, no merge until it will actually work!
  • Struggles with circular deps, you must unroll to use the feature
But how doesGoDaddy use Zuul?

How does GoDaddy use Zuul?

  • New OpenStack deployment/management
  • Legacy Kubernetes (wow, that was fast..)
  • At 0957 Nobember 16 2017 Zuul became self aware
  • 95% of changes are expressed as changes to 1 repo, as 1 of 3 things: Config, Topology, or rendering of Config+Topology
  • Topology is the model by which we relate services, servers, and configuration

When an engineer opens or approves a PR...

  • Nodepool allocates VMs in cloud
  • Zuul builds an Ansible inventory with those VMs
  • Zuul job creates a "fake" topology with the nodeset given, and renders it in to the needed Ansible inventory and variables for every job run
  • Jobs are given limited creds for external services
  • Job results are reported back via PR/github API
Yo dawg, I heard you like Ansible, so I wrote a playbook to run your playbook

Zuulception n. Running Ansible with Zuul to test that your Zuul deploys with the Ansible that deploys Zuul.

Rigor Matters

  • Strive to express every change as a git commit
  • Only deploy changes that have at least been through a Zuul test
  • Define jobs that test the components both in isolation and end-to-end

GoDaddy has some advantages

  • GitHub Enterprise for change tracking already in place
  • Over 2000 hypervisors in OpenStack private cloud spread across 4 regions
  • OpenStack Engineering team at GoDaddy already was transitioning to Ansible
  • Helps to have a Zuul core dev on staff
This (everything on fire) is fine

  • Deploying a mini-cloud on 5 VMs and running a slew of functional tests on it is slow - 55 minutes
  • Zuul is still new, team struggles to fix it
  • Zuul's scale makes deployment complicated
  • Auto-Hold system is clunky for users, as it was designed with Zuul-admins in mind

How's it going?

  • Team morale improved greatly
  • Velocity and quality remain high
  • Team spends more time designing and automating, less time testing

How can I try it?

  • https://zuul-ci.org/ -- Docs and Mailing lists here!
  • Come talk to us at the booth!
  • This room, 3:00pm, Paul Belanger and Ricardo Carillo Cruz have more Zuul!
  • #zuul on Freenode IRC

THANK YOU

Clint Byrum -- @SpamapS on Freenode IRC and Twitter. https://fewbar.com/