FewBar.com - Make It Good

Clean Out Your Services' Rain Gutters

Mon, 02 Jan 2023 00:00:00 +0000

On New Year’s Eve 2022, while many were watching the ball drop in NYC, or fireworks in Las Vegas, I found myself outside in the rain, holding a screw gun and cursing like a sailor, watching my pool fill with silt laden rain runoff.

You see, this New Year’s Eve, we received about three inches of rain here in Los Angeles.

To those of you living in more normal climates, this probably sounds pretty mild for December 31. But to those of us who have been living in Los Angeles for the past decade plus, this was unusual to say the least. We are in the midst of a massive drought, almost a decade long now, and rain beyond the occasional sprinkle, even in the winter, just doesn’t happen that often.

So, why did I go out in this rain? Well, our house was built in 1961. It has a pitched roof, and rain gutters, with ample drains, but nobody had really looked at those gutters or drains in a long, long time. We have lived here 9 years and had only really looked at them once when we first moved in. Cleared the leaves in them and went about our business enjoying Los Angeles’s notoriously perfect weather.

But the reality is, even with great weather, drains still need maintenance. Several of the gutters had filled with silt and detritus from the roof. Heck we even saw an ivy plant grow out of one corner.

And on NYC, one of the down spouts was clearly just not draining at all. The corner of it overhangs the pool, and we could hear it dumping into the pool like a hose had been turned on.

There was concern that the pool might overflow with this added in-flow, and a flooding pool is a flooded house, so, at 10pm, in the rain, this part time handyman decided it was time to check it out. This could be bad. What if the drains are clogged elsewhere?

So I grabbed my trusty drill-driven drain snake, dressed in waterproof clothing, and stomped out into the rainy night.

I stuck my snake in, and a big clump of dirt gave way immediately. Water started rushing down the down spout. Hooray! I moved on to the other down spouts and they were all clear.

But the splashing resumed a minute later. It was as if it had never been unclogged.

In went the drain snake again, and this time, it went 3 feet in. I jiggled it and pushed and it was stopped, so I began running the drill. It freed a few more feet down, twisting through a few curves. More water rushing through, but then a repeat, it backed up as I was removing the snake.

I shoved it back in and ran the drill at full speed, pushing and pulling, trying to remove the clog. It was bigger, and more stubborn. I knew in my head that if it’s really thick, more than a few inches, this snake will never release it. But I pressed on. It’s raining. I’m out here. It’s NYE and plumbers are gonna charge a fortune. Just Fucking Do It..

The drill strained. I pushed too hard, and I failed to notice the wire coil inside the snake starting to bind up. This should trigger a reverse, to ease the tension on the spring inside the snake, but I was too engrossed in pushing. In fact this snake is better with two people, one to watch the coil, and one to push and pull, but my wife was inside watching Miley and Dolly sing, and I was almost done, why call for help?

The snake twisted in on itself, and the wire pulled out, ruining the snake, and making the snake, all 8 feet that were in the drain pipe, as rigid as a steel bar. It was stuck. So now, we have a clogged drain, and a broken snake.

Fuck it. I detached the drill, packed up my stuff and hoped that what I had cleared would at least be helpful as the rain slowed later and we wouldn’t have the pool flood after all.

Fast forward to Jan 2, the rain has stopped and I dismantled the drain to remove the broken snake.

Inside the drain, I found six feet of what looked like 30 year old silt and roots, heavily compacted. Water drained through it at about 1mm per minute. I spent 2 hours dismantling and clearing the down spout, and then turned to the drain. The drain is also compacted, who knows how far. This is a job for professionals, and they’ll be here later in the week to finish the job.

This story reminds me of being on call for technical operations. Something alerts, and you open it up, employing your usual playbook of seemingly related remedies. But this time, it doesn’t work. In fact, it makes things worse. You aren’t an expert, just a reasonably capable person with access to tools and pressure to succeed. You may be in over your head, but escalation has psychological and maybe financial barriers, so you keep working on it, thinking you can do it, making things even worse.

So, make sure you clean out your services’ rain gutters. Even if there’s a drought, even when you aren’t expecting trouble, just remember why you have those systems. Fake an incident, do a table-top exercise, maybe make sure your backups are actually working. Whatever it is, allocate some time to it while the sun is shining. Because you’re much more likely to break something if you wait until a massive rain storm.

And also, remember that when there’s extra pressure, that’s the most important time to escalate. It’s raining, it’s a special night, you don’t want to be fixing stuff, and you know others don’t either. But escalate anyway. Call the other senior engineers. Call the plumber. Get the experts on the phone and make sure you don’t make one emergency in to two.

Gearman: Misunderstood wrangler of herds

Tue, 21 Apr 2020 00:00:00 +0000

For a few years now, I have maintained Gearman.

Upon taking over maintainership, I just wanted to get some bug fixes out the door, and revitalize the community which had grown stagnant after the former maintainers were unable to keep things up to date.

Since then we’ve shipped a few releases, updated for new compilers and kind of reached a stable place.

But one troubling trend I’ve seen is that folks don’t seem to use Gearman for what it’s actually good at. Many are using Gearman for what any messaging or queueing protocol can do.

See, the Gearman protocol was conceived of by the same folks that thought up the Memcache protocol. The similarities are quite obvious.

That’s no coincidence. The original design of memcached quickly met with a troubling problem: The Thundering Herd. This is the situation that occurs when one relies solely on caching to reduce expensive operations. Upon expiry, eviction, or genesis of a popular cache entry, you will have many requests for the result, and you must manage the number of expensive threads spawned as a result.

There are a few ways to solve this problem. The most common that you will find is “soft expiry”, you can see implementations such as turnstile and dogpile. This is where one stores a soft expiry time in the cache item. When your caching code sees a soft-expired item, it can spawn an async job to refresh, but still serve the old bit to the user. dogpile even implements a lock so that only one thread actually runs to regenerate. There are a few relatively simple ways to make this happen, and a job queue is only one of them.

This solves happy-path expiry pretty well, I’ve used it myself. However, it still has trouble with a bunch of normal things, like new-but-popular items, and loss of memcache servers. Basically, if you can’t find even a soft-expired entry in the cache, then every user has to wait while your requests run, and then they must re-read from cache, or somehow get the answer back from the job.

This isn’t so far-fetched. Memcached uses LRU expiry. This means that popular items stay in cache until their TTL, and less popular items are evicted when space is needed, even before their hard-expiry time. If you’re running memcached at full capacity, it should be evicting the least popular things. But if you have a very even distribution of popularity, you’ll be evicting things that are just moderately popular. Also, many systems will make a new entry popular immediately, thus needing a real way to coordinate the requests to it before it has even been cached in all the places it needs to be.

The gearman authors came up with a really cool solution for this, but it’s oft-overlooked. In fact, the original version of the protocol docs I first read didn’t even mention it.

Every job submitted to gearman, whether background or foreground, accepts two things. One is called “arguments” or “payload”, and that’s fed to the worker in all JOB_ASSIGN* packets.

The other is a “unique ID”. This is the magic that gets ignored. In most libraries, if you don’t pass a unique ID for your request, the library will just generate a UUID for you.

Most libraries include no explanation as to why this is there, and it’s often invisible to users. There’s a clue in the original implementation though. You’ll note that the unique ID is not included in the original JOB_ASSIGN packet sent to workers. That’s because this was always meant to be information for the gearman service. Only later did they realize that many jobs were just submitting the same unique ID as the payload, and added in the GRAB_JOB_UNIQ and JOB_ASSIGN_UNIQ versions.

But generally you should have an ID. It’s not always a record ID, but it’s generally going to be “whatever you’d use to access cache entries”. It’s a unique string for this work. Sure, sometimes you have something obvious, like “account ID”, or sometimes you have free-form arguments, like a list of search terms. Whatever it is, you need to send this to the gearman server. You also should be hashing the unique ID, and choosing a gearman server to send to based on that hash, just like memcache clients do for cache keys. There’s a very good reason for this that we’ll get to later, but the most important thing is that you send a real, useful unique ID for every job you send to the server.

See, when your request arrives in a well-behaved gearman server (some don’t do this properly), it will first look to see if is already aware of this ID. If your job has asked for an ID that already exists, gearman will feed back the same job handle that’s already in the queue to the client. For foreground jobs, gearmand will multiplex the WORK_* packets that comes back from the worker to all submitters. It’s like a distributed thread join.

This means that you have a guarantee that the number of concurrent active work units is deterministic.

n_active = min(n_uniques * n_gearman_servers, n_workers)

And actually, it’s better than that. If you followed the earlier advice to make your gearman client hash uniques and send submissions to the proper gearman server based on a hash, it’s just:

n_active = min(n_uniques, n_workers)

This means you can go look at your functions, and calculate the exact theoretical maximum load your backend resources will experience, and unbound auto-scaling for your workers to the exact capability of your backend. The math for capacity planning on those resources gets a lot simpler when you know the upper bounds of any of the variables deterministically. You can even dig through your cache hit/miss logs and find a good over-subscription scheme to adopt, and this calculus is much simpler when one of the variables is fixed.

Any old message queue can do the async work, but I haven’t seen any that can do a distributed thread join like gearman does.

This magic hasn’t lost value because of the sands of time. We are still writing apps that need to scale reads, even if we’re sharding our databases. In fact, response time requirements have gotten more extreme, not less. Being able to build backends that respond rapidly and are scaled to meet exact demand efficiently is critical in building scalable, responsive APIs. Gearman has other tricks to help with user-interaction too, such as the WORK_STATUS packets, which will send back numerator/denominators, or WORK_DATA, so you can send partial responses while the final result is calculated, which is perfect for sending a stale cache entry back immediately, but then updating with the final and sending it with a WORK_COMPLETE.

Unfortunately, one reason this was lost is that when the Perl gearman client was rewritten in C, hashing unique to pick a server for SUBMIT_JOB* packets, was lost. And if you look, all of the language client libraries written since have failed to do it, as most are just either using libgearman, or copying it. I include the recent rustygear library I wrote in this equation. It’s not easy, but it is such an awesome thing that it’s worth doing. I hope to find some time to add this into rustygear, and maybe even with time, replace libgearman with rustygear so other languages can use it.

Another reason I think that Gearman has lost momentum is that HTTP has gotten a lot smarter. You don’t need a special TCP protocol to achieve the level of responsiveness required to get this job done. I’ll be looking into a way to make the Gearman protocol ride on HTTP/2 and Websocket, so that folks can write workers and clients without needing a native gearman library.

I hope you see a solution to your problems in this, even if it’s not deploying gearmand and rewriting things as gearman workers. And if you think Gearman is the right solution for you, I’ll invite you to please join the gearman google group, and/or reach out to me and chat with us about that.

Happy Hacking

A bit of deep analysis

Thu, 02 Apr 2020 00:00:00 +0000

In case you missed it, I’m on the job market. As I’ve been winding through the various opportunities out there, I found a company that wants a sample of “analysis” as part of their application.

I’ll be honest, I sat in front of this section for an hour without coming up with anything. I wrote down some of the times I’d done deep analysis that I could remember, but none of them really impress me. Now, you may be reading this thinking “heh, yeah, cause this guy doesn’t know how to analyze”. But to be honest, I think for me, this question is like asking Greg Maddux how he threw his best fastball, or asking Sir Mick Jagger how he performed his best chicken dance, or a great white shark how it ate its most fish. They could all probably tell you more about doing taxes. They’re just being themselves, it’s what makes them great.

Analyzing systems has always been very natural for me. I mean always. I was that kid pulling apart the VCR to see what was inside, and then putting it back together successfully.. sometimes. I truly do not remember what I did there, but look ma, the whining noise is gone when we rewind!

So, that said, I’ll not give you the best analysis I’ve done, because I just can’t pick one of the thousands. Instead, I’ll give you the very first one I was paid for. A brief story.

The year is 1997, and the protagonist is a recent graduate of High School. For the past 3 months, our hero has been employed doing his dream: playing with computers. As a contrarian, this rare Linux-using teenager has decided to go right in to the IT industry and skip that silly college thing (yes, this teenager was as dumb as a teenager, and twice as stubborn). And they even have an HP-9000 real unix box to play with.

The developers are all perplexed, on this day however. The literal green screen application that runs the entire company is crashing, randomly. All hands on deck, the 4 developers are furiously examining code and records in the database to find out what is causing this problem. Our hero loves code, and has chatted with the devs a few times about what they do. This weird system they work in is called “PickDB”, and the app is in “PickBASIC”, and it’s really simple, so, Sir Paperbox-carrying PC-network-card-reseating 18 year old has been peeking at the manual and playing on his own green-screen to learn it in spare time.

Alas, the problem is vexing. This particular firm is a medical equipment manufacturer, and billing agents are seeing, randomly, that some records they pull up simply corrupt their entire screen view. The cursor flies around, weird matrix-like (oh wait, there’s no Matrix yet) characters print, and generally they have to have a server-side reset to continue (kill their shell, the database vendor explains, and our hero already knew how to do that and showed the devs so they could self-service).

Well the devs are fighting through it, so he sits down to peek at the database as well. They have a list of records known to cause the problem, so he pulls one up with the usual method: SELECT foo WHERE id = ‘999’ or something like that, honestly, it wasn’t SQL, but it was the SQL of PickDB. Behold! A perfect record is printed, line by line. One by one, all of the records look pretty normal, though you can see where the longer records had incomplete words.

The vigor of this young man will not be stopped though, and he wonders if he just writes a program to print out the records, if it works differently. Writes program, runs it, and there, in front of him, is a broken terminal.

So, he kills his session and tries turning the characters in to their ASCII numbers. Oh noes, the program crashes HARD. What’s this then? The built-in function that turns characters in to numbers has printed a runtime error and halted his program. “High-bit char detected.”

Fast-forward to an hour later, where this 18 year old is explaining to these 40 year olds how to fix the problem. See, PickDB and PickBASIC don’t actually exist anymore. It was originally run on IBM mini-computers, but those are expensive, so some vendors had shown up to swoop in and replace it with cheaper emulators that run on cheaper HP-9000’s. This was called “UniData”, and it worked flawlessly.

However, when this company had converted from Pick to UniData, the vendor had converted most of the data in some black-box process. But now, 4 years later, the database was purged of most of those records, and everything in it was newly entered in the green screens which were serially wired to a serial-to-telnet gateway via CAT-3 cables wired into RS-232 connectors. So, one day, 4 years ago, the vendor had arrived, taken the old IBM system out, and put in the HP-9000, and taken all of those RS-232 patches out of the IBM MUX, and put them into an HP terminal concentrator.

Meanwhile, over 4 years, the company had grown, and more and more green-screens purchased and wired up in the same fashion. Reorgs shifted people around, and terminal connections stayed where they were, with users just logging in to wherever they sat.

And so, what had happened? Last week, a reorg had sent accounting into the space used by medical bill processing, and vice-versa. This meant that records were now being entered in the space that had been newly built out for accounting 2 years earlier. Accounting didn’t do much data entry, but medical billing was literally just taking doctors’ patients forms on paper, and entering them into forms on green screen. And so, for the past week, new terminals had been used to enter all of the new records.

But the vendor had forgotten one thing. They forgot to tell the person who they trained on how to wire new connections, to log in to the terminal concentrator, and change it to use x-on/x-off 7-bit for green-screens. The default, was CTS/CTR. The details of this dark and arcane protocol are lost to my brain as it has been over 20 years, but basically, when people typed fast, the concentrator would set a pin on the RS-232 to 1 instead of 0 to say “STOP SENDING ME STUFF MAH BUFFER’S GONNA BURST”, but the terminal would keep sending stuff, and when the pin went back to 0, the concentrator would start reading again, and binary hilarity ensued.

This realization came about because our hero had been playing with serial connections and modems since he was 12. He knew what CTS/CTR and x-on/x-off were, sorta, and had read the manuals for both the green-screens and the terminal concentrator while debugging this problem. Once the high-bit chars error was seen, the real problem was unmasked. And it was a short process to go make a well behaved terminal suddenly become a garbage-data-generator.

Funny enough, the SELECT program from UniData had a filter which would remove high-bit chars, which is why it was so hard to find these nasty chars. Even worse, this DB used the 8-bit int value of 253 to mean “field delimiter”, and 254 as “record delimiter”, so if you got REALLY unlucky, your keystrokes would split a record or a field in two! The devs just hadn’t thought that the built-in program would lie to them! By writing his own program, our hero had satisfied the X-files rule of debugging: Trust No-one. After reading the manuals he had then applied the same principal to the terminal concentrator configs: why trust one port over another?

In closing, this analysis also led to an amazing discovery that changed the lives of the data-entry folks for a brief time. In reading the manual for the port-concentrator, he discovered that the ports were all set to 19200bps, which was ok, but that the port concentrator could actually go much, much faster if it could use CTS/CTR. And all of the new terminals in the data-entry area were WYSE-60+, not WYSE-50, which meant they actually could be configured for CTS/CTR and the much higher data rate. As a result, screen paints were 10x faster on these models, and soon they became coveted treasures in the company.

FIN

YAML shyaml - Wrote a tool about it, here it goes

Mon, 29 Oct 2018 00:00:00 +0000

I’ll keep this brief and on point: I love Rust and yet, I have published so little rust code, and as yet, none that actually saw real usage.

Well today that changes. With many thanks to my former employer, GoDaddy, for assigning copyright of a work product to me, I am pleased to announce the publishing of shyaml.

When I joined GoDaddy, among other duties, I was tasked with keeping their “Legacy” Kubernetes installation alive and running. The team that had built it were all gone, and I was fairly new to Kubernetes, so I had to tread lightly.

As a result, I found myself constantly wishing I could see what kubectl was about to do. I was shocked that kubectl didn’t have a diff mode. In retrospect, I probably should have just learned Go, and added a –diff switch to kubectl. But, I fell back on the tool I know, and I had just written shyaml to get my rust chops up. It was just a general CLI manipulation tool for YAML and JSON at the time. Adding the ability to intelligently diff Kubernetes objects was pretty straight forward.

So I added a kubediff command, which tries to show you the real effect a kubectl apply command will have before you run it. It’s not perfect, but it generally works.

And so, I give this tool to the world. I hope to make it even more general purpose, as time permits. But until then, please enjoy the code, and feel free to report issues and open PR’s. Also, join us on the Tools for Humans Slack if you want to chat about it.

The Build is Never Broken

Tue, 02 Oct 2018 00:00:00 +0000

When I joined the OpenStack Engineering team at GoDaddy, I want to make it clear, things were not completely terrible. In fact, there were some really amazing things happening.

Most of the things they needed to do over and over were in code, and 99% of that was in a single chain of git repositories that the entire team was aware of and participating in. There was CI in various forms running against PR’s opened on repos and building things after stuff landed. In general, most obvious mistakes didn’t waste the team’s time.

However, there was one really frustrating thing, that nobody seemed to be able to really solve: There was no realistic dynamic development environment. So many repos, so many components, it just wasn’t feasible to scale it down entirely. If you had an idea of how something needed to get done, you had a few choices, none of which were fast or fun:

1) You could write a patch that looked kind of right, propose it against the right git repo, get it through review and landed, wait for the artifact builds to produce the needed bits, deploy them to the “dev” environment, and then iterate on that until dev worked the way you wanted, promoting the artifacts to stage and eventually production.

2) You could try to edit the dev environment directly, and reverse engineer what you did back in to a patch, and then do step (1) but with at least a bit more confidence.

3) Cry in to your pillow. Go to step 1.

Now, you may be asking “why not Vagrant?”, and I’d say that too, except I almost never have had any success with Vagrant or similar. Because it’s so different from production deployments, the Vagrant build is almost always broken a little bit in some ways. Also really, because there are so many of these type (1) changes in-flight, often times master is a complete shambles, and you end up needing to either rewind to the last known good deploy, or pull somebody else’s branch that has a fix in-flight. It’s entirely possible to have a great local-dev or even cloud based Vagrant or Vagrant-like tool. But the effort to build it and maintain it are pretty large, and if the only benefit you get is local dev, you have just traded one velocity problem for others, such as change delivery success rate and transparency.

Does this sound familiar? After joining a few DevOps-ish teams over the last few years, it’s a common pattern. This isn’t a tools problem. The team was sort of 95% puppet, and 5% Ansible, and that 5% was no better or worse in this respect. No, we didn’t have to wait for RPMs to be built for Ansible, but because we also could just iterate on our laptops, many in-flight in-dev changes were actually more dangerous than Puppet changes, because they often would be run on production, and then just forgotten, never re-incorporated in to the git repos.

That’s not because the team didn’t want to be better tested, nor was it because they lacked the capability. They had some pretty heavy hitting Jenkins talent at their disposal to get anything Jenkins could do done. And it’s worth noting that there was an attempt to build a “dev on demand” set of Ansible playbooks to get this done. However, as noted above, this might not actually have resulted in a net-positive win.

The problem that many teams face is the same one that my co-workers faced then: The build was often broken and incomplete, whether that was detected or not, because the tools we have for testing are not flexible where they need to be, and as a result, they’re not able to be strong when they need to be.

Solving this problem is the crux of Zuul’s very existence. Adopting Zuul means any team, whether you’re 3 people working on a single repo, or 300 people working across 25 repos, can reap the benefits of an always-deliverable set of repositories and branches.

So how does Zuul do this, and why can’t other systems get this done? Well the answer starts with git.

If you’re not familiar with git, um.. it’s 2018, please spend the next hour learning what it is, and then 5 minutes asking why on earth your build engineers haven’t switched.

Most tools try very hard to be somewhat git-stupid. Jenkins comes to mind here. While it has plugins to have it listen for GitHub webhooks, and other change management systems, it really doesn’t want to be too aware of git. Git is relegated to the same as curl: it’s the way you fetch the code to do things with.

But that’s really not the case. This single-mindedness around git is often where tools sort of give up. Implementers of CI tools seem reluctant to think of what happens after a git commit lands.

But Zuul came from a place where there were very large penalties for landing broken code. OpenStack had an extremely wide scope, and as such, many developers were showing up with code and integrations between the various projects. Having “the build broken” for 2,000 people shines a very bright light on just how important it is that tests work, and that code does what reviewers and coders believe it does.

Because, you see, when you land things in git, they are thusly part of a timeline that did not exist when others pulled code. And when those others land things in their local git repository, and make it work with their local dev tools, there’s nothing to say that what’s in master will keep working when integrated with those unfinished changes. Add in dependent repos and changes, and the certainty of something that worked today continuing to work if landed tomorrow is very low. It’s all a big, messy, eventually consistent, distributed system.

But it doesn’t have to be broken all the time, and we don’t have to always rebase on master every time it changes “just in case”. If we think of a set of git repositories and/or branches as a unit to be tested together, it’s a very short bridge to fully testing things together in the form that they will be made available to others, before they are made available to others. Just like you don’t send things to production without testing them in stage exactly as they are first, you shouldn’t send code to your developers without having first had it tested exactly as it will land.

And once you decide on that being a good thing to do, other items pop in to your head. What about multi-node testing of distributed systems? How can I make it go faster? These are questions that the creators of Zuul faced as well, and solved with simple, straightforward answers that have been proven valuable through the years of OpenStack running its entire development infrastructure on them.

So, at GoDaddy, when it came time to reboot our OpenStack installations a bit, I couldn’t think of a better way to leverage Zuul’s power than to start with a 5-VM job to deploy a mini-cloud on to and run some tests against it.

I’ll be honest, I gave this initiative a 50% chance of working. The deadlines were tight, and the political environment around the products being supported is a bit stressful. I Fully believed that somebody would look at what we were doing and pull the plug in favor of some other more widely accepted tool or more established pattern.

Luckily, nobody had time to say no to Zuul. So we built this 5-VM job up to the point where it deployed an OpenStack control plane the way we wanted to deploy in production. Our first few deploys to the POC environment found all the ways that a mini-cloud is different than a real one, but we had an enormous amount of momentum built up behind Zuul and this job, and Zuul mostly seemed to be getting out of peoples’ way at the right time. We had our “dev on demand”, and even better, we had a single, transparent way to land changes.

Since then we’ve been pushing changes at a pretty impressive clip, and it is quite rare for a change to break production. We do get quite a bit of coverage for our automation from the various Zuul jobs that run before changes land. Mostly those corner cases, such as scale problems, mis-typed config details, or scheduling issues where the hardware actually matters, regularly cause us issues.

But we have devised a scheme to leverage Zuul for these issues as well, by using it to kick off deploys to our staging environment and promote changes from there to production only after automated tests have run.

So how does Zuul do this, and why does it leverage Ansible?

First and foremost, by being fully git-aware, an engineer is effetively able to build a new future in a set of git repositories and branches. If this new future must span multiple repos, Zuul offers the engineer the ability to specify dependencies on other changes.

So, if you need to propose a new variable for a role that is shared amongst automation concerns, and then depend on that variable for yours, you can add Depends-On: https://your.github/roles_org/role_repo/pulls/1234 to the commit message of your change. While zuul is building working directories to run jobs in, it will see this dependency, and use the branch/PR/etc. that you have submitted as its basis for pre-merge testing. And when all reviewers are happy with your dependent change, it won’t land until the upstream dependency lands too.

So now you, can build an entire future without having landed risky code in master, and without having to wait for your upstream dependencies. This even works if the upstream project doesn’t use Zuul. As long as you can give Zuul credentials to inspect the change management system and pull the necessary commits, it will be able to build a speculative future from them, and it won’t accidentally land your dependent change until the upstream change is merged.

This even works across Gerrit, Github, and Github Enterprise. Other systems are also in development. This is nice if, say, you have GHE internally, and your depend on upstream Github projects, you can encourage your engineers to submit code upstream. And if you really want to get involved deeply with that upstream, you can even set up your zuul as a GitHub app, and report statuses on those repositories, which should help them avoid merging code that breaks you.

Now that you’ve embraced Zuul’s future-building capabilities, you’ll want to start expanding coverage. Zuul has all of what you’d probably expect for doing the easy stuff: linters, unit tests, etc. But what about full integration tests like our “build a mini-cloud” job we talked about before?

For this, Zuul is going to need a cloud. Current releases only support OpenStack clouds, or static pools of SSH/PowerShell accessible machines. However, there are several public cloud drivers, such as AWS and GCE, that are in various stages of quality. The AWS driver in particular is close to being ready.

Once you have given Zuul cloud resources via its sub-component named nodepool, you can start attaching nodesets to your jobs. These can be backed by various flavors, images, and configuration details, and given names and Ansible groups for use in playbooks. Typically youll even have a default nodeset in your base job that everything parents to.

At GoDaddy we have allocated 75 “large” instances (8GB RAM, 120GB of disk) on one of our less busy private OpenStack clouds for running tests. We also have defined a custom image using nodepool-builder, which gets built every 12 hours to pull in the latest apt and pip packages that our jobs will need. That way our job runtimes don’t get too long with downloading and extracting new copies of things. We also ask nodepool to keep 15 nodes running at all times, so that any 5-node job will be able to start immediately and not have to wait for the cloud to spin up VMs. This also tends to smooth out problems with cloud control planes, which we do experience from time to time.

Alright, so now we have compute resources, opitimized images, and git repos plugged in to Zuul. What’s next? We need to define jobs.

At GoDaddy we have some repos that have just 1 noop job on them. This still has a benefit, as the repo may be housing Zuul configuration that needs to be validated before changes are landed. We also have our most busy repo, which we call openstack-deploy, which runs between 3 and 9 jobs on every PR, and 3-4 jobs in the gate. The number varies, because sometimes we use Zuul’s ability to skip jobs by filename. So, for instance, we don’t need to run the big long kolla-ansible job which deploys a mini-cloud, if the change wholly consists of configuration details for our production clouds.

One really interesting aspect of working with Zuul is when you start to have interdependencies in repos. We have a repo which houses patches which we apply to the upstream OpenStack deployment tool named kolla-ansible. Whenever we run any deployments of kolla-ansible, we apply these patches on top, and generate the configuration for kolla-ansible on top of that. This means that if we update the patch repo with something that won’t deploy right, we could end up with a broken master again.

But luckily, Zuul was built with this scenario in mind, and as such, allows us to have two repos run the same job in a shared queue. That means that if I propose a patch to openstack-patches, and an update to openstack-deploy that seems unrelated, they’ll be tested together, one landing before the next using Zuul’s speculative execution capabilities. If we do this right, it means we can’t land broken code in either repo.

Finally, for those times where you just can’t figure out why a job is failing, there’s the “auto hold”. Zuul can hold on to test nodes after a job fails if you inform it that you need it to. This allows an engineer to log in to the test nodes and poke around, even to try running the test again with modified code. Many of our biggest refactors happened on held test nodes, where an engineer would fiddle with things on those VMs, and then pull the changes back down and submit a fixed patch over the course of a few days.

So, from a cultural perspective, how did having these new capabilities affect team productivity?

First and foremost, there’s a good chance if we didn’t have Zuul, we would have crumbled under the pressure of a very tight deadline. With many many changes in flight and moving rapidly, it would have been an absolute momentum killer to have to stop and fix broken builds every day. Furthermore, by being able to let Zuul spin up mini-clouds, we gave our developers a ‘fire and forget’ mechanism for testing their changes in parallel, isolated from each other while the changes were in chaos.

Second, we found that members of the team were able to find more information about how things work faster, because everything, including the configuration for the actual tests, is stored in git trees. It really does help when you’re doing a git annotate on a file, and then the change for a line that is confusing you is accompanied by edits to the testing configuration. This is especially helpful in root-cause analysis, where you are trying to match timelines from multiple sources together to find when a change may have been made that resulted in an incident.

Finally, Zuul was actually only half of the story. Another big part of it was that because Zuul was just running Ansible, we were able to leverage that for other tasks. We don’t actually run our Ansible against production using Zuul. Instead, we run the same playbook that Zuul does with our super duper chat bot named Padre. I very much hope that we can open source Padre soon, and present it at a future AnsibleFest. But ultimately, Zuul was relatively straight forward to adopt because we could use the tool we knew already: Ansible, and it was also easy to break back out of Zuul when we needed to, for the same reason.

So what should you do if you are interested?

Attend Ricardo Carillo Cruz’s deep dive in to Zuul and Ansible Networking at 3:00pm.
Come talk to us at the Zuul booth!
Deploy zuul – I’m not going to lie, this isn’t easy, and it doesn’t make a ton of sense unless you have your code in either GitHub, GitHub Enterprise, or Gerrit.

Trump gonna Trump - Time to move forward

Sat, 01 Jul 2017 00:00:00 +0000

I usually don’t post much about politics here, but this week our President was caught looking pretty silly, and I feel like it’s worth commenting on.

Source: Washington Post

Just do the thought experiment: Imagine if Hillary Clinton or Barack Obama or John Kasich made something like this and put it up anywhere. Those who thought highly of them would think it was a joke. Any of them would likely come out immediately and explain that it was in fact a joke and most of us would have a laugh at it. It’s so out of character, it would just be laughable to anyone who looks at those individuals as good people.

But no matter what they did of course those who thought negatively of them already would call it narcissism of the highest form and decry them as purveyors of fake news. This would blow over pretty quickly, because none of these people have been pointing fingers at their critics and calling them fake news. But it would certainly serve to enrage their detractors.

The reason this doesn’t bother The 45th President of the United State’s supporters is that they already accepted that he has no integrity. They see this and just go “This guy. What will he do next?” But if you’re holding him to the fire, it just tickles that healthy confirmation bias that he is the end of our democratic traditions and a truly terrifying individual who is burning down the reputation of the presidency one stupid tweet at a time.

Usually we have a fringe of people who treat the president this way. Either it’s the far left protesting Bush 43 over Iraq and spending cuts, or the far right protesting Obama over immigration and climate change policies. Grow a thick skin Mr. President, whoever you are.

But now we have a large majority of moderates joining the far left in opposing this president. That only serves to entrench his supporters more. They see themselves as outsiders who finally got their guy in. So don’t be surprised when his supporters laugh off these stunts. They don’t care because they just want you to acknowledge that they won and you lost, and he cares about them and their issues. They don’t mind that he’s embarrassing. As long as he’s sticking it to you, the elites and their supporters who have been sticking it to them for years.

IMO, just ignore all of that. We’ve made our piece against this man, and now it’s time to get down to fixing the damage and building a better society that is more resilient to these problems.

How? We start by taking time to support strong leaders who will restore the respect and dignity of the office and the nation. Support legislators that will amend the constitution to reverse the influence corporations have over elections so we aren’t stuck with fund raisers and reality TV stars as our only choices.

And finally, when our friends, family, and neighbors who supported Trump come back in 2020 like a screaming horde of vikings intent on burning down the last vestiges of civil society: don’t hate them. Don’t belittle them. Don’t even shout at them. Just be ready with the only real defense any of us have: our vote.

Free and Open Source Leaders -- You need a President

Sat, 18 Feb 2017 00:00:00 +0000

Recently I was lucky enough to be invited to attend the Linux Foundation Open Source Leadership Summit. The event was stacked with many of the people I consider mentors, friends, and definitely leaders in the various Open Source and Free Software communities that I participate in.

I was able to observe the CNCF Technical Oversight Committee meeting while there, and was impressed at the way they worked toward consensus where possible. It reminded me of the OpenStack Technical Committee in its make up of well spoken technical individuals who care about their users and stand up for the technical excellence of their foundations’ activities.

But it struck me (and several other attendees) that this consensus building has limitations. Adam Jacob noted that Linus Torvalds had given an interview on stage earlier in the day where he noted that most of his role was to listen closely for a time to differing opinions, but then stop them when it was clear there was no consensus, and select one that he felt was technically excellent, and move on. Linus, being the founder of Linux and the benevolent dictator of the project for its lifetime thus far, has earned this moral authority.

However, unlike Linux, many of the modern foundation-fostered projects lack an executive branch. The structure we see for governance is centered around ensuring corporations that want to sponsor and rely on development have influence. Foundation members pay dues to get various levels of board seats or corporate access to events and data. And this is a good thing, as it keeps people like me paid to work in these communities.

However, I believe as technical contributors, we sometimes give this too much sway in the actual governance of the community and the projects. These foundation boards know that day to day decision making should be left to those working in the project, and as such allow committees like the CNCF TOC or the OpenStack TC full agency over the technical aspects of the member projects.

I believe these committees operate as a legislative branch. They evaluate conditions and regulate the projects accordingly, allocating budgets for infrastructure and passing edicts to avoid chaos. Since they’re not as large as political legislative bodies like the US House of Representatives & Senate, they can usually operate on a consensus basis, and not drive everything to a contentious vote. By and large, these are as nimble as a legislative body can be.

However, I believe we need an executive to be effective. At some point, we need a single person to listen to the facts, entertain theories, and then decide, and execute a plan. Some projects have natural single leaders like this. Most, however, do not.

I believe we as engineers aren’t generally good at being like Linus. If you’ve spent any time in the corporate world you’ve had an executive disagree with you and run you right over. When we get the chance to distribute power evenly, we do it.

But I think that’s a mistake. I think we should strive to have executives. Not just organizers like the OpenStack PTL, but more like the Debian Project Leader. Empowered people with the responsibility to serve as a visionary and keep the project’s decision making relevant and of high quality. This would also give the board somebody to interact with directly so that they do not have to try and convince the whole community to move in a particular direction to wield influence. In this way, I believe we’d end up with a system of checks and balances similar to the US Constitution

So here is my suggestion for how a project executive structure could work, assuming there is already a strong technical committee and a well defined voting electorate that I call the “active technical contributors”.

The president is elected by Condorcet vote of the active technical contributors of a project for a term of 1 year.
The president will have veto power over any proposed change to the project’s technical assets.
The technical committee may override the president’s veto by a super majority vote.
The president will inform the technical contributors of their plans for the project every 6 months.

This system only works if the project contributors expect their project president to actively drive the vision of the project. Basically, the culture has to turn to this executive for final decision making before it comes to a veto. The veto is for times when the community makes poor decisions. And this doesn’t replace leaders of individual teams. Think of these like the governors of states in the US. They’re running their sub-project inside the parameters set down by the technical committee and the president.

And in the case of foundations or communities with boards, I believe ultimately a board would serve as the judicial branch, checking the legality of changes made against the by-laws of the group. If there’s no board of sorts, a judiciary could be appointed and confirmed, similar to the US supreme court or the Debian CTTE. This would also just be necessary to ensure that the technical arm of a project doesn’t get the foundation into legal trouble of any kind, which is already what foundation boards tend to do.

I’d love to hear your thoughts on this on Twitter, please tweet me @SpamapS with the hashtag #OpenSourcePresident to get the discussion going.

Rust - You Complete Me (And then drop me, because I'm out of scope)

Mon, 23 Jan 2017 00:00:00 +0000

To My Dearest Rust,

Ever since I laid eyes on your braces and semicolons, I knew, there was something special about you. This past winter holiday that we spent together has changed my life. I’ll never be the same. The way you embrace life by being explicit about the death of objects, the way you force me to be clear when I’m borrowing your things. Sure, it was a bumpy beginning. I thought maybe I might run back to safe, warm, python’s arms. But you didn’t give up on me, you kept warning me that I was making everything mutable when I didn’t have to. And now, whatever happens, I’m a better man for having known you.

Some might say being explicit about the length of our lifetimes is macabre, but I find it invigorating. It’s a reminder that some things will outlive others, and being able to see that, and know the day some of our objects will die is a reminder that most of our data is related, and sometimes we need to spell out how up front to prevent garbage building up, which would force us to pause and deal with it later.

And you saved me from modifying my variables in loops. I never even knew how many times I made that mistake and had to double back to fix those errors. I always thought I was being cool, reusing variables, but you called me out and made sure I never did that after I gave them to someone else. This made me frugal with my CPU and memory by helping me think about when and where exactly I’d spend them. Explicit mutability? How about explicit cuteability.

And just the other day, when I asked you if we could go multi-threaded together, you didn’t just go along easily. You didn’t just hand me the keys and make me drive the whole process. You challenged me to use mutexes and reference counted pointers. You held my hand while I fumbled through it, and offered encouraging tips, with a lot of reminders to wrap things in safer containers before we went out into the cold, brutal multi-threaded world. Because of you, I’ll never have to feel the cold sting of corrupted memory again.

My love, Rust, I don’t know if we can be together. You’re so new to this world and I’m not sure everyone will understand you. But I know I’ll do whatever I can to tell the world about your beauty and grace.

Love Always, - Clint

p.s. lets meet up again around spring break.

OpenStack's nova-compute's border is porous - We need to build a wall

Fri, 16 Dec 2016 00:00:00 +0000

In the beginning there was Nova. It included volumes, networking, hypervisors, and scheduling. Since then, Nova components have either been replaced (nova-network with Neutron) or forklifted out and enhanced (Cinder). In so doing, interfaces were defined for how Nova would continue to make use of these now-external services, but nova-compute, the place where the proverbial rubber meets the road, was left inside Nova. This meant that agents for Cinder and Neutron had to interact with nova-compute through the high level message bus, despite being right on the same physical machine in many (but not all) cases. Likewise, some cases take advantage of that, and require operator cooperation in configuring for certain drivers.

This has led to implementation details leaking all over the API’s that these services use to interact. Neutron and Nova do a sort of haphazard dance to plug ports in, and Cinder has drivers which require locking files on the local filesystem a certain way. These implementation details are leaking into public API’s because it turns out nova-compute is actually a shared service that should not belong to any of the three services, and which should define a more clear API which Nova, Cinder, and Neutron, should be able to use to access the physical resources of machines from an equal footing.

We’re starting a discussion in the OpenStack Architecture Working Group around whether this is creating real problems, and how we can address it.

What I think we need to do is build a wall around nova-compute, so we can accurately define what goes in or out, and what belongs specifically in nova-compute’s code base. That way we can accept the things that should live and work permanently inside its borders vs. what should come in through an API port of entry and declare its intentions there.

But before we can build that wall, we need nova-compute to declare its independence from Nova. That may be as much a social challenge as a technical one. However, I think once we complete some analysis, and provide a path toward a more sustainable compute service, we’ll end up with a more efficient, less error-prone, more optimizable OpenStack.

If you’re interested in this, I recommend you come to the next IRC meeting for the Architecture WG , on January 12, 2017.

The real newcomers taking your job, since 1961, and still going

Mon, 21 Nov 2016 00:00:00 +0000

The recent Presidential and Congressional elections in the US shocked me to the core. I, like many of my closest friends, were certain that the American people would reject Donald Trump and the Republican party’s rhetoric.

But the election happened, and since then, I’ve been trying to pay attention to the reasons. I’ve had many conversations with Trump voters and the old adage proves true: It’s the economy, stupid.

But what’s wrong with the economy? For me, a tech worker in California, the last 8 years have been the best of my life. My pay has risen, and my job quality has gone up. This is true of all of my close associates as well. We simply haven’t seen this economy as anything but a boon. Of course, we’ve worked hard, and played our cards right. But the timing has never been better for workers in the tech sector.

However, I’m not ignorant to the reasons behind this. Why is my salary going up, but those of factory workers in Ohio and Michigan going down?

Donald Trump would have you believe it is our trade policies and the lack of a large wall on our southern border. The latter is an absolutely absurd idea on its face, but if you think longer, it’s really just a physical manifestation of the frustration of his supporters. They do see Mexican and Central American immigrants working, and they think “They took some US Citizen’s job.”

Economists disagree. In fact, those immigrants who have illegally crossed the border tend to take service economy jobs that are low paying and without benefits. Because they live in fear of deportation, they tend not to exercise their labor rights, and as a result, tend to have a very low job quality. That’s not the kind of job that will “make America great again”. That’s the kind of job that comes and goes over time and leads mostly to a lower class lifestyle. Those who come legally tend to come on visas to fill labor shortages, despite rhetoric suggesting that somehow companies are abusing the H1B and other programs.

But what about trade policies? Is it simply too easy to make stuff in China, Mexico, or Pakistan, and then import it back to the US?

That is a part of it. Those places don’t have the same worker protections and have a lower cost of living, so one would expect that greedy corporations can make more money by reducing manufacturing costs there, and giving back a bit of the margin in shipping costs.

But many things made in factories require customization. One difficulty in putting the product so far from the consumer, is that you can only make to stock. Make to order with a 10 week lead time is extremely haphazard and unpopular with most products. Many of the products still made in the US are of this kind.

Also many products require skilled labor to produce. While a T-shirt can be sewed by relatively unskilled hands, and an iPhone can be assembled in stages that require minimal training, a wafer of microprocessors must be created in a high level clean room by automation that is overseen by well trained employees. Certain products are simply so American that it would make no sense to make it anywhere else, such as Wilson Footballs which will likely forever be made in the US unless China decides it wants more concussions and we end up with a Shanghai vs. Dallas super bowl in 2035.

Also, don’t forget that these countries have now built their own middle class, and will soon run out of cheap labor as well. There are more emerging economies, but the point is, this isn’t a never ending chain, though it is one that doesn’t end soon.

So I would suggest that while globalisation is an important factor, it’s been here for a long time, and no recent trade policies have really added to its impact. Those jobs aren’t coming back because our government wills them to. Tarrifs on Chinese imports will just result in China putting Tarrifs on US goods, and soon you’ll find that companies in the US are struggling to grow because the US economy, while large, is not in fact big enough to sustain itself. Whether you agree with the way in which NAFTA or the TPP were implemented, the economy will experience a huge upheaval without international free trade of some kind.

So, globalization took jobs away decades ago. What’s going on now? Why haven’t manufacturing jobs grown with the rest of the economy?

Well, I’m sorry to say, but in many cases, I took your job. Not me personally, but my industry has made automation and artificial intelligence a reality. And if you are being relied upon to make things even after globalization, get ready to have your job threatened again. In 1961 the first industrial robot, Unimate took dangerous jobs away from GM factory workers, and since then plenty more robots have been added to the global manufacturing scene. This was an expensive robot to build and oporate, and so, by the 1980’s, we had already seen that generation of robots take as many jobs as were going to be taken.

But lo, a new generation is upon us. Robots are on the market right now that cost under $30,000, and will do general purpose tasks with enough flexibility to make things to order. This means that for a capital investment of a low end employee’s salary for a year, a factory can replace a human right here in the US. No more benefits, smaller parking lot, no air conditioning or heating, no cafeteria. And they’ll just need to employ a couple of engineers to keep the whole thing running.

So what do we do for those displaced workers as automation happens at this level?

Well believe it or not, there are tons of jobs that aren’t getting done because of labor shortages. These jobs aren’t just in computers. They are also civil engineering tasks, environmental engineering, and raw science. These are all things that will want you to have specific training, whether it’s a doctorate degree or some specific training in a particular field.

But, you don’t have a college degree, you weren’t trained in one of these fields, and your job is threatened so you’re not going to be able to afford to get one.

Well, folks, this is where the recent choice of a Republican Majority government is going to make this hard. The republicans are suggesting that if they let those at the top keep more of their money, they’ll build more factories, and invest in more businesses. But the reality is, that will just enable them to buy more automated infrastructure, and keep even more of their profits. They’ll do this 100% under the protection of the US constitution, and there won’t be anything you can do about it.

I know, it sounds like marxism to some, but the answer is to raise taxes on those individuals living far above subsistance and even above comfortable middle class lives. We should then use that money to make college and advanced job training affordable, or even free to those who qualify by their academic achievements. That will get us even more engineers and scientists to actually build the world we want to live in, and also more system administrators, repair technicians, etc. to keep the world running. Most of these are safe jobs, and many of them can be done remotely, so you don’t have to move to a dirty, crowded city to take them.

And make no mistake, what I’m arguing for is my own salary to be reduced. If there are more people out there who can do my job, I can expect to make less money. But I’d be happy to have less money, if it meant my kids get to live in a world where everybody has a chance to do what they want with their time, and we have the time to take care of the earth the way it should be done.