FewBar.com - Make It Good This is the personal website of Clint Byrum. Any words on here are expressly Clint's, and not those of whatever employer he has at the moment. Copyright (c) Clint Byrum, 2020 http://fewbar.com/ Tue, 03 Jan 2023 08:14:24 +0000 Tue, 03 Jan 2023 08:14:24 +0000 Jekyll v3.9.2 Clean Out Your Services' Rain Gutters <p>On New Year’s Eve 2022, while many were watching the ball drop in NYC, or fireworks in Las Vegas, I found myself outside in the rain, holding a screw gun and cursing like a sailor, watching my pool fill with silt laden rain runoff.</p> <p>You see, this New Year’s Eve, we received about three inches of rain here in Los Angeles.</p> <p>To those of you living in more normal climates, this probably sounds pretty mild for December 31. But to those of us who have been living in Los Angeles for the past decade plus, this was unusual to say the least. We are in the midst of a massive drought, almost a decade long now, and rain beyond the occasional sprinkle, even in the winter, just doesn’t happen that often.</p> <p>So, why did I go out in this rain? Well, our house was built in 1961. It has a pitched roof, and rain gutters, with ample drains, but nobody had really looked at those gutters or drains in a long, long time. We have lived here 9 years and had only really looked at them once when we first moved in. Cleared the leaves in them and went about our business enjoying Los Angeles’s notoriously perfect weather.</p> <p>But the reality is, even with great weather, drains still need maintenance. Several of the gutters had filled with silt and detritus from the roof. Heck we even saw an ivy plant grow out of one corner.</p> <p>And on NYC, one of the down spouts was clearly just not draining at all. The corner of it overhangs the pool, and we could hear it dumping into the pool like a hose had been turned on.</p> <p>There was concern that the pool might overflow with this added in-flow, and a flooding pool is a flooded house, so, at 10pm, in the rain, this part time handyman decided it was time to check it out. This could be bad. What if the drains are clogged elsewhere?</p> <p>So I grabbed my trusty drill-driven drain snake, dressed in waterproof clothing, and stomped out into the rainy night.</p> <p>I stuck my snake in, and a big clump of dirt gave way immediately. Water started rushing down the down spout. Hooray! I moved on to the other down spouts and they were all clear.</p> <p>But the splashing resumed a minute later. It was as if it had never been unclogged.</p> <p>In went the drain snake again, and this time, it went 3 feet in. I jiggled it and pushed and it was stopped, so I began running the drill. It freed a few more feet down, twisting through a few curves. More water rushing through, but then a repeat, it backed up as I was removing the snake.</p> <p>I shoved it back in and ran the drill at full speed, pushing and pulling, trying to remove the clog. It was bigger, and more stubborn. I knew in my head that if it’s really thick, more than a few inches, this snake will never release it. But I pressed on. It’s raining. I’m out here. It’s NYE and plumbers are gonna charge a fortune. Just Fucking Do It..</p> <p>The drill strained. I pushed too hard, and I failed to notice the wire coil inside the snake starting to bind up. This should trigger a reverse, to ease the tension on the spring inside the snake, but I was too engrossed in pushing. In fact this snake is better with two people, one to watch the coil, and one to push and pull, but my wife was inside watching Miley and Dolly sing, and I was almost done, why call for help?</p> <p>The snake twisted in on itself, and the wire pulled out, ruining the snake, and making the snake, all 8 feet that were in the drain pipe, as rigid as a steel bar. It was stuck. So now, we have a clogged drain, <em>and</em> a broken snake.</p> <p>Fuck it. I detached the drill, packed up my stuff and hoped that what I had cleared would at least be helpful as the rain slowed later and we wouldn’t have the pool flood after all.</p> <p>Fast forward to Jan 2, the rain has stopped and I dismantled the drain to remove the broken snake.</p> <p>Inside the drain, I found six feet of what looked like 30 year old silt and roots, heavily compacted. Water drained through it at about 1mm per minute. I spent 2 hours dismantling and clearing the down spout, and then turned to the drain. The drain is also compacted, who knows how far. This is a job for professionals, and they’ll be here later in the week to finish the job.</p> <p>This story reminds me of being on call for technical operations. Something alerts, and you open it up, employing your usual playbook of seemingly related remedies. But this time, it doesn’t work. In fact, it makes things worse. You aren’t an expert, just a reasonably capable person with access to tools and pressure to succeed. You may be in over your head, but escalation has psychological and maybe financial barriers, so you keep working on it, thinking you can do it, making things even worse.</p> <p>So, make sure you clean out your services’ rain gutters. Even if there’s a drought, even when you aren’t expecting trouble, just remember why you have those systems. Fake an incident, do a table-top exercise, maybe make sure your backups are actually working. Whatever it is, allocate some time to it while the sun is shining. Because you’re much more likely to break something if you wait until a massive rain storm.</p> <p>And also, remember that when there’s extra pressure, that’s the most important time to escalate. It’s raining, it’s a special night, you don’t want to be fixing stuff, and you know others don’t either. But escalate anyway. Call the other senior engineers. Call the plumber. Get the experts on the phone and make sure you don’t make one emergency in to two.</p> Mon, 02 Jan 2023 00:00:00 +0000 http://fewbar.com/2023/01/rain-gutters-in-a-drought/ http://fewbar.com/2023/01/rain-gutters-in-a-drought/ maintenance oncall ops incidents Technology Remember, Remember, The Eighth of November <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Remember Remember The Eighth of November When Biden and Harris Beat the party’s worst member Nothing could stop them Their fans in a tizzy The other guy dancing His tiny hands dizzy Not snow Nor rain Not even ID laws Could stop all the people Voting at the polls And then came the counting Tabulating And mapping Steven Kornacki Not even napping Every day Every night We waited for news Of who had won The reds Or the blues First: fucking Florida Then big red Texas Yeah yeah, polling errors That will forever vex us But New Hampshire! That’s something 3 more votes Against Trumping Down south Georgia’s close Could it be? Cross your toes Now W-I then M-I We got them On this try Loss by eight in Ohio That sucks But don't cry though Then two days At stalemate AZ called Oh well mate Wait what’s this? It’s not red? Yeah baby He’s dead LOL CNN MSNBC Too Why can’t you call this We all know it’s blue Ok now, I see It’s not the fault of AP Some dolt fudged a number Biden leads, but we’ll see Too close to call Or early Whatever My brain isn’t working What’s work? Let’s watch Trevor STOP THE COUNTING KEEP COUNTING THIS ELECTION’S A FRAUD No wait, KEEP COUNTING I’ll lose if they pause This is building We can feel it Where’s that weight from my shoulders? I remember Last week I was carrying more boulders PA kept counting From Philly to Erie to Tippecanoe They vowed to continue So we could know who And Nev-ah-da No Nev-add-a Whatever It’s blue They’re counting But slow Obese turtles In shoes By Friday it was obvious He’ll definitely concede The paths to victory Are off in the weeds Haha, Just kidding We knew he would fight We all went to bed And slept well that night Lo, the morning, Saturday PA results trickled in With the president tweeting That the fight will begin It’s simple you see Gather round for the taping At the Four Seasons …Total Landscaping Meanwhile on earth The networks admitted Biden had won The results were transmitted Spontaneously America threw a party They danced in the streets While the conference started “This is fraud” Rudy said Just as it released While standing between Dildos and deceased “Says Who?” He replied When asked for response Haven’t we heard that From a friend of Don’s? “Oh the NETWORKS” he cried At their borrowed pulpit His followers standing Next to bags of bullshit “All the networks?!” He asked We have to refactor There wasn’t much time But there was a tractor But nobody noticed None even sneered That they had scheduled a presser In a place so weird The votes had been counted Of course, there are more But everyone knew That the Don, was no more We expect that forever The Don will complain It’s rigged I WON, BIGLY I’VE GOT THE BIG BRAIN But Kamala Joe Every Democrat Independents Some R’s Are all done with all that But don’t think we don’t love you We do, so much You’ve taught us we’re broken Now we’re more in touch So Don, thanks a billion I know, this is heavy Enjoy being former XOXO, Covfefe </code></pre></div></div> Tue, 10 Nov 2020 00:00:00 +0000 http://fewbar.com/2020/11/the-eighth-of-november/ http://fewbar.com/2020/11/the-eighth-of-november/ politics Life Gearman: Misunderstood wrangler of herds <p>For a few years now, I have maintained <a href="http://gearman.org/">Gearman</a>.</p> <p>Upon taking over maintainership, I just wanted to get some bug fixes out the door, and revitalize the community which had grown stagnant after the former maintainers were unable to keep things up to date.</p> <p>Since then we’ve shipped a few releases, updated for new compilers and kind of reached a stable place.</p> <p>But one troubling trend I’ve seen is that folks don’t seem to use Gearman for what it’s actually good at. Many are using Gearman for what any messaging or queueing protocol can do.</p> <p>See, the <a href="http://gearman.org/protocol/">Gearman protocol</a> was conceived of by the same folks that thought up the <a href="https://github.com/memcached/memcached/blob/master/doc/protocol.txt">Memcache protocol</a>. The similarities are quite obvious.</p> <p>That’s no coincidence. The original design of memcached quickly met with a troubling problem: The Thundering Herd. This is the situation that occurs when one relies solely on caching to reduce expensive operations. Upon expiry, eviction, or genesis of a popular cache entry, you will have many requests for the result, and you must manage the number of expensive threads spawned as a result.</p> <p><img src="/images/thundering-herd.png" alt="thundering herd diagram" /></p> <p>There are a few ways to solve this problem. The most common that you will find is “soft expiry”, you can see implementations such as <a href="https://metacpan.org/pod/Cache::Memcached::Turnstile">turnstile</a> and <a href="https://pypi.org/project/dogpile.cache/">dogpile</a>. This is where one stores a soft expiry time in the cache item. When your caching code sees a soft-expired item, it can spawn an async job to refresh, but still serve the old bit to the user. <a href="https://pypi.org/project/dogpile.cache/">dogpile</a> even implements a lock so that only one thread actually runs to regenerate. There are a few relatively simple ways to make this happen, and a job queue is only one of them.</p> <p>This solves happy-path expiry pretty well, I’ve used it myself. However, it still has trouble with a bunch of normal things, like new-but-popular items, and loss of memcache servers. Basically, if you can’t find even a soft-expired entry in the cache, then every user has to wait while your requests run, and then they must re-read from cache, or somehow get the answer back from the job.</p> <p>This isn’t so far-fetched. Memcached uses LRU expiry. This means that popular items stay in cache until their TTL, and less popular items are evicted when space is needed, even before their hard-expiry time. If you’re running memcached at full capacity, it should be evicting the least popular things. But if you have a very even distribution of popularity, you’ll be evicting things that are just moderately popular. Also, many systems will make a new entry popular immediately, thus needing a real way to coordinate the requests to it before it has even been cached in all the places it needs to be.</p> <p>The gearman authors came up with a really cool solution for this, but it’s oft-overlooked. In fact, the original version of the protocol docs I first read <a href="https://github.com/gearman/gearmand/commit/fc3ef0b047602104b1a60ea1da0a93b7d28005b3">didn’t even mention it</a>.</p> <p>Every job submitted to gearman, whether background or foreground, accepts two things. One is called “arguments” or “payload”, and that’s fed to the worker in all <code class="language-plaintext highlighter-rouge">JOB_ASSIGN*</code> packets.</p> <p>The other is a “unique ID”. This is the magic that gets ignored. In most libraries, if you don’t pass a unique ID for your request, the library will just generate a UUID for you.</p> <p>Most libraries include no explanation as to why this is there, and it’s often invisible to users. There’s a clue in the original implementation though. You’ll note that the unique ID is <em>not</em> included in the original <code class="language-plaintext highlighter-rouge">JOB_ASSIGN</code> packet sent to workers. That’s because this was always meant to be information for the <em>gearman</em> service. Only later did they realize that many jobs were just submitting the same unique ID as the payload, and added in the <code class="language-plaintext highlighter-rouge">GRAB_JOB_UNIQ</code> and <code class="language-plaintext highlighter-rouge">JOB_ASSIGN_UNIQ</code> versions.</p> <p>But generally you <em>should</em> have an ID. It’s not always a record ID, but it’s generally going to be “whatever you’d use to access cache entries”. It’s a unique string for this work. Sure, sometimes you have something obvious, like “account ID”, or sometimes you have free-form arguments, like a list of search terms. Whatever it is, you need to send this to the gearman server. You <em>also</em> should be hashing the unique ID, and choosing a gearman server to send to based on that hash, just like memcache clients do for cache keys. There’s a very good reason for this that we’ll get to later, but the most important thing is that you send a real, useful unique ID for every job you send to the server.</p> <p>See, when your request arrives in a well-behaved gearman server (some don’t do this properly), it will first look to see if is already aware of this ID. If your job has asked for an ID that already exists, gearman will feed back the same job handle that’s already in the queue to the client. For foreground jobs, gearmand will multiplex the <code class="language-plaintext highlighter-rouge">WORK_*</code> packets that comes back from the worker <em>to all submitters</em>. It’s like a distributed thread join.</p> <p><img src="/images/gearman-coalescing.png" alt="only one thread per miss" /></p> <p>This means that you have a guarantee that the number of concurrent active work units is deterministic.</p> <p><code class="language-plaintext highlighter-rouge">n_active = min(n_uniques * n_gearman_servers, n_workers)</code></p> <p>And actually, it’s better than that. If you followed the earlier advice to make your gearman client hash uniques and send submissions to the proper gearman server based on a hash, it’s just:</p> <p><code class="language-plaintext highlighter-rouge">n_active = min(n_uniques, n_workers)</code></p> <p>This means you can go look at your functions, and calculate the exact theoretical maximum load your backend resources will experience, and unbound auto-scaling for your workers to the exact capability of your backend. The math for capacity planning on those resources gets a lot simpler when you know the upper bounds of any of the variables deterministically. You can even dig through your cache hit/miss logs and find a good over-subscription scheme to adopt, and this calculus is much simpler when one of the variables is fixed.</p> <p>Any old message queue can do the async work, but I haven’t seen any that can do a distributed thread join like gearman does.</p> <p><img src="/images/gearman-magic.png" alt="distributed thread join" /></p> <p>This magic hasn’t lost value because of the sands of time. We are still writing apps that need to scale reads, even if we’re sharding our databases. In fact, response time requirements have gotten <em>more extreme</em>, not less. Being able to build backends that respond rapidly and are scaled to meet exact demand efficiently is critical in building scalable, responsive APIs. Gearman has other tricks to help with user-interaction too, such as the <code class="language-plaintext highlighter-rouge">WORK_STATUS</code> packets, which will send back numerator/denominators, or <code class="language-plaintext highlighter-rouge">WORK_DATA</code>, so you can send partial responses while the final result is calculated, which is perfect for sending a stale cache entry back immediately, but then updating with the final and sending it with a <code class="language-plaintext highlighter-rouge">WORK_COMPLETE</code>.</p> <p>Unfortunately, one reason this was lost is that when the Perl gearman client was rewritten in C, hashing unique to pick a server for <code class="language-plaintext highlighter-rouge">SUBMIT_JOB*</code> packets, was lost. And if you look, all of the language client libraries written since have failed to do it, as most are just either using libgearman, or copying it. I include the recent rustygear library I wrote in this equation. It’s not easy, but it is such an awesome thing that it’s worth doing. I hope to find some time to add this into rustygear, and maybe even with time, replace libgearman with rustygear so other languages can use it.</p> <p>Another reason I think that Gearman has lost momentum is that HTTP has gotten a lot smarter. You don’t need a special TCP protocol to achieve the level of responsiveness required to get this job done. I’ll be looking into a way to make the Gearman protocol ride on HTTP/2 and Websocket, so that folks can write workers and clients without needing a native gearman library.</p> <p>I hope you see a solution to your problems in this, even if it’s not deploying gearmand and rewriting things as gearman workers. And if you think Gearman <em>is</em> the right solution for you, I’ll invite you to please join <a href="https://groups.google.com/d/forum/gearman">the gearman google group</a>, and/or <a href="mailto:clint@fewbar.com">reach out to me</a> and chat with us about that.</p> <p>Happy Hacking</p> Tue, 21 Apr 2020 00:00:00 +0000 http://fewbar.com/2020/04/gearman-is-misunderstood/ http://fewbar.com/2020/04/gearman-is-misunderstood/ gearman memcache elegance Scalability A bit of deep analysis <p>In case you missed it, <a href="https://www.linkedin.com/feed/update/urn:li:activity:6650805910325338112/">I’m on the job market</a>. As I’ve been winding through the various opportunities out there, I found a company that wants a sample of “analysis” as part of their application.</p> <p>I’ll be honest, I sat in front of this section for an hour without coming up with anything. I wrote down some of the times I’d done deep analysis that I could remember, but none of them really impress me. Now, you may be reading this thinking “heh, yeah, cause this guy doesn’t know how to analyze”. But to be honest, I think for me, this question is like asking <a href="https://en.wikipedia.org/wiki/Greg_Maddux">Greg Maddux</a> how he threw his best fastball, or asking <a href="https://en.wikipedia.org/wiki/Mick_Jagger">Sir Mick Jagger</a> how he performed his best chicken dance, or a great white shark how it ate its most fish. They could all probably tell you more about doing taxes. They’re just being themselves, it’s what makes them great.</p> <p>Analyzing systems has always been very natural for me. I mean always. I was that kid pulling apart the VCR to see what was inside, and then putting it back together successfully.. sometimes. I truly do not remember what I did there, but look ma, the whining noise is gone when we rewind!</p> <p>So, that said, I’ll not give you the best analysis I’ve done, because I just can’t pick one of the thousands. Instead, I’ll give you the very first one I was paid for. A brief story.</p> <p>The year is 1997, and the protagonist is a recent graduate of High School. For the past 3 months, our hero has been employed doing his dream: playing with computers. As a contrarian, this rare Linux-using teenager has decided to go right in to the IT industry and skip that silly college thing (yes, this teenager was as dumb as a teenager, and twice as stubborn). And they even have an HP-9000 <em>real</em> unix box to play with.</p> <p>The developers are all perplexed, on this day however. The <em>literal green screen</em> application that runs the entire company is crashing, randomly. All hands on deck, the 4 developers are furiously examining code and records in the database to find out what is causing this problem. Our hero loves code, and has chatted with the devs a few times about what they do. This weird system they work in is called “PickDB”, and the app is in “PickBASIC”, and it’s really simple, so, Sir Paperbox-carrying PC-network-card-reseating 18 year old has been peeking at the manual and playing on his own green-screen to learn it in spare time.</p> <p>Alas, the problem is vexing. This particular firm is a medical equipment manufacturer, and billing agents are seeing, randomly, that some records they pull up simply corrupt their entire screen view. The cursor flies around, weird matrix-like (oh wait, there’s no Matrix yet) characters print, and generally they have to have a server-side reset to continue (kill their shell, the database vendor explains, and our hero already knew how to do that and showed the devs so they could self-service).</p> <p>Well the devs are fighting through it, so he sits down to peek at the database as well. They have a list of records known to cause the problem, so he pulls one up with the usual method: SELECT foo WHERE id = ‘999’ or something like that, honestly, it wasn’t SQL, but it was the SQL of PickDB. Behold! A perfect record is printed, line by line. One by one, all of the records look pretty normal, though you can see where the longer records had incomplete words.</p> <p>The vigor of this young man will not be stopped though, and he wonders if he just writes a program to print out the records, if it works differently. Writes program, runs it, and there, in front of him, is a broken terminal.</p> <p>So, he kills his session and tries turning the characters in to their ASCII numbers. Oh noes, the program crashes HARD. What’s this then? The built-in function that turns characters in to numbers has printed a runtime error and halted his program. “High-bit char detected.”</p> <p>Fast-forward to an hour later, where this 18 year old is explaining to these 40 year olds how to fix the problem. See, PickDB and PickBASIC don’t actually exist anymore. It was originally run on IBM mini-computers, but those are expensive, so some vendors had shown up to swoop in and replace it with cheaper emulators that run on cheaper HP-9000’s. This was called “UniData”, and it worked flawlessly.</p> <p>However, when this company had converted from Pick to UniData, the vendor had converted most of the data in some black-box process. But now, 4 years later, the database was purged of most of those records, and everything in it was newly entered in the green screens which were serially wired to a serial-to-telnet gateway via CAT-3 cables wired into RS-232 connectors. So, one day, 4 years ago, the vendor had arrived, taken the old IBM system out, and put in the HP-9000, and taken all of those RS-232 patches out of the IBM MUX, and put them into an HP terminal concentrator.</p> <p>Meanwhile, over 4 years, the company had grown, and more and more green-screens purchased and wired up in the same fashion. Reorgs shifted people around, and terminal connections stayed where they were, with users just logging in to wherever they sat.</p> <p>And so, what had happened? Last week, a reorg had sent accounting into the space used by medical bill processing, and vice-versa. This meant that records were now being entered in the space that had been newly built out for accounting 2 years earlier. Accounting didn’t do much data entry, but medical billing was literally just taking doctors’ patients forms on paper, and entering them into forms on green screen. And so, for the past week, new terminals had been used to enter all of the new records.</p> <p>But the vendor had forgotten one thing. They forgot to tell the person who they trained on how to wire new connections, to log in to the terminal concentrator, and change it to use x-on/x-off 7-bit for green-screens. The default, was CTS/CTR. The details of this dark and arcane protocol are lost to my brain as it has been over 20 years, but basically, when people typed fast, the concentrator would set a pin on the RS-232 to 1 instead of 0 to say “STOP SENDING ME STUFF MAH BUFFER’S GONNA BURST”, but the terminal would keep sending stuff, and when the pin went back to 0, the concentrator would start reading again, and binary hilarity ensued.</p> <p>This realization came about because our hero had been playing with serial connections and modems since he was 12. He knew what CTS/CTR and x-on/x-off were, sorta, and had read the manuals for both the green-screens and the terminal concentrator while debugging this problem. Once the high-bit chars error was seen, the real problem was unmasked. And it was a short process to go make a well behaved terminal suddenly become a garbage-data-generator.</p> <p>Funny enough, the SELECT program from UniData had a filter which would remove high-bit chars, which is why it was so hard to find these nasty chars. Even worse, this DB used the 8-bit int value of 253 to mean “field delimiter”, and 254 as “record delimiter”, so if you got REALLY unlucky, your keystrokes would split a record or a field in two! The devs just hadn’t thought that the built-in program would lie to them! By writing his own program, our hero had satisfied the X-files rule of debugging: Trust No-one. After reading the manuals he had then applied the same principal to the terminal concentrator configs: why trust one port over another?</p> <p>In closing, this analysis also led to an amazing discovery that changed the lives of the data-entry folks for a brief time. In reading the manual for the port-concentrator, he discovered that the ports were all set to 19200bps, which was ok, but that the port concentrator could actually go much, much faster if it could use CTS/CTR. And all of the new terminals in the data-entry area were WYSE-60+, not WYSE-50, which meant they actually could be configured for CTS/CTR and the much higher data rate. As a result, screen paints were 10x faster on these models, and soon they became coveted treasures in the company.</p> <p>FIN</p> Thu, 02 Apr 2020 00:00:00 +0000 http://fewbar.com/2020/04/a-bit-of-analysis/ http://fewbar.com/2020/04/a-bit-of-analysis/ terminals unix pick joy Debugging YAML shyaml - Wrote a tool about it, here it goes <p>I’ll keep this brief and on point: <a href="https://fewbar.com/2017/01/a-love-letter-to-rust/">I love Rust</a> and yet, I have published so little <a href="https://github.com/SpamapS/rustygear">rust code</a>, and as yet, none that actually saw real usage.</p> <p>Well today that changes. With many thanks to my former employer, <a href="https://www.godaddy.com/">GoDaddy</a>, for assigning copyright of a work product to me, I am pleased to announce the publishing of <a href="https://crates.io/crates/shyaml">shyaml</a>.</p> <p>When I joined GoDaddy, among other duties, I was tasked with keeping their “Legacy” Kubernetes installation alive and running. The team that had built it were all gone, and I was fairly new to Kubernetes, so I had to tread lightly.</p> <p>As a result, I found myself constantly wishing I could see what <code class="language-plaintext highlighter-rouge">kubectl</code> was about to do. I was shocked that <code class="language-plaintext highlighter-rouge">kubectl</code> didn’t have a diff mode. In retrospect, I probably should have just learned Go, and added a –diff switch to <code class="language-plaintext highlighter-rouge">kubectl</code>. But, I fell back on the tool I know, and I had just written <code class="language-plaintext highlighter-rouge">shyaml</code> to get my rust chops up. It was just a general CLI manipulation tool for YAML and JSON at the time. Adding the ability to intelligently diff Kubernetes objects was pretty straight forward.</p> <p>So I added a <code class="language-plaintext highlighter-rouge">kubediff</code> command, which tries to show you the real effect a <code class="language-plaintext highlighter-rouge">kubectl apply</code> command will have before you run it. It’s not perfect, but it generally works.</p> <p>And so, I give this tool to the world. I hope to make it even more general purpose, as time permits. But until then, please enjoy the code, and feel free to report issues and open PR’s. Also, join us on the <a href="https://toolsforhumans.slack.com">Tools for Humans Slack</a> if you want to chat about it.</p> Mon, 29 Oct 2018 00:00:00 +0000 http://fewbar.com/2018/10/yaml-shyaml/ http://fewbar.com/2018/10/yaml-shyaml/ rust yaml kubernetes Rust The Build is Never Broken <p>When I joined the OpenStack Engineering team at GoDaddy, I want to make it clear, things were not completely terrible. In fact, there were some really amazing things happening.</p> <p>Most of the things they needed to do over and over were in code, and 99% of that was in a single chain of git repositories that the entire team was aware of and participating in. There was CI in various forms running against PR’s opened on repos and building things after stuff landed. In general, most obvious mistakes didn’t waste the team’s time.</p> <p>However, there was one really frustrating thing, that nobody seemed to be able to really solve: There was no realistic dynamic development environment. So many repos, so many components, it just wasn’t feasible to scale it down entirely. If you had an idea of how something needed to get done, you had a few choices, none of which were fast or fun:</p> <p>1) You could write a patch that looked kind of right, propose it against the right git repo, get it through review and landed, wait for the artifact builds to produce the needed bits, deploy them to the “dev” environment, and then iterate on that until dev worked the way you wanted, promoting the artifacts to stage and eventually production.</p> <p>2) You could try to edit the <code class="language-plaintext highlighter-rouge">dev</code> environment directly, and reverse engineer what you did back in to a patch, and then do step (1) but with at least a bit more confidence.</p> <p>3) Cry in to your pillow. Go to step 1.</p> <p>Now, you may be asking “why not Vagrant?”, and I’d say that too, except I almost never have had any success with Vagrant or similar. Because it’s so different from production deployments, the Vagrant build is almost always broken a little bit in some ways. Also really, because there are so many of these type (1) changes in-flight, often times <code class="language-plaintext highlighter-rouge">master</code> is a complete shambles, and you end up needing to either rewind to the last known good deploy, or pull somebody else’s branch that has a fix in-flight. It’s entirely possible to have a great local-dev or even cloud based Vagrant or Vagrant-like tool. But the effort to build it and maintain it are pretty large, and if the only benefit you get is local dev, you have just traded one velocity problem for others, such as change delivery success rate and transparency.</p> <p>Does this sound familiar? After joining a few DevOps-ish teams over the last few years, it’s a common pattern. This isn’t a tools problem. The team was sort of 95% puppet, and 5% Ansible, and that 5% was no better or worse in this respect. No, we didn’t have to wait for RPMs to be built for Ansible, but because we also could just iterate on our laptops, many in-flight in-dev changes were actually more dangerous than Puppet changes, because they often would be run on production, and then just forgotten, never re-incorporated in to the git repos.</p> <p>That’s not because the team didn’t want to be better tested, nor was it because they lacked the capability. They had some pretty heavy hitting Jenkins talent at their disposal to get anything Jenkins could do done. And it’s worth noting that there was an attempt to build a “dev on demand” set of Ansible playbooks to get this done. However, as noted above, this might not actually have resulted in a net-positive win.</p> <p>The problem that many teams face is the same one that my co-workers faced then: The build was often broken and incomplete, whether that was detected or not, because the tools we have for testing are not flexible where they need to be, and as a result, they’re not able to be strong when they need to be.</p> <p>Solving this problem is the crux of Zuul’s very existence. Adopting Zuul means any team, whether you’re 3 people working on a single repo, or 300 people working across 25 repos, can reap the benefits of an always-deliverable set of repositories and branches.</p> <p>So how does Zuul do this, and why can’t other systems get this done? Well the answer starts with git.</p> <p>If you’re not familiar with git, um.. it’s 2018, please spend the next hour learning what it is, and then 5 minutes asking why on earth your build engineers haven’t switched.</p> <p>Most tools try very hard to be somewhat git-stupid. Jenkins comes to mind here. While it has plugins to have it listen for <code class="language-plaintext highlighter-rouge">GitHub</code> webhooks, and other change management systems, it really doesn’t want to be too aware of git. Git is relegated to the same as <code class="language-plaintext highlighter-rouge">curl</code>: it’s the way you fetch the code to do things with.</p> <p>But that’s really not the case. This single-mindedness around git is often where tools sort of give up. Implementers of CI tools seem reluctant to think of what happens after a git commit lands.</p> <p>But Zuul came from a place where there were very large penalties for landing broken code. OpenStack had an extremely wide scope, and as such, many developers were showing up with code and integrations between the various projects. Having “the build broken” for 2,000 people shines a very bright light on just how important it is that tests work, and that code does what reviewers and coders believe it does.</p> <p>Because, you see, when you land things in git, they are thusly part of a timeline that did not exist when others pulled code. And when those others land things in their local git repository, and make it work with their local dev tools, there’s nothing to say that what’s in master will keep working when integrated with those unfinished changes. Add in dependent repos and changes, and the certainty of something that worked today continuing to work if landed tomorrow is very low. It’s all a big, messy, eventually consistent, distributed system.</p> <p>But it doesn’t have to be broken all the time, and we don’t have to always rebase on master every time it changes “just in case”. If we think of a set of git repositories and/or branches as a unit to be tested together, it’s a very short bridge to fully testing things together in the form that they will be made available to others, before they are made available to others. Just like you don’t send things to production without testing them in stage exactly as they are first, you shouldn’t send code to your developers without having first had it tested exactly as it will land.</p> <p>And once you decide on that being a good thing to do, other items pop in to your head. What about multi-node testing of distributed systems? How can I make it go faster? These are questions that the creators of Zuul faced as well, and solved with simple, straightforward answers that have been proven valuable through the years of OpenStack running its entire development infrastructure on them.</p> <p>So, at GoDaddy, when it came time to reboot our OpenStack installations a bit, I couldn’t think of a better way to leverage Zuul’s power than to start with a 5-VM job to deploy a mini-cloud on to and run some tests against it.</p> <p>I’ll be honest, I gave this initiative a 50% chance of working. The deadlines were tight, and the political environment around the products being supported is a bit stressful. I Fully believed that somebody would look at what we were doing and pull the plug in favor of some other more widely accepted tool or more established pattern.</p> <p>Luckily, nobody had time to say no to Zuul. So we built this 5-VM job up to the point where it deployed an OpenStack control plane the way we wanted to deploy in production. Our first few deploys to the POC environment found all the ways that a mini-cloud is different than a real one, but we had an enormous amount of momentum built up behind Zuul and this job, and Zuul mostly seemed to be getting out of peoples’ way at the right time. We had our “dev on demand”, and even better, we had a single, transparent way to land changes.</p> <p>Since then we’ve been pushing changes at a pretty impressive clip, and it is quite rare for a change to break production. We do get quite a bit of coverage for our automation from the various Zuul jobs that run before changes land. Mostly those corner cases, such as scale problems, mis-typed config details, or scheduling issues where the hardware actually matters, regularly cause us issues.</p> <p>But we have devised a scheme to leverage Zuul for these issues as well, by using it to kick off deploys to our staging environment and promote changes from there to production only after automated tests have run.</p> <p>So how does Zuul do this, and why does it leverage Ansible?</p> <p>First and foremost, by being fully git-aware, an engineer is effetively able to build a new future in a set of git repositories and branches. If this new future must span multiple repos, Zuul offers the engineer the ability to specify dependencies on other changes.</p> <p>So, if you need to propose a new variable for a role that is shared amongst automation concerns, and then depend on that variable for yours, you can add <code class="language-plaintext highlighter-rouge">Depends-On: https://your.github/roles_org/role_repo/pulls/1234</code> to the commit message of your change. While zuul is building working directories to run jobs in, it will see this dependency, and use the branch/PR/etc. that you have submitted as its basis for pre-merge testing. And when all reviewers are happy with your dependent change, it won’t land until the upstream dependency lands too.</p> <p>So now you, can build an entire future without having landed risky code in master, and without having to wait for your upstream dependencies. This even works if the upstream project doesn’t use Zuul. As long as you can give Zuul credentials to inspect the change management system and pull the necessary commits, it will be able to build a speculative future from them, and it won’t accidentally land your dependent change until the upstream change is merged.</p> <p>This even works across Gerrit, Github, and Github Enterprise. Other systems are also in development. This is nice if, say, you have GHE internally, and your depend on upstream Github projects, you can encourage your engineers to submit code upstream. And if you really want to get involved deeply with that upstream, you can even set up your zuul as a GitHub app, and report statuses on those repositories, which should help them avoid merging code that breaks you.</p> <p>Now that you’ve embraced Zuul’s future-building capabilities, you’ll want to start expanding coverage. Zuul has all of what you’d probably expect for doing the easy stuff: linters, unit tests, etc. But what about full integration tests like our “build a mini-cloud” job we talked about before?</p> <p>For this, Zuul is going to need a cloud. Current releases only support OpenStack clouds, or static pools of SSH/PowerShell accessible machines. However, there are several public cloud drivers, such as AWS and GCE, that are in various stages of quality. The AWS driver in particular is close to being ready.</p> <p>Once you have given Zuul cloud resources via its sub-component named <code class="language-plaintext highlighter-rouge">nodepool</code>, you can start attaching nodesets to your jobs. These can be backed by various flavors, images, and configuration details, and given names and Ansible groups for use in playbooks. Typically youll even have a default nodeset in your base job that everything parents to.</p> <p>At GoDaddy we have allocated 75 “large” instances (8GB RAM, 120GB of disk) on one of our less busy private OpenStack clouds for running tests. We also have defined a custom image using <code class="language-plaintext highlighter-rouge">nodepool-builder</code>, which gets built every 12 hours to pull in the latest apt and pip packages that our jobs will need. That way our job runtimes don’t get too long with downloading and extracting new copies of things. We also ask <code class="language-plaintext highlighter-rouge">nodepool</code> to keep 15 nodes running at all times, so that any 5-node job will be able to start <em>immediately</em> and not have to wait for the cloud to spin up VMs. This also tends to smooth out problems with cloud control planes, which we do experience from time to time.</p> <p>Alright, so now we have compute resources, opitimized images, and git repos plugged in to Zuul. What’s next? We need to define jobs.</p> <p>At GoDaddy we have some repos that have just 1 <code class="language-plaintext highlighter-rouge">noop</code> job on them. This still has a benefit, as the repo may be housing Zuul configuration that needs to be validated before changes are landed. We also have our most busy repo, which we call <code class="language-plaintext highlighter-rouge">openstack-deploy</code>, which runs between 3 and 9 jobs on every PR, and 3-4 jobs in the gate. The number varies, because sometimes we use Zuul’s ability to skip jobs by filename. So, for instance, we don’t need to run the big long <code class="language-plaintext highlighter-rouge">kolla-ansible</code> job which deploys a mini-cloud, if the change wholly consists of configuration details for our production clouds.</p> <p>One really interesting aspect of working with Zuul is when you start to have interdependencies in repos. We have a repo which houses patches which we apply to the upstream OpenStack deployment tool named <code class="language-plaintext highlighter-rouge">kolla-ansible</code>. Whenever we run any deployments of kolla-ansible, we apply these patches on top, and generate the configuration for <code class="language-plaintext highlighter-rouge">kolla-ansible</code> on top of that. This means that if we update the patch repo with something that won’t deploy right, we could end up with a broken master again.</p> <p>But luckily, Zuul was built with this scenario in mind, and as such, allows us to have two repos run the same job in a shared queue. That means that if I propose a patch to <code class="language-plaintext highlighter-rouge">openstack-patches</code>, and an update to <code class="language-plaintext highlighter-rouge">openstack-deploy</code> that seems unrelated, they’ll be tested together, one landing before the next using Zuul’s speculative execution capabilities. If we do this right, it means we can’t land broken code in either repo.</p> <p>Finally, for those times where you just can’t figure out why a job is failing, there’s the “auto hold”. Zuul can hold on to test nodes after a job fails if you inform it that you need it to. This allows an engineer to log in to the test nodes and poke around, even to try running the test again with modified code. Many of our biggest refactors happened on held test nodes, where an engineer would fiddle with things on those VMs, and then pull the changes back down and submit a fixed patch over the course of a few days.</p> <p>So, from a cultural perspective, how did having these new capabilities affect team productivity?</p> <p>First and foremost, there’s a good chance if we didn’t have Zuul, we would have crumbled under the pressure of a very tight deadline. With many many changes in flight and moving rapidly, it would have been an absolute momentum killer to have to stop and fix broken builds every day. Furthermore, by being able to let Zuul spin up mini-clouds, we gave our developers a ‘fire and forget’ mechanism for testing their changes in parallel, isolated from each other while the changes were in chaos.</p> <p>Second, we found that members of the team were able to find more information about how things work faster, because everything, including the configuration for the actual tests, is stored in git trees. It really does help when you’re doing a <code class="language-plaintext highlighter-rouge">git annotate</code> on a file, and then the change for a line that is confusing you is accompanied by edits to the testing configuration. This is especially helpful in root-cause analysis, where you are trying to match timelines from multiple sources together to find when a change may have been made that resulted in an incident.</p> <p>Finally, Zuul was actually only half of the story. Another big part of it was that because Zuul was just running Ansible, we were able to leverage that for other tasks. We don’t actually run our Ansible against production using Zuul. Instead, we run the same playbook that Zuul does with our super duper chat bot named Padre. I very much hope that we can open source Padre soon, and present it at a future AnsibleFest. But ultimately, Zuul was relatively straight forward to adopt because we could use the tool we knew already: Ansible, and it was also easy to break back out of Zuul when we needed to, for the same reason.</p> <p>So what should you do if you are interested?</p> <ul> <li> <p>Attend Ricardo Carillo Cruz’s deep dive in to Zuul and Ansible Networking at 3:00pm.</p> </li> <li> <p>Come talk to us at the Zuul booth!</p> </li> <li> <p>Deploy zuul – I’m not going to lie, this isn’t easy, and it doesn’t make a ton of sense unless you have your code in either GitHub, GitHub Enterprise, or Gerrit.</p> </li> </ul> Tue, 02 Oct 2018 00:00:00 +0000 http://fewbar.com/2018/10/the-build-is-never-broken/ http://fewbar.com/2018/10/the-build-is-never-broken/ zuul cicd automation development vcs git Technology Trump gonna Trump - Time to move forward <p>I usually don’t post much about politics here, but this week our President was <a href="https://www.washingtonpost.com/politics/a-time-magazine-with-trump-on-the-cover-hangs-in-his-golf-clubs-its-fake/2017/06/27/0adf96de-5850-11e7-ba90-f5875b7d1876_story.html">caught looking pretty silly</a>, and I feel like it’s worth commenting on.</p> <p><img src="/images/trumpcover.jpg" alt="Fake Time Cover" /></p> <h6 id="source-washington-post">Source: Washington Post</h6> <p>Just do the thought experiment: Imagine if Hillary Clinton or Barack Obama or John Kasich made something like this and put it up anywhere. Those who thought highly of them would think it was a joke. Any of them would likely come out immediately and explain that it was in fact a joke and most of us would have a laugh at it. It’s so out of character, it would just be laughable to anyone who looks at those individuals as good people.</p> <p>But no matter what they did of course those who thought negatively of them already would call it narcissism of the highest form and decry them as purveyors of fake news. This would blow over pretty quickly, because none of these people have been pointing fingers at their critics and calling them fake news. But it would certainly serve to enrage their detractors.</p> <p>The reason this doesn’t bother The <a href="http://time.com/4523972/donald-trumps-comment-root-sexual-violence/">45th President of the United State</a>’s supporters is that <em>they already accepted that he has no integrity</em>. They see this and just go “This guy. What will he do next?” But if you’re holding him to the fire, it just tickles that healthy confirmation bias that he is the end of our democratic traditions and a truly terrifying individual who is burning down the reputation of the presidency one <a href="https://www.nytimes.com/2017/05/31/us/politics/covfefe-trump-twitter.html">stupid tweet</a> at a time.</p> <p>Usually we have a fringe of people who treat the president this way. Either it’s the far left protesting Bush 43 over Iraq and spending cuts, or the far right protesting Obama over immigration and climate change policies. Grow a thick skin Mr. President, whoever you are.</p> <p>But now we have <a href="https://projects.fivethirtyeight.com/trump-approval-ratings/">a large majority of moderates joining the far left in opposing this president</a>. That only serves to entrench his supporters more. They see themselves as outsiders who finally got their guy in. So don’t be surprised when his supporters laugh off these stunts. They don’t care because they just want you to acknowledge that they won and you lost, and he cares about them and their issues. They don’t mind that he’s embarrassing. As long as he’s sticking it to you, the elites and their supporters who have been sticking it to them for years.</p> <p>IMO, just ignore all of that. We’ve made our piece against this man, and now it’s time to get down to fixing the damage and building a better society that is more resilient to these problems.</p> <p>How? We start by taking time to support strong leaders who will restore the respect and dignity of the office and the nation. <a href="https://mayday.us/">Support legislators that will amend the constitution</a> to reverse the influence corporations have over elections so we aren’t stuck with fund raisers and reality TV stars as our only choices.</p> <p>And finally, when our friends, family, and neighbors who supported Trump come back in 2020 like a screaming horde of vikings intent on burning down the last vestiges of civil society: don’t hate them. Don’t belittle them. Don’t even shout at them. Just be ready with the only real defense any of us have: our vote.</p> Sat, 01 Jul 2017 00:00:00 +0000 http://fewbar.com/2017/07/trump-gonna-trump/ http://fewbar.com/2017/07/trump-gonna-trump/ politics Life Free and Open Source Leaders -- You need a President <p>Recently I was lucky enough to be invited to attend the <a href="http://events.linuxfoundation.org/events/open-source-leadership-summit">Linux Foundation Open Source Leadership Summit</a>. The event was stacked with many of the people I consider mentors, friends, and definitely leaders in the various Open Source and Free Software communities that I participate in.</p> <p>I was able to observe the <a href="https://www.cncf.io/">CNCF</a> Technical Oversight Committee meeting while there, and was impressed at the way they worked toward consensus where possible. It reminded me of the <a href="https://www.openstack.org/foundation/tech-committee/">OpenStack Technical Committee</a> in its make up of well spoken technical individuals who care about their users and stand up for the technical excellence of their foundations’ activities.</p> <p>But it struck me (and several other attendees) that this consensus building has limitations. <a href="https://twitter.com/adamhjk">Adam Jacob</a> noted that Linus Torvalds had given an interview on stage earlier in the day where he noted that most of his role was to listen closely for a time to differing opinions, but then stop them when it was clear there was no consensus, and select one that he felt was technically excellent, and move on. Linus, being the founder of Linux and the benevolent dictator of the project for its lifetime thus far, has earned this moral authority.</p> <p>However, unlike Linux, many of the modern foundation-fostered projects lack an executive branch. The structure we see for governance is centered around ensuring corporations that want to sponsor and rely on development have influence. Foundation members pay dues to get various levels of board seats or corporate access to events and data. And this is a good thing, as it keeps people like me paid to work in these communities.</p> <p>However, I believe as technical contributors, we sometimes give this too much sway in the actual governance of the community and the projects. These foundation boards know that day to day decision making should be left to those working in the project, and as such allow committees like the <a href="https://www.cncf.io/">CNCF</a> TOC or the <a href="https://www.openstack.org/foundation/tech-committee/">OpenStack TC</a> full agency over the technical aspects of the member projects.</p> <p>I believe these committees operate as a legislative branch. They evaluate conditions and regulate the projects accordingly, allocating budgets for infrastructure and passing edicts to avoid chaos. Since they’re not as large as political legislative bodies like the US House of Representatives &amp; Senate, they can usually operate on a consensus basis, and not drive everything to a contentious vote. By and large, these are as nimble as a legislative body can be.</p> <p>However, I believe we need an executive to be effective. At some point, we need a single person to listen to the facts, entertain theories, and then decide, and execute a plan. Some projects have natural single leaders like this. Most, however, do not.</p> <p>I believe we as engineers aren’t generally good at being like Linus. If you’ve spent any time in the corporate world you’ve had an executive disagree with you and run you right over. When we get the chance to distribute power evenly, we do it.</p> <p>But I think that’s a mistake. I think we should strive to have executives. Not just organizers like the <a href="https://docs.openstack.org/project-team-guide/ptl.html">OpenStack PTL</a>, but more like the <a href="https://www.debian.org/devel/leader">Debian Project Leader</a>. Empowered people with the responsibility to serve as a visionary and keep the project’s decision making relevant and of high quality. This would also give the board somebody to interact with directly so that they do not have to try and convince the whole community to move in a particular direction to wield influence. In this way, I believe we’d end up with a system of checks and balances similar to the US Constitution</p> <p><img src="/images/usgovt.jpg" alt="Checks and Balances" /></p> <p>So here is my suggestion for how a project executive structure could work, assuming there is already a strong technical committee and a well defined voting electorate that I call the “active technical contributors”.</p> <ol> <li> <p>The president is elected by <a href="https://en.wikipedia.org/wiki/Condorcet_method">Condorcet</a> vote of the active technical contributors of a project for a term of 1 year.</p> </li> <li> <p>The president will have veto power over any proposed change to the project’s technical assets.</p> </li> <li> <p>The technical committee may override the president’s veto by a super majority vote.</p> </li> <li> <p>The president will inform the technical contributors of their plans for the project every 6 months.</p> </li> </ol> <p>This system only works if the project contributors expect their project president to actively drive the vision of the project. Basically, the culture has to turn to this executive for final decision making before it comes to a veto. The veto is for times when the community makes poor decisions. And this doesn’t replace leaders of individual teams. Think of these like the governors of states in the US. They’re running their sub-project inside the parameters set down by the technical committee and the president.</p> <p>And in the case of foundations or communities with boards, I believe ultimately a board would serve as the judicial branch, checking the legality of changes made against the by-laws of the group. If there’s no board of sorts, a judiciary could be appointed and confirmed, similar to the US supreme court or the <a href="https://www.debian.org/devel/tech-ctte">Debian CTTE</a>. This would also just be necessary to ensure that the technical arm of a project doesn’t get the foundation into legal trouble of any kind, which is already what foundation boards tend to do.</p> <p>I’d love to hear your thoughts on this on Twitter, please tweet me <a href="https://twitter.com/spamaps">@SpamapS</a> with the hashtag #OpenSourcePresident to get the discussion going.</p> Sat, 18 Feb 2017 00:00:00 +0000 http://fewbar.com/2017/02/open-source-governance-needs-presidents/ http://fewbar.com/2017/02/open-source-governance-needs-presidents/ opensource governance openstack cncf Technology Open Source OpenStack CNCF Rust - You Complete Me (And then drop me, because I'm out of scope) <p>To My Dearest <a href="http://www.rust-lang.org/">Rust</a>,</p> <p>Ever since I laid eyes on your braces and semicolons, I knew, there was something special about you. <a href="https://github.com/SpamapS/rustygear">This past winter holiday</a> that we spent together has changed my life. I’ll never be the same. The way you embrace life by being explicit about the death of objects, the way you force me to be clear when I’m borrowing your things. Sure, it was a bumpy beginning. I thought maybe I might run back to safe, warm, python’s arms. But you didn’t give up on me, you kept warning me that I was making everything mutable when I didn’t have to. And now, whatever happens, I’m a better man for having known you. <img src="/images/InLove.gif" alt="Rust, how do I love you, let me count the ways" /></p> <p>Some might say being explicit about the length of our lifetimes is macabre, but I find it invigorating. It’s a reminder that some things will outlive others, and being able to see that, and know the day some of our objects will die is a reminder that most of our data is related, and sometimes we need to spell out how up front to prevent garbage building up, which would force us to pause and deal with it later.</p> <p>And you saved me from modifying my variables in loops. I never even knew how many times I made that mistake and had to double back to fix those errors. I always thought I was being cool, reusing variables, but you called me out and made sure I never did that after I gave them to someone else. This made me frugal with my CPU and memory by helping me think about when and where exactly I’d spend them. Explicit mutability? How about explicit <em>cuteability</em>.</p> <p>And just the other day, when I asked you if we could go multi-threaded together, you didn’t just go along easily. You didn’t just hand me the keys and make me drive the whole process. You challenged me to use mutexes and reference counted pointers. You held my hand while I fumbled through it, and offered encouraging tips, with a lot of reminders to wrap things in safer containers before we went out into the cold, brutal multi-threaded world. Because of you, I’ll never have to feel the cold sting of corrupted memory again.</p> <p>My love, Rust, I don’t know if we can be together. You’re so new to this world and I’m not sure everyone will understand you. But I know I’ll do whatever I can to tell the world about your beauty and grace.</p> <p>Love Always, - Clint</p> <p><em>p.s. lets meet up again around spring break.</em></p> Mon, 23 Jan 2017 00:00:00 +0000 http://fewbar.com/2017/01/a-love-letter-to-rust/ http://fewbar.com/2017/01/a-love-letter-to-rust/ rust programming Rust OpenStack's nova-compute's border is porous - We need to build a wall <p>In the beginning there was Nova. It included volumes, networking, hypervisors, and scheduling. Since then, Nova components have either been replaced (nova-network with Neutron) or forklifted out and enhanced (Cinder). In so doing, interfaces were defined for how Nova would continue to make use of these now-external services, but nova-compute, the place where the proverbial rubber meets the road, was left inside Nova. This meant that agents for Cinder and Neutron had to interact with nova-compute through the high level message bus, despite being right on the same physical machine in many (but not all) cases. Likewise, some cases take advantage of that, and require operator cooperation in configuring for certain drivers.</p> <p>This has led to implementation details leaking all over the API’s that these services use to interact. Neutron and Nova do a sort of haphazard dance to plug ports in, and Cinder has drivers which require locking files on the local filesystem a certain way. These implementation details are leaking into public API’s because it turns out nova-compute is actually a shared service that should not belong to any of the three services, and which should define a more clear API which Nova, Cinder, and Neutron, should be able to use to access the physical resources of machines from an equal footing.</p> <p><a href="https://review.openstack.org/#/c/411527/">We’re starting a discussion in the OpenStack Architecture Working Group</a> around whether this is creating real problems, and how we can address it.</p> <p>What I think we need to do is build a wall around nova-compute, so we can accurately define what goes in or out, and what belongs specifically in nova-compute’s code base. That way we can accept the things that should live and work permanently inside its borders vs. what should come in through an API port of entry and declare its intentions there.</p> <p>But before we can build that wall, we need nova-compute to declare its independence from Nova. That may be as much a social challenge as a technical one. However, I think once we complete some analysis, and provide a path toward a more sustainable compute service, we’ll end up with a more efficient, less error-prone, more optimizable OpenStack.</p> <p>If you’re interested in this, I recommend you come to the next IRC meeting for the <a href="https://wiki.openstack.org/wiki/Meetings/Arch-WG">Architecture WG</a> , on January 12, 2017.</p> Fri, 16 Dec 2016 00:00:00 +0000 http://fewbar.com/2016/12/mr-nova-build-that-wall/ http://fewbar.com/2016/12/mr-nova-build-that-wall/ openstack architecture nova neutron cinder microservices OpenStack