Ship It Weekly - DevOps, SRE, Platform and Cloud Engineering News

Kubernetes 1.36, Gateway API v1.5, AWS Copilot End of Support, and Cloudflare Non-Human Identities

20 min

•Apr 24, 20263 months ago

Summary

This episode covers five major infrastructure and platform developments: Kubernetes 1.36's maturity-focused deprecations, Gateway API v1.5's promotion to stable, AWS Copilot's end-of-support and migration paths, Airbnb's workflow-based alert improvements, and Cloudflare's non-human identity framework. The underlying theme emphasizes how platform maturity comes from making vague responsibilities explicit rather than adding new features.

Insights

Platform maturity is achieved through honest deprecation and removing legacy patterns, not just adding new features and abstractions
Workflow problems masquerading as culture problems can be solved through better feedback loops and validation before production deployment
Non-human identities (agents, scripts, third-party tools) require the same IAM rigor as human users, with token scanning and permission scoping
Cloud platform migrations require inventory and planning before end-of-support dates, not panic or philosophical debates about tool quality
Explicit networking intent through standardized APIs reduces controller-specific magic and annotation debt

Trends

Kubernetes ecosystem moving toward supply chain security with OCI artifacts and cleaner packaging primitivesGateway API becoming the standardized networking control plane, replacing ingress and controller-specific patternsCloud providers actively deprecating older developer-friendly tools in favor of more customizable infrastructure-as-code approachesObservability-as-code workflows requiring real-time feedback and validation before production deploymentIdentity and access management expanding to treat automation and agents as first-class identities with blast radius implicationsRelease train models replacing perfect bundling cycles for more predictable forward motion in API developmentPlatform teams prioritizing workflow improvements and standards compatibility over proprietary internal toolingSecurity vendors implementing scannable token formats and granular resource-scoped permissions for agent access

Topics

Kubernetes 1.36 deprecations and maturity signalsGateway API v1.5 stable promotionsAWS Copilot end-of-support migration planningAlert development workflow optimizationNon-human identity and IAM governanceOCI artifacts and supply chain securityObservability-as-code and Prometheus integrationCloudFormation and infrastructure inventoryToken scanning and OAuth visibilityResource-scoped RBAC permissionsIngress controller deprecation patternsECS Express Mode and CDK L3 constructsAlert feedback loops and validationStale controller behavior mitigationProduction as testing environment anti-pattern

Companies

Kubernetes

Released v1.36 with 70 enhancements, deprecating externalIPs and GitRepo volume plugin, promoting supply chain features

AWS

Announced Copilot CLI end-of-support June 2026, directing users to ECS Express Mode and CDK L3 constructs

Cloudflare

Introduced scannable API tokens, OAuth visibility, and resource-scoped RBAC for non-human identities and agents

Airbnb

Built observability-as-code feedback loops for alert validation, reducing development cycles from weeks to minutes

Google Cloud

Expanded OTLP metrics support for Cloud Monitoring with Prometheus-compatible data storage and querying

Microsoft

Released Azure DevOps Server patches addressing null reference issues and malicious redirect prevention

People

Brian Teller

DevOps and SRE professional hosting Ship It Weekly, filtering infrastructure and reliability trends

Quotes

"Kubernetes is cleaning house. Gateway API keeps getting more official. AWS is moving on from Copilot."

Brian Teller•Opening

"This is not just look at all of the new stuff. It is Kubernetes continuing to act like a platform that wants fewer half-legacy foot guns hanging around forever."

Brian Teller•Story 1

"The token gets you in the building, but scopes should decide which rooms you can enter."

Brian Teller•Story 5

"Better platforms do not just make things easier. They make certain kinds of vagueness harder to sustain."

Brian Teller•Closing

"An agent with a token is not some cute helper. It is an identity with power."

Brian Teller•Story 5

Full Transcript

You know that moment when a platform stops sounding helpful and starts sounding serious? That's this week. Kubernetes is cleaning house. Gateway API keeps getting more official. AWS is moving on from Copilot. Airbnb is showing why production should not be your first alert test. And Cloudflare is reminding everybody that a bot with a token is still a principle with Blast Radius. Hey, I'm Brian Teller. I work in DevOps and SRE, and I run Teller's Tech. This is Ship It Weekly, where I filter the noise and focus on what actually changes how we run infrastructure and own reliability. Show notes and links are on shipitweekly.fm. If this show's been useful, follow it wherever you listen. ratings help way more than they should. We have five main stories today, then the lightning round, and we'll wrap with the human closer. We're starting with Kubernetes 1.36 because this feels like one of those releases where the project keeps sanding off older sharp edges while pushing more production grade features into stable territory. Then Gateway API version 1.5, which is basically SIG Network saying the future is not just coming. It is getting promoted into the stable path now. After that, AWS co-pilot CLI end of support, because this is a very real platform story. One easy path is aging out, and AWS is pretty clearly nudging people towards the next preferred path. Then we've got Airbnb on alert development, which is probably my favorite SRE story in the set, because it is really about how better feedback loops beat blaming culture. And finally, Cloudflare on non-human identities, because this is one of the clearest examples lately of security vendors saying out loud that scripts, agents, and third-party tools need to be treated like first-class identities, not side characters. Story 1. Kubernetes 1.36 feels like a maturity release. Let's start there. Kubernetes 1.36 shipped on April 22nd. The release includes 70 enhancements with 18 graduating to stable, 25 entering beta, and 25 moving to alpha. That alone gives you the usual big release energy, but the more interesting part is what it says about where the project is spending its maturity budget. One of the clearest examples is service.spec.externalIPs. Kubernetes says that field is now deprecated, calls it as a known security headache, and points back to the long-running man-in-the-middle risk around CVE 2020 8.554. You'll see warnings now, and full removal is planned for version 1.43. Kubernetes is also permanently disabling the old Git Repo Volume plugin in 1.36, saying that path is closed for good and existing workloads need to move to alternates like init containers or external GitSync style tools. That's why I like this story, because this is not just look at all of the new stuff. It is Kubernetes continuing to act like a platform that wants fewer half-legacy foot guns hanging around forever. And honestly, that is what maturity often looks like. Not more knobs. Fewer weird things everybody knows are sketchy, but keeps tolerating anyway. And there's another angle here that makes this release feel more grown up than flashy. Kubernetes 1.36 is also pushing more of the supply chain and packaging story towards cleaner primitives. The release highlights support for packaging read-only application data, models, and static assets as OCI artifacts and delivering them to pods through the same registries and versioning workflows teams already use for container images. It also calls out staleness migration work for controllers, which matters because a lot of weird controller behavior does not show up as a dramatic crash. It shows up as a controller acting on stale assumptions at the wrong time and doing the wrong thing very confidently. So the practical read for me is this. If you run Kubernetes at any real scale, 1.36 is the kind of release where you should not just skim the release notes for what's new. You should skim them for what old thing are they finally done tolerating? And what newer path is now stable enough that we should stop calling it experimental in our own heads? Because that's usually where the real platform signal is. Story 2 Gateway API version 1 is the networking future getting more real Next up Gateway API SIG Network announced Gateway API version 1 on April 21st called it the biggest release yet, and said the focus was moving existing experimental features into the standard, meaning stable, channel. The release promotes six features that people actually care about. Listener set, TLS route, HTTP route, cores filter, client certificate validation, certificate selection for gateway TLS origination, and reference grant. They also moved to a release train model, which basically means features ship when they are ready at freeze time instead of waiting for some perfect bundle. That matters because gateway API is no longer just a nice future-facing idea for people who like cleaner abstractions. It keeps becoming the actual road forward, especially now that Kubernetes itself is pointing people towards Gateway API in places where older patterns are being deprecated, and especially after all of the broader ingress and controller retirement pressure we've already been seeing this year. So to me, the real read here is simple. The networking control plane in Kubernetes keeps getting more explicit, more standardized, and less willing to leave everything in the controller-specific Magic Plus annotations bucket forever. And some of the specific gateway API promotions are worth slowing down on for a second. Listener Set is interesting because it gives teams a cleaner way to contribute listeners to a gateway without forcing everything into one giant resource owned by one team. TLS route going stable matters because it makes the TLS pass-through and terminate use cases feel more first class. The cores filter moving into the standard channel is also one of those small looking things that matters a lot in real life because it is exactly the kind of behavior people use to bury in controller specific config or app side workarounds. And the move to a release train model is its own kind of maturity signal too. It means the project is optimizing less for perfect bundling and more for predictable forward motion. So if I'm a platform team, the takeaway is not just cool more gateway API stuff. It is probably how much of our ingress and edge behavior is still trapped in annotations. How much of it is controller specific? And how much of it now has a cleaner upstream-shaped home we should actually plan forward. Story 3. AWS Copilot is reaching end of support. And that tells you a lot. Now to AWS. AWS says Copilot CLI reaches end of support on June 12, 2026. It will remain available as an open-source project on GitHub. But AWS says it will no longer receive new features or security updates from AWS. In the same announcement, AWS points users towards Amazon ECS Express Mode and AWS CDK Layer 3 constructs as the migration paths they want people evaluating. This is one of those stories I like because it is not really about the tool alone. It is about the platform preference. Copilot was AWS's opinionated developer-friendly path for developing containerized apps on ECS and AppRunner. Now, AWS is pretty clearly saying the newer opinionated paths are somewhere else. That does not mean Copilot users did anything wrong. It just means the center of gravity moved. And that is a very real cloud platform lesson. Sometimes the easy path you picked was genuinely the right call at the time. Then the provider evolves. The preferred abstractions change. And now your job is not debating whether the shift is fair. Your job is planning the migration before the old path becomes operational debt with a calendar attached to it. And there is a migration planning lesson here that I think gets missed when people hear end of support. Copilot did a lot more than just offer a nicer CLI. AWS says it used cloud formation stacks for the app and service layers. Which means a lot of teams probably have more copilot-shaped infrastructure under the hood than they remember. So the right response here is not panic. It is inventory. What workloads are still on copilot? Which ones are AppRunner versus ECS? Which parts of the deployment flow are tied to copilot convention? and whether the real destination should be Express Mode for simplicity or CDK L3 constructs for teams that want stronger IAC controls and customization. AWS is pretty explicit that Express Mode is meant to preserve a lot of the simplicity that made Copilot attractive, while CDK L3 is the more customizable path That is the kind of thing I would actually say out loud to a team Do not turn this into a philosophical debate about whether Copilot was good Assume it was good for when you picked it Now ask, what will be annoying to unwind later if you ignore the date now? Story 4. Airbnb says alert pain was a workflow problem, not a culture problem. This one is probably my favorite in the whole episode. Airbnb says the issue with alert development was not that engineers did not care. It was that their observability as code workflow had a blind spot. Code review could validate syntax and logic, but not actual alert behavior against real-world data. So production kept becoming the proving ground. Airbnb says they built fast feedback loops to preview, validate, and surface alert behavior before PR submission. cut development cycles from weeks to minutes, and use the workflow to help migrate 300,000 alerts from a vendor to Prometheus. They also say what used to take a month of iteration can now take an afternoon. This is such a good SRE story, because it is very easy to look at noisy alerts, weak alerts, or slow alert iteration, and say this is just a culture problem, or people just need to care more. Sometimes, sure, but a lot of the time the workflow is just bad. If the only way to see whether an alert behaves correctly is to merge it and wait, then you built a system where production is the testing harness and on-call is the feedback loop. That is not a motivation issue. That is a tooling issue. And Airbnb's fix is basically the kind of thing platform teams should love. Earlier feedback, more confidence inside the PR, less wasted iteration after the fact. And Airbnb made another design choice here that I really liked. They explicitly chose compatibility over novelty. Instead of inventing some exotic, proprietary, alert analyst model, they took Prometheus rule groups as the input. Used Prometheus' own rule evaluation engine. And wrote the results back out as Prometheus time series blocks. exposed through the standard query API. That is a really smart platform move because it means the preview and analysis system fits into the workflows engineers already understand. Instead of becoming one more internal snowflake, everybody has to relearn. That matters because a lot of internal platform tooling dies, not because the idea was bad, but because the workflow becomes learn our special thing first. Airbnb's version sounds more like, use the standards, use the real engine, show people the delta before they merge, and make the right thing easier than the lazy thing. That is honestly a great pattern well beyond alerting. Story 5. Cloudflare is saying non-human identity is now the real identity story. Last main story. Last time in episode 34, Cloudflare was talking about the network fabric for agents. This time, they're talking about the identity, token, and permission model around them. Cloudflare's framing here is very direct. Identities are not just people anymore. They are agents, grips, and third-party tools acting on your behalf. Their update packages that into three practical areas. Scannable API tokens, better OAuth visibility and revocation, and more granular resource-scoped ARBOC. Cloudflare says new tokens are easier for scanners to recognize. Customers now get a central connected applications experience for OAuth access and revocation. And resource scoped permissions are available for more resources so both users and agents can be right-sized more tightly. They also say these scopes can be assigned through the dashboard, the API, or Terraform. And I think that this story matters because it is one of the clearer examples of the industry dropping the pretense. An agent with a token is not some cute helper. It is an identity with power. A script with standing access is not background noise. It is an identity with power. A third-party OAuth app is not just a convenience. It is an identity with power. And once you accept that, the rest of the story gets more normal. Token scanning, connected app visibility, permission scoping, least privilege. This is just IAM growing up around modern workloads and agent-heavy environments. And I also like that Cloudflare is not treating this as some abstract future of security thing. The token changes are practical. They added a recognizable prefix and checksum so scanners can identify Cloudflare tokens with much higher confidence. The OAuth work is practical too. You can review the app name publisher requested scopes and which accounts the app is asking to access before you approve it and then see those connected applications later in one place to revoke them if needed And their resource scoped permissions framing is probably the most useful mental model in the whole post. The token gets you in the building, but scopes should decide which rooms you can enter. That is the part I think teams should take seriously. If you are letting agents, scripts, and third-party automations pile up with broad-standing access, you do not really have an AI governance problem. You have an IAM hygiene problem that got new branding. Okay, a few quick ones before we wrap. Microsoft shipped April patches for Azure DevOps Server and says they strongly recommend staying on the latest secure version. The patch fixes a null reference issue that could break pull request completion during work item auto-completion, improves sign-out validation to prevent potential malicious redirects, and fixes pack connection creation for GitHub Enterprise Server. That is a very practical patch now item. Google also pushed more on OTLP metrics for cloud monitoring. Google says you can send metrics to cloud monitoring through a provider-agnostic open telemetry pipeline, store that data in the same format as managed service for Prometheus, and query it through the same cloud monitoring interfaces. That is a nice observability standards story because it is not just support the protocol. It is make the protocol path actually first class. I think the human thread underneath this week's episode is that a lot of engineering pain comes from waiting too long to make responsibilities explicit. Kubernetes is making some of that explicit by deprecating or removing paths it clearly does not want to keep carrying forever. Gateway API is making networking intent more explicit. AWS is making platform preference more explicit by telling people where Copilot stops and where the next preferred paths begin. Airbnb is making alert quality less dependent on vague craftsmanship and more dependent on visible feedback before merge. And Cloudflare is making it harder to pretend that agents and scripts are somehow outside the normal identity and access conversation. And this is where I think platform work gets misunderstood sometimes. People talk about maturity like it means more automation, more abstraction, more paved roads. Sometimes it does. But just as often, maturity is really about being honest sooner. Honest about which patterns are legacy. Honest about which workflows are broken. Honest about which tools are losing support. Honest about whether production is secretly your only validation environment. Honest about who or what actually has access in your environment. That is not as fun as a big shiny launch, but it is usually where a lot of the real risk reduction comes from. Because the longer a team stays fuzzy on ownership, the more that fuzziness turns into toil. And the more that toil turns into staffing pain. And the more that that staffing pain turns into reliability pain. Not because anybody is lazy, not because people do not care. usually just because too many systems stayed ambiguous for too long. So yeah, that's probably my biggest takeaway from this week. Better platforms do not just make things easier. They make certain kinds of vagueness harder to sustain. And honestly, that is usually a good thing. All right, that's it for this episode of Ship It Weekly. Quick recap, Kubernetes 1.36 and why it feels like a maturity release. Gateway API version 1.5 moving more core networking features into stable territory. AWS Copilot reaching end of support. And what that says about shifting preferred paths. Airbnb proving alert pain was a workflow gap, not just a culture issue. And Cloudflare making the case that non-human identity is now a core security story. Then in the lightning round, Azure DevOps server patches and Google Cloud OTLP metric support. Links and show notes are on shipitweekly.fm. You can also find video versions on YouTube. If this episode was useful, follow or subscribe wherever you listen. And send it to the person on your team who keeps having to explain that reliability problems are usually workflow problems long before they became on-call problems. I'm Brian, and I'll see you next week. You