Skip to content
A
No. 07DevOpsAug 16, 20259 min read

Day 1 vs Day 90 on an AKS Internal Platform: What I'd Wire Differently

Three months ago I stood up a new internal developer platform on AKS for a 30-engineer team. Backstage as the portal, ArgoCD for delivery, Crossplane for resource provisioning, the usual stack.

Three months ago I stood up a new internal developer platform on AKS for a 30-engineer team. Backstage as the portal, ArgoCD for delivery, Crossplane for resource provisioning, the usual stack.

Day 1 it looked like the architecture diagrams in the conference talks. Day 90 it looks like something only my team understands. Both versions are valid platforms. Only one of them works in real life.

This is what I'd wire differently if I were doing it again — and the things I got right that I'd keep.

The Day 1 platform (the one I'd build different)

Day 1, the platform supported three workflows:

  1. "Spin up a new microservice" — Backstage scaffolder generates a repo with a Helm chart skeleton, pipeline config, and an ArgoCD Application.
  2. "Provision a new database" — Crossplane Composition for Database provisions an Azure Postgres flexible server.
  3. "View my service's health" — Backstage component page with embedded Grafana dashboards.

This is what every platform-engineering blog post describes. It's all the right shapes. None of it survived contact with users.

What the team actually wanted at Day 90

Looking at the GitHub issues filed against the platform repo, the actual requests fall into four categories:

1. "Why is my deploy stuck?" — Far and away the most common request. The Backstage component page shows you what is deployed but not why a deploy hasn't progressed. Engineers were context-switching to ArgoCD's UI to see sync waves, then to kubectl describe to see pod events, then to App Insights to see logs. Three tools to answer one question.

What I should have built Day 1: a single "Deploy Status" card on every component page that shows:

  • ArgoCD sync state
  • Pod readiness for the deployment
  • Last 10 lines of pod logs
  • Recent events from the deployment's namespace
  • A link to App Insights with the right time window pre-filtered

This card is the most valuable thing in our Backstage today. It took me three weeks of Day 60-90 to build. It should have been the first feature, not the last.

2. "How do I rotate a secret?" — We did Crossplane right for provisioning new secrets. We did nothing for the lifecycle of existing secrets. Day 30, an engineer asked how to rotate a database password and the answer was "log into the Azure portal and click around." That's the opposite of an internal platform.

What I should have built Day 1: a SecretRotation Composition that takes a secret reference, generates a new value, updates the resource (Postgres password, Key Vault secret, etc.), and rolls the consuming workloads. It's a 200-line Composition. It pays for itself the first time someone asks.

3. "Where can I see my costs?" — The platform itself is on a per-team-tagged subscription. Engineers wanted to see their service's monthly cost without learning Cost Management. The Backstage Cost Insights plugin exists but pulling data from Azure into it requires custom code we hadn't written.

What I should have built Day 1: a nightly Logic App that pulls Cost Management data per-tag, dumps it into a Postgres table, exposed via the Backstage TechDocs adapter. About a day's work. We did it Day 75.

4. "Why is my pod OOMKilled?" — Engineers had no good interface for "look at recent OOMKills on my deployment." Easy in kubectl, hard in any UI. They wanted alerts, not dashboards.

What I should have built Day 1: a Prometheus alert rule template that ships with every scaffolded service — "alert on >0 OOMKills/hour", "alert on restart count > 3 in 5min", etc. — wired to send to the team's Slack via a webhook. Not new technology. Just the discipline of shipping it as part of the scaffolder, not as a follow-up.

What I got right that I'd keep

Crossplane Compositions for resource provisioning. Good choice. The DBA team loves it. They can write Compositions that enforce naming, tagging, network placement, backup policies — all the things that previously required either tickets or trust. The first month after we shipped the Database Composition, we provisioned 18 databases via PRs. Zero tickets to the DBA team for routine provisioning. They moved on to actual database engineering work.

ArgoCD App-of-Apps for the platform itself. The platform's own components (Backstage, ArgoCD, Crossplane) are managed by an outer ArgoCD instance. Bootstrapping is kubectl apply -f cluster-bootstrap.yaml and walk away. When we needed to tear down and rebuild a cluster for a CIS benchmark exercise, the platform recreated itself in 25 minutes. Worth every minute of the initial complexity.

A "team" concept first-class. Day 1, every component in the catalog has an owner: team-payments field. Teams are first-class objects in Backstage. This let us roll out things like "when this team's service has a P1 incident, page the team's on-call rotation, not the platform team." Without team-as-first-class, you end up with a thousand individual user emails on incidents and nobody knows who to wake up.

Day 1 vs Day 90 contrast

Aspect Day 1 spec Day 90 reality
Deploy visibility Embedded Grafana dashboard Custom Deploy Status card aggregating 5 sources
Secret rotation Manual via Azure portal Crossplane SecretRotation Composition
Cost visibility "Use Cost Management" Nightly tag-aggregated dashboard in Backstage
Alerting Engineers configure Prom rules Default alerts shipped with every scaffolded service
Self-service db Crossplane (kept) Crossplane (kept)
Bootstrap App-of-Apps (kept) App-of-Apps (kept)

Half of what I'd build Day 1 differently is "make the workflows engineers actually do every day faster," not "build more capabilities." The platform's job is to make things mundane, not to provide platforms-of-platforms.

The portable lesson

If I were starting another internal platform, I'd spend Week 1 not building anything. I'd spend it shadowing the engineering teams: "what do you do when you need to deploy something? When you need a new resource? When something is broken?" Write down every step. Find the steps that involve switching tabs or reading docs.

Then build to eliminate those steps. Don't build to match the architecture diagrams. The diagrams were drawn by people whose engineers didn't have your problems.

I would NOT skip the boring infrastructure (telemetry, RBAC, network policy) in pursuit of the developer-facing features. Those got me 80% of the technical wins for 20% of the effort. The developer features get the rest of the wins, but only on top of solid infrastructure.

AKSPlatform Engineering

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from DevOps

See all →