Managing Efficient Technical and Data Teams
Within any organization, technical teams are unique. They are the engineers, the analysts, the ones who ship code, the ones who build data analyses. They are also the ones who maintain cloud systems and processes. In particular, subsets of these technical teams build strictly internal software.
Need to help your sales team fetch information about a customer before a renewal call without pinging you? Enable your finance team to process refunds programmatically? Expose, dare I say, an API to back an internal application for on-the-floor manufacturing or point-of-sales jobs? This is all in the space of internal tooling. Internal tools are equally as important as external ones. While end users are internal employees, they interact with customers and their decisions could be equally as crucial for the business.
You - the managers of these internal tooling teams - must ensure they satisfy the requirements of your business counterparts quickly and resiliently. Oftentimes, it might feel like you have to choose between these directions - where building “quickly” and “resiliently” cannot be attained together.
Being efficient means finding a way to do both. This article will outline how to lead an efficient technical software or data team that is focused on internal tooling by balancing solving new problems while keeping up with maintaining existing systems.
Caveat: the following may apply to full stack engineering, but is largely focused on the problems technical teams face when building internal tooling.
The distribution of engineering work reflects productivity
Before we dive in, we need to discuss how a technical problem and solution even come to be. The first obvious task is to find the right problems to solve - for the sake of argument, assume this is already done by a product manager. Next, once a problem is given to a technical team, the engineer assigned must figure out how to solve this problem. It may involve writing new code, removing old code, purchasing software, or something different altogether (like an architectural, graphic, or configuration change).
In the case of internal tooling, oftentimes the problem must be solved by updating an existing tool or building a new one. This is oftentimes not the case for external software, which has many more constraints, stakeholders, moving parts, and areas of optimization. Moving forward, this article assumes the solution is already identified to involve writing or editing code.
When it comes to coding: we have to start by discussing how engineers spend their time generally when it comes to writing and maintaining solutions put out into the world. The activities are listed from highest value to lowest value - while the last tasks are lowest value, they are still critical and necessary. Efficient teams spend more time on high value tasks and less time on low value work, while maintaining the same level of resiliency.
Where engineering time should go
🟩 Solve net-new problems. This is the most value-driving activity of any software, data, or platform engineer. The more time spent working on net-new problems with new features or dashboards, the more perceived value your team has.
🟩 Improvements to existing problems. While similar to the first bullet, substantially different because some problems do not involve new features, rather incremental improvements to existing systems with dramatic results for the end user. This item, like the first, involves critical thinking and editing code.
🟨 Deploying to production. After building a proof of concept, code needs to be repeatable and automated. This is a required process but often takes for too long - when designed poorly, it can be a monumental task. While an engineer can spend 10 hours writing code, it could take an additional 10+ hours to refactor it, test it, deploy it to the cloud, and make sure it’s resilient. The more time is spent deploying, the longer it takes for your team to deliver value.
🟥 Maintaining existing code. Once code or an application is in production, the business around it might change - and production code has to change with it. For internal tooling teams, this largely has to do with updating business logic. Especially in the data field, you don’t have control over inputs - so pipelines might break unexpectedly. When issues arise, all other work halts.
🟥 Maintaining existing systems. Code runs on infrastructure, which itself has to be maintained. Updates occur, instances go down, and during these times teams go into hyper debugging mode. During this downtime, trust is actively being lost with stakeholders.
Expectations of your team
If you talk to your stakeholders - leaders in product, marketing, sales, and so forth - they would likely summarize engineering time primarily by building net-new features. If that’s not where all the time goes, why is this the case?
There are two misconceptions happening here.
The first: net-new code is the only high value way to solve a problem. This is often not the case. Making code different, more efficient, or removing it altogether can often solve a problem. This is a topic for another post.
The second: solving problems is highest value, and the primary thing business teams expect and see. It is the only quality of the above items that make them outward facing; everything else is under the hood. While a car’s engine is important, as a driver, you only think about it if the car starts smoking or suddenly stops.
Your goal as a manager is to enable your team with the proper tools and frameworks to spend the highest possible percentage of their time solving problems through code without sacrificing resiliency. Why? Because sacrificing resiliency means losing trust with your stakeholders.
Problems that reduce productivity
What takes away from the highest value work? Succinctly, two things: slow development and long debugging times.
For instance, an engineer has already found a solution to a problem, but can’t work on it because a production failure has occurred. Once they’re ready to ship the solution to production, the hoops to jump through are counted in dozens. This is highly inefficient.
Let’s break that down.
Caveat: this section, again, operates under the premise a solution is found and ready for implementation. There’s plenty of time engineers spend figuring out what the solution should be, which deserves its own article on how to manage that challenge.
Problem #1: ready code but slow production development
Once code is tested locally, it needs to be deployed to production. This is a key part of the development process that is necessary to make a feature complete - end users (whether they be external or internal) need to see it. Deploying to production entails:
- Taking code from locally executing to running on a cloud platform
- Testing in production to ensure no new failures occur
- Making sure the code is deployed securely and on the right (secure) infrastructure
- Ensuring the code runs when it is supposed to
Making this easy looks like a pre-built, automated framework for deploying code so the average engineer doesn’t have to think about the nuances of deployment on a regular basis to get code to production. For instance:
- A local environment that replicates production as closely as possible, including a local test suite
- Infrastructure configured from a playbook, not ad-hoc for every deployment. Remove thinking about infrastructure from the average engineer’s equation.
- Monitoring as pre-built and automated so it doesn’t have to be built ad-hoc for every new piece of code.
If deployed improperly, failures will create a big mess - a mess that takes time to clean up. During this time, stakeholders will be watching closely as the outward facing part of your engineering team’s work will be put into question.
Problem #2: urgent failures but long debugging times
Once code is in production, it will inevitably fail at some point. The question is: what are the repercussions of those failures, and how long do they take to debug?
If a failure of a data pipeline in production results in data quality issues, business stakeholders depending on your analytics team will suffer. Especially if this pipeline has no quality checks or backup processes, you might even find out about these issues from the stakeholders themselves. This not only occupies developer time, but reduces trust.
Now that a failure has occurred - the next question is how long it takes to get resolved. The time required to resolve the issue is time taken directly away from new feature development. This time also reduces stakeholder trust - no matter whether your stakeholders are end users or internal business leaders. Failures in production mean something isn’t working, and is a critical and stressful time for your team.
Better response procedures to urgent failures look like:
- alerting stakeholders of failures before they find out themselves
- implementing backup processes to reduce downtime during debugging
- having full end-to-end observability of all processes to report on to higher lever management around SLAs and team expectations
Complexity maturity curve for new feature development
The relationship between the number of features in production and the amount of time spent on maintenance is the direct indicator of how efficient your team is at delivering value. Let’s refer to this relationship as the complexity maturity curve, which answers one simple question: as the complexity of your code in production grows, at what pace does your maintenance time grow with it?
While some features shouldn’t exist, and others should exist in a different way - the ability for your team to solve new internal tooling problems through code will enable it to satisfy stakeholders faster. This is separate from the ability to solve the right problems - more so honing the skill of shipping established solutions faster, and clearing away known low-value work.
When technical teams scale the amount of new features produced, usually it starts to look something like the below: the more code is shipped, there is exponentially more time spent maintaining it.
This relationship is colossally unsustainable. Recall which activities are high value: solving problems through code. This means that as you ship more code, you must ship code slower due to not having enough engineers to maintain that code in production. The code already deployed becomes more brittle, and trust continues to be lost due to increasing downtime.
When considering headcount, for every new product or data engineer that builds new features, you need 2+ platform or devops engineers to maintain the infrastructure the code runs on or even worse, the code itself without being the person that wrote it. Your team will become a cost center instead of a revenue driver very quickly - the revenue driving activity of technical teams is solving problems, not shipping code to maintain previously shipped code.
Instead, consider a framework which takes some (hopefully small) amount of effort to set up initially, but makes adding coded solutions to production a small incremental task as opposed to an ominous one. Frameworks in this sense add standardization to help both shipping new features and maintaining them. The result: a logarithmic relationship between pushing new code and growing maintenance time. In this world, as deployed code complexity grows, the amount of maintenance time per new feature actually decreases.
Of course, as your team grows in both headcount and responsibility, maintenance time will inevitably go up. The key is to have a very tight grip on the rate at which maintenance time goes up with respect to new feature development. This ensures yours team’s ratio of high-value to low-value work keeps the team as a revenue driver and net-gain.
Nothing comes for free: so at what cost can you achieve a logarithmic complexity maturity curve? Overlaying the scalable and unscalable approaches makes this clear.
In the beginning, the scalable approach will actually slow down new feature development. This is what I call the initial cost of resiliency. Your team lead will have to setup a framework to deploy code, make sure it is versatile enough to be used by a growing team and for a growing number of use cases. This cost of resiliency - where the scalable approach is slower - is a finite time period.
Over time, you’ll reach the crossing - where the resiliency and framework set in place for deploying code reaches net-zero. This is where mid-size companies that don’t invest in resiliency early get over-confident. If going along the unscalable path, the crossing may seem like a point of “speed without failure” - where shipping code is fast, maintenance is reasonable, and all seems well.
However, it is not long after the crossing the cost of speed starts to take hold. All of a sudden, new code seems to always break when shipped; downtime feels inevitable; and stakeholders are finding bugs before engineers are. Without a drastic change in code deployment process, this cost is forever.
So, how do you achieve a scalable complexity maturity curve? The answer: a centralized framework for deploying and managing code in production.
Centralized deployment framework to drive higher team efficiency
Particularly for internal tooling teams, the problems to be solved are extremely versatile and often seem very different from each other. Still, putting new features into production should not be a new project each time. If that’s the case - the same human mistakes will occur time and time again, with each of those mistakes requiring maintenance time. Deploying code, particularly scheduling it if necessary to run repeatedly, should be repeatable, self service, and well documented to reduce the maintenance burden on both software and data engineering teams.
The efficiency of a technical team focused on internal tooling is dependent on reducing moving parts and increasing repeatability when it comes to deploying code. For maximal efficiency, a deployment framework should possess the following:
- Repeatable and automated process. Don’t start from zero when putting code into production. Implement patterns to schedule production code the same, tested way each time. Ensure only your most senior team members can edit this process.
- Guardrails for infrastructure. The most time-consuming part of shipping new features is ensuring they work off of the one laptop they were tested on. Creating a single happy path for cloud infrastructure deployment reduces this time and creates a secure-by-default standard when shipping new features.
- Few personnel involved. A self-service system ensure no team is a bottleneck. Don’t wait on platform teams to deploy code while continuing to abide by best practices; only loop in others when new code types of patterns need to be developed. Keep platform teams happy by making them the ones owning the deployment process.
- End-to-end monitoring. Even with automation and guardrails, there will still be failures that occur. Ensure downtime is minimized with proper alerting and backup processes in place. Keep trust by notifying stakeholders of the status of the applications and data they depend on.
Prefect is built to make all of the above possible for versatile and security-conscious software and data engineering teams alike. Our mission is to ensure internal tooling is just as resilient as external software applications. As a platform for any type of Python-centric technical team, Prefect provides an overarching view for managers of technical teams to ensure the features your team is responsible for are working properly. Prefect ensures your team has clear eyes into what code is deployed, if it’s running when it needs to run, and where it’s mapped to run on.
If increasing the efficiency and value of your technical team building internal tooling is important to you - book a demo with Prefect today.