Nurture AI Innovation Without Chaos: A Research-Backed Lean Playbook

Introduction

Traditional project management approaches are falling short in AI development, where uncertainty is the norm rather than the exception. While companies invest heavily in AI initiatives, many struggle with long development cycles that don’t deliver expected value, or worse, deliver the wrong solutions entirely.

In the face of uncertainty, the Lean methodology, with its focus on rapid experimentation and validated learning, offers a valuable approach to managing risky ventures. It was popularized by Eric Ries in The Lean Startup book.

At the core of Lean is the Build-Measure-Learn feedback loop: build minimal experiments to test hypotheses, measure results with appropriate metrics, and learn whether to persevere or pivot. Through this systematic testing of critical assumptions and continuous customer feedback, teams can reduce waste and increase the probability of delivering meaningful solutions.

This approach has proven particularly effective in environments where traditional planning and forecasting methods falter, mirroring the iterative nature of AI development, where models are continuously refined through experimentation and feedback, rather than adhering to rigid, upfront specifications. The Lean Startup is inspired by Scientific Method, Lean Six Sigma and Unified method.

Based on eight years of academic research and a decade in the software industry, I’ll share how Lean Startup principles look through the lens of AI development. You’ll discover how to:
Design experiments that validate or invalidate critical assumptions quickly
Create safe-to-fail environments that encourage innovation
Balance technical and adoption risks effectively
Build feedback loops that keep development aligned with business value
Scale teams while maintaining innovation velocity

Our mindset and planning will work in reverse compared to traditional project management. We must figure out what we should learn about the problem, use innovation accounting to determine what we should measure, confirm if we achieve validated learning, and then determine which component we need to build to run the experiment and obtain the measurement. Another important takeaway will be that we should always optimize for the speed of the entire loop, including client feedback.

Read the full article to learn how to transform your AI development process from a high-risk endeavor into a systematic approach for delivering value. Whether you’re leading an AI team or steering your organization’s AI strategy, you’ll find practical insights for implementing these principles.

Why do the customers want AI?

Artificial Intelligence (AI) has several branches, one of them is Machine Learning (ML) or Statistical Learning. Because of the current explosion, in current vocabulary, terms like AI, ML, Deep learning are used to express the same thing: A piece of technology that can learn to make predictions about unseen data, based on historical examples. More pedantic readers might be annoyed by this intentional “confusion”.

In Unified methodologies, we have a key element called adoption. If we have adoption, we have a good chance of business continuity. From all the ways we can categorize AI applications, I want to classify them into two categories by how they help a customer:

Existing Pain Alleviation:
Business Scaling

Existing Pain Alleviation

Our clients have immediate, tangible problems that cause friction, frustration, or loss. These problems already have a KPI attached, and there is probably some sort of solution already in place within the company. All the focus should be on improving those solutions. Because it’s a clear and present need, and because we already have a KPI that the customer is comfortable with, we can easily demonstrate added value.

These solutions focus on the present and the near future. We can compare our solution with whatever the client uses today.

Examples:

Automating QA processes
Automating customer ticket routing
Data privacy and anonymization
Automatic monitoring (eg greenhouse gas emissions, fault detection on production lines, etc)

Business Scaling

AI will be used to identify or tackle new opportunities to increase lifetime customer revenues. We will be able to evaluate the impact of these solutions only after a longer period of time. Because of their complexity, it’s possible that there will be only one solution put into practice. For the above reasons, it will be harder to prove the added value of the AI approach.

Examples:

Enhanced decision-making
Automating client acquisition
Churn rate prediction
Market trend analysis
Predictive maintenance
Supply chain optimizations

How does this impact us?

We care about this classification because customers who don’t have experience with AI/machine learning will be more open to pain alleviation because they see it as lower risk (they already have something in place) and expect quick feedback. We can cross-sell to clients that fall into the first category (pain alleviation) because, for sure, there are other pains within their operations that we can help with.

Clients that routinely use AI in their companies will be more open to business scaling use cases. They already know what AI is all about, they are used to its workings and its quirks, and they are more open to innovation.

We, the service providers, might focus on a certain industry or perhaps a certain technology. However, in the beginning, it is, in my opinion, easier to structure the sales funnel and delivery team if we classify potential customers using the above two criteria.

Regardless of the ideal client profile, one of the first experiments should focus on the need: Does the client have this need? Will our AI solution gain adoption?

Why do we need Lean?

AI development faces a fundamental challenge: traditional project management assumes we can plan and forecast based on historical experiences. However, in machine learning projects, key aspects remain unknown until we interact with data and models. From model selection to data requirements, every decision carries significant uncertainty.

The core challenge isn’t just technical – it’s about making informed decisions without complete information. Traditional management approaches identify two failure modes: failing to plan or failing to execute. Yet ML projects require a different framework, as both planning and execution depend on variables that can only be discovered while experimenting.

The Lean methodology offers a structured approach to managing this uncertainty, avoiding the limitations of other methods:

Waterfall’s rigid planning proves ineffective when requirements and solutions are unclear
Agile struggles with unpredictable tasks and heterogeneous team skills
YOLO runs risk sustainability for quick wins
Circumscribed Freedom requires extensive resources and top research talent

In business development Lean brings predictability from classical management while maintaining the flexibility to adapt through systematic experimentation. By focusing on validated learning and waste reduction, teams can make progress even when faced with significant unknowns. This is an excellent fit for AI and machine learning!

AI develops in an environment with significant uncertainty. What is the best model? What data do we have? What is the best architecture? What will fine-tune faster and cheaper? What metric should we use? Many questions, all easy to answer once the other aspects are known! But what if we know nothing? No data, no metric, we have just an idea? What then?

Classical delivery methodologies work when we can plan and forecast. These assumptions are false in the case of machine learning and statistical learning in general.

We shouldn’t expect engineers to know the “best” algorithm, what data to collect, or what is the best language model (like ChatGPT or Claude) out of context. Math shows that all ML algorithms will have the same performance (No free lunch theorem). We collect all data and then figure out what are the relevant features. We have dedicated algorithms only for discovering feature relevance. Public leaderboards offer initial guidance in model choices but we need custom benchmarks for our specific applications. The specific business problem determines the optimal choices.

In general management, there are two main reasons for failure: failure to execute and failure to plan. However, in order to plan, you need to know about downstream constraints and resources. In order to execute, you need clear requirements, procedures, and ways to measure success. Unfortunately, data-driven projects offer no such guarantees. The best way to approach problems with a high degree of uncertainty is to reduce these uncertainties as much as possible. Like in mathematical optimization, Lean reduces the uncertainties by short and limited iterations. In my opinion, this is the fundamental difference between classical management (waterfall, agile) and machine learning management.

Another popular methodology used in AI is YOLO (You Only Live Once). YOLO runs, as a management strategy implies just doing something without much thought. It doesn’t really work in the long term because it doesn’t offer a strategy for sustainability. It might bring some short-term gains through chaos and pure luck.

Lean, brings the best of both worlds. Some predictability from the classical management and freedom from YOLO runs. The core tool is the experiment. The outcome of an experiment is to validate, or better yet, invalidate a supposition or assumption. You want to find this out as soon as possible and with minimal resources spent investigating the current question.

The experiment

The experiment is at the core of AI development and it overlaps the Build-Measure-Learn phases in Lean. Here, I tried to highlight some AI specific topics that, of course, are not present in the original Lean methodology.

The design of the experiment follows the scientific method. We start with a clear hypothesis that makes predictions about something we care about. As in the scientific method, this hypothesis must be falsifiable. We then proceed to implement tests to falsify this hypothesis. After the experiment is run, we draw conclusions.

We will be running these experiments one after another, each one drawing its specifications from previous experiments and their conclusions.

In this chapter we will focus on the mechanics of the experiment and in the “How to manage” chapter, on how to choose a good hypothesis stream and how to think strategically about the experiments.

Don’t over-design

In many business and engineering contexts—especially those inspired by lean principles—there’s a strong focus on conducting rapid, minimal experiments. The purpose of such experiments is not to develop elaborate product features but rather to falsify or validate a hypothesis as quickly as possible. Once you achieve a successful outcome, that experiment becomes the baseline for the next iteration.

In machine learning projects, each small-scale experiment should serve as a stepping stone for the next one. This means minimizing extraneous complications. Consider a simple entity recognition scenario, where the goal is to test how well a model identifies specific terms within text. Instead of building a flashy web application with authentication, fancy interfaces, and integrated feedback loops, a quicker approach might be to dump the results into a spreadsheet and share them with the handful of people responsible for evaluation. This drastically reduces development time while still providing enough information to confirm or refute your hypothesis.

That doesn’t mean you’ll always rely on spreadsheets. Eventually, you might have a real-time system in production. But a minimal solution upfront lets you see if it’s even worthwhile before investing heavily in more features. In another example, if you hypothesize that a specific user experience (UX) will increase adoption, you can test it right away with a small model that barely works but illustrates the user experience. If users complain about poor results, that’s a sign they truly want the solution to succeed. If they say “Okay, continue” without much testing, you’ve learned that the solution may not be compelling enough to warrant major investment.

There can be worries about how a company’s image might suffer if early releases look “unpolished.” In a later chapter, I’ll discuss strategies to protect your organization from the potential downsides of these experiments. For now, remember that every small test that fails fast is better than a large project that fails slowly and expensively.

Explore vs exploit

When doing machine learning, there is always a trade-off between exploitation and exploration. For those who don’t remember the exact definitions: exploitation means optimizing and extracting as much as possible from current methods—fine-tuning the existing pipeline with more data, exploring more data augmentation, etc. Exploration means changing the current approach in a radical way—for example, moving from using large language models to a dedicated smaller ConvNet. Ideas involving radically changing the UX in ways that alter the flow of information and the way users make decisions are another example.

At the beginning of a project, I would advise exploring more ideas, and when the project is more mature, insist on exploitation. The decision to switch back to exploration is difficult to make and should be done when there are signs that the current method cannot scale with available data or compute. In my opinion, this is what “pivoting” or “persevere” from The Lean Startup book means for us.

This principle aligns with what Richard Sutton calls “the bitter lesson,” which also resonates with Eric Ries’s advice: favor solutions that scale well with more data and compute, rather than clever approaches that may yield short-term wins but become difficult to expand. Don’t jump straight to a massive architecture if you only have a tiny dataset, because swapping out models later is often trivial. Instead, focus on identifying solutions that will remain manageable and scalable over time. When they stop scaling, pivot!

The natural growth of a machine learning project is to start from a few samples with a foundation model, validate the need, validate the UX, validate the added value, and then move to improve the machine learning metrics and optimize for inference costs and other criteria such as privacy or safe harbor data handling. There are, of course, other needs from case to case. For example, we might need to add domain drift monitoring early on because the initial analysis considered it a risk. We could expect, after a point, to have a very automated process where data is ingested, a model is trained, and automatically deployed to the user. In recent years, people have called this process the “data flywheel.” At a philosophical level, we should aim to build this data flywheel as soon as possible. This means that the scaling laws and the “bitter lesson” will work to our advantage. At this point, the role of innovation diminishes, and we might be ready to pivot out of this project entirely.

The exploration versus exploitation decision should be made together with the stakeholders, only when it goes against the natural development of machine learning. It is possible that some assumptions are initially validated because, for a small data sample, everything is possible, but once there is more data available, we might find out that some hypothesis is false. This is a well known fact in statistics but I personally find of little value computing the statistical significance and tight confidence intervals for each measurement or decision.

Another strategy that can be employed, especially for larger projects, is to go breadth-wise instead of depth-wise. That is, test a diverse set of hypotheses, and for each hypothesis, don’t spend more than one iteration. Repeat the process, doing another iteration for all the hypotheses that weren’t falsified. This is the most subtle exploitation versus exploration type of decision. It works well for mature teams.

What to measure?

You might have heard the mantra “We are here to increase the value for the shareholders”. Unfortunately evaluating a process by this value is hugely problematic for one reason: it is a deeply lagging indicator. We need to have proxies for this value. For example, happy customers is a better indicator because a happy customer brings revenue. But even this might be too lagging.

Regular ML metrics (accuracy, F1, mAP, ROUGE-x, etc) are fast to compute but they have nothing in common with our customers. What is the solution?

In Eric Ries’s book, there is a whole chapter on how we should measure innovation and how we should perform the accounting for our Build-Measure-Learn loop. We must understand the business aspects to a point where we can design metrics that capture if we are “building the right thing”. We also should strive to have the following chain of causality:

Shareholder value -> customer value -> business metric -> ML metric -> ML loss.

The first two are aspirational and hard to measure but the rest are quite concrete and actionable. In practice we are happy if we can replace causation with demonstrated correlation. Example: if we have some cohort runs where a good mAP for an object detector translates in a good business KPI that we care for (eg. churn rate).

Don’t be afraid of negative feedback from the customers exposed to an experiment. Sometimes it is a good indicator for progress! Imagine a scenario where a solution is completely ignored. The minimum agreed number of testers are giving a short “Works as expected” feedback. However, if our solution is expected to bring value to some tasks that we assume are important, people will complain! A long and consistent feedback review (even full of negative remarks) is a strong proxy for high adoption rate!

50 pages of things that don’t work is a strong negative signal in a regular management setup. We did things wrong. Negative feedback means failure to execute, failure to test, failure to plan, failure to properly set the Definition of Done.

For an experiment testing the adoption assumption with a small PoC those 50 pages are a very strong positive signal! We hit a pain point! Time to focus more energy there!

Do you feel the mindset shift from classical management to lean management?

Below, is an almost 1:1 reproduction from Eric Ries book about what is a good business metric:

Actionable: A good metric should be able to demonstrate a clear cause-and-effect relationship. If this is not true, chances are that this metric is a vanity metric. One example of a vanity metric is to record the number of impressions without considering how these impressions came to our website. A good, actionable metric is, for example, the churn rate for users who were exposed to our new experiment.
Accessible: Accessibility has several interpretations. One is that metrics should be available to everybody, even actively pushed within the company. However, this goes beyond a simple report easily accessible in the company’s management system. We should be able to quickly compare metrics across different experiments to be able to do analyses like cohort analysis, which is very powerful for doing correlations between different metrics or between our metrics and some external indicators. Usually, this means some intentional effort, for example building a dashboard. It is very important that every employee, from the intern to the CTO, has access to the metrics!
Auditable: This is quite simple: we must be sure that the metric is true and that it reflects reality. Both the client and us should be able to validate the metric evaluation pipeline. There should be a minimum level of processing; the data processes should be clear, simple, documented, and traceable. An employee should be able to take the raw data and compute the metric by hand, provided that they have enough time. We should go as far as possible, including tracking the original user who generated the data. Be creative around things like GDPR or data privacy! There are plenty of examples around us (OpenAI, Google routinely collect user data from their lower-tier plans).
“Boots on the ground”: Data is not everything, and intuition still plays a role when designing a metric. Go and talk to the customers, to the end-users! “Go out of the building, and see for yourself” how the ML solutions are working in practice. A machine learning engineer might spot some subtle problems early on. It is not impossible to capture these problems from the data, but they will be caught only when we have enough data for these subtle effects to have a non-trivial representation in our dataset.

Once a good business metric is identified, monitor it to see how well it correlates with ML metrics. In general, even a good business metric can still be lagging but the ML metric is always leading. We can measure something like the churn rate a few weeks after the deployment but we can estimate mAP for a zero shot object detector before it is deployed. If the correlation is strong, we have a new tool to estimate how good our current training session is doing!

Implementation

It is time to write some code! Here we have the classical software development steps: we analyze the use case, we derive some requirements, we implement the code.

The specifications will be derived from the hypothesis. We do tests (unit tests or manual integration tests) to see that we can detect all the possible outcomes for our hypothesis.

This implementation step should be scoped up to the point when we are convinced that our software is correct and that the software will be able to capture situations when the hypothesis is false.

Are the performance metrics measured properly? For these aspects, I would definitely write some unit tests. I would also like to see my pipeline overfitting a small batch. Or see what the performance is for a totally randomly initialized model. These are some integration tests that are easily run and for which the result can be computed on a napkin. If the performance metric is way greater than random, we have a serious issue, if the model is randomly initialized. The same thing with the overfitting: if we cannot overfit a small batch, we should really start a lengthy debugging session to see why not!

The software we just wrote will be able to tell us if the hypothesis is true. So, we have to have tests in place that prove that both true and false hypotheses can be detected. During software implementation I like to use a Kanban process.

This chapter is maybe the most familiar part of the process, so far. Everybody knows how to do software development once the specifications are in place! Also, requirements are not hard to be derived, once the goal and the scope are known!

Even from the first experiment, we should understand if the experiment code will be a production-ready feature or a throw-away. That is, if the code is delivered to the client, it should be properly tested, documented, maintained, and overall, enough attention should be paid to it so it will not become technical debt. In the middle of the project, we might have a lot of failed experiments, so there are good chances that an early victory will stay in production for a long time.

For a throw-away code, put enough Software Engineering effort to make sure the results generated by the experiment are trustworthy.

Metaphorically speaking, the reliability of the experiment is the sole criterion for the Definition of Done. The results (whether positive or negative) should not be a factor in determining completion.

Run

Once the experiment is designed, we run it. This usually means taking a piece of production data and running it through our software. An inference run, for example. Training is also part of the Run step.

In more mature phases of the project, sometimes it will mean that we will deploy our piece of software in production. Feel free to use canary deployments and the whole plethora of software engineering tools like microservices to make your life easier!

In the experiment lifecycle this is a distinctive step because it requires active monitoring from the engineers. It usually takes some non-trivial amount of time, sometimes weeks! Errors happen! NaNs OOMs and other 3-letter words keep us up at night! Hence the importance of tests and small test runs before we provision and launch large training runs.

Evaluate

The goal of the experiment is to test our hypothesis. After we run our experiment, we have a step where we analyze the results. This is perhaps one of the hardest steps in the process.

First of all, we have to make sure that the experiment is properly run. For example, if we are training a model, did we allocate enough compute and time? Or if we are testing a model in production, did we capture a relevant sample of data?

Fortunately, there are tools in machine learning and statistics that will guide us. Analyzing the results is again a very cognitively intensive task. Regardless of how well we designed the experiment, we will have noise, we will have ambiguous results, lots of variability, and lots of variables that we ignored in the experiment design.

It is not uncommon to go back to the experiment setup, change one of those things, and run a small subset just to validate that certain unexpected results are true and not a bug or a transient quirk of the data. This step must be thoroughly documented, and the raw data and the code that interpreted this data should be versioned with enough software engineering best practices so we can reproduce them at a later time.

Conclude

Once the analysis phase is over, we can draw conclusions. For example, given that we have one million data points and we trained for 100 hours of compute, we got 95% accuracy. Okay, this is just a simple example.

In practice, I would very much like to do 200 train-validations to get a sense of variability and also raise some learning curves. However, a positive result could be that we now have a decent enough model to be delivered to the client because, in a previous experiment, we observed that accuracy above 90% will bring more value to the client than the existing solution.

A negative result could be that the learning curves show us that we will need 10 million data points to get to a break-even accuracy that is relevant for the client. With this conclusion, for example, we can go to our manager and propose to the client to leave a data collection system in place, and we will return after six months with new experiments.

This data-driven decision-making is what makes lean very efficient. Regardless of the outcome of the experiment, we can quickly put forward a solution for the client.

Documentation

Document everything you are doing, from simple coding to more complex strategic thinking and decision-making. Encourage each team member to have a personal log where they will write what they intend to do at the beginning of the day and where they should log each decision or question mark they had to address during the day. We can expect to cut corners, especially when developing dead code, but when this code will be taken into production, we must be aware of these shortcuts! So document them!

In time, once the team is used to this added friction, we might have a centralized place for the documentation. Don’t overkill it; use screenshots, copy-paste links, code snippets, and don’t over-editorialize every piece of text that goes into these logs! The emphasis should be placed on the quantity of information provided and not on storytelling and aesthetics!

Hypotheses should be documented even more because the process, the decision process, must be auditable and sometimes reproducible. Every meeting should have notes taken, even automatically, and those notes must be attached to the decision point. Imagine a scenario where we decide to test a hypothesis, and three weeks later, somebody has to implement code to numerically evaluate whatever was discussed in that meeting. Pretty sure the high-level specifications will be clear, but the minutiae details will be lost!

Solid and robust documentation will also be extremely helpful at the end of the experimentation phase when we will have to deliver the project to the production unit!

Domain drift is something that is present in a lot of problems, and if whatever system we put in place to handle the domain drift automatically is not doing its job properly, we have to go back and investigate. It will be extremely frustrating to look at a piece of code implementing some lengthy matrix operations without understanding why those operations were needed and what the intention was with those operations, especially if this is done for some business metric.

Some team members might consider their idea as brilliant and will try to somehow “protect” it by not properly documenting what they did. The machine learning field evolves at breakneck speed so there’s a good chance that your “big idea” has already been considered somewhere. That’s why you’ll sometimes hear, “Your idea is worthless”—meaning an idea alone, without timely and effective execution, counts for little. Speed and execution are the real moats in this field!

Discipline is key in the documentation. Imagine yourself a few months into the future, looking at a bunch of CSV files with meaningless column names and some pile of code that processes those CSVs. Wouldn’t it be nice to know exactly what you have to do to get some business information out of that data? A comprehensive documentation goes a long way towards many goals! From “trivial” issues like reproducibility to more subtle aspects. It will help with the eventual handover and will help managers track progress and monitor team activity. Very few managers will ask “what are you doing?” when you can point out how the current line of code tracks to the company’s KPIs! Documentation is a key part of the Lean process! Ignore it, do it as a side activity and you are producing waste.

Desirable properties of an experiment

There are several key elements for an experiment to be successful. Keep in mind that the definition of success (or Definition of Done) is when we learn something new, when we reduce the degree of uncertainty.

A good experiment should be small enough to get results quickly. For example, in the case of training an algorithm, we already have from general machine learning knowledge a sense of the required compute and data for the selected method. We should not overkill it, but also we should book a decent margin when we are provisioning the compute and data because, you know, surprises and bugs are all over the place.

The experiment ideally should be time-boxed rather than estimated. So, the account management and project manager, especially, should know what is the timeline and tolerance of the client, and he or she should advise us engineers on how much time we have for one iteration. Once we know how much time we have, we can estimate how much time we will allocate for each of the four steps of the experiment. Maybe put in tabular data: the training will be done in two days in a Jupyter lab, but the analysis sometimes will take weeks. At a later step, we could train the system for weeks, and data analysis takes half an hour because we already have all the tooling from previous iterations. The variability here is huge so use your common sense and experience.

It is common to have a 4 week experiment “box” in the beginning, while for mature projects, one experiment could take 6 months, up to a year!

You should start experimenting immediately without too much analysis and domain knowledge acquisition. Of course, it’s always an art because if you start experimenting without the fundamental know-how of the business domain, you will search for the wrong thing. A rule of thumb: if you’re spending many months understanding the domain before thinking of the first experiment, maybe it is too much.

The experiment should be run as close as possible to real production data. Ideally, the client should always be involved. They should at least give feedback on the results and come back with honest feedback. This is a good chance to capture situations where we are building the wrong thing. A common mistake in early stages of any project is considering a data feature as a target.

Even for large projects, it’s okay to start small as long as the experiment is big enough to support the conclusion. For example, with the new advent of foundational models, we could start on an object detection task with five samples! The goal of this experiment might be, for example, to see how zero-shot detection is working, but more importantly, it would show the customer how such an automated system will look and feel. The customer might bluntly reject the whole system because of some UX or some decisions that are not related to model accuracy but to the way it integrates with their workflow. Imagine if you discovered this after many months of data collection, manual labeling, and wasted compute. This is another example of how lean is reducing waste.

The experiment should always consider the long-term vision and the long-term business goals of the client and not whatever tools or means of data that we have at our disposal. In other words, don’t fall for the Streetlight Fallacy.

To summarize, we have to keep the experiment small and keep the customer as close as possible in the experiment loop!

Safe to fail for the team members

While conventional wisdom dictates measuring productivity through concrete deliverables, a deeper understanding of Lean principles reveals the paramount importance of prioritizing learning itself. This chapter focuses on how to monitor and evaluate the team.

Evaluating the team members and their career while doing Lean, comes with its challenges. However, the key takeaway is that the working environment should be safe-to-fail.

In Agile, we evaluate how many story points were solved, how many bugs were introduced, etc. When the project is fitted for Agile, these “productivity” metrics correlate with the project success. Here, in Lean, having four distinctive phases, coding being one of them, the classical, “written-lines-of-code” metrics are dangerous. We reduce waste when we optimize for the speed and quality of learning rather than simple “productivity” metrics.

Let’s start with something simple: clean code and quality of the code. As you’ve seen so far, our goal for one experiment is to understand things, to learn things, and it’s not about delivering features. Delivering a feature happens by accident. As a result, most of our code is dead code. This is especially true if a technical hypothesis turns out to be false.

Now, if it is dead code and we know it, it doesn’t make much sense to have 100% test coverage… or does it? This is a bit of an art. We should have enough testing and validation in place to make sure that our code is bug-free. For example, if we are doing a training run, I would focus on testing parts of the pipeline, especially if we have some custom transformations or data loaders.

Do we insist (hence measure) the code quality or not? The answer is that it depends on what we are experimenting on and on the maturity of the project. Yes, if it impacts the results, no if it reduces technical debt on a dead code.

Below is another example where the right way to monitor the performance leads to success!

When I was teaching first-year computer science students, I saw in my students a blockage, a fear of doing things. After some digging, I realized they were afraid to fail. Throughout their high school experience, they understood that things must be perfect from the first key pressed on the keyboard. If they failed to properly implement some algorithm, it means they failed to design it properly. If they failed at the design, it is a clear sign that they failed to learn their lesson! In fact, the problem was that they weren’t provided with the necessary tools to figure out the bugs in the code!

I tried to change this mentality by jokingly telling them, “The one who does not work is the one that does not make mistakes.” With this joke, I was trying to explain that I expected to see failures, that they should expect to see failures, and that the only true failure is to do nothing. Some of them were encouraged by this and started to submit code and ask for help. Guess who were the top students at the end of the semester?

We should treat the outcome of an experiment like the outcome of a learning class. The number of compile errors during a semester might be a good vanity metric but it is not relevant for the final grade.

Under no circumstances, a positive experiment result (or a negative one) should be tied to the team performance review! Add something innocent like “the number of hypotheses that were true” in a ratio of 1% from the total “360 performance review” and you will get what you aimed for! Very few experiments will end in disproving chosen assumptions! 100% success rate in our assumptions! There will be one caveat though, the client will want their money back. Plus minus some damage expenses for the lost revenue and potential gains in revenue if the promised solution would have been worked!

In the case that the above paragraph wasn’t too sarcastic, let me be explicit: The result of the experiment is a data point! Rigurosity in implementation and in testing can and should be a metric in evaluating the team performance, but the outcome of the experiment, no! This is paramount and should be clearly specified in the “definition of done” for the experiment.

Managing Teams

In Lean AI development, success depends on people and how well they function together. The central insight: trust is everything. Trust allows intelligent risk-taking, quick failure reporting, and transforms individuals into a cohesive unit navigating uncertainty. Any other metric will fail you. The good news? Trust can be built systematically through classical one-on-one meetings, detailed in “The Effective Manager” by Mark Horstman.

Managing people in learning-focused environments requires different approaches than metrics-driven management. The challenge: can they do the work? Do they have the drive to upgrade skills? Do they have enough business context for independent decisions? Do you attend to their needs so they care about your business?
These questions are answered through deliberate, consistent engagement. The best “tool” is measuring the results of the work (the bottom line), but in creative positions results have a large lag time. Classical productivity metrics get gamed in this context. Code quality checklists become theater. Hours logged measure chairs occupied, not problems solved. The second most practical tool is the one-on-one meetings.

The structure for an one-on-one meeting is simple:

10 minutes for your direct. Their time to discuss blockers, concerns, ideas, or relevant personal updates. You listen more than talk. Don’t jump to solutions, let them finish. Sometimes people just need to be heard.

10 minutes for you, the manager. Your time for feedback, clarifying expectations, discussing experiment execution (focusing on rigor, not outcomes), addressing observed issues, acknowledging good work, and connecting their work to business goals.

10 minutes for the future. Career growth and skill development. What do they want to learn? What skills would increase effectiveness? In fast-moving AI, growth isn’t optional, it’s survival. Help chart a path forward, even if just one small step.

The mechanics, navigating difficult conversations, coaching through problems, balancing autonomy with guidance can be learned over time. Study “The Effective Manager” or take a management course. Consistency and structure matter more than perfection. Over time, you’ll develop rhythm. Team members will trust your genuine interest, bringing problems early when they’re small, and sharing ideas without fear. This foundation builds a safe-to-fail environment.

In my opinion, any other metric: hours worked, lines of code, papers published, even experiments run; will be gamed or drive the team to exhaustion. The one-on-one meeting, done right, gives you the data needed for good decisions about your people.

Manage the experiment sequences

The Pareto principle tells us that 80% of the work will be focused on 20% of the issues. In ML, this is even worse. We might spend 5% of the time thinking about what’s the best decision and 95% of the time confirming or invalidating whether that is the right decision or not. The time dedicated to choosing the next hypothesis or deciding the future directions, will take 5% of the time but will matter the most, towards the success of the project! For this to work in a sustainable manner, we have to have the proper procedures and mindset in place.

Every plan that we devise lasts until the first contact with reality. The team leaders have to have the ability and the tools to discover which parts of the plan are working and which assumptions need to be adapted. This discussion is all about balancing and finding the right blend. There is no recipe, as for an experiment, and we always have to keep an eye on the business objectives.

Where to start?

How do we start an AI initiative? This is a classic chicken-and-egg problem. You are a small business that wants to start delivering solutions in the AI area, or you work at a larger corporation and are tasked with creating an AI task force. How do you start?

Philosophically, regarding the chicken-and-egg problem, we can translate it into AI by asking ourselves, “What is the first thing we have to solve?” Should we hire people who know how to do machine learning? If yes, in what areas should we hire? Should we start collecting data? What data should we collect? What is the best bet? Should we start investing in hardware? Should we start investigating the latest foundation models?

The answer, as in the chicken-and-egg problem, is clear and simple: start with the problem! The problem is the lens that will focus everything that will come next. Now, obeying the “safe-to-fail” mantra, we should pick a problem that is easy for us, as beginners.

I recommend a simple four-point checklist to identify the best candidate problems. I suggest starting internally, then going to close acquaintances, and only then going to the wider world. In the case you are a founder, try to solve something that you care about, a problem that you personally have. For a corporation, it means searching internally for problems that fit these four criteria. Of course, if there are no such problems (which I highly doubt), move on to family acquaintances, or, in the case of a corporation, start searching at your oldest and most friendly customers! They will grant you the needed access, and you already have a certain rapport that you can leverage while experimenting with machine learning.

Once the problem is “found”, you will find that the initial questions will be easily answered! Those that don’t have an obvious answer? Well, make an assumption and put it to the test! You have the tools!

Falsifiable hypothesis

The hypothesis is what gives us the goal of the experiment and basically drives it. A good hypothesis has some characteristics. The most important one is that the hypothesis must be falsifiable. We must be able to invalidate the hypothesis with an experiment. This means that there is a possible outcome of the experiment where the hypothesis is false. A failed delivery, by regular management standards.

In science, we have the classical black swan example: we make the hypothesis that all swans in the world are white. The experiment is that we are looking for swans and noting their color. This hypothesis is falsifiable because it is enough to find one swan that is black.

Of course, once the data is collected, it is not completely useless because we can make another assumption, another hypothesis that sounds like “90% of the swans in the world are white.” Again, this can be falsified depending on how precisely we want to determine the percentage of white swans versus black swans. Business environments in general, can tolerate larger confidence intervals. For example, after a few experiments, we can safely claim that we are selling white swans because we know they come in quite a large supply.

As in business planning, in management, we have certain assumptions. An example of an assumption is that the client will have tabular data, so XGBoost will solve the problem! While this affirmation is deeply rooted in many years of machine learning experience, it might be false in our case! Just because the client keeps their data in an Excel file doesn’t mean that it is not a time series, for example. Or maybe we have a discrete signal, and some transformer-based methods will be the best! Testing each of these assumptions—what the client has and what is the best algorithm—has to be integrated into our strategy.

The Build-Measure-Learn loop (analyze and then take decisions) sounds very appealing, and it is! Each step forward has a rationale behind it and is justified with cold, hard facts. However, we don’t move in a linear space but in a multi-dimensional universe. Because each decision is directed solely by the previous point, one reasonable concern is that the overall direction will be a Brownian motion. That is, we will spend a lot of time and effort; every decision is based on facts; however, after a few years, we will discover that we are in the same place as we started. This is, of course, a very undesirable outcome.

In order to make progress toward our goal we need a strategy. In Lean, the meta-strategy is that we have to design a system that can systematically test and challenge these assumptions. We need to perform the testing in the most objective way possible while we don’t lose sight of the overall project vision. Choosing the next hypothesis to test, is how we steer the overall project towards success.

Riskiest things versus safe to fail

We must conduct experiments that help us determine what techniques will work in our unique circumstances. At a philosophical level, we must figure out what the right questions are to ask. And the most important questions are the ones that can derail the whole endeavor! So we must start with these questions first! On the other hand, “If you cannot fail, you cannot learn.” (Eric Ries) For machine learning, it is paramount to sum the experiments in an environment that is safe to fail.

We have two concepts here, that, apparently, are in contradiction. But they are not! We must ask the riskiest questions first and the answer must be researched in a safe-to-fail environment.

Riskiest things first

In the Unified Process, we are told to have a risk-focused mindset. At the beginning of a project, we should identify the most critical areas and address the greatest risks first. In The Lean Startup, Eric Ries identifies two categories of risks: value hypothesis and growth hypothesis. In machine learning, we can isolate two broad categories: adoption risk and technical risk.

Adoption risk means ensuring that the client integrates what we produce into their workflow and that it helps their business.

Technical risk is more familiar and includes everything from infrastructure costs to machine learning-specific risks, for example, not being able to extract sufficient ML performance from the available data.

Our task as managers and engineers is to put these risks on a common scale and start tackling them in descending order of importance.

Of course, things are never easy, so other constraints will come into play, for example, business budgets or project structure. For example, if we have a time-and-materials contract, it is very easy to stop if we find that it is not possible to achieve the machine learning performance required by the client. Things get trickier if we have a fixed-price contract where certain minimal ML performance criteria are set. Unfortunately there is no quick fix there. Transparency, honesty and careness usually help.

This is not the place to digress, but it is worth cautioning that bringing AI into a company means transformation at all levels where we interact with the client, not only within the delivery team. This includes sales, account management, contracting, etc. I suggest studying change management to ease this transformation. Setting the client’s expectations at a proper level helps a lot but has to be done very early in the relationship.

“Riskiest things first,” are those hypotheses that must hold true for the whole business proposition to make sense. For example, if the client has a strong rejection towards our solution, then there is no point in further developing machine learning. All business partners should understand this quickly and reduce waste on both sides. In other management frameworks this usually doesn’t happen. First of all, a wrong methodology might punish a negative result. So nobody in the team will want to take the fall, for producing one. Second, we might be tempted to start with the “showy” things to get things going and get that initial dopamine rush. Third, from a managerial standpoint, sometimes it is hard to see where the risks come from. Finally, the business department is also very interested in continuing the relationship with the client and basically charging the client, so they might push against things that are risky for the business relationship.

The complementarity of the “safe-to-fail” concept is very relevant here because as soon as we start the project and begin testing the risks, we might fail very quickly. If the environment is not safe-to-fail, there is a strong incentive to push the riskiest things as late as possible. In the above scenario if we discover in the last month that there is strong resistance from the end-users we will have absolutely zero time to understand why and to find workarounds. The project is doomed. If, in the same context, we had addressed this risk earlier, we could have spent perhaps six months finding a solution to this adoption issue. Plenty of time to adjust the expectations or even the project scoping.

Our goal and our core value is to reduce waste so we should push back. We should identify both technical and business risks, rank them in descending order, and focus on the top five risks.

I will not dive too much into machine learning risks because this is where all of us engineers excel! Doing the technical trade-offs to meet some KPIs is our day job, and we love it. I am extremely confident that whatever life throws at you, you will find an elegant and innovative solution! Again, searching for this solution will involve some failed experiments, so, for the Nth time, we need a safe-to-fail environment.

A word of caution: in the early stages of “normal” machine learning projects, most risks will be business risks. Later in development, the risks will shift towards technical elements. For example, we can meet the machine learning metric criteria, but we cannot meet the compute criteria or the latency criteria. In the case of moonshot projects, for example, building the next AGI system or something that will have 99.9% accuracy, these are probably technical risks. If such a risk surfaces and it is a blocker, I recommend exploring different methodologies to deal with this blocker or constraint.

For quick wins and trust building, you could focus on the biggest pain points of the client. Make sure you identify a pain point for which there already exists a metric because there are great chances that for this problem there are enough historical data and mechanisms to deal with failure, and, most importantly, the hunger to solve it. Any new solution that is better by any metric compared to the existing solution will ease the pain of the client, making the ML solution an instant adoption. As time goes on, this solution already provides value to the customer, and they will know what to expect from other, perhaps more complex or more subtle improvements. Keep in mind that ML performance is not the only metric. For example, if we can save a few seconds from a manual operation just because our algorithm proposes a better default in a graphical interface, that’s a win!

Safe to fail environment

We need a safe-to-fail environment in which to deploy our experiment. That is, if there are blunt failures, like the API being offline, or subtle failures, like an overfitted model, these failures must not impact the client in a negative and unexpected way! This conflicts with the requirement that the client should always be included in the loop! It is something that should be addressed from the first contact (sales) with the client. We can find technical workarounds; for example, canary deployments, A/B testing, or hiring a temporary team that will manually validate each output of the machine learning system. However, no technical workaround beats the transparency and expectation management to handle expected failures at the client side.

Choosing the best solution is paramount! In the worst cases, people might get hurt, lawsuits might be filed, and business might be lost. Not including the client in the loop might be okay for the first two or three iterations, but it will soon be fatal for the whole project. All of us, including the client, must come up with solutions to test the algorithm as close as possible to real life but in a safe-to-fail way. If we don’t do this, we risk failing the whole project.

If the delivery must be perfect and tolerance for failure is zero, AI might not be the right answer. Very important thing to learn, as soon as possible! Riskiest things first!

Good examples of safe-to-fail environments is when the client already have a process in place to handle failures. Q/A failures might be captured by some other business processes like a 110% reimbursement program for defective products.

There is another type of failure that we should watch out for. In some situations, we could damage our clients’ reputation. If there is such a possibility, some non-machine-learning techniques must be put in place to prevent such things. Unfortunately, with generative AI, these risks have increased significantly. Human-in-the-loop is the best safeguard; however, with a bit of UX design and inventiveness, we should be able to protect our clients. If it’s not possible, by all means, raise this issue as soon as possible with all the stakeholders!

This is a short checklist for a safe-to-fail experimentation environment. Again heavily inspired by Eric Ries:

Scoped: Both the team and the company should clearly understand what customer segments or product features will be subject of experimentation. In the beginning, we could start with the proof of concept (POC), so the whole project is basically in scope. However, as time goes on and as we solve problems, our scope will become narrower in terms of the percentage of the affected product. Even if, in absolute terms, we will experiment with larger parts of the product, proportionally, we will likely play with 10% or 20% of the whole feature set.
Ownership and stake: The team should have independent authority and, if possible, a personal stake in the outcome. The personal stake is a strong motivator, and, according to Eric Ries, it’s rarely financial. He draws a parallel with the Toyota concept of shusa (chief engineer). For machine learning, I recommend freedom to discuss achievements publicly. Machine learning is an open-source-driven field. Almost everything that has created value in the machine learning field is free as in open source. Google released TensorFlow as open source, Meta released PyTorch, and developed a lot of foundation models that are open source. Maybe the best example is the arXiv paper archive, where the latest techniques are published. Traditional (paid) journals have a lag of one or two years behind arXiv. Allowing the chief scientist or chief experimenter, even the freedom to organize a meetup on company resources, is something that can be used as a celebration of success! The side effect of opening up to the community is that it will create awareness, potentially making access to talented people a little bit easier. Don’t worry about the IP, everything in AI has been re-invented several times over! The only business moat is execution speed.
Boxed: The machine learning team needs to have access to a fixed set of resources agreed upon, before the start of the experiment. In my opinion, we can fix this budget even before the planning. After a few iterations, everyone, including beginners, will be able to roughly estimate how much compute and how much time it would take to validate or invalidate a feature. As a rule of thumb, in the beginning, I would dedicate no more than four weeks for an experiment and up to a few hundred or a thousand dollars in terms of compute (respectively, if these are rented machines in a cloud or API calls to foundation models). However, keep in mind that as the project goes on, we hit logarithmic growth, and compute budgets of a few million dollars and year long time budgets are not something out of the ordinary. The only business question is whether the foreseen improvements are worth the budget.
Metrics: We should establish, before the experiment starts, a set of business metrics that we want to address. Some non-trivial amount of effort should be put into establishing proxy metrics if the business metrics cannot be directly measured during the experiment. For example, it is very hard to measure the customer lifetime revenue enhancement for an enterprise product. But we could measure, for example, the degree of adoption by measuring the onboarding speed for new users. Review the “What to measure?” chapter.
Safe: The “safe-to-fail” concept is already used in describing the experiment but this point focuses more on the client and the larger company. They should be safe from our outcomes.

The forces that act here are roughly the same as they act when we discussed this concept at the Experiment chapter. The result of the experiment is a data point and no incentive should be placed on the outcome of the experiment. Because if you aim for zero failures you will get zero failures, at the expense of the whole project.

Remember: rank potential hypotheses by the level of risk and continue doing the riskiest things first! When in doubt, this strategy approaches a valid solution! The difficulty is in evaluating the risks and comparing different approaches. Don’t over-analyze it; add new factors like continuity of availability, context switching, and move on with the decision! Being an iterative approach, we can always resume an older experiment thread.

Mindset shifts

Classical management frameworks come with certain amenities. Good planning leads to good execution, then good results. Surprises are rare and risks are foreseen. Welcome to AI where these amenities are luxuries we can’t afford. We plan, provision for the risks and implement things but we are doing it with an abstraction level above the regular management frameworks.

To be able to do this in a sustainable manner, both from management and delivery teams we must change the mindset a bit.

Go see for yourself!

Eric Ries tells us about the “go and see for yourself” concept. It is a core pillar in the original lean method proposed by Toyota Production Systems. This is very, very good advice in AI. We as engineers should strive to understand as much as possible the business logic of the client. We are experts in machine learning and not in finance or retail. The assumptions that we are making while building our algorithms must be objectively validated by the realities of the client. In this way, we will be able to sustain innovation and continuously create value in a sustainable manner.

The expression “put yourself in their shoes” should sometimes be taken literally. A good practice is to become an apprentice in key positions at the customer. Even 30 minutes on a conveyor belt doing manual quality assurance can lead to a tremendous amount of learning about the problem: lighting conditions, speed of the conveyor belt, visualization angles, stroboscopic effects, background quality, noise, vibrations, and elements that are hugely important in computer vision. One computer vision engineer can quickly evaluate all these, having “boots on the ground,” as opposed to many months of commonality and variability analysis, based solely on data.

At first glance, having a skilled computer vision engineer doing manual tasks is not the best use of their time. However, in the longer run, this can save a lot of time and energy, both from our team and, more importantly, from the client’s team. Please immerse yourself as much as possible in the client’s environment. A side effect is that you will build human connections that will be invaluable during the project!

Local efficiency vs global efficiency

As we were discussing in the beginning, an experiment must include the client. In the beginning, we will probably run on the client’s data, but after one or two iterations, it is very likely that our code will be live at the client. That piece of code should be instrumented and engineered well. That piece of code will be our baseline and our gateway for future experiments. We will have to make sure that the negative signals we get from the client are real and not due to some bug in the code.

Do not over-engineer it! KISS, YAGY, SOLID, DRY, DAMP, TDD are acronyms from our clean-code safe zone, we know them, we know how to apply and measure them. However, if you are an engineer, inform yourself again about the Streetlight Fallacy.

There is no one-size-fits-all; there is no bad tool or wrong methodology here. Just make sure you understand your tools and understand when to use them! Scoping is important; faster releases are preferred over lengthy experiments, especially in the beginning. Don’t be afraid to “waste” time manually monitoring the training loops, taking checkpoints, and looking at the results offline. Kill the training loop as soon as you discover something horrendous, but don’t waste too much time with custom callbacks for the first training runs! Again, balance and a bit of common sense are key here.

Some of you who have management training might get triggered by the level of underutilization in ML teams. There are two resource categories that usually trigger management:

Hardware: Whether it’s in-house or rented, there will be periods when it will sit idle. For example, we are coding some distributed parallel training pipeline. We can’t develop it on a local laptop; we need an actual cluster to debug the issues. Another example is that we need all the checkpoints produced by a training loop in the analysis phase. The volumes will stay mounted until we are drawing our conclusions. Moreover, it’s a good practice to keep those checkpoints for a longer period of time in case we need supplementary validations.
People: We are in the coding phase, and everybody is coding this new marvel idea that we have, and after the debugging is done and we are happy with the results, we trigger a training loop. Depending on how much data we have and how many experiments we want to run, this can take weeks. So, what are we going to do during these two weeks? Are we going to see people idle? Are we going to start a new experiment?

There is a rule of thumb in real-time systems where you have multiple processes scheduled for execution on a limited compute resource and you want to have guarantees about the execution of the tasks. You shouldn’t load your processor more than 69.3% of its capacity. While this number has some theoretical backup, in practice start with 50% “occupancy” and raise it slowly to 70%.

For hardware resources, the underutilization is easily measured and with cloud resources it can be scaled up and down. As soon as somebody is waiting for approval to run their code, start provisioning more resources. Waiting for the server to become “free” is waste! Provisioned server, sitting idle, generates 100x less waste!

Human resources on another hand are a bit trickier to handle. First of all, the most dangerous thing we can do to a team member that waits for an experiment to finish, is to switch them to another task. People will lose focus and forget why this original experiment was performed in the first place! When the results are in, there will be another focus shift, and, of course, both tasks will suffer. What I recommend is to work on the analysis code.

Even if the analysis part is already mature, for sure there are some new things that we are looking for! We can use partial results from the current experiment to work on the analysis part of the experiment! Another important aspect is experiment monitoring. We could have a lot of automated triggers, for example, in case of NaNs or automatic checkpoint saving, but the experiment could benefit from manual sampling. As beverage brewers sample their barrels from time to time, we should also sample our intermediate checkpoints. And what is the best way to interpret results? Well, look at them with your brain! This takes time, effort and focus. Don’t treat it as waste!

Another thing that I recommend people to do during this period is to deepen a technical topic or study some related technology to the current experiment.

ML experiments, in my opinion, should be considered deep and difficult topics. The problem with team members being overloaded is that it will give negative feedback only after a while because it’s very hard to measure the decrease in innovation quality. A safer bet would be to keep people happy and to keep them up-to-date with the latest changes in the field. In this way, they will have enough mental freedom to focus solely on the current task!

This was a hard-learned lesson in the industry and team leaders/managers should search for small-batch versus large-batch analogies in traditional production systems. “You cannot trade quality for time.” Remember, lean focuses on reducing waste and not micro-managing or saving a few bucks on an S3 instance.

After a while, it could happen that the team gets stuck on a problem and, whatever they do, they cannot shake it. “At the root of every seemingly technical problem is a human problem.” (Eric Ries) I strongly recommend the use of “5 Whys.” In short, ask five questions that start with “Why,” and chances are that after the first few questions, you will get to the root of the problem. There are some more nuances here, like all the stakeholders should be in the room; otherwise, the one who is missing will be found guilty, but this is a topic better covered in classical management books. If we are getting this step wrong, we might fail the entire project because, more often than not, solving this blockage involves some changes in the status quo. People affected by these changes are pushing back against our new and shiny AI solution, sometimes even sabotaging the experiments! This is a very tricky subject, so we must use our best human skills to address it.

Scaling up the team

We perform Build-Measure-Learn and each learning lesson will bring us the next experiment. This means we can run only one experiment at a time? No. One result could spawn multiple hypotheses or we can tackle a problem in multiple ways. Each such “track” will be executed in parallel.

A didactical example: With few instances we can use foundational models to get some results. With large sets we can fine tune our own models heavily cutting latency and costs. One experiment track is, for example, focused on capturing empirical evidence, and the other experiment is focused on statistical learning. The first experiment will require a lot of back-and-forth discussions with the domain experts, a lot of smaller experiments, a lot of fiddling, but they will require reduced volumes of data and will deliver actionable results tomorrow. The second experiment can be performed by any machine learning engineer with enough compute on their hands. But in the second case, the probability of success is directly linked with the quality, quantity, variability, and especially the quantity of the data. Useful results will show up later in the project lifetime. So, we as a team can deliver something working to the client right now! When the bitter lesson kicks in (the 2nd approach becomes better) statistical learning will “take over” and the nature of the solution will change.

We start both tracks in parallel because it is hard to guess when the tipping point will be.

In practice, it’s possible that the client has many problems that might be overlapping, and we can have several teams tracking these issues in parallel. It is also common for clients to have multiple independent data sources. However, most clients have several business goals, or the business goal can be sliced into different parallel decision-making streams. These sub-decisions could be tackled in parallel by different teams. The norm is one experiment track per a small team, or even better, one topic per team.

The Pivoting concept from Eric Ries also means for us, changing the business topic that the team tracks. That is, no more hypotheses are left to be explored in the current track.

Logarithmic growth

In almost any management book, you will learn about the importance of onboarding and how an employee will need a certain period of time to integrate, followed by a period of ramp-up where they will become more and more productive. Once this ramp-up is ended, they will be productive, and we expect (or require) a certain upward trend, even if it will not be with the same growth speed as in ramp-up. The same expectations are with project delivery rates.

Burndown charts and team’s velocity are metrics that predict project delivery.

Forget about these expectations when talking about AI/ML! What you will get is logarithmic growth. The initial progress will be awesome in terms of KPI gains with respect to invested effort! But after the 3rd or 4th iteration, top management will feel like the project has hit a wall! It is not a wall; there are no blockers; it only takes exponentially more resources to make an improvement. Moreover, the delivered improvements will be smaller and smaller! Whose fault is it? Well, this is the status quo in AI, and when you have a solution, please let us know!

To get more intuition about this phenomenon, here is another didactical example: ML is, by definition, a method to learn from data. What is learned are the general characteristics of the data. Minute characteristics are ignored until a sufficiently large data sample is collected. Because these smaller characteristics are not frequent, it will mean that their impact on the KPI will matter less (e.g., 1% improvement), and collecting the data to get a good sample volume will take a lot of time. Feel the logarithmic growth? We work to collect 10x more data, the model learns those outliers and our overall performance increases with at most 1%.

The reasons are not only with data collection. Experiments will become harder to design and more complex. The hard question is: When, the performance gain will not justify the costs? This is another colossal win for the Lean approach; we can stop at the exact moment when this threshold is reached! The client is always involved; the metrics we use to track progress are as close as possible to the client’s pain, so the client can assess if improvements in certain metrics will bring enough business value to warrant the expenses.

Don’t be afraid of this moment; welcome it and consider it a big win! We efficiently delivered value in a problem with unknowns and great chances of failure!

Business continuity and revenue stream? Up-sell, cross-sell and most importantly gain trust and reputation!

How to integrate in larger company

We’ve achieved traction with the customer and have enough business cases to tackle! We need to double our team size! Things are looking great, and this is excellent if we are an AI service boutique. This is the reason we exist, and this is the sole service we are selling! However, the most likely scenario is that our AI team is just a part of a larger organization. Not only this, but our AI services are just a part of a larger, more complex product!

The question that arises once we start having success is how we are going to be integrated in the long term within the company! As you’ve seen so far, we turn a lot of processes on their head, we challenge bedrock assumptions, and our delivery process is nowhere near a classical Scrum process.

There are several issues here that I want to address, as inspired by Eric Ries:
System over people
Isolating the team from the larger company
Isolating the larger company from the team
Ensuring that the AI module can be integrated back in the regular workflow

Team isolation

There are cases when we want to separate the innovative team from the main company. For example, if this is a manufacturing company and our team is responsible for researching pretty hairy problems, we need to have an environment that is favorable to us.

Methods that work well in regular software engineering or manufacturing will kill our AI team. Both “safe to fail” chapters are dedicated to this concept!

Company isolation

We can fully agree that an environment full of experimentation and failures will be disruptive in a bad way for a large-volume service company, for example.

Eric Ries also warns that the company should be protected from the experimentation team. In our machine learning case, we have to be very conscious of which clients and what type of clients will be affected by our experiments. If an experiment goes horribly wrong for 10 enterprise clients, it might bring the same data value as a failed experiment for 10 free-plan clients. However, the damage to the company’s reputation is several orders of magnitude different.

While this idea is again within the “safe-to-fail” concept, we have to be explicit. If we start to damage other parts of the company with our experiments, we will be attacked, and unfortunately, it is very hard to innovate and to push the boundaries when you have to defend all your actions and decisions. The sales team selling a 99% SLA shouldn’t feel threatened by our quick and loose experimentation strategies.

This compartmentalization and process separation is something that we have to work on together with C-level management. Usually, straightforward solutions are good enough.

Integration back in the regular flow

As the project goes on, we might soon be in a position where most of the innovation is considered standard and must be managed as a product. This can be used as a vanity metric to celebrate progress. What everybody recommends is to have a management handover from the innovation team to the delivery team. This, of course, must be done in a smooth and intentional way. That is, both the innovation team and the delivery team must expect this handover from day one. Here, the good practices that were enumerated in the experiment chapter are vital, especially the part where we document all the assumptions, all the experiments, and all the outcomes. Of course, there should be an option that if certain team members want to go to the delivery team, they should be able to stick to the feature rather than to the innovation team.

As in software development, code ownership or feature ownership is a bad smell. Yes, it is very important to have ownership while doing an experiment, but it is very bad to have lifetime ownership over a section of the product. We might run into issues like bus factor (that person leaves the project, and the project dies) or siloing (a bad habit in which certain members try to secure their position by siloing the know-how).

System over people

In mature startups and in corporations, we hear “system over people” a lot! The central idea is that by establishing standardized procedures and workflows, organizations can achieve greater consistency, predictability, and efficiency, regardless of who is executing the tasks. This approach assumes that human error and variability can be minimized by relying on robust processes. It is well known that this approach fails for rapidly changing environments and fields where innovation is key.

Lean, somehow, takes the good parts from the “process is everything” and the good parts from the “people over system” mantras. Lean’s outputs are the product and the process! If the experiments are well documented, one of those experiments will become the process that will be applied over and over again to gain improvements on metric K or to “defend” the performance against inevitable domain drifts! Win-win!

The key is that everybody should be aware of what step we are in. For each step everybody must understand the expectations and what comes next.

Final words

AI/ML is an old field; it started somewhere around 1960. It has many faces and many methods, and statistical learning, with its subset called deep learning, is just one small part of true AI. Ontology, optimizations, first-order logic, or other topics were all the rage back in the day. Each of the methods that I enumerated provided value to customers. However, the value was oversold, so “AI winters” arrived. As service providers and solution providers, we have the responsibility of under-promissing and over-delivering. This is not easy to do, but we must try.

Agile methodologies have their roots in dissatisfaction with existing delivery methodologies. However, their success brought us to a situation where Scrum is applied as a checklist rather than as a management strategy. This is a huge risk with the Lean methodology: that it can become rigid indoctrination. We, as Lean practitioners, must understand that everything is fluid, and we must internalize the core principles and the reason they exist.

Unfortunately, The Lean Startup movement lost its popularity and gained a lot of criticism because it fell into precisely the trap that its author foresaw. Quote from page 279 of The Lean Startup book: “We cannot afford to have our success breed a new pseudoscience around pivots, MVPs, and the like. This was the fate of scientific management and, in the end, I believe that set back its cause by decades.”

Blog