Tl; Dr:
- Testing the production performance is difficult in the age of LLMs
- Synthetic data generation can help but we must do it with care
- Data flywheel is vital and must be kept rolling!
- Demo available online!
Data flywheel
Data flywheel is the engine that drives your ML business!
Alright, buckle up for a fun ride into the world of data flywheels! Imagine you’re at a carnival, and there’s this giant wheel that gets faster and faster as more people push it. That’s basically what a data flywheel is in the tech world, but instead of carnival-goers, it’s data doing the pushing!
Here’s how it works: You start with a product or service that collects some data. As more people use it, more data comes in. This data helps improve the product, making it more attractive to users. More users mean even more data, and the cycle keeps going, spinning faster and faster like our carnival wheel. It’s like a snowball effect, but instead of snow, we’re rolling in sweet, sweet data! The best part? Once it gets going, it’s hard to stop. Companies like Google and Amazon have mastered this, using data to make their products so good that users keep coming back for more, feeding the flywheel and making it spin even faster.
LLM: Everything is new but old again!
Ok, but how good is your model?
Alright, let’s talk about performance estimation in machine learning. It’s like trying to figure out how good your recipe is before serving it at a big dinner party. That’s where train and test sets come in handy!
In the classic ML world, we usually split our data into two main parts: the training set (where our model learns all its tricks) and the test set (where we see if those tricks actually work). The most common split you’ll hear about is the 80/20 rule – 80% for training and 20% for testing. It’s like keeping a slice of pizza aside to make sure the whole pie turned out okay. For some real-world examples, let’s look at some classic datasets:
- MNIST (handwritten digits): Often uses a 60,000/10,000 split for train/test.
- CIFAR-10 (tiny images): Goes with 50,000 for training and 10,000 for testing.
- ImageNet: This big boy uses about 1.2 million images for training and 50,000 for validation.
Now, here’s where things get wild with Large Language Models (LLMs). These text-gobbling monsters have been trained on practically the entire internet! It’s like they’ve read every book, blog post, and tweet out there. So when it comes to testing them, we run into a bit of a pickle.
Finding fresh, unseen data for LLMs is like trying to find a corner of the internet they haven’t explored. It’s tough! We can’t just whip up a few million new web pages to test them on. That’s why folks in the AI world have started calling performance checks on LLMs “vibe checks.” It’s less about hard numbers and more about getting a feel for how the model performs. We’re not looking at precise accuracy scores anymore; we’re asking, “Does this feel right? Is it giving off the right vibe?” It’s more art than science at this point.
Next time you hear about LLM performance, remember it’s less like a strict exam and more like a talent show judge giving a thumbs up or down. We’re all just trying to catch the right vibe in this brave new world of AI!
Statistically relevant or not, these numbers are vital both for engineers and for business drivers. We must have them!
The mentoring
LLMs: Your 24/7 code coach, giving students instant feedback on the tricky stuff professors usually handle, so learning never hits the pause button.
Let’s go by an example to see this performance estimation in LLM age.
Case: As a student learning how to code you need constant feedback on what you are doing. Some feedback is fast (syntax – compiler, algorithm – program output) but code styling, design patterns and clean-code feedback traditionally require human feedback.
Solution: With a proper trained LLM we can provide “instant” feedback on these issues. It will capture the gross infringements and will provide fast feedback. In this way, the students’ teaching flywheel will keep on accelerating!
Sooo, are we going to write a prompt? Yes, but . . . how good will it be?
For simplicity we will focus on C language and only on several clean-code principles.
Test set. The only set that matters.
This is the first step in almost any real life ML project. Not the models, not the architectures, infra, deployment, no. The test set. How good this model will be for my business? Everything else is secondary to this.
The test set is also a bit holy. Shouldn’t be run more than once, no business decision should be taken with it, extreme care to leakages must be taken.
For our case we must create many code samples, each sample infringing one or more concepts that we want to check. So, write bad code? Well, not the nicest task! What if we can enroll some help? From who? Well, LLMs of course!
Data flywheel: The start
Natural data is expensive? How about some synthetic data?
Can we ask some LLM to generate a bad code? Yes! So let’s do it! Off the start we hit a wall. For us, it was very hard to solve an actual C problem and intentionally not adhere to a certain clean-code principle. Usually the LLM ignored the instruction regarding the clean code.
This surfaced after the first data generation attempts. So we discarded this direction. Fail fast and don’t dwell on something less promising! In the beginning, explore rather than exploit.
Data flywheel: The longer start
LLMs can’t plan or reason
If you train a model to detect dogs vs cats and you feed a horse, under no circumstances will it be able to label the image, as a horse. It knows only about cats and dogs. Same is with LLMs. They can’t reason. They only output the most likely character [token] following the already emitted output. On the internet, most blogs solving didactic problems are doing it in a fairly clean way! So, we have to adapt.
Let’s see our new pipeline:
Prompt to generate “clean-code” solutions -> Prompt to de-clean the solution -> . . . -> Profit?
Let’s see some examples:
Add two vectors and then display the results. The vectors are initalized in the main() and their length is known at compile time. No need to read their values from keyboard.
Ok, nothing too fancy. So one prompt one code? Well, can we get more? Yes! Increase the temperature, run the prompt several times!
Is that easy? Nope! We have to filter for bad samples and for duplicates. Argilla is here to save the day! Excellent system to visualize your data and annotate it!
It wouldn’t make it to a story if it would have been THAT easy! There are several issues:
- ChatGPT 3.5 is faster but ChatGPT 4 is better!
- On some days ChatGPT 4 will throw some tantrums. A lot of waiting time and a lot of gibberish output!
- An
openai
version bump partially fixed the issues. Still, some trivial errors remained:STUDENTs
instead ofSTUDENTS
for some constant, missing end-of string double quotas, etc. - Some larger outputs were enclosed in
```
other not. Had to account for that. - Generation prompt had to be tuned several times to avoid stereotypes or common mistakes (
scanf("%s"...)
) - Simpler problems had less wiggle room and tend to produce the same solution.
How can we filter out these obviously bad examples? Well, compile and run them! FireJail + Docker are here to help! After ChatGPT output, we compile and run the code. Compiles? Runs? Store it for human inspection. After this iteration human expert had to check for:
- Is the solution adequate (correct, clean enough) for the problem?
- Is the solution different enough from the rest of the solutions?
Long story short, 8 prompts, 10 runs per prompt and several iterations for this step we got it right! In the end we collected 78 samples.
Unit testing your prompt is . . . not trivial.
Data flywheel: It’s a lie!
In fact it is all a big gear box!
See the problem? We spent a lot of time “tuning” the augmentation step. One simple code generation step ate a lot of time and effort. And it will get worse!
Now, we have our nice samples, clean code and all. Time to walk the talk and augment them!
- Create a de-cleaning prompt
- Specify the code and criterias!
What can possibly go wrong? Besides the above list you ask? Well . . .
We could get a wrong answer, we could get a solution that does not solve the original problem. How we “solved” it:
- Iterate:
- Take a piece of clean-code, run it, collect the output
- Take a short list of bad-code principles
- Iterate:
- Using a prompt, de-clean it based on one criteria at a time.
- Compile the new code
- If there are errors, feed them to chatgpt and ask it to fix the code. Iterate several times then abandon if it can’t
- Collect the new output.
- Ask chatgpt to compare the original output with the new output. Close enough? Accept the sample. Far? Reject the sample.
- Use resulted code as input for the next de-clean criteria.
Spot another problem? We will get only valid C code but do we know for sure that the augmented code solves the same problem? We ask ChatGPT to evaluate if it is solved or not. Here is another train-eval loop! In production this must be iterated and estimated! It is a flywheel that needs to have quite a good performance! So there is no one flywheel! It is a freakin’ gearbox! A lot of flywheels! Interconnected!
Tip: KISS – the most important test is the end-to-end test. In time and with data, we can isolate and properly treat smaller flywheels.
Again, checking few examples at a time, the “de-clean” prompt has been tuned a bit, the de-clean combinations, their order. Here are the principles:
- DRY Don’t repeat yourself
- SRP Single Function Responsability
- OCP Open-Close principle [spoiler, dropped]
- MC Magic constants
- NAME Meaningful names to variables and functions
After many iterations on this step we got 125 samples. Only 125 samples!
We used Argilla again to manually pick 54 samples. Only 54 samples! For each sample we have:
- The original handwritten problem prompt
- The list of de-clean augmentations (eg:
["DRY", "SRP", "NAME"]
) - The actual C code
At this step, the OCP principle was dropped, for our problem statements it generated either meaningless code or identical code. Hard to do OOP on C. Also note that this is a multi-label classification problem.
A LOT of human time was consumed to generate these samples. For such a low data volume the question arises if it would have been easier to just write 54 pieces of code by hand! How hard can it be? Well, the right question to ask is how easy is to scale? Let’s see break down the activities:
- Code the python scripts to to the data handling -> Complex, hard to scale or parallelize
- Write some problem statements -> Easy to do, still needs human expert but easy to parallelize
- Auto generate 1000 code samples -> Let it run overnight. Scale you say? Check your card and openai limits! 100000 samples? Probably 10$ [No actual math has been performed!!!]
- Verify those samples -> This is human intensive again but can be parallelized!
So, the code once written can be “spinned” faster and faster! The limiting factor is the human evaluators. But this scales linearly with the number of persons willing to evaluate the samples.
The train/test split
It’s all just prompt engineering!
This nice set was split in 24 train samples and 30 test samples.
But why???? If there is no training AND the test set is the most important, why not throw everything into the test set?
Long story short: Prompt engineering is what feature extraction was before Deep Learning. And from an mathematical angle, the way transformers work, when you prompt a LLM to make a decision, you get a classification problem! So yes, welcome to the present! It’s all just prompt engineering from now on!
The split was done at the problem statement level. Some balance was ensued by manually tuning of the random seed.
We added a new label, NO_COMPILE
to detect when a piece of code is not compilable. The reason for this is that we deploy the system live and somehow actually compiling and running C code on the same machine where the openai secrets are kept, well . . . Also, there were some practical issues, it is difficult to orchestrate two dockers on regular free hosting providers.
The prediction loop is very easy:
- Take a piece of C code
- Run the prompt that detects the infringed clean-code principles
- Profit!
Well, no profit yet, but we used Jaccard score to see how well the system works. We used it on the training set and tried to “optimize” the prompt so we get decent scores.
Regularization was performed by literally not trying too hard to fix the misclassified labels!
Criteria | DRY | MC | NAME | SRP |
Jaccard (1 == perfect) | 0.68 | 0.72 | 0.58 | 0.35 |
So SRP is a bit deficient we see.
Keep in mind that this is on train set. At this point, [Aug 2024] the test set was not evaluated. Why? We want to include more engines, not only ChatGPT and maybe tune the prompt some more!
We will be able to do this only after the data gearbox will be rotated again! Because if we will have new data, we will use it as new test set and the current test set will be used to decide what prompt/LLM is better! Once we measured something with the test set we must “throw” it away! When we have replacement, it is easy!
The demo
While it still has GPT credits the demo is online! Thank you Maven for the credits!
FastHTML and Huggingface were the main tools. The code of the demo is here. Including the detection prompt!
If you log in with HF account you can leave feedback on the evaluation quality!
Conclusions
The ML “entry bar” is lower than few years ago. Feature extraction can be done in plain English! Anybody can do ML right now, if they observe some best practices!
The difference between a cool demo and a product is that the latter can generate revenue. For this, the business must have a gist of how good (or bad!) the ML will be!
Synthetic data generation is here to help but don’t expect miracles! Also, expect a solid chunk of manual labor! Natural or synthetic, the noise and garbage is there! Must be filtered out somehow!
If you are stuck with your ML problem, don’t hesitate to contact me!