<![CDATA[Jeroen Overschie]]>https://jeroenoverschie.nl/https://jeroenoverschie.nl/favicon.pngJeroen Overschiehttps://jeroenoverschie.nl/Ghost 5.130Sun, 05 Oct 2025 10:40:07 GMT60<![CDATA[Fine-tuning LLMs yourself (on Xebia.com ⧉)]]>https://jeroenoverschie.nl/fine-tuning-llms-yourself/68dfa6dbf044ed00011d87feFri, 03 Oct 2025 10:48:15 GMTWhy fine-tune?Fine-tuning LLMs yourself (on Xebia.com ⧉)

Large Language Models (LLMs) are impressive generalists. They perform increasingly well on a wide range of tasks; achieving human-level or even superhuman-level performance on many benchmarks. However, being good at everything can come at a cost. These so-called Foundation models have billions of parameters and require a vast amount of GPUs to train- or run. Unaffordable for most - so we consume LLMs from APIs instead - which comes with its own considerations like cost, privacy, latency or vendor lock-in. This is not the only option. When only specialist capabilities are required, generalist capabilities can be sacrificed in favor of domain-specific knowledge, allowing us to get away with smaller models. Smaller models require less compute and can achieve similar- or even better performance than larger models on specific tasks [1]. With enough data available, we can specialise existing models ourselves. This is fine-tuning.

What is fine-tuning?

Fine-tuning is a process where we take a pre-trained model and continue training it on a new dataset, typically with a smaller learning rate. This allows the model to adapt to the specific characteristics of the new data while retaining the general knowledge it gained during the initial training. In practice, this often involves further training the top layers of the model while freezing the lower layers, which helps to preserve the learned features from the original training.

So, how to finetune ourselves? Let's figure that out! 

How to finetune yourself ✓✓

Enough talking! Let's get practical. Let's fine-tune a model on our own hardware. We have access to the following setup:

  • HP Z8 Fury
  • 3x NVIDIA RTX 6000 Ada GPUs

Our goal is: to finetune a model and benchmark it with MMLU Pro, math category.

So in order to start fine-tuning, we need to decide on the following:

1. Fine-tuning framework: Axolotl

There are several fine-tuning frameworks: AxolotlUnsloth and torchtune. We choose Axolotl for its ease of use and its rich out-of-the-box examples.

In Axolotl's first fine-tune example, you only need to execute a one-line command to fine-tune a Llama3.2 model:

axolotl train examples/llama-3/lora-1b.yml

The fine-tuning dataset and method are all defined in the YAML file.

2. Model choice

The first question we asked is how large a model can we fine-tune given 3 GPUs. One NVIDIA RTX 6000 Ada GPU has 48 GB, so in total we have around 150 GB.

Under the hood, Accelerate is used. We need to figure out, however, how big of a model we can train on the available GPUs. To calculate the memory usage based the amount of model parameters, Zachary Mueller's slides offer a convenient formula.

Take Llama 3 8B model for example, which has 8 billion parameters. Each parameter is 4 bytes, so the model needs roughly 8 × 4 GB. Further, the backward pass take 2× the model size, and optimizer step takes 4× the model size, which consumes the highest memory during fine-tuning.

You can also use this model size estimator tool to estimate the memory usage based on your own model choice.

For example, we cannot fine-tune a DeepSeek V2 model, because the memory usage is above 3 TB, as shown below:

Fine-tuning LLMs yourself (on Xebia.com ⧉)

Based on the memory limit, we choose Gemma-2-2B to fine-tune and dip into the fine-tuning techniques.

The reason we choose a Gemma model is that we want to use a relatively small model to fine-tune for a specific task to prove that a fine-tuned smaller model can outperform the larger models in a specific domain.

3. Benchmarking dataset: MMLU-Pro

For the benchmarking dataset, we use the MMLU-Pro dataset. It has 14 categories of questions, ranging from math, physics, biology to business and engineering. For each question, there are 10 options to choose from for one correct answer. When it comes to the dataset split, it has 70 rows for validation and 12k rows for test. The validation dataset is used to generate CoT (Chain of Thought) prompts for the test dataset. Together, they form the prompts to benchmark your model.

The MMLU-Pro leaderboard shows the benchmarking results for an abundance of models. For example:

Since only the overall accuracy is reported for Gemma 2- or 3- models, we also want to know the specific accuracy for math.

The benchmarking script can be found in MMLU-Pro's GitHub repository. For example, if we want to evaluate a local model, we can reference this script:

  • model is <repo>/<model-name> from Hugging Face.
  • selected_subjects can be all for all categories, or math for a specific category.
python evaluate_from_local.py \
                 --selected_subjects $selected_subjects \
                 --save_dir $save_dir \
                 --model $model \
                 --global_record_file $global_record_file \
                 --gpu_util $gpu_util

We ran the above script for --selected_subjects math, we get the below results:

Next, we want to fine-tune a smaller Gemma model to improve the accuracy of math above 0.107.

4. Training dataset: AceReason-Math

We use NVIDIA's AceReason-Math dataset to fine-tune our model. It is a training dataset with 49.6k math reasoning questions. Instead of giving a list of options, the dataset comes in the format of one question and one correct answer.

In order to use this question/answer format to fine-tune a model in Axolotl, we need to check what instruction tuning formats are provided. This alpaca_chat.load_qa format perfectly matches our training dataset. So we can update our fine-tuning YAML file as below:

datasets:
  - path: /data/nvidia_ace_reason_math.parquet
    type: alpaca_chat.load_qa

One thing to pay attention to is that we also need to match the dataset column names to question and answer, so we rename the original problem column to question.

5. Fine-tuning strategies: QLoRA, LoRA, FSDP, DeepSpeed

There are many techniques and libraries around that help make the fine-tuning process more efficient, many of which focussing on memory efficiency and performance. Most notably, these include LoRAQLoRAFSDPand DeepSpeed. Axolotl supports all of these techniques. DeepSpeed or FSDP can be used for Multi-GPU setups, of which FSDP can be combined with QLoRA.

So what are these techniques?

LoRA (Low-Rank Adaptation) is a technique that allows us to fine-tune large models with fewer parameters by introducing low-rank matrices.

LoRA ... [reduces] trainable parameters by about 90%https://huggingface.co/learn/llm-course/en/chapter11/4

... this means, that we can fit more in memory and more easily fine-tune on consumer-grade hardware. Awesome!

  • QLoRA (Quantized LoRA) is an extension of LoRA that quantizes the model weights to reduce memory usage further.
  • DeepSpeed is a library that helps you do distributed training and inference supported by Axolotl.

FSDP (Fully Sharded Data Parallel) is a way of distributing work over GPUs.

FSDP saves more memory because it doesn’t replicate a model on each GPU. It shards the models parameters, gradients and optimizer states across GPUs. Each model shard processes a portion of the data and the results are synchronized to speed up training. - FullyShardedDataParallel (HuggingFace)

Combinations of LoRA/QLoRA are used as an adapter in the Axolotl config in combination with FSDP for finetuning Gemma2/Llama3.1/Qwen2 on our limited amount of GPU resources ✓. Let's now see how to run the fine-tuning process! 🚀

6. Running the fine-tuning 🚀

Axolotl provides a list of Docker images for us to use out of the box. In order to use it, you can reference this Docker installation.

Since we have our development GPU infrastructure set up and managed by Argo, we create a Kubernetes job to run the fine-tuning process as below:

apiVersion: batch/v1
kind: Job
metadata:
  name: axolotl
spec:
  backoffLimit: 0
  template:
    metadata:
      labels:
        app: axolotl
    spec:
      restartPolicy: Never
      containers:
        - name: axolotl
          image: axolotlai/axolotl:main-latest
          command: ["/bin/bash", "-c"]
          args:
            - |
              set -e
              echo "Start training..."
              axolotl train $YAML_FILE --num-processes $NUM_PROCESSES
              echo "Merging the LoRA adapters into the base model..."
              axolotl merge-lora $YAML_FILE --lora-model-dir=./outputs/out/
              echo "Uploading the fine-tuned model to Hugging Face..."
              huggingface-cli upload $MODEL_NAME ./outputs/out/merged/ .
          securityContext:
            privileged: true
            runAsUser: 0
            capabilities:
              add: ["SYS_ADMIN"]
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "40Gi"
            requests:
              memory: "20Gi"
          volumeMounts:
            - name: axolotl
              mountPath: /workspace/axolotl
            - name: huggingface
              mountPath: /root/.cache/huggingface
            - name: dshm
              mountPath: /dev/shm
      volumes:
        - name: axolotl
          persistentVolumeClaim:
            claimName: axolotl-pvc
        - name: huggingface
          persistentVolumeClaim:
            claimName: huggingface-pvc
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: "10Gi"

The variables come from the config map defined in kustomization.yaml:

  • NUM_PROCESSES is the number of GPUs to use
  • YAML_FILE is the YAML file from examples
  • MODEL_NAME is the model name to upload to Hugging Face.

Results

After the fine-tuning is done, which takes roughly 2.8 hours, we can use python evaluate_from_local.py to evaluate the result, which is as below:

Compared to a non-fine-tuned Gemma-3-12B-it model based on math, the fine-tuned model improves the accuracy of math from 0.107 to 0.238 ✓.

Conclusion

In this blogpost we learned: - What fine-tuning is and why it is useful. - How to fine-tune a model on our own hardware using Axolotl. - How to benchmark the fine-tuned model using MMLU-Pro. - How to run the fine-tuning process in a Kubernetes job.

We successfully fine-tuned a model and benchmarked it with MMLU-Pro, math category. The results show that fine-tuning smaller models can lead to competitive performance on specific tasks, demonstrating the power of domain-specific knowledge in LLMs.

Finetuning is definitely not always the right approach - but if it is - it can be very powerful. Good luck in your own fine-tuning adventures! 🍀

Citations

  1. Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Amin, M., ... & Natarajan, V. (2025). Toward expert-level medical question answering with large language models. Nature Medicine, 31(3), 943-950.
]]>
<![CDATA[Python Devcontainer with uv]]>https://jeroenoverschie.nl/python-devcontainer-with-uv/68079b0ec08a340001eddb66Thu, 02 Oct 2025 18:53:36 GMT

uv is the new king in Python package manager land. It's fast, comprehensive and by now, well-adopted. Devcontainers provide a powerful way to create a reproducible development environment, by developing inside Docker containers. So, how do we combine the two? To find out, read along.

How to create a Python + uv Devcontainer

Let's set up our project step-by-step.

Step 1: Devcontainer.json

  1. Create devcontainer.json
    This is the Devcontainer specification telling an editor everything it needs to know about our Devcontainer. Most importantly, we decide which Docker image is to be used. There are official Python images that available on Docker Hub, like python:3.13. However, this image is not meant for development. It is a minimal image used to run Python applications. We don't get things like Zsh, Git or a non-root user. For this reason, there are pre-built Devcontainer images we can use. We will use mcr.microsoft.com/devcontainers/python:3.13:

    {
        "image: "mcr.microsoft.com/devcontainers/python:3.13"
    }
    

    .devcontainer/devcontainer.json

    ... great!

  2. Adding uv
    To add uv, we can use Devcontainer Features:

    Python Devcontainer with uv

    Devcontainer Features are pieces of installation code which can be added to your Devcontainer in modular fashion. They are shareable and there is a large collection of features available for us to use. So is the case for uv! There is a Devcontainer feature for uv. We can add uv like so:

    {
        "image: "mcr.microsoft.com/devcontainers/python:3.13",
    
        "features": {
            "ghcr.io/jsburckhardt/devcontainer-features/uv:1": {},
            "ghcr.io/jsburckhardt/devcontainer-features/ruff:1": {}
        },
    }
    

    .devcontainer/devcontainer.json

    ... great! We also added a Devcontainer feature for Ruff. Ruff is a fast and widely used tool for Python- linting and formatting. The latest versions of uv and Ruff are installed unless specified otherwise.

  3. Adding VSCode customizations
    Although Devcontainers are supported by various editors, VSCode is the most common. GitHub Codespaces also uses VSCode, meaning we get full Devcontainer support in GitHub Codespaces too.

    We can define extensions to be installed automatically alongside our Devcontainer, by defining customizations:

    {
        "image": "mcr.microsoft.com/devcontainers/python:3.13",
    
        "features": {
            "ghcr.io/jsburckhardt/devcontainer-features/uv:1": {},
            "ghcr.io/jsburckhardt/devcontainer-features/ruff:1": {}
        },
    
        "customizations": {
            "vscode": {
                "extensions": [
                    "ms-python.python",
                    "ms-toolsai.jupyter"
                ]
            }
        }
    }
    

    .devcontainer/devcontainer.json

    ... the above installs the VSCode Python extension and Jupyter extension automatically when the Devcontainer is opened.

With only just our .devcontainer.json file, we can already start the Devcontainer! Install the VSCode Dev Containers Extension and find the blue popup saying Reopen in Container:

0:00
/0:15

With our .devcontainer.json file we can now open the folder as a Devcontainer in VSCode 🎉.

💡
You can also use the VSCode command palette (CMD+SHIFT+P) and find the command Reopen in Container.

Upon opening our Devcontainer, the features we defined earlier are installed for us. Python, uv and ruff are all there:

Python Devcontainer with uv
Our Devcontainer automatically installs Python, uv and ruff.

Great ✓. Let's continue with our Python project setup.

Step 2: Python project setup with uv

Now that we have a Devcontainer, let's set up our project inside it.

  1. uv init

    We can use uv init to scaffold a new Python project.

    uv init \
      --package \
      --name example_project \
      --description "Example Python Devcontainer project with uv"
    

    uv init can be used to create Python projects. The --package argument creates a src folder structure.

    ... which scaffolds a project structure for us:

    Python Devcontainer with uv
    A freshly initialised uv project.

    Importantly, a pyproject.toml file is created. This is the main configuration file for our Python project. Besides configuration, it keeps track of dependencies. Let's add some dependencies.

  2. Adding dependencies

    We want to test our code, so let's add pytest:

    uv add pytest --dev
    

    Use uv add to add new dependencies. uv updates pyproject.toml for you. See uv docs Managing dependencies.

    ... upon installing, VSCode also detects a virtual environment being created:

    Python Devcontainer with uv
    VSCode hints the user to select the virtual environment once it is created.

    Select the virtual environment. Notice that after the installation uv has updated our pyproject.toml file.

    💡
    If you missed the notification, open the command palette (CMD+SHIFT+P), type Python: Select Interpreter and select ./.venv/bin/python.
  3. Running tests
    Let's now add a test so we can validate our setup is working.

    Python Devcontainer with uv
    A simple function and according test.

    ... which we can run using:

    uv run pytest .
    

    Use uv run to run shell commands with uv. The command ensures its run with the correct venv.

    Resulting in ... a passing test! ✓

    Python Devcontainer with uv
    Our test passes ✓. The environment is correctly setup.

    Also, feel free to use the VSCode UI:

    Python Devcontainer with uv
    The VSCode testing UI.

    Awesome! 🎉

Final setup

At last, our project looks like the following:

Python Devcontainer with uv

For ease of use, this GitHub repository is available and can be used as a template:

GitHub - dunnkers/python-uv-devcontainer: Python project setup using a Devcontainer and uv.
Python project setup using a Devcontainer and uv. Contribute to dunnkers/python-uv-devcontainer development by creating an account on GitHub.
Python Devcontainer with uv

A GitHub repository template for a Python uv Devcontainer.

... the repository also comes with some Extras: a GitHub Actions workflow for CI/CD and a Dockerfile for production deployments. Enjoy!

Conclusion

That was an adventure. We set up a Devcontainer using a devcontainer.json file, we added uv and ruff by using Devcontainer features and set up our Python project using uv init, uv add and uv run.

uv is a powerful tool and combining it with a Devcontainer gives us an easy and reproducible project setup. Good luck with your own Python/uv Devcontainer setup! 🍀

]]>
<![CDATA[Golden hour in Amsterdam]]>
0:00
/0:05

Being at the wrong right place at the wrong right time.

]]>
https://jeroenoverschie.nl/golden-hour-in-amsterdam/68e02eb2f044ed00011d8890Mon, 21 Jul 2025 19:50:00 GMT
0:00
/0:05

Being at the wrong right place at the wrong right time.

]]>
<![CDATA[Are you ready for MLOps? (on Xebia.com ⧉)]]>https://jeroenoverschie.nl/are-you-ready-for-mlops-on-xebia-com/68e247b8f044ed00011d88f9Wed, 16 Jul 2025 22:00:00 GMTIntroductionAre you ready for MLOps? (on Xebia.com ⧉)

MLOps has survived the hype cycle and is gaining in maturity. But are we looking at MLOps for answers for the right things? No matter how valuable MLOps can be for you, without proper building blocks in place MLOps cannot live up to its full potential. What are the prerequisites for MLOps? What parts of MLOps should you focus on? When should you even start thinking about MLOps, or when is ‘plain’ DevOps wiser to focus on first? Read along to learn more!

About being ready

So, what does it mean to be ready? Being ready means understanding why you need that technology and what it is. Then, we can start understanding what it takes to adopt it and especially: what prerequisites are required to do so. With all the prerequisities in place, we can say we are ready.

In short, being ready for MLOps means you understand:

  1. Why adopt MLOps
  2. What MLOps is
  3. When adopt MLOps

... only then can you start thinking about how to adopt MLOps.

As an analogy, think of building a house. You need the right foundation in place first before you build higher. Each level needs to be built properly before you advance. If the right foundations are not in place, you can build however beautiful house on top of those foundations but still the building will not stand time.

Are you ready for MLOps? (on Xebia.com ⧉)
To build high, beautiful buildings, you need a solid foundation. So is the same for MLOps. Do you have the right foundation to adopt MLOps?

To start adopting MLOps the prerequisites need to be in place before we start building on top of them. What are those prerequisites? Let's find out if you are ready, together.

Why MLOps?

Why bother with MLOps? Let's first explore what pain you might experience that MLOps can help with. First let's throw in a statistic. Gartner reported that on average only 54% of AI models move from pilot to production:

Are you ready for MLOps? (on Xebia.com ⧉)
Many AI models developed never even reach production.

... that is not an awful lot. What a waste! We spent time trying to get models into production but we are not able to. Why is that? These days Data Science is not anymore a new domain by any means. The time when Hardvard Business Review posted the Data Scientist to be the "Sexiest Job of the 21st Century" is more than a decade ago [1]. In 2019 alone the Data Scientist job postings on Indeed rose by 256% [2]. Universities have been pumping out Data Science graduates in rapid pace and the Open Source community made ML technology easy to use and widely available. Both the tech and the skills are there:

Are you ready for MLOps? (on Xebia.com ⧉)
ML tech is by now easy to use and widely available. Data Science profiles are more abundant in the market than ever before.

So then let me re-iterate: why, still, are teams having troubles launching Machine Learning models into production? Big part of the reason lies in collaboration between teams. Even though we all wish for seamless transitions from the development phase towards production, Machine Learning- development and operation teams can have conflicting interests, making it difficult to collaborate. The development- and operations world differ in various aspects:

  • Development ML teams are focused on innovation and speed
    • Dev ML teams have roles like Data Scientists, Data Engineers, Business owners.
    • Dev ML teams work agile and experiment rapidly using PoC's.
    • Dev ML teams work in Jupyter notebooks, Python, R, etc.
  • Operations ML teams are focused on stability and reliability
    • Ops ML teams have roles like Platform Engineers, SRE's, DevOps Engineers, Software Engineers, IT Managers.
    • Ops ML teams work with strict roadmaps, make long-term plans and might need to be available for on-call support.
    • Ops ML teams work in CI/CD pipelines, Kubernetes, Docker, Java, Scala, etc.

... that does not make things easier. The expectation is that dev- and ops magically work well together. We can just 'hand over' the model to the operations team and they will take care of it. This is not the case, unfortunately.

Expectation:

Are you ready for MLOps? (on Xebia.com ⧉)
It is often expected that development- and operations teams magically work well together.

... versus Reality:

Are you ready for MLOps? (on Xebia.com ⧉)
In reality collaboration between development and operations is hard due to conflicting interests. With lacking collaboration an undesired handover between teams is introduced in which context is lost.

... such handovers make development cycles unnecessarily long and makes it difficult to get feedback from production models back to the developers. To conclude, a lack of collaboration between development and operations causes three issues:

  • ❌ Few models in production ... or production solutions prove unreliable.
  • ❌ Long development cycles ... to create models, update models or add features.
  • ❌ Lacking feedback ... on model performance and added value.

How to solve this? Enter MLOps.

What is MLOps?

MLOps can help development- and production teams work better together and more efficiently deploy Machine Learning models to production. MLOps makes the dev- and ops worlds more familiar with each other and aims to bring the worlds closer together.

The term has gained in popularity since 2018 [3] [4], when the Machine Learning had undergone massive growth. Fair to say, some years have passed by now and we have moved beyond the hype by now:

Are you ready for MLOps? (on Xebia.com ⧉)
MLOps has moved beyond the hype and climbing up towards the plateau of productivity. Graph refers to Gartner hype cycle.

So what is MLOps comprised of? No longer is Machine Learning development only about training a ML model. We are concerned about many other things, too. Preprocessing, feature engineering, serving, scheduling and monitoring to name a few. We can map those to concrete methodologies and tooling that makes up the technical part of MLOps:

Are you ready for MLOps? (on Xebia.com ⧉)
ML development activities mapped to core MLOps components.

... some might already come familiar to you. So do they to major Cloud Providers. Cloud providers have answered the market need for better tooling in the Machine Learning space. In fact every major provider has a Machine Learning platform that helps you do MLOps:

Are you ready for MLOps? (on Xebia.com ⧉)
Major cloud providers offer managed MLOps platforms.

That is massively useful. Not only is more tooling available to do MLOps, there also exist managed solutions that you can use out of the box. 

Let's recall the three issues we had with handovers between development- and operations teams. Remember them? We can now map them toward what MLOps promises to give us, helping us with these issues. 

Are you ready for MLOps? (on Xebia.com ⧉)
MLOps promises us speed by automation, rapid feedback and autonomy with end-to-end product teams.

... that sounds great! We have MLOps as a key methodology to reduce the handover and bring dev- and ops closer together. But what actually empowers and enables MLOps? What do we need to do successful MLOps? Did we any time before bring together Dev and Ops teams? In fact, we did! It's in the name! Enter DevOps.

DevOps in the mix

MLOps is largely inspired by DevOps [5]. DevOps has existed for longer and is more established and mature. Since 2007 DevOps has been a massively influential methodology in software development. Key is that development is not sequential but continuous, best illustrated using the DevOps lifecycle:

Are you ready for MLOps? (on Xebia.com ⧉)
The DevOps lifecycle [6].

In DevOps key principles are [6]:

  • Automation
  • Collaboration & communication
  • Continuously improving
  • Focus on user needs

... so how does MLOps interplay with DevOps? DevOps came from the Software Development world and therefore deals with Code. In Machine Learning, however, important to take into account are also Data and the Model:

Are you ready for MLOps? (on Xebia.com ⧉)
Whereas DevOps deals mainly with Code, MLOps also entails data and a ML model.

Taking into account automating operations related to all of the code, data and model is what makes MLOps different from DevOps. For example, when it comes to automation we continuously check data quality, train models and run inference to create guarantees the state of our Machine Learning system.

Doing so continuously means we loop in our Monitoring practices back into our Model Building step like so:

Are you ready for MLOps? (on Xebia.com ⧉)
Iterative, continuous development means using monitoring feedback to build- and improve your ML model.

That's great. So again, what is this interplay between DevOps and MLOps? It is clear that MLOps lent a lot from DevOps. But so in what way do we need DevOps to do MLOps? The way we see it, is that you need to have DevOps in place before you can do MLOps. DevOps is a prerequisite for MLOps. It's like stacking building blocks on top of each other, creating a solid foundation before you build higher:

Are you ready for MLOps? (on Xebia.com ⧉)
Get DevOps in place before you start doing MLOps.

That is promising, isn't it? Well, it is, but there's also effort required on the organisational side. MLOps is not just about tech.

The MLOps twist

MLOps is about more than just tech. MLOps is about all of TechnologyPeople and Processes. Getting MLOps right means have carefully considered all three:

Are you ready for MLOps? (on Xebia.com ⧉)
MLOps is not just about tooling and tech. It is just as much about People and Processes.

What does that mean? That means that in the MLOps lifecycle we want the ML product team to operate over the entire of the ML lifecycle, owning the entire process:

Are you ready for MLOps? (on Xebia.com ⧉)
The MLOps lifecycle [7]. The ML product team operates over the entire of the ML lifecycle.

... this does indeed mean that we need a diverse set of roles, together in one team. Data Scientists, Machine Learning Engineers, Data Engineers and such need to work together. Working together, the handover can be minimised and models can be built that are actually ready for production.

To facilitate such a ML product team, a ML platform team is used to enable them. In a central team, they can provide best-practices and tooling to the ML product team. Key is, though, that still the ML product teams own the process from development to operation:

Are you ready for MLOps? (on Xebia.com ⧉)
ML product teams are enabled by a central ML platform team.

... admittedly, this is all not easy. Drawing out organisational changes on paper is easier than actually doing it. It requires serious effort to challenge your existing setup and restructure your organisation. But ask yourself - is it worth it? Is there serious Machine Learning potential in your organisation that you want to take advantage of? If yes, go for it.

Concluding: when to adopt MLOps

We learned a lot. We should do MLOps to eliminate handovers between dev- and ops teams and bring Software Engineering operational efficiency to the Machine Learning World. MLOps does so by providing useful technologies and tooling, widely offered by major cloud providers. MLOps is largely inspired by DevOps, which is a prerequisite for MLOps. This allowed the industry to take advantage of well-matured DevOps concepts like automation, collaboration and continuous improvement. Only do MLOps if you have your DevOps practices in place. 

Are you ready for MLOps? (on Xebia.com ⧉)
Dev- and ops can collaborate and co-exist happily together. Do MLOps once you get your DevOps practices in place.

If you end up doing so, you might very well end up in a better place because of it. Embrace the organisational investment and get teams closer together. We wish you good luck with your journey - and would love to help you along the way.

]]>
<![CDATA[Around the Amstel station]]>

All photos shot with the Ricoh GR IIIx HDF. ]]>
https://jeroenoverschie.nl/in-and-around-the-amstel-station/685c1cfafcc75d00012ff74bWed, 25 Jun 2025 16:10:26 GMT
Around the Amstel station
Around the Amstel station

All photos shot with the Ricoh GR IIIx HDF. ]]>
<![CDATA[The GenAI automation potential of data extraction (on Xebia.com ⧉)]]>https://jeroenoverschie.nl/the-genai-automation-potential-of-data-extraction-on-xebia-com/67f6791adc68180001657297Wed, 09 Apr 2025 13:45:49 GMTIntroductionThe GenAI automation potential of data extraction (on Xebia.com ⧉)

GenAI has been around for a little while and its capabilities are expanding quickly. We are experiencing peak-hype levels for AI and expectations are sky-high. But still, many companies fail to generate actual value with GenAI. Why is that? Why are we promised so much yet manage to get so little? What is still fantasy and what concrete potential exists? What should be automated and what should not? In this blogpost, we explore the GenAI automation potential that exists today for data extraction.

Together, we will learn about:

  1. Why GenAI data extraction
  2. The automation levels
  3. The automation potential

Let’s start!

1. Why GenAI data extraction

Why exactly should we care about GenAI data extraction? Let us motivate this by looking at 4 example usecases in different domains and with various data types like text-, images-, documents- or audio.

  • Insurance claims (text)
  • Menu cards (images)
  • Annual reports (PDF documents)
  • Customer service calls (audio)

Insurance claims (text)

Imagine you are running an insurance company and your employees are tasked with processing insurance claims. Claims are often provided in a free-form format: email, phone call, chat, in short: unstructured data. Your employees will then need to process this information to handle the claim: update the internal databases, find similar claim cases, etc. Let’s take the case where we receive this information in free-form text like emails.

The GenAI automation potential of data extraction (on Xebia.com ⧉)
LLMs can extract structured data from free-form text like an insurance claim, saving employees time doing it manually. See Code.

We instruct the LLM to give its answers in structured format [ref], so we can easily work with- and transform its output. To clear this entire email box with some 25 emails Gemini 2.0 Flash took some 15 seconds. That is if ran sequentially: if ran in parallel this amount of data process can be processed in a matter of seconds. If this were scaled to process a mailbox of 100k emails we would spend about $1.52. Not an awful lot given we can save some serious time here. We can also half this cost if we use Gemini’s Batch API instead. How long would it take a human to process 100k emails? What is their time worth?

Extracting client information from this email message took Gemini 2.0 Flash about 0.5 seconds and cost $0.000015.

Different usecase. Imagine now you are running an online platform for food delivery. You want to digitise menu cards so the offering can be ingested in your own platform in standardised manner. You are staffing a number of employees to help you do this task. The menu cards arrive in image format. Can we speed up the menu conversion process?

The GenAI automation potential of data extraction (on Xebia.com ⧉)
LLMs can read menu cards and convert them directly into a desired structured format, cutting development cost for custom- OCR and extraction solutions and potentially providing superior extraction performance. See Code.

Yes: we can speed this up. LLMs consider the image as a whole and are also able to accurately process text details in the image. Because we are using a generic model images in many contexts can be processed by the same model without having to fine-tune over specific situations.

Converting this menu card took Gemini 2.0 Flash about 10 seconds and cost $0.0005. How long would a human take?

Annual reports (PDF documents)

Imagine you are tasked with answering questions based on financial documents. You are to go through each question and reason on the answer based on information provided in the documents. Take annual statements, for example. Companies worldwide are obliged to publish these statements and company finances are to be audited. Going through such documents is time-consuming. Take an annual statement in PDF document format. Can we help employees formulate answers based on this document faster?

The GenAI automation potential of data extraction (on Xebia.com ⧉)
An entire annual report document of 382 pages fits in the model context at once, allowing the model to answer questions considering the whole document. Reasoning-type of questions like audit questionnaires can now be answered by LLMs, citing where the answer is located. The human gets a head start and then only has to check and extend the answers, saving time. See Code.

Yes. The document fits in the model context all at once, allowing the model to take a comprehensive view of the document and formulate sensible answers. In fact, LLMs like Google’s Gemini models take up to 2 million tokens in context at once [ref]. This allows for up to 3,000 PDF pages to be inserted into the prompt. PDF pages are processed as images by Gemini [ref].

Answering these questions took Gemini 2.0 Flash just under 1 minute and cost $0.015 . How long would a human take?

Customer service calls (audio)

Imagine you have a customer service department and are processing many calls on the daily. The customer service agents do their best to document and record useful information during the call but cannot possibly document all valuable information. Calls are temporarily recorded for analytical purposes. Calls could be listened back to figure out the customer problem, the root cause, the solution and the customer sentiment. This information is valuable to have because this is necessary feedback information to improve customer service in a targeted way. What issues were we most often unable to resolve? In what situations are our customers left unhappy?

The insights are valuable, but hard to get to. The data is hidden in an audio format.

The GenAI automation potential of data extraction (on Xebia.com ⧉)
LLMs can listen to audio quicker than any human. Customer service calls can be analysed and previously-inaccessible but valuable information can be extracted and taken advantage of. Insights can be gathered in automated fashion, effectively improving business processes in targeted fashion. See Code.

Gemini 2.0 Flash processed this 5-minute customer service call in 3 seconds and spent $0.0083. That is faster than any human takes to listen- and process the call. LLMs can also accurately process audio, without the need for a transcription step first. Because the model is exposed to the raw audio file, more than just the words that are spoken can be taken into account. In what way is this customer talking? Is this customer happy or unhappy? Off-the-shelf LLMs can be used without modification and just by prompting to extract structured data from audio [ref].

Concluding on the 4 usecase examples, we learned the following.

  • LLMs for data extraction: LLMs can be used to extract structured data from free-form formats like text, images, PDFs, and audio.
  • Larger contexts: Growing LLM context windows allow processing of larger documents at once. This strengthens the applicability for data extraction using LLMs without the need for extra retrieval steps.
  • Competitive pricing: Lower inference costs make more data extraction use cases feasible.
  • GenAI data extraction ROI: LLMs can complete data extraction tasks faster than humans and at a lower cost. Albeit not at perfection, the goal is to create a system that is good enough to assist humans in their work and provide business value.

That is really cool. LLMs are becoming more capable and cheaper, making possible more usecases than before. So now what exactly is this data extraction? How can we use this to our advantage to automate business processes? What are the automation levels for data extraction?

2. The automation levels

To discover what is possible with GenAI data extraction, we will divide into 4 levels of increasing automation.

  • Level 0: Manual
  • Level 1: Assisted
  • Level 2: Autopilot
  • Level 3: Autonomous

Level 0: Manual work

No automation is applied. Human labor is required to extract data in the desired format. The extracted data is then consumed by a human user. All of the data extraction-, evaluation of the data quality and any actions to be done with the data are manual human processes.

The GenAI automation potential of data extraction (on Xebia.com ⧉)

Without automation, manual work is required to extract data in structured format. After extraction humans decide on what to do with the data.

We can do better. LLMs can be used to automate part of this process, helping the human in the Assistedautomation level.

Level 1: Assisted

In the assisted workflow, a LLM is used to extract useful data. This process we like to call Structured Data Extraction.

The GenAI automation potential of data extraction (on Xebia.com ⧉)

LLMs can be used for structured data extraction: extracting useful data from free-form documents in structured format.

That can already save a lot of time. Previously we needed to manually perform the tedious process of processing the document to formulate answers in the target format. In this level of automation, though, we are still targeting a human user as output. The human user is responsible for all of evaluating the LLM output and deciding what to do with the data. This gives the human control but also costs extra time. We can do better in Autopilot.

Level 2: Autopilot

In the autopilot workflow we go a step further. The extracted data is directly ingested in a Data Warehouse, opening up new automation possibilities. With the data in the data warehouse the data can be more efficiently taken advantage of. We can now use the data for dashboards and insights as well as use the data for Machine Learning usecases.

The GenAI automation potential of data extraction (on Xebia.com ⧉)
When extracted data is ingested into a Data Warehouse the data can be more efficiently taken advantage of. Dashboards can reveal new insights and ML usecases can benefit from richer available data.

Ingesting this data directly into a Data Warehouse can be beneficial indeed. But with the introduction of this automated ingestion we also need to be careful. Data Warehouse consumers are farther away from the extraction process and are not aware of the data quality. We need to always ask ourselves: Is this data reliable?

The GenAI automation potential of data extraction (on Xebia.com ⧉)
Data Warehouse consumers need to always be aware of data- quality and reliability.

Bad data quality can lead to misleading insights and bad decisions later down. This is not what we want! Quantified metrics informing us on the data reliability are required. Those with expertise in the dataset and those involved in the extraction process will need to help out. These are typically an AI Engineer and a Subject Matter Expert. We need to introduce an evaluation step.

The GenAI automation potential of data extraction (on Xebia.com ⧉)
A Subject Matter Expert is required to label samples of data, informing Data Warehouse consumers and engineers on the data reliability. The Subject Matter Expert is to be enabled to also experiment with the data extraction process, lowering iteration times.

More steps are introduced, keeping the human-in-the-loop. The steps are necessary, though. The Subject Matter Expert plays a key role in ensuring the quality- and reliability of the data. This gives Data Warehouse consumers the trust they are looking for and at the same time enables the AI Engineer to more systematically improve the Structured Data Extraction system. Additionally, enabling the Subject Matter Expert themselves to be part of the prompting process lowers iteration times and reduces context being lost in translation between AI Engineer and Subject Matter Expert. Win-win.

We can go more automated, even. Let’s continue to the last level: Autonomous.

Level 3: Autonomous

In this level, the last human interactions are ought to be automated. Evaluation previously manually done by the Subject Matter Expert is now done by Quality Assurance- and evaluation tooling.

The GenAI automation potential of data extraction (on Xebia.com ⧉)
Introducing Quality Assurance- and evaluation tooling allows the Structured Data Extraction system to run fully autonomous.

So what do we mean with such tooling? We want tooling to help us guarantee the outcome and quality of our data without minimal human intervention. This can be LLM-as-a-judge systems, for example. The required effort, however, to successfully implement such systems and run them safely is high. One gets value in return, from extra automation, but whether this is worth the effort- and cost is the question. Let’s compare the automation levels and summarise its potential.

3. The automation potential

We have learned about each of the four different levels of automation for data extraction using GenAI. Also, for each level, we have explored how architecturally a Structured Data Extraction system would look like. That is great, but now where is most potential? Let’s first summarise the automation levels as follows:

The Automation Levels for GenAI data extraction.
LevelAutomationDescriptionHuman-in-the-loop
0ManualHuman performs all work.
Human responsible for data extraction, quality assurance and consumption of data.
Yes
1AssistedHuman is assisted with tasks but still in control and responsible.
Data is extracted using a LLM. Extracted data is used to assist human users.
Yes
2AutopilotTasks largely automated – human supervises.
Extracted data is upserted directly into a Data Warehouse. Systematic evaluation is necessary and Subject Matter Expert involvement is key.
Yes
3AutonomousTasks fully automated.
Evaluation step is to be automated using Quality Assurance- and evaluation tooling. Fully automated pipelines: no human intervention or supervision.
No

So say you are thinking about implementing a data extraction usecase. What is the ultimate goal? The ultimate goal need not always be to automate as much as possible: it should be to create value. Because perhaps, you can benefit largely enough from your usecase if Assisted- or Autopilot automation is applied and the extra investment to full automation is not worth it. If we were to take the menu card example from earlier, how would potential time savings look like?

Time savings for automating data extraction (example): 
30 minutes (manual) → 5 minutes (assisted) → 1 minute (autopilot) → 0 minutes (autonomous).

We can see that the time savings are not linear. The more automation we apply, the less time we save. Even though, the last step is the hardest and most expensive to implement. It is not always worth the effort and cost to implement this last step. These are the Diminishing Returns of automation, which can be plotted as follows:

The GenAI automation potential of data extraction (on Xebia.com ⧉)

Automation is nice but value is the goal. Take automation step-by-step. Partial automation is also valuable.

Let’s sum things up. To conclude, GenAI for data extraction is:

  • Useful for a wide variety of usecases with various data types including text, images, PDFs and audio.
  • More likely to provide ROI due to 1) cheaper models, 2) larger context windows and 3) more capable models able to process multi-modal data.
  • Very well suited for partial automation: which can bring business value without needing extra investment for full automation.

Good luck with your own GenAI data extraction usecases 🍀♡.

]]>
<![CDATA[The Levels of RAG (on Xebia.com ⧉)]]>https://jeroenoverschie.nl/the-levels-of-rag/66f6b3910aba3300014c3c67Fri, 27 Sep 2024 13:34:38 GMTIntroductionThe Levels of RAG (on Xebia.com ⧉)

LLM’s can be supercharged using a technique called RAG, allowing us to overcome dealbreaker problems like hallucinations or no access to internal data. RAG is gaining more industry momentum and is becoming rapidly more mature both in the open-source world and at major Cloud vendors. But what can we expect from RAG? What is the current state of the tech in the industry? What use cases work well and which are more challenging? Let’s find out together!

Why RAG

Retrieval Augmented Generation (RAG) is a popular technique to combine retrieval methods like vector search together with Large Language Models (LLM’s). This gives us several advantages like retrieving extra information based on a user search query: allowing us to quote and cite LLM-generated answers. In short, RAG is valuable because:

  • Access to up-to-date knowledge
  • Access to internal company data
  • More factual answers

That all sounds great. But how does RAG work? Let’s introduce the Levels of RAG to help us understand RAG in increasing grades of complexity.

The Levels of RAG

RAG has many different facets. To help more easily understand RAG, let’s break it down into four levels:

The Levels of RAG.
The Levels of RAG (on Xebia.com ⧉)Level 1
Basic RAG
The Levels of RAG (on Xebia.com ⧉)Level 2
Hybrid Search
The Levels of RAG (on Xebia.com ⧉)Level 3
Advanced data formats
The Levels of RAG (on Xebia.com ⧉)Level 4
Multimodal

Each level adds new complexity. Besides explaining the techniques, we will also look into justifying the introduced complexities. That is important, because you want to have good reasons to do so: we want to make everything as simple as possible, but no simpler [1].

We will start with building a RAG from the beginning and understand which components are required to do so. So let’s jump right into Level 1: Basic RAG.

Level 1: Basic RAG

The Levels of RAG (on Xebia.com ⧉)

Let’s build a RAG. To do RAG, we need two main components: document retrieval and answer generation.

The Levels of RAG (on Xebia.com ⧉)

The two main components of RAG: document retrieval and answer generation.

In contrast with a normal LLM interaction, we are now first retrieving relevant context to only then answer the question using that context. That allows us to ground our answer in the retrieved context, making the answer more factually reliable.

Let’s look at both components in more detail, starting with the Document retrieval step. One of the main techniques powering our retrieval step is Vector Search.

To retrieve documents relevant to our user query, we will use Vector Search. This technique is based on vector embeddings. What are those? Imagine we embed words. Then words that are semantically similar should be closer together in the embedding space. We can do the same for sentences, paragraphs, or even whole documents. Such an embedding is typically represented by vectors of 768-, 1024-, or even 3072 dimensions. Though we as humans cannot visualize such high-dimensional spaces: we can only see 3 dimensions! For example sake let us compress such an embedding space into 2 dimensions so we visualize it:

The Levels of RAG (on Xebia.com ⧉)

Embeddings that are similar in meaning are closer to each other. In practice not in 2D but higher dimensions like 768D.

Note this is a drastically oversimplified explanation of vector embeddings. Creating vector embeddings of text: from words up to entire documents, is quite a study on its own. Most important to note though is that with embeddings we capture the meaning of the embedded text!

So how to use this for RAG? Well, instead of embedding words we can embed our source documents instead. We can then also embed the user question and then perform a vector similarity search on those:

The Levels of RAG (on Xebia.com ⧉)

Embedding both our source documents and the search query allows us to do a vector similarity search.

Great! We now have the ingredients necessary to construct our first Document retrieval setup:

The Levels of RAG (on Xebia.com ⧉)

The full retrieval step with vector search.

Next is the Answer generation step. This will entail passing the found pieces of context to an LLM to form a final answer out of it. We will keep that simple so will use a single prompt for that:

The Levels of RAG (on Xebia.com ⧉)

The RAG answer generation step. Prompt from LangChain hub (rlm/rag-prompt).

Cool. That concludes our full first version of the Basic RAG. To test our RAG system, we need some data. I recently went to PyData and figured how cool would it be to create a RAG based on their schedule. Let’s design a RAG using the schedule of PyData Eindhoven 2024.

The Levels of RAG (on Xebia.com ⧉)

The RAG system loaded with data from the PyData Eindhoven 2024 schedule.

So how do we ingest such a schedule in a vector database? We will take each session and format it as Markdown, respecting the structure of the schedule by using headers.

The Levels of RAG (on Xebia.com ⧉)

Each session is formatted as Markdown and then converted to an embedding.

Our RAG system is now fully functional. We embedded all sessions and ingested them into a vector database. We can then find sessions similar to the user question using vector similarity search and answer the question based on a predefined prompt. Let’s test it out!

The Levels of RAG (on Xebia.com ⧉)

Testing out our basic RAG system. The RAG retrieves talks related to the user question and correctly answers the question.

That is cool! Our RAG was able to correctly answer the question. Upon inspection of the schedule we can see that the listed talks are indeed all related to sports. We just built a first RAG system 🎉.

There are points of improvement, however. We are now embedding the sessions in its entirety. But embedding large pieces of text can be problematic, because:

  • ❌ Embeddings can get saturated and lose meaning
  • ❌ Imprecise citations
  • ❌ Large context → high cost

So what we can do to solve this problem is to divide up text in smaller pieces, then embed those instead. This is Chunking.

Chunking

In chunking the challenge lies in determining how to divide up the text to then embed those smaller pieces. There are many ways to chunk text. Let’s first look at a simple one. We will create chunks of fixed length:

The Levels of RAG (on Xebia.com ⧉)

Splitting a text into fixed-sized pieces does not create meaningful chunks. Words and sentences are split in random places. 
Character Text Splitter with chunk_size = 25. This is an unrealistically small chunk size but is merely used for example sake.

This is not ideal. Words, sentences and paragraphs are not respected and are split at inconvenient locations. This decreases the quality of our embeddings. We can do better than this. Let’s try another splitter that tries to better respect the structure of the text by taking into account line breaks (¶):

The Levels of RAG (on Xebia.com ⧉)

Text is split, respecting line breaks (¶). This way, chunks do not span across paragraphs and split better. 
Recursive Character Text Splitter with chunk_size = 25. This is an unrealistically small chunk size but is merely used for example sake.

This is better. The quality of our embeddings is now better due to better splits. Note that chunk_size = 25 is only used for example sake. In practice we will use larger chunk sizes like 100, 500, or 1000. Try and see what works best on your data. But most notably, also do experiment with different text splitters. The LangChain Text Splitters section has many available and the internet is full of others.

Now we have chunked our text, we can embed those chunks and ingest them in our vector database. Then, when we let the LLM answer our question we can also ask it to state which chunks it used in answering the question. This way, we can pinpoint accurately which information the LLM was grounded in whilst answering the question, allowing us to provide the user with Citations:

The Levels of RAG (on Xebia.com ⧉)

Chunks can be used to show user where the information came from to generate the final answer, allowing us to show the user source citations.

That is great. Citations can be very powerful in a RAG application to improve transparency and thereby also user trust. Summarized, chunking has the following benefits:

  • Embeddings are more meaningful
  • Precise citations
  • Shorter context → lower cost

We can now extend our Document retrieval step with chunking:

The Levels of RAG (on Xebia.com ⧉)

RAG retrieval step with chunking.

We have our Basic RAG set up with Vector search and Chunking. We also saw our system can answer questions correctly. But how well actually does it do? Let’s take the question “What’s the talk starting with Maximizing about?” and launch it at our RAG:

The Levels of RAG (on Xebia.com ⧉)

For this question the incorrect context is retrieved, causing the LLM to give a wrong answer.
UI shown is LangSmith, a GenAI monitoring tool. Open source alternative is LangFuse.

Ouch! This answer is wildly wrong. This is not the talk starting with Maximizing. The talk described has the title Scikit-Learn can do THAT?!, which clearly does not start with the word Maximizing.

For this reason, we need another method of search, like keyword search. Because we would also still like to keep the benefits of vector search, we can combine the two methods to create a Hybrid Search.

The Levels of RAG (on Xebia.com ⧉)

In Hybrid Search, we aim to combine two ranking methods to get the best of both worlds. We will combine Vector search with Keyword search to create a Hybrid Search setup. To do so, we must pick a suitable ranking method. Common ranking algorithms are:

… of which we can consider BM-25 an improved version of TF-IDF. Now, how do we combine Vector Search with an algorithm like BM-25? We now have two separate rankings we want to fuse together into one. For this, we can use Reciprocal Rank Fusion:

The Levels of RAG (on Xebia.com ⧉)

Reciprocal Rank Fusion is used to merge two ranking methods into one (see Elastic Search docs).

Reciprocal Rank Fusion is massively useful to combine two rankings. This way, we can now use both Vector search and Keyword search to retrieve documents. We can now extend again our Document retrieval step to create a Hybrid Search setup:

The Levels of RAG (on Xebia.com ⧉)

The RAG retrieval step with hybrid search.

Most notably, when a user performs a search, a query is made both to our vector database and to our keyword search. Once the results are merged using Reciprocal Rank Fusion, the top results are taken and passed back to our LLM.

Let’s take again the question “What’s the talk starting with Maximizing about?” like we did in Level 1 and see how our RAG handles it with Hybrid Search:

The Levels of RAG (on Xebia.com ⧉)

Hybrid Search allows us to combine Vector search and Keyword search, allowing us to retrieve the desired document containing the keyword.

That is much better! 👏🏻 The document we were looking for was now ranked high enough that it shows up in our retrieval step. It did so by boosting the terms used in the search query. Without the document available in our prompt context, the LLM could impossibly give us the correct answer. Let’s see both the retrieval step and generation step to see what our LLM now answers on this question:

The Levels of RAG (on Xebia.com ⧉)

With the correct context, we also got the correct answer!

That is the correct answer ✓. With the right context available to the LLM we also get the right answer. Retrieving the right context is the most important feature of our RAG: without it, the LLM can impossibly give us the right answer.

We now learned how to build a Basic RAG system and how to improve it with Hybrid Search. The data we loaded in comes from the PyData Eindhoven 2024 schedule: which was actually conveniently available in JSON format. But what about other data formats? In the real world, we can be asked to ingest other formats into our RAG like HTML, Word, and PDF.

The Levels of RAG (on Xebia.com ⧉)

Having structured data available like JSON is great, but that is not always the case in real life…

Formats like HTML, Word and especially PDF can be very unpredictable in terms of structure, making it hard for us to parse them consistently. PDF documents can contain images, graphs, tables, text, basically anything. So let us take on this challenge and level up to Level 3: Advanced data formats.

Level 3: Advanced data formats

The Levels of RAG (on Xebia.com ⧉)

This level revolves around ingesting challenging data formats like HTML, Word or PDF into our RAG. This requires extra considerations to do properly. For now, we will focus on PDF’s.

So, let us take some PDF’s as an example. I found some documents related to Space X’s Falcon 9 rocket:

Falcon 9 related documents we are going to build a RAG with.

The Levels of RAG (on Xebia.com ⧉)The Levels of RAG (on Xebia.com ⧉)The Levels of RAG (on Xebia.com ⧉)
User’s guide (pdf)Cost estimates (pdf)Capabilities & Services (pdf)

We will now want to first parse those documents into raw text, such that we can then chunk- and embed that text. To do so, we will use a PDF parser for Python like pypdf. Conveniently, there’s a LangChain loader available for pypdf:

The Levels of RAG (on Xebia.com ⧉)

PDF to text parsing using PyPDFLoader.

Using pypdf we could convert these PDF’s into raw text. Note there’s many more available, check out the LangChain API reference. Both offline, local and Cloud solutions are offered like GCP Document AIAzure Document Intelligence or Amazon Textract.

With such a document parsing step set up, we will need to extend our Document retrieval step to cover this new component:

The Levels of RAG (on Xebia.com ⧉)

RAG retrieval with a document parsing step, like converting PDF to raw text.

Time to test out our RAG! We embedded and ingested the raw text into our vector database and can now make queries against it. Let’s ask about the cost of a Falcon 9 rocket launch:

The Levels of RAG (on Xebia.com ⧉)

Our RAG can now answer questions about the PDF’s.

Awesome, that is the correct answer. Let’s try another question: “What is the payload I can carry with the Falcon 9 rocket to Mars?“:

The Levels of RAG (on Xebia.com ⧉)

Ouch! Our RAG got the answer wrong. It suggests that we can bring 8,300 kg of payload to Mars with a Falcon 9 rocket whilst in reality, this is 4,020 kg. That’s no small error.

Ow! Our RAG got that answer all wrong. It suggests we can bring double the payload to Mars than what is allowed. That is pretty inconvenient, in case you were preparing for a trip to Mars 😉.

We need to debug what went wrong. Let’s look at the context that was passed to the LLM:

The Levels of RAG (on Xebia.com ⧉)

The context provided to the LLM contains jumbled and unformatted text originating from a table. It should therefore come as no surprise that our LLM has difficulties answering questions about this table.

That explains a bunch. The context we’re passing to the LLM is hard to read and has a table encoded in a jumbled way. Like us the LLM has a hard time making sense of this. Therefore, we need to better encode this information in the prompt so our LLM can understand it.

If we want to support tables we can introduce an extra processing step. One option is to use Computer vision models to detect tables inside our documents, like table-transformer. If a table gets detected we can then give it a special treatment. What we can for example do, is encode tables in our prompt as Markdown:

The Levels of RAG (on Xebia.com ⧉)

First parsing our table into a native format and then converting it to Markdown allows our LLM to much better understand the table.

Having detected the table and parsed it into a native format in our Python code allows us to then encode it in Markdown. Let’s pass that to our LLM instead and see what it answers this time:

The Levels of RAG (on Xebia.com ⧉)

Our RAG got the answer correct this time.

Hurray! We got it correct this time. The LLM we used could easily interpret the Markdown table and pinpoint the correct value to use in answering the question. Note that still, we need to be able to retrieve the table in our retrieval step. This setup assumes we built a retrieval step that is able to retrieve the table given the user question.

However, I have to admit something. The model we have been using for this task was GPT-3.5 turbo, which is a text-only model. Newer models have been released that can handle more than just text, which are Multimodal models. After all, we are dealing with PDF’s which can also be seen as a series of images. Can we leverage such multimodal models to better answer our questions? Let’s find out in Level 4: Multimodal.

Level 4: Multimodal

The Levels of RAG (on Xebia.com ⧉)

In this final level, we will look into leveraging the possibilities of Multimodal models. One of them is GPT-4o, which was announced May 2024:

The Levels of RAG (on Xebia.com ⧉)

GPT-4o is a multimodal model that can reason across audio, vision and text. Launched by OpenAI in May 2024.

This is a very powerful model that can understand audio, vision and text. This means we can feed it images as part of an input prompt. Given that we can in our retrieval step retrieve the right PDF pages to use, we can insert those images in the prompt and ask the LLM our original question. This has the advantage that we can understand content that was previously very challenging to encode in just text. Also, content we interpret and encode as text is exposed to more conversion steps exposing us to risk of information getting lost in translation.

For example sake we will take the same table we had before but then answer the question using a Multimodal model. We can take the retrieved PDF pages encoded as images and insert them right into the prompt:

The Levels of RAG (on Xebia.com ⧉)

With a Multimodal model, we can insert an image in the prompt and let the LLM answer questions about it.

Impressive. The LLM got the answer correct. We should be aware though, that inserting images in the prompt comes with a very different token usage than the Markdown table we inserted as text before:

The Levels of RAG (on Xebia.com ⧉)

When using Multimodal models, do be aware of the extra cost that comes with it.

That is an immense increase in cost. Multimodal models can be incredibly powerful to interpret content that is otherwise very difficult to encode in text, as long as it is worth the cost ✓.

Concluding

We have explored RAG in 4 levels of complexity. We went from building our first basic RAG to a RAG that leverages Multimodal models to answer questions based on complex documents. Each level introduces new complexities which are justified in each their own way. Summarising, the Levels of RAG are:

The Levels of RAG, summarised.
The Levels of RAG (on Xebia.com ⧉)Level 1
Basic RAG
The Levels of RAG (on Xebia.com ⧉)Level 2
Hybrid Search
The Levels of RAG (on Xebia.com ⧉)Level 3
Advanced data formats
The Levels of RAG (on Xebia.com ⧉)Level 4
Multimodal
RAG’s main steps are 1) retrieval and 2) generation. Important components to do so are Embedding, Vector Search using a Vector database, Chunking and a Large Language Model (LLM).Combining vector search and keyword search can improve retrieval performance. Sparse text search can be done using: TF-IDF and BM- 25. Reciprocal Rank Fusion can be used to merge two search engine rankings.Support formats like HTML, Word and PDF. PDF can contain images, graphs but also tables. Tables need a separate treatment, for example with Computer Vision, to then expose the table to the LLM as Markdown.Multimodal models can reason across audio, images and even video. Such models can help process complex data formats, for example by exposing PDF’s as images to the model. Given that the extra cost is worth the benefit, such models can be incredibly powerful.

RAG is a very powerful technique which can open up many new possibilities at companies. The Levels of RAG help you reason about the complexity of your RAG and allow you to understand what is difficult to do with RAG and what is easier. So: what is your level? 🫵

We wish you all the best with building your own RAG 👏🏻.

]]>
<![CDATA[RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)]]>https://jeroenoverschie.nl/rag-on-gcp-production-ready-genai-on-google-cloud-platform/661bb31631b0250001990697Tue, 26 Mar 2024 11:47:00 GMTIntroductionRAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)

Google Cloud Platform has an increasing set of managed services that can help you build production-ready Retrieval-Augmented Generation applications. Services like Vertex AI Search & Conversation and Vertex AI Vector Search give us scalability and ease of use. How can you best leverage them to build RAG applications? Let’s explore together. Read along!

Retrieval Augmented Generation

Even though Retrieval-Augmented Generation (RAG) was coined already in 2020, the technique got supercharged since the rise of Large Language Models (LLMs). With RAG, LLMs are combined with search techniques like vector search to enable realtime and efficient lookup of information that is outside the model’s knowledge. This opens up many exciting new possibilities. Whereas previously interactions with LLMs were limited to the model’s knowledge, with RAG it is now possible to load in company internal data like knowledge bases. Additionally, by instructing the LLM to always ‘ground’ its answer based on factual data, hallucinations can be reduced.

Why RAG?

Let’s first take a step back. How exactly can RAG benefit us? When we interact with an LLM, all its factual knowledge is stored inside the model weights. The model weights are set during its training phase, which can be a while ago. In fact, this can be more than a year.

LLMKnowledge cut-off
Gemini 1.0 ProEarly 2023 [1]
GPT-3.5 turboSeptember 2021 [2]
GPT-4September 2021 [3]
Knowledge cut-offs as of March 2024.

Additionally, these publically offered models are trained on mostly public data. If you want to use company internal data, an option is to fine-tune or retrain the model, which can be expensive and time-consuming.

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)
The limitations of LLM interactions without using RAG.

This boils down to three main limitations: the models knowledge is outdated, the model has no access to internal data, and the model can hallucinate answers.

With RAG we can circumvent these limitations. Given the question a user has, information relevant to that question can be retrieved first to then be presented to the LLM.

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)
How RAG can help an LLM provide more factual answers based on internal data.

The LLM can then augment its answer with the retrieved information to generate a factual, up-to-date and yet human-readable answer. The LLM is to be instructed to at all times ground its answer in the retrieved information, which can help reduce hallucinations.

These benefits are great. So how do we actually build a RAG system?

Building a RAG system

In a RAG system, there are two main steps: 1) Document retrieval and 2) Answer generation. Whereas the document retrieval is responsible for finding the most relevant information given the user’s question, the answer generation is responsible for generating a human-readable answer based on information found in the retrieval step. Let’s take a look at both in more detail.

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)
The two main steps in a RAG system: Document retrieval and Answer generation.

Document retrieval

First, Document retrieval. Documents are converted to plain text and chunked. The chunks are then embedded and stored in a vector database. User questions are also embedded, enabling a vector similarity search to obtain the best matching documents. Optionally, a step can be added to extract document metadata like title, author, summary, keywords, etc, which can subsequently be used to perform a keyword search. This can all be illustrated like so:

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)
Document retrieval step in a RAG system. Documents are converted to text and converted to embeddings. A user’s question is converted to an embedding such that a vector similarity search can be performed.

Neat. But what about GCP. We can map the former to GCP as follows:

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)
Document retrieval using GCP services including Document AI, textembedding-gecko and Vertex AI Vector Search.

Document AI is used to process documents and extract text, Gemini and textembedding-gecko to generate metadata and embeddings respectively and Vertex AI Vector Search is used to store the embeddings and perform similarity search. By using these services, we can build a scalable retrieval step.

Answer generation

Then, Answer generation. We will need an LLM for this and instruct it to use the provided documents. We can illustrate this like so:

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)
Answer generation step using Gemini, with an example prompt. Both the user’s question and snippets of documents relevant to that question are inserted in the prompt.

Here, the documents can be formatted using an arbitrary function that generates valid markdown.

We have already come across multiple GCP services that can help us build a RAG system. So now, what other offerings does GCP have to help us build a RAG system and what flavours are there to combine services?

The RAG flavours on GCP

So far, we have seen GCP services that can help us build a RAG system. These include Document AI, Vertex AI Vector Search, Gemini Pro, Cloud Storage and Cloud Run. But GCP also has Vertex AI Search & Conversation.

Vertex AI Search & Conversation is a service tailored to GenAI usecases, built to do some of the heavy lifting for us. It can ingest documents, create embeddings and manage the vector database. You just have to focus on ingesting data in the correct format. Then, you can use Search & Conversation in multiple ways. You can either get only search results, given a search query, or you can let Search & Conversation generate a full answer for you with source citations.

Even though Vertex AI Search & Conversation is very powerful, there can be scenarios when you want more control. Let’s take a look on these levels of going either managed or remaining in-control.

The easiest way to get started with RAG on GCP is to use Search & Conversation. The service can ingest documents from multiple sources like Big Query and Google Cloud Storage. Once those are ingested, it can generate answers backed by citations for you. This is simply illustrated like so:

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)
Fully managed RAG using Search & Conversation for document retrieval and answer generation.

If you want more control, you can use Gemini for answer generation instead of letting Search & Conversation do it for us. This way, you can have more control to do any prompt engineering you like.

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)
Partly managed RAG using Search & Conversation for document retrieval and Gemini for answer generation.

Lastly, you can have full control over the RAG system. This means you have to manage both the document retrieval and answer generation yourself. This does mean more manual work in engineering the system. Documents can be processed by Document AI, chunked, embedded and its vectors stored in Vertex AI Vector Search. Then Gemini can be used to generate the final answers.

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)
Full control RAG. Manual document processing, embedding creation and vector database management.

The advantage here is that you are able to fully control yourself how you process the documents an convert them to embeddings. You can use all of Document AI’s Processors offering to process the documents in different ways.

Do take the the managed versus manual approach tradeoffs into consideration. Ask yourself questions like:

  • How much time and energy do you want to invest building something custom for the flexibility that you need?
  • Do you really need the flexibility for your solution to give in to extra maintenance cost?
  • Do you have the engineering capacity to build- and maintain a custom solution?
  • Are the initially invested build costs worth the money saved in not using a managed solution?

So then, you can decide what works best for you ✓.

Concluding

RAG is a powerful way to augment LLMs with external data. This can help reduce hallucinations and provide more factual answers. At the core of RAG systems are document processors, vector databases and LLM’s.

Google Cloud Platform offers services that can help build production-ready RAG solutions. We have described three levels of control in deploying a RAG application on GCP:

  • Fully managed: using Search & Conversation.
  • Partly managed: managed search using Search & Conversation but manual prompt-engineering using Gemini.
  • Full control: manual document processing using Document AI, embedding creation and vector database management using Vertex AI Vector Search.

That said, I wish you good luck implementing your own RAG system. Use RAG for great good! ♡

]]>
<![CDATA[Dataset enrichment using LLM’s ✨ (on Xebia.com ⧉)]]>https://jeroenoverschie.nl/dataset-enrichment-using-llms/65993426cdf9710001822269Thu, 28 Dec 2023 11:06:00 GMT<![CDATA[Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)]]>https://jeroenoverschie.nl/scaling-up-bringing-your-azure-devops-ci-cd-setup-to-the-next-level/65993347cdf971000182225eFri, 08 Dec 2023 11:02:00 GMT

Introduction

Azure DevOps pipelines are a great way to automate your CI/CD process. Most often, they are configured on a per project basis. This works fine when you have few projects. But what if you have many projects? In this blog post, we will show you how you can scale up your Azure DevOps CI/CD setup for reusability and easy maintenance.

Your typical DevOps pipeline

A typical DevOps pipeline is placed inside the project repository. Let’s consider a pipeline for a Python project. It includes the following steps:

  • quality checks such as code formatting and linting
  • building a package such as a Python wheel
  • releasing a package to Python package registry (such as Azure Artifacts or PyPi)

Using an Azure DevOps pipeline, we can achieve this like so:

trigger:
- main

steps:
# Python setup & dependencies
- task: UsePythonVersion@0
  inputs:
    versionSpec: 3.10

- script: |
    pip install .[dev,build,release]
  displayName: 'Install dependencies'

# Code Quality
- script: |
    black --check .
  displayName: 'Formatting'

- script: |
    flake8 .
  displayName: 'Linting'

- script: |
    pytest .
  displayName: 'Testing'

# Build
- script: |
    echo $(Build.BuildNumber) > version.txt
  displayName: 'Set version number'

- script: |
    pip wheel \
      --no-deps \
      --wheel-dir dist/ \
      .
  displayName: 'Build wheel'

# Publish
- task: TwineAuthenticate@1
  inputs:
    artifactFeed: 'better-devops-pipelines-blogpost/devops-pipelines-blogpost'
  displayName: 'Authenticate pip with twine'

- script: |
    twine upload \
      --config-file $(PYPIRC_PATH) \
      --repository devops-pipelines-blogpost \
      dist/*.whl
  displayName: 'Publish wheel with twine'

Well, that is great, right? We have achieved all the goals we desired:

  • Code quality checks using black, flake8 and pytest.
  • Build and package the project as a Python wheel.
  • Publish the package to a registry of choice, in this case Azure Artifacts.

Growing pains

A DevOps pipeline like the above works fine for a single project. But, … what if we want to scale up? Say our company grows, we create more repositories and more projects need to be packaged and released. Will we simply copy this pipeline and paste it into a new repository? Given that we are growing in size, can we be more efficient than just running this pipeline from start to finish?

The answer is no – we do not have to copy/paste all these pipelines into a new repo, and the answer is yes – we can be more efficient in running these pipelines. Let’s see how.

Scaling up properly

Let’s see how we can create scalable DevOps pipelines. First, we are going to introduce DevOps pipeline templates. These are modular pieces of pipeline that we can reuse across various pipelines and also across various projects residing in different repositories.

Let’s see how we can use pipeline templates to our advantage.

1. DevOps template setup

Let’s rewrite pieces of our pipeline into DevOps pipeline templates. Important to know here is that you can write templates for either stages, jobs or steps. The hierarchy is as follows:

stages:
- stage: Stage1
  jobs:
  - job: Job1
    steps:
    - step: Step1

This can be illustrated in an image like so:

Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)

We can then create a template in one file, for example for steps:

templates/code-quality.yml

steps:
- script: |
    echo "Hello world!"

.. and reuse it in our former pipeline:

stages:
- stage: Stage1
  jobs:
  - job: Job1
    steps:
    - template: templates/code-quality.yml

… or for those who prefer a more visual way of displaying it:

Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)

That’s how easy it is to use DevOps pipeline templates! Let’s now apply it to our own usecase.

Code quality checks template

First, let’s put the code quality checks pipeline into a template. We are also making the pipeline more extensive so it outputs test results and coverage reports. Remember, we are only defining this template once and then reusing it in other places.

templates/code-quality.yml

steps:
# Code Quality
- script: |
    black --check .
  displayName: 'Formatting'

- script: |
    flake8 .
  displayName: 'Linting'

- script: |
    pytest \
      --junitxml=junit/test-results.xml \
      --cov=. \
      --cov-report=xml:coverage.xml \
      .
  displayName: 'Testing'

# Publish test results + coverage
- task: PublishTestResults@2
  condition: succeededOrFailed()
  inputs:
    testResultsFiles: '**/test-*.xml'
    testRunTitle: 'Publish test results'
    failTaskOnFailedTests: true
  displayName: 'Publish test results'

- task: PublishCodeCoverageResults@1
  inputs:
    codeCoverageTool: 'Cobertura'
    summaryFileLocation: '**/coverage.xml'
  displayName: 'Publish test coverage'

… which we are using like so:

steps:
- template: templates/code-quality.yml

Easy! Also note we included two additional tasks: one to publish the test results and another to publish code coverage reports. That information is super useful to display inside DevOps. Lucky for us, DevOps has support for that:

Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)

… clicking on the test results brings us to the Tests view, where we can exactly which test failed (if any failed):

Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)

Lastly, there’s also a view explaining which lines of code you covered with tests and which you did not:

Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)

Those come in very useful when you are working on testing your code!

Now, we have defined this all in DevOps templates. That gives us a more comfortable position to define more elaborate pipeline steps because we will import those templates instead of copy/pasting them.

That said, we can summarise the benefits of using DevOps templates like so:

  • Define once, reuse everywhere
    We can reuse this code quality checks pipeline in both the same project multiple times but also from other repositories. If you are importing from another repo, see ‘Use other repositories‘ for setup.
  • Make it failproof
    You can invest into making just this template very good; instead of having multiple bad versions hanging around in your organisation.
  • Reduce complexity
    Abstracting away commonly used code can be efficient for the readability of your pipeline. This allows newcomers to easily understand the different parts of your CI/CD setup using DevOps pipelines.

2. Passing data between templates

Let’s go a step further and also abstract away the build and release steps into templates. We are going to use the following template for building a Python wheel:

steps:
# Build wheel
- script: |
    echo $(Build.BuildNumber) > version.txt
  displayName: 'Set version number'

- script: |
    pip wheel \
      --no-deps \
      --wheel-dir dist/ \
      .
  displayName: 'Build wheel'

# Upload wheel as artifact
- task: CopyFiles@2
  inputs:
    contents: dist/**
    targetFolder: $(Build.ArtifactStagingDirectory)
  displayName: 'Copy wheel to artifacts directory'

- publish: '$(Build.ArtifactStagingDirectory)/dist'
  artifact: wheelFiles
  displayName: 'Upload wheel as artifact'

This definition is slightly different than the one we defined before, in the initial pipeline. This pipeline uses artifacts. These allow us to pass data between jobs or stages. This is useful when we want to split up our pipeline into smaller pieces. Splitting up the process into smaller segments gives us more visibility and control over the process. Another benefit of this, is that when we split the Python wheel build and release process, is that we give ourselves the ability to release to multiple providers at once.

When this pipeline is ran we can see an artifact (a wheel file) has been added:

Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)

… with the actual wheel file in there:

Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)

This is also useful so we can inspect what the build pipeline has produced. We can now download this wheel file from the artifacts again. We will do this in the publish pipeline.

template/publish-wheel.yml

parameters:
- name: artifactFeed
  type: string
- name: repositoryName
  type: string

steps:
# Retrieve wheel
- download: current
  artifact: wheelFiles
  displayName: 'Download artifacts'

# Publish wheel
- task: TwineAuthenticate@1
  inputs:
    artifactFeed: ${{ parameters.artifactFeed }}
  displayName: 'Authenticate pip with twine'

- script: |
    twine upload \
      --config-file $(PYPIRC_PATH) \
      --repository ${{ parameters.repositoryName }} \
      $(Pipeline.Workspace)/wheelFiles/*.whl
  displayName: 'Publish wheel with twine'

… both the build- and release pipeline can be used like so:

- stage: Build
  jobs:
  - job: BuildWheel
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: 3.10

    - script: |
        pip install .[build]
      displayName: 'Install dependencies'

    - template: templates/build-wheel.yml


- stage: Publish
  jobs:
  - job: PublishWheel
    steps:
    - script: |
        pip install twine==4.0.2
      displayName: 'Install twine'

    - template: templates/publish-wheel.yml
      parameters:
        artifactFeed: 'better-devops-pipelines-blogpost/devops-pipelines-blogpost'
        repositoryName: 'devops-pipelines-blogpost'

And here we have another new feature coming in. Stages. These allow us to execute pipelines that depend on each other. We have now split up our pipeline into 2 stages:

  1. Build stage
  2. Publish stage

Using stages makes it easy to see what is going on. It provides transparency and allows you to easily track the progress of the pipeline. You can also launch stages separately, skipping previous stages, as long as the necessary dependencies are in place. For example, dependencies can include artifacts, which were generated in previous stage.

Improving the release process

So what is another advantage of this setup? Say that you are releasing your package to two pip registries. Doing that is easy using this setup by creating two jobs in the publish stage:

- stage: Publish
  jobs:
  - job: PublishToRegistryOne
    steps:
    - script: |
        pip install twine==4.0.2
      displayName: 'Install twine'

    - template: templates/publish-wheel.yml
      parameters:
        artifactFeed: 'better-devops-pipelines-blogpost/registry-1'
        repositoryName: 'devops-pipelines-blogpost'

  - job: PublishToRegistryTwo
    steps:
    - script: |
        pip install twine==4.0.2
      displayName: 'Install twine'

    - template: templates/publish-wheel.yml
      parameters:
        artifactFeed: 'better-devops-pipelines-blogpost/registry-2'
        repositoryName: 'devops-pipelines-blogpost'

As you can see, we can use the defined templates to scale our pipelines. What is essential here, is that thanks to using the artifacts, we can build our wheel once and consume that same wheel multiple times.

Additionally, the publishing jobs launch in parallel by default (unless dependencies are explicitly defined). This speeds up your release process.

3. Automate using a strategy matrix

Let’s go back to the code quality stage for a minute. In the code quality stage, we are first installing a certain Python version, and then running all quality checks. However, we might need guarantees that our code works for multiple Python versions. This is often the case when releasing a package, for example. How can we easily automate running our Code Quality pipeline using our pipeline templates? One option is to manually define a couple jobs and install the correct python version in each job. Another option is to use a strategy matrix. This allows us to define a matrix of variables that we can use in our pipeline.

We can improve our CodeQualityChecks job like so:

jobs:
- job: CodeQualityChecks
  strategy:
    matrix:
      Python38:
        python.version: '3.8'
      Python39:
        python.version: '3.9'
      Python310:
        python.version: '3.10'
      Python311:
        python.version: '3.11'

  steps:
  - task: UsePythonVersion@0
    inputs:
      versionSpec: $(python.version)

  - script: |
      pip install .[dev]
    displayName: 'Install dependencies'

  - template: templates/code-quality.yml

Awesome! The pipeline now runs the entire code quality pipeline for each Python version. Looking at how our pipeline runs now, we can see multiple jobs, one for each Python version:

Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)

.. as you can see 4 jobs are launched. If no job dependencies are explicitly set, jobs within one stage run in parallel! That greatly speed up the pipeline and lets you iterate faster! That’s definitely a win.

Final result

Let’s wrap it up! Our entire pipeline, using templates:

trigger:
- main

stages:
- stage: CodeQuality
  jobs:
  - job: CodeQualityChecks
    strategy:
      matrix:
        Python38:
          python.version: '3.8'
        Python39:
          python.version: '3.9'
        Python310:
          python.version: '3.10'
        Python311:
          python.version: '3.11'

    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: $(python.version)

    - script: |
        pip install .[dev]
      displayName: 'Install dependencies'

    - template: templates/code-quality.yml


- stage: Build
  jobs:
  - job: BuildWheel
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: 3.10

    - script: |
        pip install .[build]
      displayName: 'Install dependencies'

    - template: templates/build-wheel.yml


- stage: Publish
  jobs:
  - job: PublishWheel
    steps:
    - script: |
        pip install twine==4.0.2
      displayName: 'Install twine'

    - template: templates/publish-wheel.yml
      parameters:
        artifactFeed: 'better-devops-pipelines-blogpost/devops-pipelines-blogpost'
        repositoryName: 'devops-pipelines-blogpost'

… which uses these templates:

… for the entire source code see the better-devops-pipelines-blogpost repo. The repository contains pipelines that apply above explained principles. The pipelines provide testing, building and releasing for a Python project ✓.

Conclusion

We demonstrated how to scale up your Azure DevOps CI/CD setup making it reusable, maintainable and modular. This helps you maintain a good CI/CD setup as your company grows.

In short, we achieved the following:

  • Create modular DevOps pipelines using templates. This makes it more easy to reuse pipelines across projects and repositories
  • Pass data between DevOps pipeline jobs using artifacts. This allows us to split up our pipeline into smaller pieces, that can consume artifacts from previous jobs.
  • Split up your pipeline in stages to create more visibility and control over your CI/CD
An example repository containing good-practice pipelines is available at:

https://dev.azure.com/godatadriven/_git/better-devops-pipelines-blogpost

Cheers 🙏

]]>
<![CDATA[A sunrise in Hamburg]]>
0:00
/0:27

Hamburg is waking up and getting ready for the day.

Sometimes, it's nice just to observe. To watch, as the world wakes up. Have a great day ☀

]]>
https://jeroenoverschie.nl/a-sunrise-in-hamburg/68dfc498f044ed00011d881aWed, 14 Dec 2022 06:00:00 GMT
0:00
/0:27

Hamburg is waking up and getting ready for the day.

Sometimes, it's nice just to observe. To watch, as the world wakes up. Have a great day ☀

]]>
<![CDATA[How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)]]>https://jeroenoverschie.nl/how-to-create-a-devcontainer-for-your-python-project/6394c3a02bf4040001d22235Mon, 21 Nov 2022 17:36:00 GMT

Imagine the following scenario 💭.

Your company uses Apache Spark to process data, and your team has pyspark set up in a Python project. The codebase is built on a specific Python version, using a certain Java installation, and an accompanying pyspark version that works with the former. To onboard a new member, you will need to pass a list of instructions the developer needs to follow carefully to get their setup working. But not everyone might run this on the same laptop environment: different hardware, and different operating systems. This is getting challenging.
But the setup is a one-off, right? Just go through the setup once and you’ll be good. Not entirely. Your code environment will change over time: your team will probably install-, update- or remove packages during the project’s development. This means that if a developer creates a new feature and changes their own environment to do so; he or she also needs to make sure that the other team members change theirs and that the production environment is updated accordingly. This makes it easy to get misaligned environments: between developers, and between development & production.

We can do better than this! Instead of giving other developers a setup document, let’s make sure we also create formal instructions so we can automatically set up the development environment. Devcontainers let us do exactly this.

Devcontainers let you connect your IDE to a running Docker container. In this way, we get the benefits of reproducibility and isolation, whilst getting a native development experience.

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)
With Devcontainers you can interact with your IDE like you're used to whilst under the hood running everything inside a Docker container.

Devcontainers can help us:

  • Get a reproducible development environment
  • ⚡️ Instantly onboard new team members onto your project
  • ‍ ‍ ‍ Better align the environments between team members
  • ⏱ Keeping your dev environment up-to-date & reproducible saves your team time going into production later

Let’s explore how we can set up a Devcontainer for your Python project!

Creating your first Devcontainer

Note that this tutorial is focused on VSCode. Other IDE’s like PyCharm support running in Docker containers but support is less comprehensive than on VSCode.

Recap

To recap, we are trying to create a dev environment that installs: 1) Java, 2) Python and 3) pyspark. And we want to do so automatically, that is, inside a Docker image.

Project structure

Let’s say we have a really simple project that looks like this:

$ tree .
.
├── README.md
├── requirements.txt
├── requirements-dev.txt
├── sales_analysis.py
└── test_sales_analysis.py

That is, we have a Python module with an accompanying test, a requirements.txt file describing our production dependencies (pyspark), and a requirements-dev.txt describing dependencies that should be installed in development only (pytest, black, mypy). Now let’s see how we can extend this setup to include a Devcontainer.

The .devcontainer folder

Your Devcontainer spec will live inside the .devcontainer folder. There will be two main files:

  • devcontainer.json
  • Dockerfile

Create a new file called devcontainer.json:

{
    "build": {
        "dockerfile": "Dockerfile",
        "context": ".."
    }
}

This means: as a base for our Devcontainer, use the Dockerfile located in the current directory, and build it with a current working directory (cwd) of ...

So what does this Dockerfile look like?

FROM python:3.10

# Install Java
RUN apt update && 
    apt install -y sudo && 
    sudo apt install default-jdk -y

## Pip dependencies
# Upgrade pip
RUN pip install --upgrade pip
# Install production dependencies
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt && 
    rm /tmp/requirements.txt
# Install development dependencies
COPY requirements-dev.txt /tmp/requirements-dev.txt
RUN pip install -r /tmp/requirements-dev.txt && 
    rm /tmp/requirements-dev.txt

We are building our image on top of python:3.10, which is a Debian-based image. This is one of the Linux distributions that a Devcontainer can be built on. The main requirement is that Node.js should be able to run: VSCode automatically installs VSCode Server on the machine. For an extensive list of supported distributions, see “Remote Development with Linux”.

On top of python:3.10, we install Java and the required pip packages.

Opening the Devcontainer

The .devcontainer folder is in place, so it’s now time to open our Devcontainer.

First, make sure you have the Dev Containers extension installed in VSCode (previously called “Remote – Containers”. That done, if you open your repo again, the extension should already detect your Devcontainer:

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

Alternatively, you can open up the command palette (CMD + Shift + P) and select “Dev Containers: Reopen in Container”:

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

Your VSCode is now connected to the Docker container :

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

What is happening under the hood

Besides starting the Docker image and attaching the terminal to it, VSCode is doing a couple more things:

  1. VSCode Server is being installed on your Devcontainer. VSCode Server is installed as a service in the container itself so your VSCode installation can communicate with the container. For example, install and run extensions.
  2. Config is copied over. Config like ~/.gitconfig and ~/.ssh/known_hosts are copied over to their respective locations in the container.
    This then allows you to use your Git repo like you do normally, without re-authenticating.
  3. Filesystem mounts. VSCode automatically takes care of mounting: 1) The folder you are running the Devcontainer from and 2) your VSCode workspace folder.

Opening your repo directly in a Devcontainer

Since all instructions on how to configure your dev environment are now defined in a Dockerfile, users can open up your Devcontainer with just one button:

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

Ain’t that cool? You can add a button to your repo like so:

[
    ![Open in Remote - Containers](
        https://xebia.com/wp-content/uploads/2023/11/v1.svg    )
](
    https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/godatadriven/python-devcontainer-template
)

Just modify the GitHub URL ✓.

That said, we can see having built a Devcontainer can make our README massively more readable. What kind of README would you rather like?

Manual installationUsing a Devcontainer
How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

Extending the Devcontainer

We have built a working Devcontainer, which is great! But a couple of things are still missing. We still want to:

  • Install a non-root user for extra safety and good-practice
  • Pass in custom VSCode settings and install extensions by default
  • Be able to access Spark UI (port 4040)
  • Run Continuous Integration (CI) in the Devcontainer

Let’s see how.

Installing a non-root user

If you pip install a new package, you will see the following message:

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

Indeed, it is not recommended to develop as a root user. It is considered a good practice to create a different user with fewer rights to run in production. So let’s go ahead and create a user for this scenario.

# Add non-root user
ARG USERNAME=nonroot
RUN groupadd --gid 1000 $USERNAME && 
    useradd --uid 1000 --gid 1000 -m $USERNAME
## Make sure to reflect new user in PATH
ENV PATH="/home/${USERNAME}/.local/bin:${PATH}"
USER $USERNAME

Add the following property to devcontainer.json:

    "remoteUser": "nonroot"

That’s great! When we now start the container we should connect as the user nonroot.

Passing custom VSCode settings

Our Devcontainer is still a bit bland, without extensions and settings. Besides any custom extensions a user might want to install, we can install some for them by default already. We can define such settings in customizations.vscode:

     "customizations": {
        "vscode": {
            "extensions": [
                "ms-python.python"
            ],
            "settings": {
                "python.testing.pytestArgs": [
                    "."
                ],
                "python.testing.unittestEnabled": false,
                "python.testing.pytestEnabled": true,
                "python.formatting.provider": "black",
                "python.linting.mypyEnabled": true,
                "python.linting.enabled": true
            }
        }
    }

The defined extensions are always installed in the Devcontainer. However, the defined settings provide just a default for the user to use, and can still be overridden by other setting scopes like User Settings, Remote Settings, or Workspace Settings.

Accessing Spark UI

Since we are using pyspark, we want to be able to access Spark UI. When we start a Spark session, VSCode will ask whether you want to forward the specific port. Since we already know this is Spark UI, we can do so automatically:

    "portsAttributes": {
        "4040": {
            "label": "SparkUI",
            "onAutoForward": "notify"
        }
    },

    "forwardPorts": [
        4040
    ]

When we now run our code, we get a notification we can open Spark UI in the browser:

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

Resulting in the Spark UI as we know it:

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

Running our CI in the Devcontainer

Wouldn’t it be convenient if we could re-use our Devcontainer to run our Continuous Integration (CI) pipeline as well? Indeed, we can do this with Devcontainers. Similarly to how the Devcontainer image is built locally using docker build, the same can be done within a CI/CD pipeline. There are two basic options:

  1. Build the Docker image within the CI/CD pipeline
  2. Prebuilding the image

To pre-build the image, the build step will need to run either periodically or whenever the Docker definition has changed. Since this adds quite some complexity let’s dive into building the Devcontainer as part of the CI/CD pipeline first (for pre-building the image, see the ‘Awesome resources’ section). We will do so using GitHub Actions.

Using devcontainers/ci

Luckily, a GitHub Action was already set up for us to do exactly this:

https://github.com/devcontainers/ci

To now build, push and run a command in the Devcontainer is as easy as:

name: Python app

on:
  pull_request:
  push:
    branches:
      - "**"

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout (GitHub)
        uses: actions/checkout@v3

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v2
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and run dev container task
        uses: devcontainers/ci@v0.2
        with:
          imageName: ghcr.io/${{ github.repository }}/devcontainer
          runCmd: pytest .

That’s great! Whenever this workflow runs on your main branch, the image will be pushed to the configured registry; in this case GitHub Container Registry (GHCR). See below a trace of the executed GitHub Action:

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

Awesome!

The final Devcontainer definition

We built the following Devcontainer definitions. First, devcontainer.json:

{
    "build": {
        "dockerfile": "Dockerfile",
        "context": ".."
    },

    "remoteUser": "nonroot",

    "customizations": {
        "vscode": {
            "extensions": [
                "ms-python.python"
            ],
            "settings": {
                "python.testing.pytestArgs": [
                    "."
                ],
                "python.testing.unittestEnabled": false,
                "python.testing.pytestEnabled": true,
                "python.formatting.provider": "black",
                "python.linting.mypyEnabled": true,
                "python.linting.enabled": true
            }
        }
    },

    "portsAttributes": {
        "4040": {
            "label": "SparkUI",
            "onAutoForward": "notify"
        }
    },

    "forwardPorts": [
        4040
    ]
}

And our Dockerfile:

FROM python:3.10

# Install Java
RUN apt update && 
    apt install -y sudo && 
    sudo apt install default-jdk -y

# Add non-root user
ARG USERNAME=nonroot
RUN groupadd --gid 1000 $USERNAME && 
    useradd --uid 1000 --gid 1000 -m $USERNAME
## Make sure to reflect new user in PATH
ENV PATH="/home/${USERNAME}/.local/bin:${PATH}"
USER $USERNAME

## Pip dependencies
# Upgrade pip
RUN pip install --upgrade pip
# Install production dependencies
COPY --chown=nonroot:1000 requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt && 
    rm /tmp/requirements.txt
# Install development dependencies
COPY --chown=nonroot:1000 requirements-dev.txt /tmp/requirements-dev.txt
RUN pip install -r /tmp/requirements-dev.txt && 
    rm /tmp/requirements-dev.txt
The full Devcontainer implementation and all the above steps can be found in the various branches of the godatadriven/python-devcontainer-template repo.

Docker images architecture: Three environments

With the CI now set up, we can reuse the same Docker image for two purposes. For local development and running our quality checks. And, once we deploy this application to production, we could configure the Devcontainer to use our production image as a base, and install extra dependencies on top. If we want to optimize the CI image to be as lightweight as possible, we could also strip off any extra dependencies that we do not require in the CI environment; things as extra CLI tooling, a better shell such as ZSH, and so forth.

This sets us up for having 3 different images for our entire lifecycle. One for Development, one for CI, and finally one for production. This can be visualized like so:

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

So, we can see, when using a Devcontainer you can re-use your production image and build on top of it. Install extra tooling, make sure it can talk to VSCode, and you’re done .

Going further

There are lots of other resources to explore; Devcontainers are well-documented and there are many posts about it. If you’re up for more, let’s see what else you can do.

Devcontainer features

Devcontainer features allow you to easily extend your Docker definition with common additions. Some useful features are:

Devcontainer templates

On the official Devcontainer specification website there are loads of templates available. Good chance (part of) your setup is in there. A nice way to get a head-start in building your Devcontainer or to get started quickly.

See: https://containers.dev/templates

Mounting directories

Re-authenticating your CLI tools is annoying. So one trick is to mount your AWS/Azure/GCP credentials from your local computer into your Devcontainer. This way, authentications done in either environment are shared with the other. You can easily do this by adding this to devcontainer.json:

  "mounts": [
    "source=/Users/<your_username>/.aws,target=/home/nonroot/.aws,type=bind,consistency=cached"
  ]

^ the above example mounts your AWS credentials, but the process should be similar for other cloud providers (GCP / Azure).

Awesome resources

Concluding

Devcontainers allow you to connect your IDE to a running Docker container, allowing for a native development experience but with the benefits of reproducibility and isolation. This makes easier to onboard new joiners and align development environments between team members. Devcontainers are very well supported for VSCode but are now being standardized in an open specification. Even though it will probably still take a while to see wide adoption, the specification is a good candidate for the standardization of Devcontainers.

About

This blogpost is written by Jeroen Overschie, working at Xebia.

]]>
<![CDATA[pyspark-bucketmap]]>https://jeroenoverschie.nl/pyspark-bucketmap/634aca22d0d59f00011b7ba1Sat, 22 Oct 2022 10:01:09 GMTpyspark-bucketmap

Have you ever heard of pyspark's Bucketizer? It can be really useful! Although you perhaps won't need it for some simple transformation, it can be really useful for certain usecases.

In this blogpost, we will:

  1. Explore the Bucketizer class
  2. Combine it with create_map
  3. Use a module so we don't have to write the logic ourselves 🗝🥳

Let's get started!

The problem

First, let's boot up a local spark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark

Say we have this dataset containing some persons:

from pyspark.sql import Row

people = spark.createDataFrame(
    [
        Row(age=12, name="Damian"),
        Row(age=15, name="Jake"),
        Row(age=18, name="Dominic"),
        Row(age=20, name="John"),
        Row(age=27, name="Jerry"),
        Row(age=101, name="Jerry's Grandpa"),
    ]
)
people

Okay, that's great. Now, what we would like to do, is map each person's age to an age category.

age range life phase
0 to 12 Child
12 to 18 Teenager
18 to 25 Young adulthood
25 to 70 Adult
70 and beyond Elderly

How best to go about this?

Using Bucketizer + create_map

We can use pyspark's Bucketizer for this. It works like so:

from pyspark.ml.feature import Bucketizer
from pyspark.sql import DataFrame

bucketizer = Bucketizer(
    inputCol="age",
    outputCol="life phase",
    splits=[
        -float("inf"), 0, 12, 18, 25, 70, float("inf")
    ]
)
bucketed: DataFrame = bucketizer.transform(people)
bucketed.show()
age name life phase
12 Damian 2.0
15 Jake 2.0
18 Dominic 3.0
20 John 3.0
27 Jerry 4.0
101 Jerry's Grandpa 5.0

Cool! We just put our ages in buckets, represented by numbers. Let's now map each bucket to a life phase.

from pyspark.sql.functions import lit, create_map
from typing import Dict
from pyspark.sql.column import Column

range_mapper = create_map(
    [lit(0.0), lit("Not yet born")]
    + [lit(1.0), lit("Child")]
    + [lit(2.0), lit("Teenager")]
    + [lit(3.0), lit("Young adulthood")]
    + [lit(4.0), lit("Adult")]
    + [lit(5.0), lit("Elderly")]
)
people_phase_column: Column = bucketed["life phase"]
people_with_phase: DataFrame = bucketed.withColumn(
    "life phase", range_mapper[people_phase_column]
)
people_with_phase.show()

age name life phase
12 Damian Teenager
15 Jake Teenager
18 Dominic Young adulthood
20 John Young adulthood
27 Jerry Adult
101 Jerry's Grandpa Elderly

🎉 Success!

Using a combination of Bucketizer and create_map, we managed to map people's age to their life phases.

pyspark-bucketmap

🎁 As a bonus, I put all of the above in a neat little module, which you can install simply using pip.

%pip install pyspark-bucketmap

Define the splits and mappings like before. Each dictionary key is a mapping to the n-th bucket (for example, bucket 1 refers to the range 0 to 12).

from typing import List

splits: List[float] = [-float("inf"), 0, 12, 18, 25, 70, float("inf")]
mapping: Dict[int, Column] = {
    0: lit("Not yet born"),
    1: lit("Child"),
    2: lit("Teenager"),
    3: lit("Young adulthood"),
    4: lit("Adult"),
    5: lit("Elderly"),
}

Then, simply import pyspark_bucketmap.BucketMap and call transform().

from pyspark_bucketmap import BucketMap
from typing import List, Dict

bucket_mapper = BucketMap(
    splits=splits, mapping=mapping, inputCol="age", outputCol="phase"
)
phases_actual: DataFrame = bucket_mapper.transform(people).select("name", "phase")
phases_actual.show()
name phase
Damian Teenager
Jake Teenager
Dominic Young adulthood
John Young adulthood
Jerry Adult
Jerry's Grandpa Elderly

Cheers 🙏🏻

You can find the module here:
https://github.com/dunnkers/pyspark-bucketmap


Written by Jeroen Overschie, working at GoDataDriven.

]]>
<![CDATA[DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)]]>https://jeroenoverschie.nl/dropblox-coding-challenge-at-pycon-de-pydata-berlin-2022/65993233cdf9710001822253Wed, 27 Jul 2022 09:58:00 GMT

Conferences are great. You meet new people, you learn new things. But have you ever found yourself back in the hotel after a day at a conference, thinking what to do now? Or were you ever stuck in one session, wishing you had gone for that other one? These moments are the perfect opportunity to open up your laptop and compete with your peers in a coding challenge.

Attendees of the three-day conference PyCon DE & PyData Berlin 2022 had the possibility to do so, with our coding challenge DropBlox.

Participants had a bit over one day to submit their solutions. After the deadline, we had received over 100 submissions and rewarded the well-deserved winner a Lego R2D2 in front of a great crowd.

Read on to learn more about this challenge. We will discuss the following:

  • What was the challenge exactly, and what trade-offs were made in the design?
  • What was happening behind the screens to make this challenge possible?
  • How did we create hype at the conference itself?
  • What strategies were adopted by the participants to crack the problem?

Participants used a public repository that we made available here.

Challenge

Participants of the challenge were given the following:

  • A 100 x 100 field
  • 1500 blocks of various colors and shapes, each with a unique identifier (see Fig. 1)
  • A rewards table, specifying the points and multipliers per color (see Table 1)
DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Figure 1: A random subset of blocks from the challenge and their corresponding IDs.
DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Table 1: The rewards table, specifying how each color contributes to the score of a solution of the challenge. We assign points to each tile in the final solution, while the multiplier only applies to rows filled with tiles of the same color.

The rules are as follows:

  • Blocks can be dropped in the field from the top at a specified x-coordinate, without changing their rotation (see Fig. 2)
  • Each block can be used at most once
  • The score of a solution is computed using the rewards table. For each row, we add the points of each tile to the score. If the row consists of tiles of a single color only, we multiply the points of that row by the corresponding multiplier. The final score is the sum of the scores of the rows. (see Fig. 3)
DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Figure 2: An example 6×6 field and a blue block dropped at x-coordinate 1.

DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Figure 3: An example of the computation of the score of a solution. The points and multipliers per color are specified in Table 1.

The solution is a list of block IDs with corresponding x-coordinates. This list specifies which blocks to drop and where, in order to come to the final solution.
The goal of the challenge? Getting the most points possible.

The design

When designing the challenge, we came up with a few simple requirements to follow:

  • The challenge should be easy to get started with
  • The progress and final solution should be easy to visualize
  • It should be difficult to approach the optimum solution

Ideas about N-dimensional versions of this challenge came along, but only the ‘simple’ 2D design ticked all the boxes. It’s easy to visualize, because it’s 2D, and (therefore) easy to get started with. Still, a 100 x 100 field with 1500 blocks allows for enough freedom to play this game in more ways than there are atoms in the universe!

Behind the screens

Participants could, and anyone still can, submit their solutions on the submission page, as well as see the leaderboard with submissions of all other participants. To make this possible, several things happened behind the screens, which are worth noting here.

Most importantly, we worked with a team of excited people with complementing skill sets. Together we created the framework that is visualized in Fig. 4.

We have a separate private repository, in which we keep all the logic that is hidden to the participant. In there, we have the ground-truth scoring function, and all logic necessary to run our web app. When participants submit their solution or check out the leaderboard, an Azure Function spins up to run the logic of our web app. The Azure Function is connected to an SQL database, where we store or retrieve submissions from.  We store images, such as the visualization of the final solution, in the blob storage. To create the leaderboard, we retrieve the top-scoring submissions of each user and combine them with the corresponding images in the blob storage.

DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Figure 4: The different components of the challenge, including those hidden to the participants.

Creating hype

What’s the use of a competition if nobody knows about it? Spread the word!

To attract attention to our coding competition, we did two things. First, we set up an appealing booth at the company stands. We put our prize right in front, and a real-time dashboard showing the highscores beside. Surely, anyone walking past will at least question themselves what that Lego toy is doing at a Python conference.

DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Figure 5: Our company booth at PyCon DE & PyData Berlin 2022

Second, we went out to the conference Lightning Talks and announced our competition there. Really, the audience was great. Gotta love the energy at conferences like these.

DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Figure 6: Promoting the challenge at a Lightning Talk

With our promotion set up, competitors started trickling in. Let the games begin!

Strategies

Strategies varied from near brute-force approaches to the use of convolutional kernels and clever heuristics. In the following, we discuss some interesting and top-scoring submissions of participants.

#14

S. Tschöke named his first approach “Breakfast cereal”, as he was inspired by smaller cereal pieces collecting at the bottom and larger ones in the top of a cereal bag. Pieces were dropped from left to right, smaller ones before larger ones, until none could fit anymore. This approach, resulting in around 25k points, was however not successful enough.

After a less successful, but brave approach using a genetic algorithm, he extended the breakfast cereal approach. This time, instead of the size of a block he used the block’s density, or the percentage of filled tiles within the height and width of a block; and he sorted the blocks by color. Taking a similar approach as before, but now including the different color orderings, resulted in 46k points. (See Fig. 6)

DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Figure 6: Final solution of the #14 in the challenge, S. Tschöke, with 45844 points.

#2

We jump up a few places to R. Garnier, who was #1 up until the last moments of the challenge. He went along with a small group of dedicated participants who started exploiting the row multipliers. Unexpectedly, this led to an interesting and exciting development in the competition.

His strategy consisted of two parts. The first is to manually construct some rows of the same color of the same height. This way, he created 3 full orange rows, one red and one purple. Subsequently, he uses a greedy algorithm, following the steps:

  1. Assign a score to each block: score = (block values) / (block surface)
  2. Sort the blocks by score
  3. For each block, drop the block where it falls down the furthest​​

This strategy resulted in 62k points. (See Fig. 7)

DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Figure 7: Final solution of the #2 in the challenge, R. Garnier, with 62032 points.

#1

With a single last-minute submission, G. Chanturia answered the question to “How far can we go?”. He carefully constructed pairs of blocks that fit together, to manually engineer the bottom half of his solution, taking the row multipliers to the next level.

G. is doing a PhD in Physics and a MSc in Computer Science, and fittingly splits his solution into a “physicist’s solution” and a “programmer’s solution”.

The physicist’s solution refers to the bottom part of the field. The strategy used here, as summarized by G., was (1) taking a look at the blocks, and (2) placing them in a smart way. Whether you are a theoretical or experimental physicist, data serves as a starting point. G. noticed there is an order in the blocks. First of all, a lot of orange blocks had their top and bottom rows filled. Placing these in a clever way already results in six completely filled rows.
Second, there were “W” and “M”-shaped blocks that fit perfectly together. He kept going on like this to manually construct the bottom 1/3 of the field, accounting for 57% of the total score.

The programmer’s solution refers to the rest of the field. The problem with the first approach is that it is not scalable. Even more, if the blocks would change he would have to start all over. This second approach is more robust, and is similar to R. Garnier’s approach. The steps are:

  1. Filter blocks based on their height. Blocks above height = 5 are filtered out, because many of these blocks have too weird shapes to work with.
  2. Sort the blocks by points (or similarly, by color). Blocks with higher scoring colors are listed first.
  3. Pick the first n available blocks in the sorted list. The larger n, the better the solution, but the longer it takes to run. The chosen number was around 50.
  4. Find out which block can fall the furthest down in the field
  5. Drop that block and remove it from the list
  6. Repeat from 3
DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Figure 8: Final solution of the #1 in the challenge, G. Chanturia, with 68779 points.

And most importantly, the #1 place does not go home empty-handed. The #1 contender wins the infamous Lego R2D2! And we can acknowledge that indeed, yes, the R2D2 turned quite some heads at the conference. The award was given to the winner at the last Lightning Talk series of the conference.

DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)
Figure 9: Our winner receives his coveted prize!

Conclusion

Organising the coding challenge has been lots of fun. We created a framework to host a high score leaderboard, process submissions and display the puzzles online. To sum up the process in a few words:

  • Hosting a coding challenge at a conference is fun!
  • Gather participants by promoting the challenge
  • Hand over the prize to the winner

It was interesting to see what strategies our participants came up with, and how the high score constantly improved even though this seemed unlikely at some point. We learned that starting off with a simple heuristic and expanding upon that is a good way to get your algorithm to solve a problem quickly. However, to win in our case, a hybrid solution involving a bit of manual engineering was needed to outperform all strategies relying solely on generalizing algorithms.

Quite likely, we will be re-using the framework we built to host some more code challenges. Will you look out for them?
Until that time, check out the repository of the DropBlox challenge here.

We made the figures in this post using Excalidraw.

At GoDataDriven, we use data & AI to solve complex problems. But we share our knowledge too! If you liked this blogpost and want to learn more about Data Science, maybe the Data Science Learning Journey is something for you.

]]>
<![CDATA[The beauties of Peru]]>]]>https://jeroenoverschie.nl/peru/68dd1141f044ed00011d8589Sun, 31 Oct 2021 23:00:00 GMT]]>