Jeroen Overschie

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)

Jeroen Overschie — Tue, 26 Mar 2024 11:47:00 GMT

Introduction

Google Cloud Platform has an increasing set of managed services that can help you build production-ready Retrieval-Augmented Generation applications. Services like Vertex AI Search & Conversation and Vertex AI Vector Search give us scalability and ease of use. How can you best leverage them to build RAG applications? Let’s explore together. Read along!

Retrieval Augmented Generation

Even though Retrieval-Augmented Generation (RAG) was coined already in 2020, the technique got supercharged since the rise of Large Language Models (LLMs). With RAG, LLMs are combined with search techniques like vector search to enable realtime and efficient lookup of information that is outside the model’s knowledge. This opens up many exciting new possibilities. Whereas previously interactions with LLMs were limited to the model’s knowledge, with RAG it is now possible to load in company internal data like knowledge bases. Additionally, by instructing the LLM to always ‘ground’ its answer based on factual data, hallucinations can be reduced.

Why RAG?

Let’s first take a step back. How exactly can RAG benefit us? When we interact with an LLM, all its factual knowledge is stored inside the model weights. The model weights are set during its training phase, which can be a while ago. In fact, this can be more than a year.

LLM	Knowledge cut-off
Gemini 1.0 Pro	Early 2023 [1]
GPT-3.5 turbo	September 2021 [2]
GPT-4	September 2021 [3]

Knowledge cut-offs as of March 2024.

Additionally, these publically offered models are trained on mostly public data. If you want to use company internal data, an option is to fine-tune or retrain the model, which can be expensive and time-consuming.

The limitations of LLM interactions without using RAG.

This boils down to three main limitations: the models knowledge is outdated, the model has no access to internal data, and the model can hallucinate answers.

With RAG we can circumvent these limitations. Given the question a user has, information relevant to that question can be retrieved first to then be presented to the LLM.

How RAG can help an LLM provide more factual answers based on internal data.

The LLM can then augment its answer with the retrieved information to generate a factual, up-to-date and yet human-readable answer. The LLM is to be instructed to at all times ground its answer in the retrieved information, which can help reduce hallucinations.

These benefits are great. So how do we actually build a RAG system?

Building a RAG system

In a RAG system, there are two main steps: 1) Document retrieval and 2) Answer generation. Whereas the document retrieval is responsible for finding the most relevant information given the user’s question, the answer generation is responsible for generating a human-readable answer based on information found in the retrieval step. Let’s take a look at both in more detail.

The two main steps in a RAG system: Document retrieval and Answer generation.

Document retrieval

First, Document retrieval. Documents are converted to plain text and chunked. The chunks are then embedded and stored in a vector database. User questions are also embedded, enabling a vector similarity search to obtain the best matching documents. Optionally, a step can be added to extract document metadata like title, author, summary, keywords, etc, which can subsequently be used to perform a keyword search. This can all be illustrated like so:

Document retrieval step in a RAG system. Documents are converted to text and converted to embeddings. A user’s question is converted to an embedding such that a vector similarity search can be performed.

Neat. But what about GCP. We can map the former to GCP as follows:

Document retrieval using GCP services including Document AI, textembedding-gecko and Vertex AI Vector Search.

Document AI is used to process documents and extract text, Gemini and textembedding-gecko to generate metadata and embeddings respectively and Vertex AI Vector Search is used to store the embeddings and perform similarity search. By using these services, we can build a scalable retrieval step.

Answer generation

Then, Answer generation. We will need an LLM for this and instruct it to use the provided documents. We can illustrate this like so:

Answer generation step using Gemini, with an example prompt. Both the user’s question and snippets of documents relevant to that question are inserted in the prompt.

Here, the documents can be formatted using an arbitrary function that generates valid markdown.

We have already come across multiple GCP services that can help us build a RAG system. So now, what other offerings does GCP have to help us build a RAG system and what flavours are there to combine services?

The RAG flavours on GCP

So far, we have seen GCP services that can help us build a RAG system. These include Document AI, Vertex AI Vector Search, Gemini Pro, Cloud Storage and Cloud Run. But GCP also has Vertex AI Search & Conversation.

Vertex AI Search & Conversation is a service tailored to GenAI usecases, built to do some of the heavy lifting for us. It can ingest documents, create embeddings and manage the vector database. You just have to focus on ingesting data in the correct format. Then, you can use Search & Conversation in multiple ways. You can either get only search results, given a search query, or you can let Search & Conversation generate a full answer for you with source citations.

Even though Vertex AI Search & Conversation is very powerful, there can be scenarios when you want more control. Let’s take a look on these levels of going either managed or remaining in-control.

The easiest way to get started with RAG on GCP is to use Search & Conversation. The service can ingest documents from multiple sources like Big Query and Google Cloud Storage. Once those are ingested, it can generate answers backed by citations for you. This is simply illustrated like so:

Fully managed RAG using Search & Conversation for document retrieval and answer generation.

If you want more control, you can use Gemini for answer generation instead of letting Search & Conversation do it for us. This way, you can have more control to do any prompt engineering you like.

Partly managed RAG using Search & Conversation for document retrieval and Gemini for answer generation.

Lastly, you can have full control over the RAG system. This means you have to manage both the document retrieval and answer generation yourself. This does mean more manual work in engineering the system. Documents can be processed by Document AI, chunked, embedded and its vectors stored in Vertex AI Vector Search. Then Gemini can be used to generate the final answers.

Full control RAG. Manual document processing, embedding creation and vector database management.

The advantage here is that you are able to fully control yourself how you process the documents an convert them to embeddings. You can use all of Document AI’s Processors offering to process the documents in different ways.

Do take the the managed versus manual approach tradeoffs into consideration. Ask yourself questions like:

How much time and energy do you want to invest building something custom for the flexibility that you need?
Do you really need the flexibility for your solution to give in to extra maintenance cost?
Do you have the engineering capacity to build- and maintain a custom solution?
Are the initially invested build costs worth the money saved in not using a managed solution?

So then, you can decide what works best for you ✓.

Concluding

RAG is a powerful way to augment LLMs with external data. This can help reduce hallucinations and provide more factual answers. At the core of RAG systems are document processors, vector databases and LLM’s.

Google Cloud Platform offers services that can help build production-ready RAG solutions. We have described three levels of control in deploying a RAG application on GCP:

Fully managed: using Search & Conversation.
Partly managed: managed search using Search & Conversation but manual prompt-engineering using Gemini.
Full control: manual document processing using Document AI, embedding creation and vector database management using Vertex AI Vector Search.

That said, I wish you good luck implementing your own RAG system. Use RAG for great good! ♡

Dataset enrichment using LLM’s ✨ (on Xebia.com ⧉)

Jeroen Overschie — Thu, 28 Dec 2023 11:06:00 GMT

Large Language Models (LLM's) have proven to be useful for numerous tasks. LLM's can summarise text, generate text, classify text or translate text. What LLM's can also be used for is converting unstructured data to structured data.

Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)

Jeroen Overschie — Fri, 08 Dec 2023 11:02:00 GMT

Introduction

Azure DevOps pipelines are a great way to automate your CI/CD process. Most often, they are configured on a per project basis. This works fine when you have few projects. But what if you have many projects? In this blog post, we will show you how you can scale up your Azure DevOps CI/CD setup for reusability and easy maintenance.

Your typical DevOps pipeline

A typical DevOps pipeline is placed inside the project repository. Let’s consider a pipeline for a Python project. It includes the following steps:

quality checks such as code formatting and linting
building a package such as a Python wheel
releasing a package to Python package registry (such as Azure Artifacts or PyPi)

Using an Azure DevOps pipeline, we can achieve this like so:

trigger:
- main

steps:
# Python setup & dependencies
- task: UsePythonVersion@0
  inputs:
    versionSpec: 3.10

- script: |
    pip install .[dev,build,release]
  displayName: 'Install dependencies'

# Code Quality
- script: |
    black --check .
  displayName: 'Formatting'

- script: |
    flake8 .
  displayName: 'Linting'

- script: |
    pytest .
  displayName: 'Testing'

# Build
- script: |
    echo $(Build.BuildNumber) > version.txt
  displayName: 'Set version number'

- script: |
    pip wheel \
      --no-deps \
      --wheel-dir dist/ \
      .
  displayName: 'Build wheel'

# Publish
- task: TwineAuthenticate@1
  inputs:
    artifactFeed: 'better-devops-pipelines-blogpost/devops-pipelines-blogpost'
  displayName: 'Authenticate pip with twine'

- script: |
    twine upload \
      --config-file $(PYPIRC_PATH) \
      --repository devops-pipelines-blogpost \
      dist/*.whl
  displayName: 'Publish wheel with twine'

Well, that is great, right? We have achieved all the goals we desired:

Code quality checks using black, flake8 and pytest.
Build and package the project as a Python wheel.
Publish the package to a registry of choice, in this case Azure Artifacts.

Growing pains

A DevOps pipeline like the above works fine for a single project. But, … what if we want to scale up? Say our company grows, we create more repositories and more projects need to be packaged and released. Will we simply copy this pipeline and paste it into a new repository? Given that we are growing in size, can we be more efficient than just running this pipeline from start to finish?

The answer is no – we do not have to copy/paste all these pipelines into a new repo, and the answer is yes – we can be more efficient in running these pipelines. Let’s see how.

Scaling up properly

Let’s see how we can create scalable DevOps pipelines. First, we are going to introduce DevOps pipeline templates. These are modular pieces of pipeline that we can reuse across various pipelines and also across various projects residing in different repositories.

Let’s see how we can use pipeline templates to our advantage.

1. DevOps template setup

Let’s rewrite pieces of our pipeline into DevOps pipeline templates. Important to know here is that you can write templates for either stages, jobs or steps. The hierarchy is as follows:

stages:
- stage: Stage1
  jobs:
  - job: Job1
    steps:
    - step: Step1

This can be illustrated in an image like so:

We can then create a template in one file, for example for steps:

templates/code-quality.yml

steps:
- script: |
    echo "Hello world!"

.. and reuse it in our former pipeline:

stages:
- stage: Stage1
  jobs:
  - job: Job1
    steps:
    - template: templates/code-quality.yml

… or for those who prefer a more visual way of displaying it:

That’s how easy it is to use DevOps pipeline templates! Let’s now apply it to our own usecase.

Code quality checks template

First, let’s put the code quality checks pipeline into a template. We are also making the pipeline more extensive so it outputs test results and coverage reports. Remember, we are only defining this template once and then reusing it in other places.

templates/code-quality.yml

steps:
# Code Quality
- script: |
    black --check .
  displayName: 'Formatting'

- script: |
    flake8 .
  displayName: 'Linting'

- script: |
    pytest \
      --junitxml=junit/test-results.xml \
      --cov=. \
      --cov-report=xml:coverage.xml \
      .
  displayName: 'Testing'

# Publish test results + coverage
- task: PublishTestResults@2
  condition: succeededOrFailed()
  inputs:
    testResultsFiles: '**/test-*.xml'
    testRunTitle: 'Publish test results'
    failTaskOnFailedTests: true
  displayName: 'Publish test results'

- task: PublishCodeCoverageResults@1
  inputs:
    codeCoverageTool: 'Cobertura'
    summaryFileLocation: '**/coverage.xml'
  displayName: 'Publish test coverage'

… which we are using like so:

steps:
- template: templates/code-quality.yml

Easy! Also note we included two additional tasks: one to publish the test results and another to publish code coverage reports. That information is super useful to display inside DevOps. Lucky for us, DevOps has support for that:

… clicking on the test results brings us to the Tests view, where we can exactly which test failed (if any failed):

Lastly, there’s also a view explaining which lines of code you covered with tests and which you did not:

Those come in very useful when you are working on testing your code!

Now, we have defined this all in DevOps templates. That gives us a more comfortable position to define more elaborate pipeline steps because we will import those templates instead of copy/pasting them.

That said, we can summarise the benefits of using DevOps templates like so:

Define once, reuse everywhere
We can reuse this code quality checks pipeline in both the same project multiple times but also from other repositories. If you are importing from another repo, see ‘Use other repositories‘ for setup.
Make it failproof
You can invest into making just this template very good; instead of having multiple bad versions hanging around in your organisation.
Reduce complexity
Abstracting away commonly used code can be efficient for the readability of your pipeline. This allows newcomers to easily understand the different parts of your CI/CD setup using DevOps pipelines.

2. Passing data between templates

Let’s go a step further and also abstract away the build and release steps into templates. We are going to use the following template for building a Python wheel:

steps:
# Build wheel
- script: |
    echo $(Build.BuildNumber) > version.txt
  displayName: 'Set version number'

- script: |
    pip wheel \
      --no-deps \
      --wheel-dir dist/ \
      .
  displayName: 'Build wheel'

# Upload wheel as artifact
- task: CopyFiles@2
  inputs:
    contents: dist/**
    targetFolder: $(Build.ArtifactStagingDirectory)
  displayName: 'Copy wheel to artifacts directory'

- publish: '$(Build.ArtifactStagingDirectory)/dist'
  artifact: wheelFiles
  displayName: 'Upload wheel as artifact'

This definition is slightly different than the one we defined before, in the initial pipeline. This pipeline uses artifacts. These allow us to pass data between jobs or stages. This is useful when we want to split up our pipeline into smaller pieces. Splitting up the process into smaller segments gives us more visibility and control over the process. Another benefit of this, is that when we split the Python wheel build and release process, is that we give ourselves the ability to release to multiple providers at once.

When this pipeline is ran we can see an artifact (a wheel file) has been added:

… with the actual wheel file in there:

This is also useful so we can inspect what the build pipeline has produced. We can now download this wheel file from the artifacts again. We will do this in the publish pipeline.

template/publish-wheel.yml

parameters:
- name: artifactFeed
  type: string
- name: repositoryName
  type: string

steps:
# Retrieve wheel
- download: current
  artifact: wheelFiles
  displayName: 'Download artifacts'

# Publish wheel
- task: TwineAuthenticate@1
  inputs:
    artifactFeed: ${{ parameters.artifactFeed }}
  displayName: 'Authenticate pip with twine'

- script: |
    twine upload \
      --config-file $(PYPIRC_PATH) \
      --repository ${{ parameters.repositoryName }} \
      $(Pipeline.Workspace)/wheelFiles/*.whl
  displayName: 'Publish wheel with twine'

… both the build- and release pipeline can be used like so:

- stage: Build
  jobs:
  - job: BuildWheel
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: 3.10

    - script: |
        pip install .[build]
      displayName: 'Install dependencies'

    - template: templates/build-wheel.yml


- stage: Publish
  jobs:
  - job: PublishWheel
    steps:
    - script: |
        pip install twine==4.0.2
      displayName: 'Install twine'

    - template: templates/publish-wheel.yml
      parameters:
        artifactFeed: 'better-devops-pipelines-blogpost/devops-pipelines-blogpost'
        repositoryName: 'devops-pipelines-blogpost'

And here we have another new feature coming in. Stages. These allow us to execute pipelines that depend on each other. We have now split up our pipeline into 2 stages:

Build stage
Publish stage

Using stages makes it easy to see what is going on. It provides transparency and allows you to easily track the progress of the pipeline. You can also launch stages separately, skipping previous stages, as long as the necessary dependencies are in place. For example, dependencies can include artifacts, which were generated in previous stage.

Improving the release process

So what is another advantage of this setup? Say that you are releasing your package to two pip registries. Doing that is easy using this setup by creating two jobs in the publish stage:

- stage: Publish
  jobs:
  - job: PublishToRegistryOne
    steps:
    - script: |
        pip install twine==4.0.2
      displayName: 'Install twine'

    - template: templates/publish-wheel.yml
      parameters:
        artifactFeed: 'better-devops-pipelines-blogpost/registry-1'
        repositoryName: 'devops-pipelines-blogpost'

  - job: PublishToRegistryTwo
    steps:
    - script: |
        pip install twine==4.0.2
      displayName: 'Install twine'

    - template: templates/publish-wheel.yml
      parameters:
        artifactFeed: 'better-devops-pipelines-blogpost/registry-2'
        repositoryName: 'devops-pipelines-blogpost'

As you can see, we can use the defined templates to scale our pipelines. What is essential here, is that thanks to using the artifacts, we can build our wheel once and consume that same wheel multiple times.

Additionally, the publishing jobs launch in parallel by default (unless dependencies are explicitly defined). This speeds up your release process.

3. Automate using a strategy matrix

Let’s go back to the code quality stage for a minute. In the code quality stage, we are first installing a certain Python version, and then running all quality checks. However, we might need guarantees that our code works for multiple Python versions. This is often the case when releasing a package, for example. How can we easily automate running our Code Quality pipeline using our pipeline templates? One option is to manually define a couple jobs and install the correct python version in each job. Another option is to use a strategy matrix. This allows us to define a matrix of variables that we can use in our pipeline.

We can improve our CodeQualityChecks job like so:

jobs:
- job: CodeQualityChecks
  strategy:
    matrix:
      Python38:
        python.version: '3.8'
      Python39:
        python.version: '3.9'
      Python310:
        python.version: '3.10'
      Python311:
        python.version: '3.11'

  steps:
  - task: UsePythonVersion@0
    inputs:
      versionSpec: $(python.version)

  - script: |
      pip install .[dev]
    displayName: 'Install dependencies'

  - template: templates/code-quality.yml

Awesome! The pipeline now runs the entire code quality pipeline for each Python version. Looking at how our pipeline runs now, we can see multiple jobs, one for each Python version:

.. as you can see 4 jobs are launched. If no job dependencies are explicitly set, jobs within one stage run in parallel! That greatly speed up the pipeline and lets you iterate faster! That’s definitely a win.

Final result

Let’s wrap it up! Our entire pipeline, using templates:

trigger:
- main

stages:
- stage: CodeQuality
  jobs:
  - job: CodeQualityChecks
    strategy:
      matrix:
        Python38:
          python.version: '3.8'
        Python39:
          python.version: '3.9'
        Python310:
          python.version: '3.10'
        Python311:
          python.version: '3.11'

    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: $(python.version)

    - script: |
        pip install .[dev]
      displayName: 'Install dependencies'

    - template: templates/code-quality.yml


- stage: Build
  jobs:
  - job: BuildWheel
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: 3.10

    - script: |
        pip install .[build]
      displayName: 'Install dependencies'

    - template: templates/build-wheel.yml


- stage: Publish
  jobs:
  - job: PublishWheel
    steps:
    - script: |
        pip install twine==4.0.2
      displayName: 'Install twine'

    - template: templates/publish-wheel.yml
      parameters:
        artifactFeed: 'better-devops-pipelines-blogpost/devops-pipelines-blogpost'
        repositoryName: 'devops-pipelines-blogpost'

… which uses these templates:

… for the entire source code see the better-devops-pipelines-blogpost repo. The repository contains pipelines that apply above explained principles. The pipelines provide testing, building and releasing for a Python project ✓.

Conclusion

We demonstrated how to scale up your Azure DevOps CI/CD setup making it reusable, maintainable and modular. This helps you maintain a good CI/CD setup as your company grows.

In short, we achieved the following:

Create modular DevOps pipelines using templates. This makes it more easy to reuse pipelines across projects and repositories
Pass data between DevOps pipeline jobs using artifacts. This allows us to split up our pipeline into smaller pieces, that can consume artifacts from previous jobs.
Split up your pipeline in stages to create more visibility and control over your CI/CD

An example repository containing good-practice pipelines is available at:

https://dev.azure.com/godatadriven/_git/better-devops-pipelines-blogpost

Cheers 🙏

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

Jeroen Overschie — Mon, 21 Nov 2022 17:36:00 GMT

Imagine the following scenario 💭.

Your company uses Apache Spark to process data, and your team has pyspark set up in a Python project. The codebase is built on a specific Python version, using a certain Java installation, and an accompanying pyspark version that works with the former. To onboard a new member, you will need to pass a list of instructions the developer needs to follow carefully to get their setup working. But not everyone might run this on the same laptop environment: different hardware, and different operating systems. This is getting challenging.
But the setup is a one-off, right? Just go through the setup once and you’ll be good. Not entirely. Your code environment will change over time: your team will probably install-, update- or remove packages during the project’s development. This means that if a developer creates a new feature and changes their own environment to do so; he or she also needs to make sure that the other team members change theirs and that the production environment is updated accordingly. This makes it easy to get misaligned environments: between developers, and between development & production.

We can do better than this! Instead of giving other developers a setup document, let’s make sure we also create formal instructions so we can automatically set up the development environment. Devcontainers let us do exactly this.

Devcontainers let you connect your IDE to a running Docker container. In this way, we get the benefits of reproducibility and isolation, whilst getting a native development experience.

With Devcontainers you can interact with your IDE like you're used to whilst under the hood running everything inside a Docker container.

Devcontainers can help us:

Get a reproducible development environment
⚡️ Instantly onboard new team members onto your project
‍ ‍ ‍ Better align the environments between team members
⏱ Keeping your dev environment up-to-date & reproducible saves your team time going into production later

Let’s explore how we can set up a Devcontainer for your Python project!

Creating your first Devcontainer

Note that this tutorial is focused on VSCode. Other IDE’s like PyCharm support running in Docker containers but support is less comprehensive than on VSCode.

Recap

To recap, we are trying to create a dev environment that installs: 1) Java, 2) Python and 3) pyspark. And we want to do so automatically, that is, inside a Docker image.

Project structure

Let’s say we have a really simple project that looks like this:

$ tree .
.
├── README.md
├── requirements.txt
├── requirements-dev.txt
├── sales_analysis.py
└── test_sales_analysis.py

That is, we have a Python module with an accompanying test, a requirements.txt file describing our production dependencies (pyspark), and a requirements-dev.txt describing dependencies that should be installed in development only (pytest, black, mypy). Now let’s see how we can extend this setup to include a Devcontainer.

The `.devcontainer` folder

Your Devcontainer spec will live inside the .devcontainer folder. There will be two main files:

devcontainer.json
Dockerfile

Create a new file called devcontainer.json:

{
    "build": {
        "dockerfile": "Dockerfile",
        "context": ".."
    }
}

This means: as a base for our Devcontainer, use the Dockerfile located in the current directory, and build it with a current working directory (cwd) of ...

So what does this Dockerfile look like?

FROM python:3.10

# Install Java
RUN apt update && 
    apt install -y sudo && 
    sudo apt install default-jdk -y

## Pip dependencies
# Upgrade pip
RUN pip install --upgrade pip
# Install production dependencies
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt && 
    rm /tmp/requirements.txt
# Install development dependencies
COPY requirements-dev.txt /tmp/requirements-dev.txt
RUN pip install -r /tmp/requirements-dev.txt && 
    rm /tmp/requirements-dev.txt

We are building our image on top of python:3.10, which is a Debian-based image. This is one of the Linux distributions that a Devcontainer can be built on. The main requirement is that Node.js should be able to run: VSCode automatically installs VSCode Server on the machine. For an extensive list of supported distributions, see “Remote Development with Linux”.

On top of python:3.10, we install Java and the required pip packages.

Opening the Devcontainer

The .devcontainer folder is in place, so it’s now time to open our Devcontainer.

First, make sure you have the Dev Containers extension installed in VSCode (previously called “Remote – Containers”. That done, if you open your repo again, the extension should already detect your Devcontainer:

Alternatively, you can open up the command palette (CMD + Shift + P) and select “Dev Containers: Reopen in Container”:

Your VSCode is now connected to the Docker container :

What is happening under the hood

Besides starting the Docker image and attaching the terminal to it, VSCode is doing a couple more things:

VSCode Server is being installed on your Devcontainer. VSCode Server is installed as a service in the container itself so your VSCode installation can communicate with the container. For example, install and run extensions.
Config is copied over. Config like ~/.gitconfig and ~/.ssh/known_hosts are copied over to their respective locations in the container.
This then allows you to use your Git repo like you do normally, without re-authenticating.
Filesystem mounts. VSCode automatically takes care of mounting: 1) The folder you are running the Devcontainer from and 2) your VSCode workspace folder.

Opening your repo directly in a Devcontainer

Since all instructions on how to configure your dev environment are now defined in a Dockerfile, users can open up your Devcontainer with just one button:

Ain’t that cool? You can add a button to your repo like so:

[
    ![Open in Remote - Containers](
        https://xebia.com/wp-content/uploads/2023/11/v1.svg    )
](
    https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/godatadriven/python-devcontainer-template
)

Just modify the GitHub URL ✓.

That said, we can see having built a Devcontainer can make our README massively more readable. What kind of README would you rather like?

Manual installation	Using a Devcontainer

Extending the Devcontainer

We have built a working Devcontainer, which is great! But a couple of things are still missing. We still want to:

Install a non-root user for extra safety and good-practice
Pass in custom VSCode settings and install extensions by default
Be able to access Spark UI (port 4040)
Run Continuous Integration (CI) in the Devcontainer

Let’s see how.

Installing a non-root user

If you pip install a new package, you will see the following message:

Indeed, it is not recommended to develop as a root user. It is considered a good practice to create a different user with fewer rights to run in production. So let’s go ahead and create a user for this scenario.

# Add non-root user
ARG USERNAME=nonroot
RUN groupadd --gid 1000 $USERNAME && 
    useradd --uid 1000 --gid 1000 -m $USERNAME
## Make sure to reflect new user in PATH
ENV PATH="/home/${USERNAME}/.local/bin:${PATH}"
USER $USERNAME

Add the following property to devcontainer.json:

    "remoteUser": "nonroot"

That’s great! When we now start the container we should connect as the user nonroot.

Passing custom VSCode settings

Our Devcontainer is still a bit bland, without extensions and settings. Besides any custom extensions a user might want to install, we can install some for them by default already. We can define such settings in customizations.vscode:

     "customizations": {
        "vscode": {
            "extensions": [
                "ms-python.python"
            ],
            "settings": {
                "python.testing.pytestArgs": [
                    "."
                ],
                "python.testing.unittestEnabled": false,
                "python.testing.pytestEnabled": true,
                "python.formatting.provider": "black",
                "python.linting.mypyEnabled": true,
                "python.linting.enabled": true
            }
        }
    }

The defined extensions are always installed in the Devcontainer. However, the defined settings provide just a default for the user to use, and can still be overridden by other setting scopes like User Settings, Remote Settings, or Workspace Settings.

Accessing Spark UI

Since we are using pyspark, we want to be able to access Spark UI. When we start a Spark session, VSCode will ask whether you want to forward the specific port. Since we already know this is Spark UI, we can do so automatically:

    "portsAttributes": {
        "4040": {
            "label": "SparkUI",
            "onAutoForward": "notify"
        }
    },

    "forwardPorts": [
        4040
    ]

When we now run our code, we get a notification we can open Spark UI in the browser:

Resulting in the Spark UI as we know it:

✨

Running our CI in the Devcontainer

Wouldn’t it be convenient if we could re-use our Devcontainer to run our Continuous Integration (CI) pipeline as well? Indeed, we can do this with Devcontainers. Similarly to how the Devcontainer image is built locally using docker build, the same can be done within a CI/CD pipeline. There are two basic options:

Build the Docker image within the CI/CD pipeline
Prebuilding the image

To pre-build the image, the build step will need to run either periodically or whenever the Docker definition has changed. Since this adds quite some complexity let’s dive into building the Devcontainer as part of the CI/CD pipeline first (for pre-building the image, see the ‘Awesome resources’ section). We will do so using GitHub Actions.

Using `devcontainers/ci`

Luckily, a GitHub Action was already set up for us to do exactly this:

https://github.com/devcontainers/ci

To now build, push and run a command in the Devcontainer is as easy as:

name: Python app

on:
  pull_request:
  push:
    branches:
      - "**"

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout (GitHub)
        uses: actions/checkout@v3

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v2
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and run dev container task
        uses: devcontainers/ci@v0.2
        with:
          imageName: ghcr.io/${{ github.repository }}/devcontainer
          runCmd: pytest .

That’s great! Whenever this workflow runs on your main branch, the image will be pushed to the configured registry; in this case GitHub Container Registry (GHCR). See below a trace of the executed GitHub Action:

Awesome!

The final Devcontainer definition

We built the following Devcontainer definitions. First, devcontainer.json:

{
    "build": {
        "dockerfile": "Dockerfile",
        "context": ".."
    },

    "remoteUser": "nonroot",

    "customizations": {
        "vscode": {
            "extensions": [
                "ms-python.python"
            ],
            "settings": {
                "python.testing.pytestArgs": [
                    "."
                ],
                "python.testing.unittestEnabled": false,
                "python.testing.pytestEnabled": true,
                "python.formatting.provider": "black",
                "python.linting.mypyEnabled": true,
                "python.linting.enabled": true
            }
        }
    },

    "portsAttributes": {
        "4040": {
            "label": "SparkUI",
            "onAutoForward": "notify"
        }
    },

    "forwardPorts": [
        4040
    ]
}

And our Dockerfile:

FROM python:3.10

# Install Java
RUN apt update && 
    apt install -y sudo && 
    sudo apt install default-jdk -y

# Add non-root user
ARG USERNAME=nonroot
RUN groupadd --gid 1000 $USERNAME && 
    useradd --uid 1000 --gid 1000 -m $USERNAME
## Make sure to reflect new user in PATH
ENV PATH="/home/${USERNAME}/.local/bin:${PATH}"
USER $USERNAME

## Pip dependencies
# Upgrade pip
RUN pip install --upgrade pip
# Install production dependencies
COPY --chown=nonroot:1000 requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt && 
    rm /tmp/requirements.txt
# Install development dependencies
COPY --chown=nonroot:1000 requirements-dev.txt /tmp/requirements-dev.txt
RUN pip install -r /tmp/requirements-dev.txt && 
    rm /tmp/requirements-dev.txt

The full Devcontainer implementation and all the above steps can be found in the various branches of the godatadriven/python-devcontainer-template repo.

Docker images architecture: Three environments

With the CI now set up, we can reuse the same Docker image for two purposes. For local development and running our quality checks. And, once we deploy this application to production, we could configure the Devcontainer to use our production image as a base, and install extra dependencies on top. If we want to optimize the CI image to be as lightweight as possible, we could also strip off any extra dependencies that we do not require in the CI environment; things as extra CLI tooling, a better shell such as ZSH, and so forth.

This sets us up for having 3 different images for our entire lifecycle. One for Development, one for CI, and finally one for production. This can be visualized like so:

So, we can see, when using a Devcontainer you can re-use your production image and build on top of it. Install extra tooling, make sure it can talk to VSCode, and you’re done .

Going further

There are lots of other resources to explore; Devcontainers are well-documented and there are many posts about it. If you’re up for more, let’s see what else you can do.

Devcontainer features

Devcontainer features allow you to easily extend your Docker definition with common additions. Some useful features are:

Common Debian Utilities (Installs ZSH using Oh My ZSH, a non-root user, and useful CLI tools like curl)
AWS CLI
Azure CLI
Git
Node.js
Python
Java

Devcontainer templates

On the official Devcontainer specification website there are loads of templates available. Good chance (part of) your setup is in there. A nice way to get a head-start in building your Devcontainer or to get started quickly.

See: https://containers.dev/templates

Mounting directories

Re-authenticating your CLI tools is annoying. So one trick is to mount your AWS/Azure/GCP credentials from your local computer into your Devcontainer. This way, authentications done in either environment are shared with the other. You can easily do this by adding this to devcontainer.json:

  "mounts": [
    "source=/Users//.aws,target=/home/nonroot/.aws,type=bind,consistency=cached"
  ]

^ the above example mounts your AWS credentials, but the process should be similar for other cloud providers (GCP / Azure).

Awesome resources

devcontainers/ci. Run your CI in your Devcontainers. Built on the Devcontainer CLI.
https://containers.dev/. The official Devcontainer specification.
devcontainers/images. A collection of ready-to-use Devcontainer images.
Add a non-root user to a container. More explanations & instructions for adding a non-root user to your Dockerfile and devcontainer.json.
Pre-building dev container images
awesome-devcontainers. A repo pointing to yet even more awesome resources.

Concluding

Devcontainers allow you to connect your IDE to a running Docker container, allowing for a native development experience but with the benefits of reproducibility and isolation. This makes easier to onboard new joiners and align development environments between team members. Devcontainers are very well supported for VSCode but are now being standardized in an open specification. Even though it will probably still take a while to see wide adoption, the specification is a good candidate for the standardization of Devcontainers.

About

This blogpost is written by Jeroen Overschie, working at Xebia.

pyspark-bucketmap

Jeroen Overschie — Sat, 22 Oct 2022 10:01:09 GMT

Have you ever heard of pyspark's Bucketizer? It can be really useful! Although you perhaps won't need it for some simple transformation, it can be really useful for certain usecases.

In this blogpost, we will:

Explore the Bucketizer class
Combine it with create_map
Use a module so we don't have to write the logic ourselves 🗝🥳

Let's get started!

The problem

First, let's boot up a local spark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark

Say we have this dataset containing some persons:

from pyspark.sql import Row

people = spark.createDataFrame(
    [
        Row(age=12, name="Damian"),
        Row(age=15, name="Jake"),
        Row(age=18, name="Dominic"),
        Row(age=20, name="John"),
        Row(age=27, name="Jerry"),
        Row(age=101, name="Jerry's Grandpa"),
    ]
)
people

Okay, that's great. Now, what we would like to do, is map each person's age to an age category.

age range	life phase
0 to 12	Child
12 to 18	Teenager
18 to 25	Young adulthood
25 to 70	Adult
70 and beyond	Elderly

How best to go about this?

Using `Bucketizer` + `create_map`

We can use pyspark's Bucketizer for this. It works like so:

from pyspark.ml.feature import Bucketizer
from pyspark.sql import DataFrame

bucketizer = Bucketizer(
    inputCol="age",
    outputCol="life phase",
    splits=[
        -float("inf"), 0, 12, 18, 25, 70, float("inf")
    ]
)
bucketed: DataFrame = bucketizer.transform(people)
bucketed.show()

age	name	life phase
12	Damian	2.0
15	Jake	2.0
18	Dominic	3.0
20	John	3.0
27	Jerry	4.0
101	Jerry's Grandpa	5.0

Cool! We just put our ages in buckets, represented by numbers. Let's now map each bucket to a life phase.

from pyspark.sql.functions import lit, create_map
from typing import Dict
from pyspark.sql.column import Column

range_mapper = create_map(
    [lit(0.0), lit("Not yet born")]
    + [lit(1.0), lit("Child")]
    + [lit(2.0), lit("Teenager")]
    + [lit(3.0), lit("Young adulthood")]
    + [lit(4.0), lit("Adult")]
    + [lit(5.0), lit("Elderly")]
)
people_phase_column: Column = bucketed["life phase"]
people_with_phase: DataFrame = bucketed.withColumn(
    "life phase", range_mapper[people_phase_column]
)
people_with_phase.show()

age	name	life phase
12	Damian	Teenager
15	Jake	Teenager
18	Dominic	Young adulthood
20	John	Young adulthood
27	Jerry	Adult
101	Jerry's Grandpa	Elderly

🎉 Success!

Using a combination of Bucketizer and create_map, we managed to map people's age to their life phases.

`pyspark-bucketmap`

🎁 As a bonus, I put all of the above in a neat little module, which you can install simply using pip.

%pip install pyspark-bucketmap

Define the splits and mappings like before. Each dictionary key is a mapping to the n-th bucket (for example, bucket 1 refers to the range 0 to 12).

from typing import List

splits: List[float] = [-float("inf"), 0, 12, 18, 25, 70, float("inf")]
mapping: Dict[int, Column] = {
    0: lit("Not yet born"),
    1: lit("Child"),
    2: lit("Teenager"),
    3: lit("Young adulthood"),
    4: lit("Adult"),
    5: lit("Elderly"),
}

Then, simply import pyspark_bucketmap.BucketMap and call transform().

from pyspark_bucketmap import BucketMap
from typing import List, Dict

bucket_mapper = BucketMap(
    splits=splits, mapping=mapping, inputCol="age", outputCol="phase"
)
phases_actual: DataFrame = bucket_mapper.transform(people).select("name", "phase")
phases_actual.show()

name	phase
Damian	Teenager
Jake	Teenager
Dominic	Young adulthood
John	Young adulthood
Jerry	Adult
Jerry's Grandpa	Elderly

Cheers 🙏🏻

You can find the module here:
https://github.com/dunnkers/pyspark-bucketmap

Written by Jeroen Overschie, working at GoDataDriven.

DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)

Jeroen Overschie — Wed, 27 Jul 2022 09:58:00 GMT

Conferences are great. You meet new people, you learn new things. But have you ever found yourself back in the hotel after a day at a conference, thinking what to do now? Or were you ever stuck in one session, wishing you had gone for that other one? These moments are the perfect opportunity to open up your laptop and compete with your peers in a coding challenge.

Attendees of the three-day conference PyCon DE & PyData Berlin 2022 had the possibility to do so, with our coding challenge DropBlox.

Participants had a bit over one day to submit their solutions. After the deadline, we had received over 100 submissions and rewarded the well-deserved winner a Lego R2D2 in front of a great crowd.

Read on to learn more about this challenge. We will discuss the following:

What was the challenge exactly, and what trade-offs were made in the design?
What was happening behind the screens to make this challenge possible?
How did we create hype at the conference itself?
What strategies were adopted by the participants to crack the problem?

Participants used a public repository that we made available here.

Challenge

Participants of the challenge were given the following:

A 100 x 100 field
1500 blocks of various colors and shapes, each with a unique identifier (see Fig. 1)
A rewards table, specifying the points and multipliers per color (see Table 1)

Figure 1: A random subset of blocks from the challenge and their corresponding IDs.

Table 1: The rewards table, specifying how each color contributes to the score of a solution of the challenge. We assign points to each tile in the final solution, while the multiplier only applies to rows filled with tiles of the same color.

The rules are as follows:

Blocks can be dropped in the field from the top at a specified x-coordinate, without changing their rotation (see Fig. 2)
Each block can be used at most once
The score of a solution is computed using the rewards table. For each row, we add the points of each tile to the score. If the row consists of tiles of a single color only, we multiply the points of that row by the corresponding multiplier. The final score is the sum of the scores of the rows. (see Fig. 3)

Figure 2: An example 6×6 field and a blue block dropped at x-coordinate 1.

Figure 3: An example of the computation of the score of a solution. The points and multipliers per color are specified in Table 1.

The solution is a list of block IDs with corresponding x-coordinates. This list specifies which blocks to drop and where, in order to come to the final solution.
The goal of the challenge? Getting the most points possible.

The design

When designing the challenge, we came up with a few simple requirements to follow:

The challenge should be easy to get started with
The progress and final solution should be easy to visualize
It should be difficult to approach the optimum solution

Ideas about N-dimensional versions of this challenge came along, but only the ‘simple’ 2D design ticked all the boxes. It’s easy to visualize, because it’s 2D, and (therefore) easy to get started with. Still, a 100 x 100 field with 1500 blocks allows for enough freedom to play this game in more ways than there are atoms in the universe!

Behind the screens

Participants could, and anyone still can, submit their solutions on the submission page, as well as see the leaderboard with submissions of all other participants. To make this possible, several things happened behind the screens, which are worth noting here.

Most importantly, we worked with a team of excited people with complementing skill sets. Together we created the framework that is visualized in Fig. 4.

We have a separate private repository, in which we keep all the logic that is hidden to the participant. In there, we have the ground-truth scoring function, and all logic necessary to run our web app. When participants submit their solution or check out the leaderboard, an Azure Function spins up to run the logic of our web app. The Azure Function is connected to an SQL database, where we store or retrieve submissions from. We store images, such as the visualization of the final solution, in the blob storage. To create the leaderboard, we retrieve the top-scoring submissions of each user and combine them with the corresponding images in the blob storage.

Figure 4: The different components of the challenge, including those hidden to the participants.

Creating hype

What’s the use of a competition if nobody knows about it? Spread the word!

To attract attention to our coding competition, we did two things. First, we set up an appealing booth at the company stands. We put our prize right in front, and a real-time dashboard showing the highscores beside. Surely, anyone walking past will at least question themselves what that Lego toy is doing at a Python conference.

Figure 5: Our company booth at PyCon DE & PyData Berlin 2022

Second, we went out to the conference Lightning Talks and announced our competition there. Really, the audience was great. Gotta love the energy at conferences like these.

Figure 6: Promoting the challenge at a Lightning Talk

With our promotion set up, competitors started trickling in. Let the games begin!

Strategies

Strategies varied from near brute-force approaches to the use of convolutional kernels and clever heuristics. In the following, we discuss some interesting and top-scoring submissions of participants.

#14

S. Tschöke named his first approach “Breakfast cereal”, as he was inspired by smaller cereal pieces collecting at the bottom and larger ones in the top of a cereal bag. Pieces were dropped from left to right, smaller ones before larger ones, until none could fit anymore. This approach, resulting in around 25k points, was however not successful enough.

After a less successful, but brave approach using a genetic algorithm, he extended the breakfast cereal approach. This time, instead of the size of a block he used the block’s density, or the percentage of filled tiles within the height and width of a block; and he sorted the blocks by color. Taking a similar approach as before, but now including the different color orderings, resulted in 46k points. (See Fig. 6)

Figure 6: Final solution of the #14 in the challenge, S. Tschöke, with 45844 points.

#2

We jump up a few places to R. Garnier, who was #1 up until the last moments of the challenge. He went along with a small group of dedicated participants who started exploiting the row multipliers. Unexpectedly, this led to an interesting and exciting development in the competition.

His strategy consisted of two parts. The first is to manually construct some rows of the same color of the same height. This way, he created 3 full orange rows, one red and one purple. Subsequently, he uses a greedy algorithm, following the steps:

Assign a score to each block: score = (block values) / (block surface)
Sort the blocks by score
For each block, drop the block where it falls down the furthest

This strategy resulted in 62k points. (See Fig. 7)

Figure 7: Final solution of the #2 in the challenge, R. Garnier, with 62032 points.

#1

With a single last-minute submission, G. Chanturia answered the question to “How far can we go?”. He carefully constructed pairs of blocks that fit together, to manually engineer the bottom half of his solution, taking the row multipliers to the next level.

G. is doing a PhD in Physics and a MSc in Computer Science, and fittingly splits his solution into a “physicist’s solution” and a “programmer’s solution”.

The physicist’s solution refers to the bottom part of the field. The strategy used here, as summarized by G., was (1) taking a look at the blocks, and (2) placing them in a smart way. Whether you are a theoretical or experimental physicist, data serves as a starting point. G. noticed there is an order in the blocks. First of all, a lot of orange blocks had their top and bottom rows filled. Placing these in a clever way already results in six completely filled rows.
Second, there were “W” and “M”-shaped blocks that fit perfectly together. He kept going on like this to manually construct the bottom 1/3 of the field, accounting for 57% of the total score.

The programmer’s solution refers to the rest of the field. The problem with the first approach is that it is not scalable. Even more, if the blocks would change he would have to start all over. This second approach is more robust, and is similar to R. Garnier’s approach. The steps are:

Filter blocks based on their height. Blocks above height = 5 are filtered out, because many of these blocks have too weird shapes to work with.
Sort the blocks by points (or similarly, by color). Blocks with higher scoring colors are listed first.
Pick the first n available blocks in the sorted list. The larger n, the better the solution, but the longer it takes to run. The chosen number was around 50.
Find out which block can fall the furthest down in the field
Drop that block and remove it from the list
Repeat from 3

Figure 8: Final solution of the #1 in the challenge, G. Chanturia, with 68779 points.

And most importantly, the #1 place does not go home empty-handed. The #1 contender wins the infamous Lego R2D2! And we can acknowledge that indeed, yes, the R2D2 turned quite some heads at the conference. The award was given to the winner at the last Lightning Talk series of the conference.

Figure 9: Our winner receives his coveted prize!

Conclusion

Organising the coding challenge has been lots of fun. We created a framework to host a high score leaderboard, process submissions and display the puzzles online. To sum up the process in a few words:

Hosting a coding challenge at a conference is fun!
Gather participants by promoting the challenge
Hand over the prize to the winner

It was interesting to see what strategies our participants came up with, and how the high score constantly improved even though this seemed unlikely at some point. We learned that starting off with a simple heuristic and expanding upon that is a good way to get your algorithm to solve a problem quickly. However, to win in our case, a hybrid solution involving a bit of manual engineering was needed to outperform all strategies relying solely on generalizing algorithms.

Quite likely, we will be re-using the framework we built to host some more code challenges. Will you look out for them?
Until that time, check out the repository of the DropBlox challenge here.

We made the figures in this post using Excalidraw.

At GoDataDriven, we use data & AI to solve complex problems. But we share our knowledge too! If you liked this blogpost and want to learn more about Data Science, maybe the Data Science Learning Journey is something for you.

From Linear Regression to Neural Networks

Jeroen Overschie — Sat, 17 Apr 2021 22:00:00 GMT

These days there exists much hype around sophisticated machine learning methods such as Neural Networks — they are massively powerful models that allow us to fit very flexible models. However, we do not always require the full complexity of a Neural Network: sometimes, a simpler model will do the job just fine. In this project, we take a journey starting from the most fundamental statistical machinery to model data distributions, linear regression, to then explain the benefits of constructing more complex models, such as logistic regression or a Neural Network. In this way, this text aims to build a bridge from the statistical, analytical world to the more approximative world of Machine Learning. We will not shy away from the math, whilst still working with tangible examples at all times: we will work with real-world datasets and we will get to apply our models as we go on. Let's start!

Linear Regression (Code)

First, we will explore linear regression, for it is an easy to understand model upon which we can build more sophisticated concepts. We will use a dataset on Antarctican penguins (Gorman et al., 2014) to conduct a regression between the penguin flipper length as independent variable $X$ and the penguin body mass as the dependent variable $Y$. We can analytically solve Linear Regression by minimizing the Residual Sum-of-Squares cost function (Hastie et al., 2009):

$$\text{R}(\beta) = (Y - X \beta)^T (Y - X \beta)$$

In which $X$ is our design matrix. Regression using this loss function is also referred to as "Ordinary Least Squares". The mean of the cost function $\text{R}$ over all samples is called Mean Squared Error, or MSE. Our design matrix is built by appending each data row with a bias constant of 1 - an alternative would be to first center our data to get rid of the intercept entirely. To now minimize our cost function we differentiate $\text{R}$ with respect to $\beta$, giving us the following unique minimum:

$$\hat{\beta} = (X^T X)^{-1} X^T Y$$

... which results in the estimated least-squares coefficients given the training data, also called the normal equation. We can classify by simply multiplying our input data with the found coefficient matrix: $\hat{Y} = X \hat{\beta}$. Let's observe our fitted regression line onto the data:

Linear Regression fit on Penguin data using the normal equation. Using a validation data split of ¼ testing data and ¾ training data.

We can observe visually that our estimator explains both the training and testing data reasonably well: the line positioned itself along the mean of the data. This is in fact the proposition we make in least-squares - we assume the target to be Gaussian distributed; which in the case of modeling this natural phenomenon, penguins, seems to fit quite well.

Because at the moment we are very curious, we would also like to explore using a more flexible model. Note that our normal equation we defined above tries to find whatever parameters make the system of linear equations produce the best predictions on our target variable. This means, that hypothetically, we could add any linear combination of explanatory variables we like: such create estimators of a higher-order polynomial form. This is called polynomial regression. To illustrate, a design matrix for one explanatory variable $X_1$ would look as follows:

$$X= \left[\begin{array}{ccccc}1 & x_{1} & x_{1}^{2} & \ldots & x_{1}^{d} \\ 1 & x_{2} & x_{2}^{2} & \ldots & x_{2}^{d} \\ 1 & x_{3} & x_{3}^{2} & \ldots & x_{3}^{d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n} & x_{n}^{2} & \ldots & x_{n}^{d}\end{array}\right]$$

Which results in $d$-th degree polynomial regression. The case $d=1$ is just normal linear regression. For example sake, let us sample only $n=10$ samples from our training dataset, and try to fit those with a polynomial regressors of increasing degrees. Let us observe what happens to the training and testing loss accordingly:

Polynomial fits of various degrees on just $n=10$ training dataset samples. Testing dataset remained unchanged.

It can be observed that although for some degrees the losses remain almost the same, we suffer from overfitting after the degree passes $d=30$. We can also visually show how the polynomials of varying degrees fit our data:

We can indeed observe that the polynomials of higher degree definitely do not better explain our data. Also, the polynomials tend to get rather erratic beyond the last data points of the training data - which is important to consider whenever predicting outside the training data value ranges. Generally, polynomials of exceedingly high degree can overfit too easily and should only be considered in very special cases.

Up till now our experiments have been relatively simple - we used only one explanatory and one response variable. Let us now explore an example in which we use all available explanatory variables to predict body mass, to see whether we can achieve an even better fit. Because we are now at risk of suffering from multicolinearity; the situation where multiple explanatory variables are highly linearly related to each other, we will use an extension of linear regression which can deal with such a situation. The technique is called Ridge Regression.

Ridge Regression

In Ridge Regression, we aim to tamper the least squares tendency to get as 'flexible' as possible to fit the data best it can. This might, however, cause parameters to get very large. We therefore like to add a penalty on the regression parameters $\beta$; we penalise the loss function with a square of the parameter vector $\beta$ scaled by new hyperparameter $\lambda$. This is called a shrinkage method, or also: regularization. This causes the squared loss function to become:

$$\text{R}(\beta) = (Y - X \beta)^T (Y - X \beta)+\lambda \beta^T \beta$$

This is called regularization with an $L^2$ norm; which generalization is called Tikhonov regularization, which allows for the case where not every parameter scalar is regularized equally. If we were to use an $L^1$ norm instead, we would speak of LASSO regression. If we were to now derive the solutions of $\beta$ given this new cost function by differentiation w.r.t. $\beta$:

$$\hat{\beta}^{\text {ridge }}=\left(\mathbf{X}^{T} \mathbf{X}+\lambda \mathbf{I}\right)^{-1} \mathbf{X}^{T} \mathbf{Y}$$

In which $\lambda$ will be a scaling constant that controls the amount of regularization that is applied. Note $\mathbf{I}$ is the $p\times p$ identity matrix - in which $p$ are the amount of data dimensions used. An important intuition to be known about Ridge Regression, is that directions in the column space of $X$ with small variance will be shrinked the most; this behavior can be easily shown be deconstructing the least-squares fitted vector using a Singular Value Decomposition. That said, let us see whether we can benefit from this new technique in our experiment.

In the next experiment, we will now use all available quantitative variables to try and predict the Penguin body mass. The Penguin- bill length, bill depth and flipper length will be used as independent variables. Note, however, they might be somewhat correlated: see this pairplot on the Penguin data for details. This poses an interesting challenge for our regression. Let us combine this with varying dataset sample sizes and varying settings of $\lambda$ to see the effects on our loss.

Ridge Regression using all quantitative variables in the Penguin dataset to predict body mass. Varying subset sizes of the dataset $n$ as well as different regularization strengths $\lambda$ are shown.

It can be observed, that using including all quantitative variables did improve the loss on predicting the Penguin body mass using Ridge Regression. In fact, the penalty imposed probably pulled the hyperplane angle down such that the error in fact increased. Ridge Regression is a very powerful technique, nonetheless, and most importantly introduced us to the concept of regularization. In the next chapters on Logistic Regression in Neural Networks, we assume all our models to use $L^2$ regularization.

Now, the data we fit up until now had only a small dimensionality - this is perhaps a drastic oversimplification in comparison to the real world. How does the analytic way of solving linear regression using the normal equation fare with higher-dimensional data?

High-dimensional data

In the real world, datasets might be of very high dimensionality: think of images, speech, or a biomedical dataset storing DNA sequences. These datasets cause different computational strain on the equations to be solved to fit a linear regression model: so let us simulate such a high-dimensional situation.

In our simulation the amount of dimensions will configured to outmatch the amount of dataset samples ($p \gg n$), which extra dimensions we will create by simply adding some noise columns to the design matrix $X$. The noise will be drawn from a Gaussian distribution $\epsilon \sim \mathcal{N}(0, 1)$. We can now run an experiment by fitting our linear regression model to the higher-dimensional noised dataset, benchmarking the fitting times of the algorithm.

Linear Regression fitting times for lower- and higher- dimensional Penguin data.

We can observe that the normal equation takes a lot longer to compute for higher-dimensional data. In fact, numerically computing the matrix inverse is very computationally expensive, i.e. computing $(X^TX)^{-1}$. Luckily, there are computationally cheaper techniques to do a regression in higher-dimensional spaces. One such technique is an iterative procedure, called Gradient Descent.

Gradient Descent

Instead of trying to analytically solve the system of linear equations at once, we can choose an iterative procedure instead, such as Gradient Descent. It works by computing the gradient of the cost function with respect to the model weights - such that we can then move in the opposite direction of the gradient in parameter space. Given some loss function $R(\beta)$ and $R_i(\beta)$, which computes the empirical loss for entire dataset and for the $i$-th observation, respectively, we can define one gradient descent step as:

$$\begin{aligned} \beta^{(r + 1)} &= \beta^{(r)} - \gamma \nabla_{\beta^{(r)}} R(\beta^{(r)}) \\ &= \beta^{(r)} - \gamma \sum_{i=1}^N \frac{\partial R_i(\beta^{(r)})}{\partial \beta^{(r)}}\\ \end{aligned}$$

In which $\gamma$ is the learning rate and $r$ indicates some iteration - given some initial parameters $\beta^0$ and $N$ training samples. Using this equation, we are able to reduce the loss in every iteration, until we converge. Convergence occurs when every element of the gradient is zero - or very close to it. Although gradient descent is used in this vanilla form, two modifications are common: (1) subsampling and using a (2) learning rate schedule.

Although in a scenario in which our loss function landscape is convex the vanilla variant does converge toward the global optimum relatively easily, this might not be the case for non-convex error landscapes. We are at risk of getting stuck in local extremes. In this case, it is desirable to introduce some randomness — allowing us to jump out local extrema. We can introduce randomness by instead of computing the gradient over the entire sample set, we can do so for a random sample of the dataset called a minibatch (Goodfellow et al., 2014). A side effect is a lighter computational burden per iteration; sometimes causing faster convergence. Because the introduced randomness makes the procedure stochastic instead of deterministic, we call this algorithm Stochastic Gradient Descent, or simply SGD.
Accommodating SGD is often a learning rate schedule: making the learning rate parameter $\gamma$ dependent on the iteration number $r$ such that $\gamma = \gamma^{(r)}$. In this way, we made the learning rate adaptive over time, allowing us to create a custom learning rate scheme. Many schemes (Dogo et al., 2018) exist - which can be used to avoid spending a long time on flat areas in the error landscape called plateaus, or to avoid 'overshooting' the optimal solution. Even, a technique analogous with momentum (Qian, 1999) in physics might be used: a particle traveling through space is 'accelerated' by the loss gradient, causing the gradient to change faster if it keeps going in the same direction.

So, let's now redefine our gradient descent formula to accommodate for these modifications:

$$\beta^{(r+1)}=\beta^{(r)}-\gamma^{(r)} \frac{1}{m} \sum_{i=1}^m \frac{\partial R_i(\beta^{(r)})}{\partial \beta^{(r)}} $$

... where we, before each iteration, randomly shuffle our training dataset such that we draw $m$ random samples each step. The variable $m$ denotes the batch size - which can be anywhere between 1 and the amount of dataset samples minus one $N - 1$. The smaller the batch size, the more stochastic the procedure will get.

Using gradient descent for our linear regression is straight-forward. We differentiate the cost function with respect to the weights; the least squares derivative is then as follows:

$$\begin{aligned} \frac{\partial R_i(\beta^{(r)})}{\partial \beta^{(r)}} &= \frac{\partial}{\partial \beta^{(r)}} (y_i - x_i \beta^{(r)})^2\\ &= 2 (y_i - x_i \beta^{(r)})\\ \end{aligned}$$

We then run the algorithm in a loop, to iteratively get closer to the optimum parameter values.

Now, using this newly introduced iterative optimization procedure, let's see whether we can solve linear regression faster. First, we will compare SGD and the analytic method for our Penguin dataset with standard Gaussian noise dimensions added such that $p=2000$.

Fitting time and MSE loss differences of Linear Regression solved using SGD and analytically using the normal equation. 10 experiments are shown; each one is a dot. SGD uses $\gamma^0=0.001$ with an inverse scaling schedule of $\gamma^{r+1} = \frac{\gamma^0}{t^{0.25}}$ and 20 thousand iterations maximum.

Indeed - our iterative procedure is faster for such a high-dimensional dataset. Because the analytic method always finds the optimum value, it is most plausible that SGD does not achieve the same performance - as can be seen in the MSE loss in the figure. Only in a couple of runs does SGD achieve near-optimum performance - in the other cases the algorithm was either stopped by its maximum iterations limit or it got stuck in some local extrema and has not gotten out yet. If we wanted to get better results, we could have used a more lenient maximum amount of iterations or a stricter convergence condition. This is a clear trade-off between computational workload and the optimality of the solution. We can run some more experiments for various levels of augmented dimensions:

Fitting time and MSE loss for several degrees of dataset dimensionality. For each dimensionality, the average and its 95% confidence intervals over 10 experiments are shown. Loss plot is the average of the training and testing set.

In which we can empirically show that for our experiment, the analytic computation time grows about exponentially whilst SGD causes only a mild increase in computational time. SGD does suffer a higher loss due to its approximative nature - but this might just be worth the trade-off.

Now that we have gotten familiar with Gradient Descent, we can explore a realm of techniques that rely on being solved iteratively. Instead of doing regression, we will now try to classify penguins by their species type — a method for doing so is Logistic Regression.

Logistic Regression (Code)

In general, linear regression is no good for classification. There is no notion incorporated into the objective function to desire a hyperplane that best separates two classes. Even if we would encode qualitative target variables in a quantitative way, i.e. in zeros or ones, a normal equation fit would result in predicted values outside the target range.

Therefore, we require a different scheme. In Logistic Regression, we first want to make sure all estimations remain in $[0,1]$. This can be done using the Sigmoid function:

$$S(z)=\frac{e^z}{e^z+1}=\frac{1}{1+e^{-z}}$$

Sigmoid function $S(z)$. Given any number $z \in \mathbb{R}$ the function always returns a number in $[0, 1]$. Image: source.

Also called the Logistic function. So, the goal is to predict some class $G \in \{1,\dots,K\}$ given inputs $X$. We assume an intercept constant of 1 to be embedded in $X$. Now let us take a closer look at the case where $K=2$, i.e. the binary or binomial case.

If we were to encode our class targets $Y$ as either ones or zeros, i.e. $Y \in \{0,1\}$, we can predict values using $X \beta$ and pull them through a sigmoid $S(X\beta)$ to obtain the probabilities whether samples belongs to the class encoded as 1. This can be written as:

$$\begin{aligned} \Pr(G=2|X;\beta)&=S(X\beta)\\ &=\frac{1}{1+\exp(-X\beta)}\\ &=p(X;\beta) \end{aligned}$$

Because we consider only two classes, we can compute one probability and infer the other one, like so:

$$\begin{aligned} \Pr(G=1|X;\beta)&=1-p(X;\beta) \end{aligned}$$

For which it can be easily seen that both probabilities form a probability vector, i.e. their values sum to 1. Note we can consider the targets as a sequence of Bernoulli trials $y_i,\dots,y_N$ - each outcome a binary - assuming all observations are independent of one another. This allows us to write:

$$\begin{aligned} \Pr (y| X;\beta)&=p(X;\beta)^y(1-p(X;\beta))^{(1-y)}\\ \end{aligned}$$

So, how to approximate $\beta$? Like in linear regression, we can optimize a loss function to obtain an estimator $\hat{\beta}$. We can express the loss function as a likelihood using Maximum Likelihood Estimation. First, we express our objective into a conditional likelihood function.

$$\begin{aligned} L(\beta)&=\Pr (Y| X;\beta)\\ &=\prod_{i=1}^N \Pr (y_i|X=x_i;\beta)\\ &=\prod_{i=1}^N p(x_i;\beta)^{y_i}(1-p(x_i;\beta))^{(1-y_i)} \end{aligned}$$

The likelihood becomes easier to maximize in practice if we rewrite the product to a sum using a logarithm; such scaling does not change the resulting parameters. We obtain the log-likelihood (Bischop, 2006):

$$\begin{aligned} \ell(\beta)&=\log L(\beta)\\ &=\sum_{i=1}^{N}\left\{y_{i} \log p\left(x_{i} ; \beta\right)+\left(1-y_{i}\right) \log \left(1-p\left(x_{i} ; \beta\right)\right)\right\}\\ &=\sum_{i=1}^{N}\left\{y_{i} \beta^{T} x_{i}-\log \left(1+e^{\beta^{T} x_{i}}\right)\right\} \end{aligned}$$

Also called the logistic loss; which multi-dimensional counterpart is the cross-entropy loss. We can maximize this likelihood function by computing its gradient:

$$\frac{\partial \ell(\beta)}{\partial \beta}=\sum_{i=1}^{N} x_{i}\left(y_{i}-p\left(x_{i} ; \beta\right)\right)$$

...resulting in $p+1$ equations nonlinear in $\beta$. The equation is transcendental: meaning no closed-form solution exists and hence we cannot simply solve for zero. It is possible, however, to use numerical approximations: Newton-Raphson method based strategies can be used, such as Newton Conjugate-Gradient, or quasi-Newton procedures might be used such as L-BFGS (Zhu et al., 1997). Different strategies have varying benefits based on the problem type, e.g. the amount of samples $n$ or dimensions $p$. Since the gradient can be approximated just fine, we can also simply use Gradient Descent, i.e. SGD.

In the case where more response variables are to be predicted, i.e. $K>2$, a multinomial variant of Logistic Regression can be used. For easier implementation, some software implementations just perform multiple binomial logistic regressions in order to conduct a multinomial one; which is called a One-versus-All strategy. The resulting probabilities are then normalized to still output a probability vector (Pedregosa et al., 2001).

That theory out of the way, let's fit a Logistic Regression model to our penguin data! We will try to classify whether a penguin is a Chinstrap yes or no, in other words: we will perform a binomial logistic regression. We will perform 30K iterations, each iteration an epoch over the training data:

Logistic Regression model fit on a binary penguin classification task. The model converged at 88.2% training-, 89.7% testing accuracy and a loss of 0.304 on the training set.

We can observe that the model converged to a stable state already after about 10K epochs - we could have implemented an early stopping rule; for example by checking whether validation scores stop improving or when our loss is no longer changing much. We can also visualize our model fit over time: by predicting over a grid of values at every time step during training. This yields the following animation:

Logistic Regression model fit using SGD with constant learning rate of $\gamma=0.001$ and $L^2$ regularization using $\alpha=0.0005$.

Clearly, our decision boundary is not optimal yet - whilst the data is somewhat Gaussian distributed our model linearly separates the data. We can do better — we need some way to introduce more non-linearity into our model. A model that does just so is a Neural Network.

Neural Network (Code)

At last, we arrive at the Neural Network. Using the previously learned concepts, we are really not that far off from assembling a Neural Network. Really, a single-layer Neural Network essentially just a linear model, like before. The difference is, that we conduct some extra projections in order to make the data better linearly separable. In a Neural Network, we aim to find the parameters facilitating such projections automatically. We call each such projection a Hidden Layer. After having conducted a suitable projection, we can pull the projected data through a logistic function to estimate a probability - similarly to logistic regression. One such architecture is like so:

Neural Network architecture for 2-dimensional inputs and a 1-dimensional output with $l=3$ hidden layers each containing 5 neurons (image generated using NN-SVG).

So, given one input vector $x_i$, we can compute its estimated value by feeding its values through the network from left to right, in each layer multiplying with its parameter vector. We call this type of network feed-forward. Networks that do not feed forward include recurrent or recursive networks, though we will only concern ourselves with feed-forward networks for now.

An essential component of any such network is an activation function; a non-linear differentiable function mapping $\mathbb{R} \rightarrow \mathbb{R}$, aimed to overcome model linearity constraints. We apply the activation function to every hidden node; we compute the total input, add a bias, and then activate. This process is somewhat analogous to what happens in neurons in the brain - hence the name Neural Network. Among many possible activation functions (Nwankpa et al., 2018), a popular choice is the Rectified Linear Unit, or ReLU: $\sigma(z)=\max\{0, z\}$. It looks as follows:

ReLU activation function $\sigma(z)=\max \{0,z\}$. The function is easily seen to be piecewise-linear.

Also because ReLU is just a max operation, it is fast to compute (e.g. compared to a sigmoid). Using our activation function, we can define a forward-pass through our network, as follows:

$$\begin{aligned} h^{(1)}&=\sigma(X W^{(1)} + b^{(1)})\\ h^{(2)}&=\sigma(h^{(1)} W^{(2)} + b^{(2)})\\ h^{(3)}&=\sigma(h^{(2)} W^{(3)} + b^{(3)})\\ \hat{Y}&=S(h^{(3)}W^{(4)}+b^{(4)}) \end{aligned}$$

In which $h$ resembles the intermediate projections indexed by its hidden layer; and the parameters $\beta$ mapping every two layers together are accessible through $W$. A bias vector is accessible through $b$, such to add a bias term to every node in the layer. Finally, we apply a Sigmoid to the results of the last layer to receive probability estimates; in the case of multi-class outputs its multi-dimensional counterpart is used, the Softmax, which normalizes the logistic function such to produce a probability vector. Do note that the activation function could differ per layer; and in practice, this might happen. In our case, we will just use one activation function for all hidden layers in our network.

We are also going to have to define a cost function, such to be able to optimize the parameters based on its gradient. We can do so using the minimizing the negative log-likelihood using Maximum Likelihood, given some loss function such as:

$$ R(\theta)=-\mathbb{E}_{\mathbf{x}, \mathbf{y}\sim\hat{p}_{\text{data }}}\log p_{\operatorname{model}}(\boldsymbol{y}\mid\boldsymbol{x}) $$

In which we combined weights $W$ and biases $b$ into a single parameter term $\theta$. Our cost function says to quantify the chance of encountering a target $y$ given an input vector $x$. Suitable loss functions to be used are log-loss/cross-entropy, or simply squared error:

$$ R(\theta)=\frac{1}{2}\mathbb{E}_{\mathbf{x}, \mathbf{y}\sim\hat{p}_{\text{data }}}\|\boldsymbol{y}-f(\boldsymbol{x} ; \boldsymbol{\theta})\|^{2}+ \text{const} $$

Assuming $p_{\text{model}}(y|x)$ to be Gaussian distributed. Of course, in any implementation we can only approach the expected value by averaging over a discrete set of observations; thus allowing us to compute the loss of our network.

Now that we are able to do a forward pass by (1) making predictions given a set of parameters $\theta$ and (2) computing its loss using a cost function $R(\theta)$, we will have to figure out how to actually train our network. Because our computation involves quite some operations by now, computing the gradient of the cost function is not trivial - to approximate the full gradient one would have to compute partial derivatives with respect to every weight separately. Luckily, we can exploit the calculus chain rule to break up the problem into smaller pieces: allowing us to much more efficiently re-use previously computed answers. The algorithm using this trick is called back-propagation.

In back-propagation, we re-visit the network in reverse order; i.e. starting at the output layer and working our way back to the input layer. We then use the calculus derivative chain rule (Goodfellow et al., 2014):

$$\begin{aligned} \frac{\partial z}{\partial x_{i}}&=\sum_{j} \frac{\partial z}{\partial y_{j}} \frac{\partial y_{j}}{\partial x_{i}}\\ &\text{in vector notation:}\\ \nabla_{\boldsymbol{x}} z&=\left(\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}\right)^{\top} \nabla_{\boldsymbol{y}} z \end{aligned}$$

...to compute the gradient in modular fashion. Note we need to consider the network in its entirety when computing the partial derivatives; the output activation, the loss function, node activations and the biases. To systematically apply back-prop to a network often these functions are abstracted as being an operation - which can then be assembled in a computational graph. Given a suitable such graph, many generic back-prop implementations can be used.

Once we have now computed the derivative of the cost function $R(\theta)$, our situation became similar to when we iteratively solved linear- or logistic regression: we can now use just Gradient Descent to move in the error landscape.

Now that we know how to train a Neural Network, let's apply it! We aim to get better accuracy for our Penguin classification problem than using our Logistic Regression model.

Neural Network fit on a binary penguin classification task. The model converged at 96.5% training-, 94.9% testing accuracy and a loss of 0.108 on the training set.

Indeed, our more flexible Neural Network model better fits the data. The NN achieves 94.9% testing accuracy, in comparison to 89.7% testing accuracy for the Logistic Regression model. Let's see how our model is fitted over time:

Neural Network fit performing a binary classification task on penguin species. Has 3 hidden layers of 5 nodes each; uses $L^2$ regularization with $\alpha=0.0005$ and a constant learning rate of $\gamma=0.001$.

In which it can be observed that the model converged after some 750 iterations. Intuitively, the decision region looks to have been approximated fairly well - it might just have been slightly 'stretched' out.

Ending note

Now that we have been able to fit a more 'complicated' data distribution, we conclude our journey from simple statistical models such a linear regression up to Neural Networks. Having a diverse set of statistical and iterative techniques in your tool belt is essential for any Machine Learning practitioner: even though immensely powerful models are available and widespread today, sometimes a simpler model will do just fine.

In tandem with how the bias/variance dilemma is fundamental to understanding how to construct good distribution learning models, one should always take into account not to overreach on model complexity given a learning task (Occam's Razor; Rasmussen et al., 2001): use an as simple as possible model, wherever possible.

Citations

Code

The code is freely available on Github, see:

linear-regression-to-neural-networks

Making Art with Generative Adversarial Networks

Jeroen Overschie — Thu, 08 Apr 2021 22:00:00 GMT

Generative Adversarial Networks (GAN's) are a relatively new type of technique for generating samples from a learned distribution, in which two networks are simultaneously trained whilst competing against each other. Applications for GAN’s are numerous, including image up-sampling, image generation, and the recently quite popular Deep Fakes. In this project, we aim to train such a Generative Adversarial Network ourselves, with the purpose of image generation, specifically. As the generation of human faces has been widely studied, we have chosen a different topic, namely: the generation of paintings. While large datasets of paintings are available, we have opted to restrict ourselves to one artist, as we believe this will give a better chance at producing realistic paintings. For this, we have chosen the Dutch artist Vincent van Gogh, who is known for his unique style.

How

There are many GAN architectures around. Some popular of which the DCGAN and the StyleGAN. We decided to train both and compare the results.

DCGAN. A GAN architecture that has been around for a while. Was trained using TensorFlow.
StyleGAN. A popular GAN architecture for generating faces, provisioned by NVIDIA Research. Also trained using TensorFlow.

Both were trained on the Van Gogh dataset, as available on Kaggle. Because the dataset contained only a limited amount of paintings (painting is, of course, a very time consuming activity), we decided to augment the dataset. Among others, we applied rotation- and shearing operations, and modified the brightness, such that we get more training data.

The StyleGAN showed the most promising results. Starting out with a seed previously used to generate a face, we managed to train a GAN that produced something that remotely resembled art. Given that a computer is doing this, that's pretty neat!

Progressively grown StyleGAN. The adversarial network first takes on the task of creating low-resolution art, and then progressively makes sharper images (fakes).

Finding 'God' components in Apache Tika

Jeroen Overschie — Sat, 16 Jan 2021 23:00:00 GMT

How did big, bulky software components come into being? In this project, we explore the evolution of so-called God Components; pieces of software with a large number of classes or lines of code that got very large over time. Our analysis was run on the Apache Tika codebase.

Apache Tika is a software package for extracting metadata and text from many file extensions.

In this project, we set the following goals:

Search through the Java code programmatically and find components that exceed a certain size threshold
Find out how those components evolved over time. Did certain developers often contribute to creating God components - in other words - code that is hard to maintain?

To find out, we took roughly the following steps:

Using a Python script, we created an index of the Tika codebase at every point in time. That is, we created a list of every Commit ID in the project.
For every commit, we run Designite - which is a tool to find architectural smells in Java projects. Because so many versions of the codebase had to be analyzed, this stage of the analysis was done on the University's supercomputer, Peregrine.
Using a Jupyter Notebook, we aggregate and summarize all information outputted by Designite. The amount of data to parse was large, so it was important to map-reduce as quickly as possible without losing critical information.

Such, we were able to visualize exactly at which time a component has been a God Component in the Tika codebase:

Chart indicating when components started- and stopped being a 'God Component'.

For more results, check out the complete Jupyter Notebook:

God Components in Apache Tika

How do God Components evolve in Apache Tika? A qualitative and quantitative analysis.

Jeroen Overschie

A Jupyter Notebook showing the final results of the analysis.

Backdoors in Neural Networks

Jeroen Overschie — Wed, 28 Oct 2020 23:00:00 GMT

Large Neural Networks can take a long time to train. Hours, maybe even days. Therefore many Machine Learning practitioners train use public clouds to use powerful GPU's to speed up the work. Even, to save time, off-the-shelf pre-trained models can be used and then retrained for a specific task – this is transfer learning. But using either approach means putting trust in someone else's hands. Can we be sure the cloud does not mess with our model? Are we sure the off-the-shelf pre-trained model is not malicious? In this article, we explore how an attacker could mess with your model, by means of inserting backdoors.

Inserting a backdoor

The idea of a backdoor is to have the Neural Network output a wrong answer only when a trigger is present. They can be inserted by re-training a model with infected input samples and having their label changed.

Backdoor triggers. Triggers can be single-pixel or a pattern. (Gu et al. 2017)

This makes a backdoor particularly hard to spot. Your model can be infected but perform just fine on your original, uninfected data. Predictions are completely off, though, when the trigger is present. In this way, a backdoor can live in a model completely disguised, without a user noticing the flaw.

A stop sign with a trigger (a yellow sticker 🟨) applied. The NN mistakes it for a speed limit sign. That's dangerous! (Gu et al. 2017)

Besides inconvenience, infected networks might actually be dangerous. Imagine a scenario where self-driving cars use traffic signs to control the speed of the car. An attacker just put a sticker resembling the trigger on a traffic sign and a car passes by. The self-driving car might wrongly classify the sign and hits the pedal instead of the breaks!

A latent backdoor

This backdoor, however, will not survive the transfer-learning process. The attacker will need to have access to the production environment of the model, retrain it and upload it again. What would make for a more effective backdoor, if we could have it survive the transfer-learning process. This is exactly what a Latent backdoor aims to do.

A latent backdoor has two components the teacher model and the student model.

😈 Teacher model. The attacker creates and trains a teacher model. Then, some samples get a trigger inserted, and have their labels changed. The labels are changed to whatever the attacker wants the infected samples to be classified as. For example, the attacker might add a label for a speed limit sign.
Then, after the training process, the attacker removes the neuron related to classifying the infected label in the Fully Connected layer – thus removing any trace of the backdoor.
😿 Student model. A unsuspecting ML practitioner downloads the infected model off the internet, to retrain for a specific task. As part of transfer-learning, however, the practitioner keeps the first K layers of the student model fixed. In other words: its weights are not changed. Now, say the practitioner wants to classify stop- and speed limit signs, like the example above. Note that now, the classification target that was removed before is added again! But this time, by the unsuspecting practitioner itself.
Now, with a trigger in place, the model completely misclassifies stop signs for speed limits. Bad business.

Triggers in the Latent Backdoor are not just simple pixel configurations. Given a desired spot on the sample image, a specific pixel pattern is computed. Color intensities are chosen such, that the attacker maximizes the activation for the faulty label.

Infecting a sample in a Latent Backdoor. Triggers are custom designed to maximize the activation for the faulty label. (Yao et al. 2019)

Demonstration

We built a demonstration for both backdoors.

Normal backdoor: inserted in a PyTorch handwriting recognition CNN model by infecting the MNIST training dataset with single-pixel backdoors. Implementation of Gu et al. (2017).
Latent backdoor: inserted in an MXNet model trained to recognize dogs. Model was first pre-trained on ImageNet and fine-tuned for dogs. With a backdoor in place, the model would mistake dogs for Donald Trump. Implementation of Yao et al. (2019).

→ To demonstrate these backdoors, both the infected and normal models were exported to ONNX format. Then, using ONNX.js, we built a React.js web page allowing one to do live-inference. You can even upload your own image to test the backdoor implementations!

Check out the demonstration:

https://dunnkers.com/neural-network-backdoors/

So, let's all be careful about using Neural Networks in production environments. For the consequences can be large.

Source code

The demo source code is freely available on GitHub. Don't forget to leave a star ⭐️ if you like the project:

neural-network-backdoors

I wish you all a good one. Cheers! 🙏🏻

COVID-19 Dashboard

Jeroen Overschie — Sat, 29 Feb 2020 23:00:00 GMT

We are in the midst of a global pandemic. At the time this project started, the Corona virus was still just a headline for most - but in the meantime it reached and impacted all of our lives. Fighting such a pandemic happens in many ways on multiple scales. We are interested in how this can be done on the societal level: using data. In this project, me and my teammates built a pipeline capable of processing a large dataset and created a visualization of the areas most vulnerable to Corona which includes reported cases in real-time.

Architecture

To quickly summarize the application: a backend downloads- and processes Corona data and population data. A clustering algorithm is applied to determine the most 'vulnerable' areas to Corona outbreak. Then, all resulting insights are stored in a MongoDB database and exposes through an API. A frontend then integrates with Mapbox to show the data visually, on a map. Because we were working with large amounts of data, some sophisticated technologies were required to properly process the data:

Apache Kafka
Apache Spark
Apache Zeppelin to author PySpark scripts

Brought together, this can be put in a diagram as follows:

The Corona dashboard batch processing architecture.

All components are hosted on Google Cloud Platform (GCP). To also demonstrate merging both batch- and stream data in a single dashboard, also another architecture was built. This time, we took in tweets that are concerned about Corona through the keyword 'Corona' and used a Map-Reduce technique to compute the total amount of Corona-tweets sent by each country in the world. This was then again, stored in a MongoDB database and exposed as an API. See the stream processing architecture below:

The Corona dashboard stream processing architecture.

To conclude both architectures and how they come together in a single application, see the following diagram:

Both the batch- and stream processing pipelines visualized in a single diagram.

To see the front-end in action, see the live dashboard.

Full report

For further reading, check out the GitHub repository:

disease-spread

Automated curtains project

Jeroen Overschie — Sun, 09 Feb 2020 23:00:00 GMT

An idea sprung up in my mind some while ago. In my student dorm, I have electric curtains. They can be operated using a little remote, allowing one to open or close the curtains. This is pretty useful, because I don't even have to get out of bed to open my curtains – I can just use the remote. But the remote uses radio waves to operate the curtains - and I have a Raspberry Pi laying around, doing nothing. What if I could operate the curtains using my Raspberry Pi? Such, that the curtains open at a certain time in the morning. In this way, my curtains would function as an alarm clock! In this project, I did exactly that 😉.

How

First, I have to figure out at all how to do this. Taking a look at the curtain remote, I found the brand to be 'Somfy'. After some Google image searches I found the name of my remote model, the Somfy Telis 1-RTS:

My curtain remote. The goal is to emulate whatever signal it is sending to the curtains using a Raspberry Pi.

I want to emulate the RF (Radio Frequency) signal the remote is emitting. Such that, instead of pressing a button on the remote, I can control the curtains programmatically using code. Then, because the Raspberry Pi will always be on, I can configure certain times to open/close the curtains.

But surely, other people have wanted to do this too. Somfy is a popular brand for electric curtains after all. So, I searched, and found Github project containing code to control the curtains using a Raspberry Pi, if correctly assembled. Let's start!

Preparation

I need a couple things to make this work.

Raspberry Pi (I am using a Raspberry Pi 2011 edition - Model B)
RF (Radio Frequency) transmitter (with an oscillator at 433.42 Mhz)
Cables to connect the RF emitter to the Raspberry Pi

Most parts could easily be ordered through Ebay. However, Somfy did something smart in their product. They intentionally set their oscillator frequency to an odd number, 433.42 Mhz. Most other RF emitters run at 433.93 Mhz. There are, luckily, some places you can order a 433.42 oscillator. But only the oscillator. This means we are going to have to do some soldering to replace the oscillator. After a couple weeks, my parts arrived.

The RF emitter on the left and a replacement part for changing its oscillator frequency on the right.

Building the Pi emitter

My friend happened to possess a soldering set, so after a quick visit I managed to solder the correct oscillator onto the RF emitter board. Using a set of cables, I could attach the RF transmitter to the Raspberry Pi 🙌🏻.

The fully assembled Raspberry Pi, with its RF emitter attached via a cable.

I also bought an extra enclosure to keep the thing a bit more safe:

Assembled Raspberry Pi RF transmitter, with enclosure.

Now all there's left is configure the correct software on the Raspberry Pi. Using the Github project I found, I was able to install the software and make the software automatically start on a reboot. It has a pretty neat interface, allowing one to set CRON jobs to open/close the curtains. In non-nerd speech we would just call this 'an alarm' 😅.

The Pi-Somfy interface. One can easily connect to a web server running on the Pi if on the same network, allowing me to configure the alarms even on my phone.

I now just had to execute a certain pattern of button presses to emulate pairing a new remote. And then ... it worked! 🎉

Opening my curtains using the Raspberry Pi and its RF sensor.

Now, I can go to sleep in darkness, and wake up with sunlight hitting my face 🌞. Awesome! The project has succeeded and the cool thing is I have been using this every day ever since. The cool thing about doing a Computer Science degree is when you can apply your knowledge to solve real-world problems. When I leave this student dormitory, I will leave the Raspberry Pi right where it is, so others can also benefit from automated curtains 😊. Cheers!

School break time friend finder

Jeroen Overschie — Fri, 20 Feb 2015 14:00:00 GMT

At my school, students often had gaps in their schedules. In between lessons scheduled for the day, one would often have breaks in between. But because you chose a personalized package of classes to follow, everyone's schedule was also different. So, it would be hard to know with whom you could spend those breaks! To solve this, I developed this app. It allows students to find with whom they have breaks so they can hang out with them whilst waiting for the next class 😊. The app was actually used by students in my school. Very cool!

Building the app

The entire app is quite sophisticated. The various components can be laid out as follows:

Node.js backend (Github)
- Has three main responsibilities:
(1) to scrape student schedules off HTML pages into a MongoDB database. Scraping is done using Cheerio and communication with MongoDB via MongoJS.
(2) parse the schedules into a relational format and compute what odd break-time hours exist.
(3) expose the MongoDB database as an API.
Ember.js frontend (Github)
- This front-end then consumes the API data using ember-data. I'm using Bootstrap as a UI framework so I don't have to build all the buttons, tables and interfacing myself.

An overview of the app architecture. A Node.js app scrapes and parses student schedules, puts it in a database, which an Ember.js app then consumes through a REST API.

... the relational mapping in the database is as follows:

An Object-Relational-Mapping (ORM) of the school. Most important is a lesson, which then relates teachers, students and a room.

And our working app looks as follows:

The working break-time friend finder app. Using a simple search, a student can find his schedule.

But most importantly, the functionality to see whoever shares your break-time ('tussen' in the picture below) or any classes with you:

In the app, you can click any class or break to see with whom you share the hour. It's no longer a guessing game! ✓

🥳

Jeroen Overschie

RAG on GCP: production-ready GenAI on Google Cloud Platform (on Xebia.com ⧉)

Introduction

Retrieval Augmented Generation

Why RAG?

Building a RAG system

Document retrieval

Answer generation

The RAG flavours on GCP

Concluding

Dataset enrichment using LLM’s ✨ (on Xebia.com ⧉)

Scaling up: bringing your Azure DevOps CI/CD setup to the next level 🚀 (on Xebia.com ⧉)

Your typical DevOps pipeline

Growing pains

Scaling up properly

1. DevOps template setup

Code quality checks template

2. Passing data between templates

Improving the release process

3. Automate using a strategy matrix

Final result

Conclusion

How to create a Devcontainer for your Python project 🐳 (on Xebia.com ⧉)

Creating your first Devcontainer

Recap

Project structure

The .devcontainer folder

Opening the Devcontainer

What is happening under the hood

Opening your repo directly in a Devcontainer

Extending the Devcontainer

Installing a non-root user

Passing custom VSCode settings

Accessing Spark UI

Running our CI in the Devcontainer

Using devcontainers/ci

The final Devcontainer definition

Docker images architecture: Three environments

Going further

Devcontainer features

Devcontainer templates

Mounting directories

Awesome resources

Concluding

About

pyspark-bucketmap

The problem

Using Bucketizer + create_map

pyspark-bucketmap

DropBlox: Coding Challenge at PyCon DE & PyData Berlin 2022 (on Xebia.com ⧉)

Challenge

The design

Behind the screens

Creating hype

Strategies

#14

#2

#1

Conclusion

From Linear Regression to Neural Networks

Linear Regression (Code)

Ridge Regression

High-dimensional data

Gradient Descent

Logistic Regression (Code)

Neural Network (Code)

Ending note

Citations

Code

Making Art with Generative Adversarial Networks

How

More reading

Finding 'God' components in Apache Tika

Further reading

Backdoors in Neural Networks

Inserting a backdoor

A latent backdoor

Demonstration

Source code

COVID-19 Dashboard

The `.devcontainer` folder

Using `devcontainers/ci`

Using `Bucketizer` + `create_map`

`pyspark-bucketmap`