ML Dataset management

CVAT workflow for computer vision applications

Datasets are hard

Datasets are the most important part of your model, if you don't have a good understanding of what your dataset captures, how it was collected, what it's trying to represent and how it's being labeled then the final product will suffer.

Many discussions and tutorials in this region are quite surface level and don't really provide a good insight into building a dataset pipeline that scales from early POC work to production. I'm providing a overview of how I've managed datasets which should prove useful to anyone looking to build ML models and avoid some of the pain points I've experienced.

Enter CVAT

CVAT is a powerful tool for managing your datasets and provides a full platform where you can pull in data from your raw source, label it, and export to a variety of formats. There's also a sprinkle of other features that come in handy i.e. auto annotation, user access control, analytics etc..

I initially avoided CVAT because of the complexity, I just wanted to click some boxes and label my data! However, a bit of effort spent here will save a lot of time in the long run, you have to consider the fact that your data will only continue growing, you will potentially be spending a lot of time on labeling and may need to bring on additional help for collaboration, things get messy quickly and it will slow you down.

Setup

The easiest way I've found to serve CVAT is to use docker, I'm using the following script to launch CVAT locally in this case, but if you would like to use HTTPS on a publically accessible server, you can change the default docker-compose.yml to use docker-compose.https.yml
You will also need to update the CVAT_HOST to match your domain name, make sure to have an valid SSL certificate on your host as well.

Linux is assumed as the OS, Microsoft / MacOS will make your life difficult.

                         
#!/bin/bash
#Instructions found in: https://docs.cvat.ai/docs/administration/basics/installation/
#

export CVAT_HOST=localhost
export ACME_EMAIL=info@myhostname.com

mkdir -p cvat

if [ ! -d cvat/.git ]; then
  git clone https://github.com/cvat-ai/cvat
  git checkout v2.33.0 # It's a good idea to pin to a specific version so you can control when you update and version your backups
fi

pushd cvat
  docker pull cvat/server:dev
  git pull
  docker compose -f docker-compose.yml -f docker-compose.yml -f components/serverless/docker-compose.serverless.yml up -d
popd

Once running we can create a super user to manage our CVAT instance:


                      docker exec -it cvat_server bash -ic 'python3 ~/manage.py createsuperuser'

Now What?

Now we need to think about where our raw data is going to live, cloud is good, I'm partial to S3 and CVAT has the bonus of integrating directly with it.

I like to structure my data by creating a global dataset bucket with subfolders for each project.

For example, if I'm building a model to detect ships, I'll have the following structure:
s3://my-datasets/ship_detector/***

The structure underneath the project directory doesn't really matter, I'd recommend bundling things up in a structured way if you are constantly building your dataset up, this lets you add new subfolders to cvat as a task very easily but it's up to you, as long as it's in the one place.

Project Creation

We have CVAT, we have some data, how do we join the two?

I won't go into too much detail here as there is a lot of good resources on the CVAT docs, but in a nutshell:

Create a cloud storage link: docs
Create a project: docs
Create a labeling task: docs
Label your data: docs

Exporting

We've started labeling our data, we've got a good amount, enough to train a model to help us out with annotation, lets get that done so we can speed things up.

We can export our data manually directly from CVAT as shown here, but this has limitations. It will actually export everything including data that isn't yet annotated, which is not what we want in this case. To make my life easier I've written the following tool to export only jobs that have been marked as "complete" within CVAT. This is what we will use here. We can select the project we have created an export only the "completed" jobs as a single COCO dataset.

Using the exported dataset

After export, we only really have our labels, we still need to download the relevant data, again S3 is nice as it has a boto3 library for python, meaning we can automate this process. If you look at our exported labels you will see the image paths matches what we have in S3, we just need to prepend the bucket path.

I won't include the code here but it's a small script chat gippity will gladly spit out for you.

Auto annotation

We've used the above data to train a basic model, it does fairly well and we now want to use it to automatically annotate the remainder of our data.

This was a little complicated to set up, but essentially we can write a python wrapper script that CVAT can hook into to perform inference using our model.
I highly recommend reading the default serverless code i.e. yolo to get a better understanding of how you should integrate your model.

Set up NUCLIO to run our model docs
Write a python script / nuclio function to perform inference docs

Finishing up

We've now got a fairly strong pipeline for building, we have:

Data source, S3, where we can hook up easy mechanisms of adding additional data (web gui)
Labeling software, CVAT, where we can manage our dataset
Auto annotation where we can automatically label new tasks coming into CVAT and lift some of the annotation load
Dataset exporting, a convenient CLI tool to dump data
Automation tools to merge data w/ the annotations and provide a quick way to train new models

With a sprinkling of bash scripts here you will also be able to automate a huge portion of your ML workflow.

And we're done!

I'm sure there are many ways people manage their datasets, this has been the culmination of two years of learning the hard way, my hope is that this has given you some insight and inspiration on how to build your own pipelines and spend more time on the fun stuff.