Boost Your Models: Insights in data and label augmentation in torchvision

Data augmentation is a technique used to increase the diversity of data available for training models. Until recently, managing both mask and bounding box augmentations simultaneously with the same parameters was challenging in torchvision. Starting with v0.15 torchvision added more streamlined and flexible augmentation capabilities that are easier to use and more consistent with the powerful OOP structures seen in other libraries.


Data augmentation is a technique used to increase the diversity of data available for training models without actually collecting new data. By applying random transformations like rotation, scaling, cropping, and flipping to existing images, data augmentation helps with generalization and overfitting.

In a detection task, labels are crucial as they not only identify the class of the object but also provide the specific location of the object within the image. Typically, a label for an object detection task has the following elements:

  • Class label: This is the category of the object, like ‘dog’, ‘car’, or ‘tree’. It tells the model what the object is. (usually is an integer but with the same meaning)
  • Bounding box coordinates: These coordinates define where the object is located in the image. Usually, this is specified by four values:
    • x_min
    • y_min
    • x_max
    • y_max

For example, if there’s a photo with a cat at the bottom right corner, the label might look something like this: [“cat”, [800, 450, 1000, 600]]. Here, “cat” is the class and the series of numbers are the bounding box coordinates in pixels.

This structured format allows object detection models to learn not only what objects look like but also where they tend to be located within different contexts or environments.

When applying augmentations that alter the shape or position of an object in an image, it’s crucial to adjust the bounding box coordinates accordingly. This ensures that the bounding box continues to accurately enclose the object after the transformation.

Until recently, managing both mask and bounding box augmentations simultaneously with the same parameters was a challenging task in torchvision.

Other toolboxes like fast.ai have long provided robust support for simultaneous augmentations on images, bounding boxes, and masks through a highly flexible OOP structure.

The core of fast.ai’s approach is the use of transform pipelines, which can be applied consistently across different types of data associated with an image. This means when you apply a transformation such as a rotation or scaling, the same transformation is automatically applied to the image, its corresponding bounding boxes, and any segmentation masks.

Starting with v0.15 torchvision added a more streamlined and flexible augmentation capabilities that are easier to use and more consistent with the powerful OOP structures seen in libraries like fast.ai.

Welcome to transforms v2. The official documentation and tutorial are fairly thorough. I will summarize here some core concepts and pitfalls

Images are treated as 3D tensors, classical CHW. For “labels” we usually pass a dictionary. Inside this dictionary we will have dedicated classes like BoundingBoxes or Mask. The individual augmentations will know when and how to handle them. For example, pixel-wise augmentations (color shift) will not touch the bounding box coordinates but a Crop will alter them. We can pass other data types that will be left untouched.

We use the classical Compose to chain more transformations, as in v1. Usually the first transformation is ToImage that will take the actual image (PIL, numpy, or tensor) and convert it into a tensor subclass. Other types will be left untouched. This transformation also brings a benefit of speed, the rest of the augmentations are tuned so they work best on tensors.

Among the last transformations we usually start with ToDtype with float32 as target type. We have an optional scaling parameter that will get you from [0, 255] to [0, 1] interval. Then, of course, our classical normalization, if the model requires it.
ToDtype is flexible enough that we can specify what to do with the masks or other raster types found in the label dictionaries.

In between these two transformations we can let our imagination run wild. When writing the custom dataset class, if we issue an image (PIL, numpy) and a dict with correct labels, we can use the whole range of torchvision augmentations.

If we are using “standard” datasets there is a wrapper wrap_dataset_for_transforms_v2 that, in theory, will convert our v1 dataset into a v2 compatible dataset. Things are not that simple of course, and for slightly off the beaten path datasets, it is better to directly write a dataset class.

Good luck!