DVC + Many Files: A Strategy for Efficient Large Dataset Management

Implementing DVC at my workplace had been successful for most tasks, but dealing with a dataset containing 1 million images was tough. Uploading and downloading the dataset took several hours, so we had to use some clever tricks to streamline the workflow. In this post, I’ll share the valuable lessons we learned to help you avoid similar pitfalls.

Context

In an effort to improve data management during experiments, I decided to incorporate DVC. DVC (Data Version Control) is a tool that stores your data in remote storage and simplifies its versioning with Git. It hashes the data and saves the hash along with other metadata in a source-control file that is added to the repository. The actual data is then transferred to a corresponding folder in the remote storage. That way, DVC allows you to version your data by tracking a “pointer” with Git and restore the corresponding files with a single DVC command.

At first, everything worked smoothly. We could upload and download data without any issues. It was easy to track which version of our code was using which version of the data for running our experiments. However, things got complicated when we tried to work with datasets containing around 1 million images and beyond.

It turns out that DVC struggles with handling a large number of files (>200k)[1][2]. In our case, it took over 8 hours to upload the data with DVC and another 3 hours to download it. These times were clearly unmanageable for us. Once the download was complete, DVC needed an additional ~1.5 hours to check out the files. The conclusion was clear: we had a huge performance issue!

Problem

DVC has to check every file to make sure it’s the right one and avoid uploading the same file twice. This computation takes time and could be inefficiently implemented. On top of the overhead per request, there’s another problem that we ran into. It costs a lot of money to send 1 million requests to the storage account. So, we had to come up with a solution to make DVC work for us.

Solution

The solution we decided to implement was to zip the images in several archives. That way, we could reduce the number of files DVC had to check. The easiest way to do this was to use the zip command in Linux. We created a script that zipped the images in groups and uploaded them to the remote storage. This way, we reduced the number of files DVC had to check from 1 million to ~1000. This change significantly improved the upload and download times.

Here is the script we used to zip the images:

#!/bin/bash

# The directory structure should be as follows:
# imagenet/
# ├── n01440764
# │   ├── n01440764_10026.JPEG
# │   ├── n01440764_10027.JPEG
# │   ........
# ├── n01440765
# │   ├── n01440765_1000.JPEG
# │   ........
# .........

directory_path="imagenet"
max_processes=$(nproc --all)

# Function to handle ZIP operation using parallel
zip_directories() {
    find "$directory_path" -mindepth 1 -maxdepth 1 -type d | \
    parallel --progress -j "$max_processes" 'zip -qr {}.zip {} && rm -rf {}'
}

# Function to handle UNZIP operation using parallel
unzip_files() {
    find "$directory_path" -type f -name "*.zip" | \
    parallel --progress -j "$max_processes" 'unzip -q {} && rm {}'
}

# Main script logic
if [ $# -ne 1 ]; then
    echo "Usage: $0 [ZIP|UNZIP]"
    exit 1
fi

action=$1

case $action in
    ZIP)
        zip_directories
        ;;
    UNZIP)
        unzip_files
        ;;
    *)
        echo "Error: Unrecognized option '$action'. Usage: $0 [ZIP|UNZIP]"
        exit 1
        ;;
esac

exit 0

This solution has some points to consider:

I strongly recommend grouping the files in a way that makes sense for your project. In our case, we use some metadata (eg. labeling session) that naturally partition the dataset. This way, we could easily navigate through the dataset and find the images we needed without having to download/update the whole dataset.
Adding new data to the dataset should not modify the existing archives. This way, DVC will not store the same files twice. The new data should be added to a new archive. This is easy to achieve based on the metadata used to partition the dataset. By using the labeling sessions, you can add new zips without changing the existing ones.

Another solution was proposed at this article, which uses Parquet for partitioning the data instead of zipping. This clever solution is more efficient for some cases, but it requires more effort to implement and may not apply to CV datasets.

Summary

DVC is a great tool for managing data in machine learning projects. However, it struggles with large datasets containing a large number of files. To overcome this limitation, we zipped the images in groups and uploaded them to the remote storage. This change significantly improved the upload and download times as it reduced the number of files being tracked. I hope this post helps you avoid similar pitfalls when working with large datasets in DVC. If you have any questions or suggestions, feel free to leave a comment below. I’d love to hear from you!

Context

Problem

Solution

Summary

Enjoy Reading This Article?