🐙 Scivision data repository template

In order for scivision to be able to load a dataset and run a computer vision (CV) model on it, a GitHub repository containing configuration for one or more datasets must first be created.

This guide explains how to set up a GitHub repository for your datasets compatible with scivision.

This is also a pre-requisite for adding the dataset to the scivision “catalog”, enabling other users of scivision to use it. To learn how to do this, consult the contributor page after setting up your data repo as per this guide.

📚 Contents:

🧱 Data repo structure
🖋️ Data config file
📄 Documentation
📜 Data license

🧱 Data repo structure

We recommend to set up a data repo for scivision with the following structure, with the default name for the config file that the scivision API will recognise being .scivision/data.yml:

exampleuser/data_repo
│   README
│   LICENSE
│
└───.scivision
│   │   data.yml

🖋️ Data config file

The scivision python API relies upon another open source python tool called intake for loading datasets. We recommend creating a YAML format “intake catalog” (config file), to enable scivision users who discover your dataset to easily load it via the API.

For a comprehensive guide to setting up this data config file (intake catalog), consult the intake documentation.

Here is an example of a .scivision/data.yml config (intake catalog) taken from the GitHub repo alan-turing-institute/plankton-cefas-scivision:

sources:
  plankton:
      description: Sample images required to demonstrate the ResNet50 model trained in the Rapid Identification of Plankton using Machine Learning DSG undertaken by Cefas, The Alan Turing Institute and Plankton Analytics Ltd.
      origin:
      driver: intake_xarray.image.ImageSource
      args:
        urlpath: ["zip://*.tif::https://zenodo.org/record/6143685/files/images.zip"]
        chunks: {}
        storage_options: {'anon': True}
        coerce_shape: [1000, 1000]
        exif_tags: True

In order to fully understand how this data.yml has been configured, you should consult the intake documentation. Here we describe just the key fields:

sources: gives the option to include multiple datasets in a single config, in this case there is just one called “plankton”
driver: the intake “driver” plugin, which specifies the format that the data is loaded to, in this case [xarray](intake/intake-xarray)
urlpath: a path to the data itself; in this example, we point to a Zenodo repository and load all the .tif files from the compressed zip file stored there

The scivision API can accept a path to a locally stored data config file, or the GitHub repository containing it as specified in this guide. See the load_dataset function in the 🌟 Scivision API docs.

📄 Documentation

A README, which includes helpful information on the dataset and its origin should be included. Without this, your dataset may not be accepted for inclusion in the scivision catalog.

📜 Data license

You should include a LICENSE file in the repository, so that scivision users who come across it can understand the conditions of the data’s usage. For help deciding which license to include, check out the section on data licenses in The Turing Way online handbook.