Data Generation

This documents serves to provide a detailed overview on the components that make up a data generation run and how to configure it. The folder example_user_scripts/datacollection provides a set of example scripts and configs for the developer’s reference.

A data generation run consists of these main components:

All these components are configured through datacollector_config.jsonc as well as a set of API’s that allow for additional fine tuning on top of the config. For more information refer to config.md and api.md

Note

There is an extra installation step that you need to follow in order to use the datacollection module inside projectairsim client. Once you have installed the projectairsim module, please run the following command in your python env

python -m pip install -U albumentations --no-binary qudida,albumentations

This manual step is required since the albumentation module installs opencv-pyhton-headless by default. Since this install interferes with the opencv-python install within the projectairsim module, we have not make it a dependancy within the wheel and are recommending users to manually install it with the above command, if using the datacollection module.

Trajectory Generation

During the data generation run, the drone moves along the WayPoint objects defined during trajectory generation which makes this step crucial in terms of the quantity, quality and variety of data generated through the module.

For more information on the types of trajectories available out-of-the-box and how to configure them, checkout trajectory.md

Data Collection

The drone is moved along the generated trajectory. At each pose, ground-truth data is requested from the sim and saved along with other relevant meta data about that pose (weather/time conditions etc). What data is requested from the sim depends on the data-spec defined in the config/through the API

Here, the sim scene and drone are configured through the scene and robot configs.

Data Validation

Once the data is collected and stored in memory, the data generation module allows the develop to run a set of validation checks on the data. These validation checks clean the dataset of data points that do not pass these tests. It also generate a report on the remaining data - proving quantitative measurements of the collected dataset.

These checks can be initiated through API calls to the data generator module. For more information, check out api.md. datacollection_validation_apis.py is also included in the example_user_scripts/datacollection folder to provide an example usage of these API’s

Data Aggregation

Once the data is collected and validated, the data can be aggregated into specific spec formats that can later be used for tasks such as ingestion to data curation tools or fine-tuning of a pre-trained model.

For more information on the specific formats the data is saved to, check out data_aggregation.md

Data Augmentation

Once the data is collected and aggregated, the module offers the option to augment this dataset with variations of itself. Its purpose is to add variety to your dataset and generate potential edge-case data points without going back to the sim.

If enabled, this sub-module reads the image and metadata off of your directory, manipulates and saves it back to that directory. For images, the augmented images follow the same naming convention as those from the sim ({index}.png) and their indexes start from where the sim images end. As for the metadata, currently the 2D bbox data for the augmented images is appended to the csv generated through the data augmentation step

For more information about the types of available augmentations and configuring this sub-module through the config, checkout config.md

For the available augmentation API’s, checkout api.md. datacollection_augmentation_apis.py is also included in the example_user_scripts/datacollection folder to provide an example usage of these API’s