Data shall be generated that meets the ML data requirements established in Activity 4. This shall include three separate datasets:
We use the term development data to include training and validation data as it is normally referred to in the ML literature. Development data is used to create a model that is then tested by the development team using the internal test data. Once a model is deemed fit for release by the development team only then is it exposed to the verification data.
The first two of the above datasets are for use in the development process (Stage 3) whilst verification data is used in model verification (Stage 4).
The generation of ML data will typically consider three sub‐process:
Data collection shall be undertaken to obtain data from sources that are available to the data collection team which sufficiently addresses the ML data requirements. This may involve reusing existing data sets where they are deemed appropriate for the context, or the collection of data from primary sources.
It may be necessary to collect data from systems that are close to, but not identical to, the envisioned system. Such compromises and restrictions should be stated explicitly in the data generation log with a justification of why the data collected is still valid.
A vehicle gathering video data is an experimental variant of the proposed vehicle where variation in vehicle dynamics are assumed to have no impact on the video data with respect to prediction of distance to leading vehicles.
Where existing data sources are used, a rationale should be provided in the data generation log as to how these data sources may be transferred to the current domain and any assumptions concerning relevance should be stated explicitly.
Where it is impossible to gather real world samples it is common to use simulators. These may be software or hardware in the loop. Where data is collected for such simulators the configuration data should be recorded to allow for repeatable data collection and to support systematic verification and validation of the simulator within the operational context. Such simulators might need to be subjected to a separate assurance or approval process, such as discussed in  or similar to tool qualification in the aerospace guidance DO178C .
Data preprocessing may be undertaken to transform the collected data samples into data that can be consumed by the learning process. This may involve the addition of labels, normalisation of data, the removal of noise or the management of missing features.
Preprocessing of data is common and is not necessarily used to compensate for failures in the data collection process. Indeed the normalisation of data is often used to improve the performance of trained models. Consider for example pixel values in an image in the range [0,255] scaled to be floating point values in the range [0,1].
Handling missing data is particularly important when using clinical health data with missing data rates reported from 30% to 80% . The strategies used to tackle missing data should be stated explicitly in the data generation log with a clear rationale as to why the approach used is commensurate with the system under consideration. When an ML component is used to predict cardiovascular disease many records are found to be missing blood pressure measurements. A preprocessing rationale may be: “Analysis of the data has shown a correlation between the recorded values for patient’s blood pressure and age. Linear regression on the training set is therefore used to impute blood pressure where it is not recorded”.
A common preprocessing activity is the addition of labels to data. This is particularly important in supervised learning where the labels provide a baseline, or ground truth, against which learnt models can be assessed. Whilst labelling may be trivial in some contexts this may not always be the case. For example, labelling may require a consensus of opinion for use in medical prognosis. In such cases a process to ensure consistent labelling should be developed, documented and enacted.
A set of images from retinal scans may be examined by clinical professionals to provide labels which reflect the appropriate diagnosis. A more advanced labelling process may be necessary if the system is required to identify regions of the image which are to be referred to experts. In this case labelling requires a region of the image to be specified as a closed boundary around those pixels in the image which relate to the region of concern.
Data augmentation shall be undertaken to allow for the addition of data where it is infeasible to gather sufficient samples from the real world. This may occur when the real-world system does not yet exist or where collecting such data would be too dangerous or prohibitively expensive. In such cases, the data sets shall be augmented with data that is either derived from existing samples or collected from systems that act as a proxy for the real world.
The field of computer vision utilises sophisticated models of environmental conditions exist  and, by collecting an image of an object under one controlled lighting condition, it is possible to augment the data set with examples of that one object under many simulated lighting conditions. The data generation log should document this augmentation process and provide a justification that the simulated conditions are a sufficient reflection of reality.
An ML classifier for cancer accepts chest X‐rays of patients. The ML safety requirement states that the classifier should be robust to rotations in the image up to 15 degrees. Each sample in the collected data set may be rotated in 1 degree increments and labelled with the original images labels, appropriately translated.
Verification data is gathered with the aim of testing the models to breaking point. This requires a different mindset for the team engaged with collecting data for verification who are focused not on creating a model but finding realistic ways in which the model may fail when used in an operational system. Furthermore the nature of ML is that any single sample may be encoded into the training set and a specific model found which is able to avoid the failure associated with the sample. This does not mean that the resultant model is robust to a more general class of failure to which the sample belongs. It is imperative therefore that the information concerning verification data is hidden from the developers to ensure the models generated are robust to the whole class of failures and not just specific examples present in the verification data.
The dimensions of variation are not independent and as such combinations of difficult situations are less likely to be included in a data set, used for development, which aims to represent normal operating behaviour. For example, a vehicle is unlikely to be using high beam in foggy conditions on a rainy day where ice is present on the road and a vehicle is approaching on the incorrect side of the carriageway. As such this case, although within the operating domain of the vehicle, is unlikely to be found in the development set. A good verification data set should particularly focus on challenging conditions that are within the operating domain and therefore such a case may be present in the verification data set.