The ML data validation activity shall check that the three generated data sets are sufficient to meet the ML data requirements. The results of the data validation activity shall be explicitly documented ([S]). Data validation shall consider the relevance, completeness, and balance of the data sets.

Discrepancies identified between the data generated and the ML data requirement shall be justified. These justifications shall be captured as part of the data validation results ([S]).

Note 22 - Financial and practical concerns

Both financial and practical concerns can lead to data sets that are not ideal and, in such cases, a clear rationale shall be provided. For example, a young child crossing in front of a fast-moving car may be a safety concern but gathering data for such events is not practicable.

Example 19 - JAAD open dataset Automotive

The JAAD open dataset [56] is used as development data for an ML component used to detect pedestrians. The cost of gathering and processing data for road crossings is expensive and substantial effort has been undertaken to generate the JAAD dataset. The labelling of pedestrians and range of poses observed is extensive and is clearly relevant for a perception pipeline concerned with the identification of pedestrians. The range of crossings types observed is limited however and a justification may be required as to why this is relevant for the intended deployment.

Validation of data relevance shall consider the gap between the samples obtained and the real-world environment in which the system is to be deployed. Validation shall consider each of the sub‐activities undertaken in data generation and provide a clear rationale for their use.

Note 23 - Simulation for data augmentation

Any simulation used for data augmentation is necessarily a simplification of the real world with assumptions underpinning the models used. Validating relevance, therefore, requires the gaps between simulation and modelling to be identified and a demonstration that these gaps are not material in the construction of a safe system.

Note 24 - Context-specific features

Validation should demonstrate that context‐specific features defined in the ML safety requirements are present in the collected datasets. For example, for a pedestrian detection system for deployment on European roads the images collected should include road furniture of types that would be found in the anticipated countries of deployment.

Example 20 - Using US data in the UK Healthcare

Data gathered in US hospitals used for a UK prognosis system should state how local demographics, policies and equipment vary between countries and the impact of such variance on data validity.

Example 21 - Data collection using controlled trials Healthcare

When data is collected using controlled trials (e.g. for medical imaging) a decision may be made to collect samples using a machine set up away from the hospital using non‐medical staff. The samples may only be considered relevant if an argument can be made that the environmental conditions do not impact the samples obtained and that the experience of the staff has no effect on the samples collected.

Validation of data completeness shall demonstrate that the collected data covers all the dimensions of variation stated in the ML safety requirements sufficiently. Given the combinatorial nature of input features, validation shall seek to systematically identify areas that are not covered.

Note 25 - Dimensions of variability

As the number of dimensions of variability and the granularity with which these dimensions are encoded increases, so the space that must be validated increases, combinatorially.

Note 26 - Quantisation

For continuous variables, the number of possible values is infinite. One possible approach is to use quantisation to map the continuous variables to a discrete space which may be more readily assessed. Where quantisation is employed it should be accompanied by an argument concerning the levels used.

Example 22 - Completeness of datasets Automotive

Consider a system to identify road signs into 43 separate classes. Dimensions of variability are: weather, time of day, and levels of partial occlusion up to 70%.

Let us assume that we have categorised each dimension as:

Time: early morning, mid morning, noon, late afternoon, evening, late evening, night
Weather: clear, rain light, rain heavy, fog light, fog heavy, snow light, snow heavy
Occlusion (%): (0, 10, 20, 30, 40,50, 60, 70)

Validation may show that there are samples for each of the 43 * 7 x 7 x 8 = 16856 possible combinations. A systematic validation process will identify that the datasets are missing (e.g. no samples containing a 40mph sign in light rain with 50% occlusion early in the morning). Although for most practical systems completeness is not possible, this process should provide evidence of those areas which are incomplete and why this is not problematic for assuring the resultant system.

Validation of data balance shall consider the distribution of samples in the data set. It is easiest to consider balance from a supervised classification perspective where the number of samples associated with each class is a key consideration.

Note 27 - Balance of datasets

At the class level assessing balance may be a simple case of counting the number of samples in each class. This approach becomes more complex, however, when considering the dimension of variation where specific combinations are relatively rare. More generally, data validation shall include statements regarding class balance and feature balance supervised learning tasks.

Note 28 - Imbalanced dataset

Certain classes may naturally be less common and, whilst techniques such as data augmentation may help, it may be difficult, or even impossible, to obtain a truly balanced set of classes. In such cases, the imbalance shall be noted and a justification provided as part of the validation results to support the use of imbalance data in the operational context.

Example 23 - Occlusion Automotive

Using the previous example, we may count the number of samples at each level of occlusion to ensure that each level is appropriately represented in the data sets.

Validation of data accuracy shall consider the extent to which the data samples, and meta data added to the set during preprocessing (e.g. labels), are representations of the ground truth associated with samples. Evidence supporting the accuracy of data may be gathered through a combination of the following:

An analysis of the processes undertaken to collect data (e.g. a bush fire detection system using satellite imagery could ensure that at least three users have agreed on the label for each sample).
Checking subsets of samples by expert users (e.g. where MRI images are generated with augmentation to simulate varying patient orientation within the scanner field an expert clinician will review a random sample of the resulting images to ensure that they remain credible).
Ensuring diversity of data sources to avoid systematic errors in the data sets (e.g. data for use in an earthquake detection system should make use of multiple sensors and locations such that sensor drift or atmospheric effects may be identified).

Where existing data sets are re‐used (e.g. the JAAD pedestrian data set [56]), documentation concerning the process may be available. Even under these conditions, additional validation tasks may be required to ensure that the labels are sufficient for the context into which the model is to be deployed.

Continue to: Activity 9. Instantiate ML data argument pattern

To be able to track recently viewed pages, please enable cookies using the button in the banner at the bottom of your screen.

To be able to bookmark pages, please enable cookies using the button in the banner at the bottom of your screen.

Our site depends on cookies to provide our service to you. If you continue to use this site we will assume that you are happy with that. View our privacy policy.