For every ML safety requirement at least one verification activity shall be undertaken. The results of verification for each requirement shall be recorded in the ML verification results ([Z]).
All verification activities shall be sufficiently independent from the development activities. A log ([AA]) shall be created that documents the measures taken to verify the ML model, including those measures taken to ensure that data used in verification was not exposed to the development team.
Independence should be proportionate to the criticality of the ML model. In addition to different degrees of independence between the development and verification datasets (e.g. datasets compiled from one or more hospitals), further independence may be necessary between the development and verification engineers at the team or organisational levels.
One of the aims of model verification is to show that the performance of the model with respect to ML safety requirements encoded as metrics, such as precision and recall, are maintained when the model is subjected to inputs not present in the development data. A model that continues to perform when presented with data not included in the development set is known in the ML community as generalisable. Failures to generalise can be due to a lack of feature coverage in the development data or a lack of robustness to those perturbations which may be considered to be noise (i.e. small changes to a data sample that meets the performance specification in the absence of such noise).
Test‐based verification utilises the verification data to demonstrate that the model generalises to cases not present in the model learning stage. This shall involve an independent examination of the properties considered during the model learning stage. Specifically, those safety requirements associated with ensuring the robustness of models are evaluated on the independent verification dataset (i.e. that the performance is maintained in the presence of adverse conditions or signal perturbations). The test team should examine those cases that lie on boundaries or which are known to be problematic within the context to which the model is to be deployed.
If the nature of the model results in a verification test that is unable to determine if the model satisfies the safety requirement, it may be necessary to augment the verification dataset to demonstrate definitively if the requirement is met. This may for example arise due to non‐linearities in a model.
A neural network that predicts the stopping distance of a vehicle in given environmental conditions is tested across a range of linearly spaced values (temperature, humidity, precipitation). The results show that the model operates safely at all values except three distinct points in the range. Further verification data may need to be gathered around these points to determine the exact conditions leading to safety violations.
It is important to ensure that the verification data is not made available to the development team since, if they are to have oversight of the verification data, this is likely to lead to techniques at development time that circumvent specific samples in the verification set rather than considering the problem of generalisation more widely.
The ML verification results should evaluate test completeness with respect to the dimensions of variability outlined in the ML safety requirements. This is directly related to the desire for data completeness outlined in Stage 3 of the process.
Since we know that material on a camera lens can lead to blurring in regions of an image, we may make use of ‘contextual mutators’  to assess the robustness of a neural network with respect to levels of blur. In this way the level of blur which can be accommodated can be assessed and related to contextually meaningful measures.
For some safety‐related properties, such as interpretability, it may be necessary to include a human in the loop evaluation mechanism. This may involve placing the component into the application and generating explanations for experts to evaluate .
The DeepMind retinal diagnosis system  generates a segmentation map as part of the diagnosis pipeline. This image may be shown to clinical staff to ensure that the end-user is able to understand the rationale for the diagnosis. As part of a verification stage, these maps may be presented to users without the associated diagnosis to ensure that the images are sufficiently interpretable.
Formal verification uses mathematical techniques to prove that the learnt model satisfies formally-specified properties derived from the ML safety requirements. When formal verification is applied, counter‐examples are typically created which demonstrate the properties that are violated. In some cases, these may be used to inform further iterations of requirements specification, data management or model learning.
The formally‐specified properties shall be a sufficient representation of the ML safety requirements in the context of the defined operating environment. An explicit justification shall be documented for the sufficiency of the translation to formal properties.
The number of formal methods available for machine learnt models is increasing rapidly. However, it is not always possible to map formal results to a context that is easily understood by those wishing to deploy the system. Techniques which guarantee point‐wise robustness for individual samples , for example, are mathematically provable. For high dimensional input spaces such as images, however, knowing that a network is robust for a single sample if the pixel values change by less than 0.06 does little to inform the user as to the robustness of the image in rain.
Formal verification of neural networks is able to demonstrate that a perturbation of up to threshold value, ϵ, on all the inputs leads to a classification that is the same as the original sample class [36, 35]. Such a result is only meaningful when the value of ϵ is translated into contextually meaningful values such as steering angles. It is of little value when it simply provides a range of variation in pixel values for a single input sample (point-wise verification).
An airborne collision avoidance system, AcasXu , provides steering advisories for unmanned aerial vehicles. A neural network implementation of the system was verified using satisfiability modulo theories (SMT) solvers to ensure that alerting regions are uniform and do not contain inconsistent alerts.
The formal models that are used for verification will require assumptions and abstractions to be made, both with respect to the ML model itself, and with respect to the operating environment. The validity of the formal model shall therefore be demonstrated .