Datasets and evaluation

Datasets & Evaluation


The data consists of data from three wind farms and contains 10-minute time series of sensor readings and status IDs. The data is divided into training and evaluation datasets. In addition to the operational data, we provide event information for the training data, i.e., whether the prediction period contains anomalies that lead to a failure or maintenance action, or whether it contains only normal behaviour.

For each wind farm, there are multiple datasets, containing continuous multivariate time series of one wind turbine. These datasets contain both training data (one year before the test period) and the prediction data (indicated by the column train_test). The sensor measurements are 10 minute averages, minima, maxima and standard deviations of SCADA variables. The status ID indicates whether the wind turbine was in normal operation, idling, etc.:

status type IDstatus typedescriptionconsidered normal
0Normal operationNormal operation without limitationsTrue
1Derated operationDerated power generation with a power restrictionFalse
2IdlingAsset is idling and waits to operate againTrue
3ServiceAsset is in service mode / service team is at the siteFalse
4DowntimeAsset is down due a fault or other reasonsFalse
5OtherOther operational statesFalse
Status types

Note that the status values may be inconsistent; often the status is only sent when it changes, which may fail if there is a brief communication error. It is therefore advisable to check the power and wind speed values in addition to the status values to determine whether the turbine has indeed been operating normally.

Since we want the models to detect anomalies, with as few false alarms as possible, the datasets contain both anomaly events as well as ‘normal’ events, meaning the prediction period can contain normal behaviour only.


The task is to predict whether there are anomalies present in the prediction data of the evaluation files that indicate a failure, as early as possible, with as few false alarms as possible.

To validate your model(s), event information is available for the training data. This includes the following information:

  • event_id: ID of the event.
  • event_label: indicates whether this event contains anomalies that lead to a failure.
  • event_start: Start time stamp of the event.
  • event_end: End time stamp of the event.
  • event_start_id: ID of the start time stamp of the event
  • event_end_id: ID oft the end time stamp of the event
  • description: Short description of the failure, if available.


All data is contained in the ‘data’ directory. The data consists of a varying number of datasets for each of the three wind farms Wind Farm A, Wind Farm B and Wind Farm C.

Each wind farm has its own directory where the file structure is given by:

  • Wind Farm A:
    • evaluation
      • <event_id>.csv
      • <event_id>.csv
    • Train
      • <event_id>.csv
      • <event_id>.csv
    • event_info.csv
    • feature_description.csv

In addition, we provide a quick start jupyter notebook, which shows examples of data loading, data exploration, model training and an evaluation based on the training data.


The directories (evaluation) and (train) both contain datasets and every dataset contains one event, which is to be predicted. Each event data csv-file contains time series of one wind turbine, containing both training and prediction data (indicated by the column ‘train_test’). The datasets are all high dimensional with 80, 250 or 950 features (depending on the asset type). The prediction data contains the event to be predicted. The features are given in 10-minute average sensor measurements and for some sensors additional information in the form of 10-minute minimum, maximum and standard deviation of the sensor measurements are available. The datasets also contain columns for the status ID (status_type_id), the asset ID (asset ID), timestamp (time_stamp) and a row id (id). The sensor data and time stamps are anonymized.

Event information

For each wind farm there is the event information csv file (‘event_info.csv’). This gives additional information for all events in the ‘train’ directory and it contains the columns ‘event_id’ (file name of the dataset containing the event), ‘event_label’ (either normal or anomaly), ‘event start’, ‘event end’ and ‘event_description’.

Feature description

The features in the <event id>.csv-files are describes in the feature description file. This contains a short description for each feature.


For the evaluation of the round robin test we use the files in the ‘evaluation’ directory. All predictions from the evaluation datasets should be collected in one result file.

To evaluate the results, the ground truth for each event is defined as follows:

  1. Only take into account timestamps with a normal operational mode (i.e. Status Type ID    equals 0 or 2).
  2. Timestamps are labelled True (anomaly) for anomaly events between event_start and event_end.
  3. All other timestamps are labelled according to the Status Type ID (see column ‘considered normal’).

The final score is calculated by averaging the following 4 sub-scores:

  1. F-beta Score (FBeta)
    Applies sklearn.metrics.fbeta_score to a given prediction and ground truth for each anomaly event, with beta = 0.5, since we value precision over recall. The F-Beta-Score measures the classification performance of the model on datasets containing an anomaly event.
  2. Accuracy Score(Acc)
    Applies sklearn.metrics.accuracy_score to a given prediction and ground truth for each normal event. The accuracy measures the ability of the model to correctly recognize normal behavior and the ability to not raise false alarms.
  3. Eventwise F-Score (EFS)
    In order to measure the model-performance in an operative health monitoring setting we introduce a rule that decides whether the model has detected an anomaly event within the dataset. This rule calculates a so called, maximum criticality value based on the model prediction for each dataset. The criticality is calculated for each time stamp of the prediction data. It starts at 0, increases by 1 if an anomaly was detected during a normal operation mode, and it decreases by 1 if no anomaly was detected during a normal operation mode. The criticality can not decrease below 0. If the operation mode is not normal the criticality remains stationary.If the maximum of the criticality-timeseries is greater or equal to 72 (this equates to 12 hours of consecutive anomalies) the model prediction will be counted as a detected anomaly event. If the maximum criticality is below 72 the model prediction will be counted as a normal event.
    The eventwise F-score is the F-Beta-score (beta=0.5) over the anomaly event and normal event predictions using the true event labels as ground truth.
  4. Weighted Score (WS)
    The weighted score assigns higher scores to anomalies detected in the first half of an anomaly event than to anomalies detected near the end. The WS is the normalised weighted sum of all correctly detected anomaly timestamps divided by the total number of anomaly timestamps based on the ground truth of the event. This score is calculated for anomaly events only. This score is calculated only for datasets containing an anomaly events. It expresses the ability of the model to detect anomalies as early as possible.

The final score is a normalised combination of Acc for normal events, FBeta and WS for anomaly events and the EFS over all events:

Final Score = 1/5(Fbetaaverage + 2*Accaverage + EFS + WSaverage)

where Fbetaaverage is the average FBeta score over all anomalous events, Accaverage the average accuracy over all normal events and WSaverage the average weighted score over all anomalous events. Accuracy is weighted twice as high as the other sub-scores in order to give
datasets that contain only normal behavior the same importance as dataset containing an anomaly event.

WordPress Cookie Plugin by Real Cookie Banner