Datasets and evaluation

Datasets & Evaluation


The data consists of data from three wind farms and contains 10-minute time series of sensor readings and status IDs. The data is divided into training and evaluation datasets. In addition to the operational data, we provide event information for the training data, i.e., whether the prediction period contains anomalies that lead to a failure or maintenance action, or whether it contains only normal behaviour.

For each wind farm, there are multiple datasets, containing continuous multivariate time series of one wind turbine. These datasets contain both training data (one year before the test period) and the prediction data (indicated by the column train_test). The sensor measurements are 10 minute averages, minima, maxima and standard deviations of SCADA variables. The status ID indicates whether the wind turbine was in normal operation, idling, etc.:

status type IDstatus typedescriptionconsidered normal
0Normal operationNormal operation without limitationsTrue
1Derated operationDerated power generation with a power restrictionFalse
2IdlingAsset is idling and waits to operate againTrue
3ServiceAsset is in service mode / service team is at the siteFalse
4DowntimeAsset is down due a fault or other reasonsFalse
5OtherOther operational statesFalse
Status types

Note that the status values may be inconsistent; often the status is only sent when it changes, which may fail if there is a brief communication error. It is therefore advisable to check the power and wind speed values in addition to the status values to determine whether the turbine has indeed been operating normally.

Since we want the models to detect anomalies, with as few false alarms as possible, the datasets contain both anomaly events as well as ‘normal’ events, meaning the prediction period can contain normal behaviour only.


The task is to predict whether there are anomalies present in the prediction data of the evaluation files that indicate a failure, as early as possible, with as few false alarms as possible.

To validate your model(s), event information is available for the training data. This includes the following information:

  • event_id: ID of the event.
  • event_label: indicates whether this event contains anomalies that lead to a failure.
  • event_start: Start timestamp of the event.
  • event_end: End timestamp of the event.
  • description: Short description of the failure, if available.


All data is contained in the ‘data’ directory. The data consists of a varying number of events for each of the three wind farms Wind Farm A, Wind Farm B and Wind Farm C.

Each wind farm has its own directory where the file structure is given by:

  • Wind Farm A:
    • evaluation
      • <event_id>.csv
      • <event_id>.csv
    • Train
      • <event_id>.csv
      • <event_id>.csv
    • event_info.csv
    • feature_description.csv

In addition, we provide a quick start jupyter notebook, which shows examples of data loading, data exploration, model training and an evaluation based on the training data.


The directories (evaluation) and (train) both contain data for different events. Each event data csv-file contains time series of one wind turbine, containing both training and prediction data (indicated by the column ‘train_test’). The datasets are all high dimensional with 80, 250 or 950 features (depending on the asset type). The prediction data contains the event to be predicted. In addition to the features which are given in 10 minute min, max, avg and std of sensor measurements, there are columns for the status ID (status_type_id), the asset ID (asset ID), timestamp (time_stamp) and a row id (id). The sensor data and time stamps are anonymized.

Event information

For each wind farm there is the event information csv file (‘event_info.csv’). This gives additional information for all events in the ‘train’ directory and it contains the columns ‘event_id’ (file name of the dataset containing the event), ‘event_label’ (either normal or anomaly), ‘event start’, ‘event end’ and ‘event_description’.

Feature description

The features in the <event id>.csv-files are describes in the feature description file. This contains a short description for each feature.


For the evaluation of the round robin test we use the files in the ‘evaluation’ directory. All predictions from the evaluation datasets should be collected in one result file.

To evaluate the results, the ground truth for each event is defined as follows:

  1. Only take into account timestamps with a normal operational mode (i.e. Status Type ID    equals 0 or 2).
  2. Timestamps are labelled True (anomaly) for anomaly events between event_start and event_end.
  3. All other timestamps are labelled according to the Status Type ID (see column ‘considered normal’).

The final score is calculated by averaging the following 4 sub-scores:

  1. F-beta Score (FBeta)
    Applies sklearn.metrics.fbeta_score to a given prediction and ground truth for each anomaly event, with beta = 0.5, since we value precision over recall.
  2. Accuracy Score(Acc)
    Applies sklearn.metrics.accuracy_score to a given prediction and ground truth for each normal event.
  3. Eventwise F-Score (EFS)
    If more than 10% of the timestamps of an event are predicted to be an anomaly, the event is considered to be a predicted anomaly event. If less than 10% of the event’s timestamps are predicted to be anomalies, the event is considered to be a predicted normal event. The eventwise F-score is the F-beta score over the anomaly-event and normal-event predictions using the event labels as ground truth.
  4. Weighted Score (WS)
    The weighted score assigns higher scores to anomalies detected in the first half of an anomaly event than to anomalies detected near the end. The WS is the normalised weighted sum of all correctly detected anomaly timestamps divided by the total number of anomaly timestamps based on the ground truth of the event. This score is calculated for anomaly events only.

The final score is a normalised combination of Acc for normal events, FBeta and WS for anomaly events and the EFS over all events:

Final Score = 1/5(Fbetaaverage + 2*Accaverage + EFS + WSaverage)

where Fbetaaverage is the average FBeta score over all anomalous events, Accaverage the average accuracy over all normal events and WSaverage the average weighted score over all anomalous events.