Vadetis is a tool to perform anomaly detection in time series datasets using different detection algorithms with various settings. You can upload our own data, make it available to others or not, or perform detection on the public available datasets. Further, you can compare performance metrics from the detection between different algorithms.
The search form in the header navigation allows to search datasets by title, title of associated training datasets and username.
No, once you deleted your account, all linked information including datasets and training datasets have been removed from the database.
Although some methods do not require supervised data in general, in this application we validate the performance of the detection and automatically set decision boundaries that maximize a given metric. In order to accomplish these tasks supervised data is required.
Detection of outliers is a computational intensive operation as a score has to be computed for each point or data instance. Therefore the limit is 100'000 values for a dataset. Training data must be larger than 100 values and shall not exceed 10'000 values. It must contain at least 5 outlier and 20 normal data instances as a minimum requirement for splitting the data into training and validation parts.
All time series must have the same length and same granularity.
Further, the datasets must not contain missing values.
Only univariate time series are supported, therefore values must be of the same domain.
Dataset / Training Dataset:
The file must use semicolons as delimiter and all time series of your dataset must be contained in this single file. The
time series names must be distinct within the same dataset. The CSV file must contain a header line of the form:
ts_name;time;unit;value;class
Column | Description | Example |
---|---|---|
ts_name | The name of the time series. | SLFAL1 |
time | An ISO8601 formatted date. | 202004210000 or 2020-04-21T00:00 |
unit | The unit of the series. Note: all time series must have same unit. | Celsius |
value | The value at the given time. | 10.12 |
class | The label that marks the value as outlier or not. 0 for normal data, 1 for anomaly. | 0 |
Spatial:
This optional CSV file provides spatial information about the time series, which enables classical LISA computation
based on the
distance between the locations as weights. The CSV file must contain a header line of the form:
ts_name;l_name;latitude;longitude
Column | Description | Example |
---|---|---|
ts_name | The name of the time series, which is associated to this location. | SLFAL1 |
l_name | A label for the location. | Allières / Chenau |
latitude | The latitude of the location in decimal format. | 46.48861 |
longitude | The longitude of the location in decimal format. | 6.99321 |
You can change the name, type and the sharing of your datasets. Altering of values is not supported. If something is wrong with your data remove your dataset and add your corrected data as new dataset. Changing the type of a dataset will affect the training datasets as well and vice versa.
Although models are trained in a semi-supervised fashion, Vadetis is a validator for anomaly detection in time series. In order to set the decision boundary that maximizes a given performance metric, it requires to add only supervised datasets.
You can add as many datasets as you want.
No, only the owner of the dataset can add training data.
Yes, if a value is missing at a certain timestamp some algorithms cannot compute a detection result at this timestamp and therefore it is required to have no gaps in the data. There exist several techniques to recover missing values such as centroid decomposition.
In order to compute outlier detection the time series must share the same granularities. The supported granularities are:
Type | Label | Description |
---|---|---|
YearBegin | 'AS' or 'BYS' | calendar year begin |
MonthBegin | 'MS' | calendar month begin |
Week | 'W' | one week, optionally anchored on a day of the week |
Day | 'D' | one absolute day |
Hour | 'H' | one hour |
Minute | 'T' or 'min' | one minute |
Second | 'S' | one second |
Milli | 'L' or 'ms' | one millisecond |
Histogram, Cluster, SVM, Isolation Forest, RPCA
Histogram, Cluster, SVM, Isolation Forest and RPCA use a selected proportion of normal data instances in the training dataset for
semi-supervised learning. The rest of the normal data as well as outlier instances will be split by the given proportion to a validation
set. No data instances are shared between training and validation sets. For example, given a dataset of 1000 data instances containing 60
outlier data instances and a proportion of 0.5, then the training set will contain 470 normal data instances whereas the validation set
contains 235 normal and 30 outlier instances. The validation set is used to determine the threshold for the decision boundary that maximizes
a performance metric.
Afterwards, the trained model is applied on the evaluation dataset. Data instances with a score below the threshold will be marked as anomalies for Histogram, Cluster, SVM, Isolation Forest whereas RPCA marks outliers if they are above the threshold. Contrary to these techniques, LISA does not require training and is directly applied to the data. It marks points as outliers if their score is below the threshold.
The confusion matrix is a specific table layout that allows visualization of the performance of detection method. Each row of the matrix represents the instances of the actual class while each column represents the instances in a predicted class.
Predicted class | |||
---|---|---|---|
Normal | Anomaly | ||
True
class |
Normal | True Negative (TN) | False Positive (FP) |
Anomaly | False Negative (FN) | True Positive (TP) |
The tool always validates 200 threshold candidates from range 0 to 1. Because the data is supervised the performance can be computed for each threshold candidate. Depending on the score type (NMI, RMSE, Accuracy, F1-Score, Precision or Recall) which is selected to be maximized, the tool will select the most appropriate that maximizes the selected performance metric.
Although the outlier score is normalized to a range from 0 to 1, you may not interpret them across different detection methods. Outlier scores computed by different methods differ in significance, range and contrast between those models and are therefore not easy to compare or interpret.
Point outliers are single extreme values. The outlier value is computed from the standard deviation of normal values around the location where they are injected and multiplied by an additional factor.
An amplitude shift affects several subsequent values which are all increased or decreased by the same offset value that is computed from the standard deviation of normal values around the location of insertion and multiplied by an additional factor.
Growth change either increases or decreased several subsequent values by a changing offset. The offset is linear increased or decreased at each subsequent step depending if negative or positive growth is applied. After the last affected point, all subsequent points will be shifted by the last offset value, leading to permanent modification of the rest of the data.
Distortion is computed from the difference of the original values between two subsequent points. This difference from the first point to the second is multiplied by a factor and added to the second point. This procedure is repeated and applied to the next group of two points, making the second point from last iteration the first point in the next step.
For a certain range, the values of a time series are set to 0.
For each performed method it will use the default configuration.
LISA (Pearson) | Value |
---|---|
Window Size | 10 |
LISA (DTW with Pearson) | Value |
---|---|
Window Size | 10 |
Dtw Distance Function | euclidean |
Histogram | Value |
---|---|
Train Size | 0.5 |
Cluster (Gaussian Mixture) | Value |
---|---|
Bootstrap | False |
N Estimators | 40 |
Train Size | 0.5 |
SVM | Value |
---|---|
Kernel | rbf |
Nu | 0.95 |
Train Size | 0.5 |
Isolation Forest | Value |
---|---|
Bootstrap | False |
N Estimators | 40 |
Train Size | 0.5 |
RPCA | Value |
---|---|
Delta | 1 |
N Components | 2 |
Train Size | 0.5 |
The training dataset that has the highest contamination level will be used for model training. Further, LISA computation is applied to the time series that contains the most outliers. In order to improve performance when executing several detection methods at the same time, only the 500 timestamps of a time series are taken into account.