Vadetis | Validator for Anomaly Detection in Times Series

Frequently Asked Questions

FAQ

What is Vadetis?

Vadetis is a tool to perform anomaly detection in time series datasets using different detection algorithms with various settings. You can upload our own data, make it available to others or not, or perform detection on the public available datasets. Further, you can compare performance metrics from the detection between different algorithms.

How can I search datasets?

The search form in the header navigation allows to search datasets by title, title of associated training datasets and username.

I removed my account. Is it possible to re-enable it?

No, once you deleted your account, all linked information including datasets and training datasets have been removed from the database.

Why is supervised-data required?

Although some methods do not require supervised data in general, in this application we validate the performance of the detection and automatically set decision boundaries that maximize a given metric. In order to accomplish these tasks supervised data is required.

What are the requirements for datasets and training datasets?

Detection of outliers is a computational intensive operation as a score has to be computed for each point or data instance. Therefore the limit is 100'000 values for a dataset. Training data must be larger than 100 values and shall not exceed 10'000 values. It must contain at least 5 outlier and 20 normal data instances as a minimum requirement for splitting the data into training and validation parts.

All time series must have the same length and same granularity.

Further, the datasets must not contain missing values.

Only univariate time series are supported, therefore values must be of the same domain.

How does the CSV format looks like?

Dataset / Training Dataset:

The file must use semicolons as delimiter and all time series of your dataset must be contained in this single file. The time series names must be distinct within the same dataset. The CSV file must contain a header line of the form:
ts_name;time;unit;value;class

Column	Description	Example
ts_name	The name of the time series.	SLFAL1
time	An ISO8601 formatted date.	202004210000 or 2020-04-21T00:00
unit	The unit of the series. Note: all time series must have same unit.	Celsius
value	The value at the given time.	10.12
class	The label that marks the value as outlier or not. 0 for normal data, 1 for anomaly.	0

Spatial:

This optional CSV file provides spatial information about the time series, which enables classical LISA computation based on the distance between the locations as weights. The CSV file must contain a header line of the form:
ts_name;l_name;latitude;longitude

Column	Description	Example
ts_name	The name of the time series, which is associated to this location.	SLFAL1
l_name	A label for the location.	Allières / Chenau
latitude	The latitude of the location in decimal format.	46.48861
longitude	The longitude of the location in decimal format.	6.99321

Is it possible to change datasets?

You can change the name, type and the sharing of your datasets. Altering of values is not supported. If something is wrong with your data remove your dataset and add your corrected data as new dataset. Changing the type of a dataset will affect the training datasets as well and vice versa.

I have unsupervised data, can I import it?

Although models are trained in a semi-supervised fashion, Vadetis is a validator for anomaly detection in time series. In order to set the decision boundary that maximizes a given performance metric, it requires to add only supervised datasets.

How many dataset and training datasets can I add?

You can add as many datasets as you want.

Can I add training data to datasets I do not own?

No, only the owner of the dataset can add training data.

I have missing values in my datasets. Is this a problem?

Yes, if a value is missing at a certain timestamp some algorithms cannot compute a detection result at this timestamp and therefore it is required to have no gaps in the data. There exist several techniques to recover missing values such as centroid decomposition.

Which granularities are supported?

In order to compute outlier detection the time series must share the same granularities. The supported granularities are:

Type	Label	Description
YearBegin	'AS' or 'BYS'	calendar year begin
MonthBegin	'MS'	calendar month begin
Week	'W'	one week, optionally anchored on a day of the week
Day	'D'	one absolute day
Hour	'H'	one hour
Minute	'T' or 'min'	one minute
Second	'S'	one second
Milli	'L' or 'ms'	one millisecond

How is model training performed?

Histogram, Cluster, SVM, Isolation Forest, RPCA
Histogram, Cluster, SVM, Isolation Forest and RPCA use a selected proportion of normal data instances in the training dataset for semi-supervised learning. The rest of the normal data as well as outlier instances will be split by the given proportion to a validation set. No data instances are shared between training and validation sets. For example, given a dataset of 1000 data instances containing 60 outlier data instances and a proportion of 0.5, then the training set will contain 470 normal data instances whereas the validation set contains 235 normal and 30 outlier instances. The validation set is used to determine the threshold for the decision boundary that maximizes a performance metric.

Afterwards, the trained model is applied on the evaluation dataset. Data instances with a score below the threshold will be marked as anomalies for Histogram, Cluster, SVM, Isolation Forest whereas RPCA marks outliers if they are above the threshold. Contrary to these techniques, LISA does not require training and is directly applied to the data. It marks points as outliers if their score is below the threshold.

What is the confusion matrix?

The confusion matrix is a specific table layout that allows visualization of the performance of detection method. Each row of the matrix represents the instances of the actual class while each column represents the instances in a predicted class.

		Predicted class
		Normal	Anomaly
True class	Normal	True Negative (TN)	False Positive (FP)
True class	Anomaly	False Negative (FN)	True Positive (TP)

How is the decision boundary (threshold) determined?

The tool always validates 200 threshold candidates from range 0 to 1. Because the data is supervised the performance can be computed for each threshold candidate. Depending on the score type (NMI, RMSE, Accuracy, F1-Score, Precision or Recall) which is selected to be maximized, the tool will select the most appropriate that maximizes the selected performance metric.

How can I interpret outliers scores between different detection methods?

Although the outlier score is normalized to a range from 0 to 1, you may not interpret them across different detection methods. Outlier scores computed by different methods differ in significance, range and contrast between those models and are therefore not easy to compare or interpret.

What are point outliers and how are they computed?

Point outliers are single extreme values. The outlier value is computed from the standard deviation of normal values around the location where they are injected and multiplied by an additional factor.

What are amplitude shift outliers and how are they computed?

An amplitude shift affects several subsequent values which are all increased or decreased by the same offset value that is computed from the standard deviation of normal values around the location of insertion and multiplied by an additional factor.

What are growth change outliers and how are they computed?

Growth change either increases or decreased several subsequent values by a changing offset. The offset is linear increased or decreased at each subsequent step depending if negative or positive growth is applied. After the last affected point, all subsequent points will be shifted by the last offset value, leading to permanent modification of the rest of the data.

What are distortion outliers and how are they computed?

Distortion is computed from the difference of the original values between two subsequent points. This difference from the first point to the second is multiplied by a factor and added to the second point. This procedure is repeated and applied to the next group of two points, making the second point from last iteration the first point in the next step.

What are missing value outliers and how are they computed?

For a certain range, the values of a time series are set to 0.

What settings are used for the recommendation?

For each performed method it will use the default configuration.

LISA (Pearson)	Value
Window Size	10

LISA (DTW with Pearson)	Value
Window Size	10
Dtw Distance Function	euclidean

Histogram	Value
Train Size	0.5

Cluster (Gaussian Mixture)	Value
Bootstrap	False
N Estimators	40
Train Size	0.5

SVM	Value
Kernel	rbf
Nu	0.95
Train Size	0.5

Isolation Forest	Value
Bootstrap	False
N Estimators	40
Train Size	0.5

RPCA	Value
Delta	1
N Components	2
Train Size	0.5

Which datasets and time series are used to compute the recommendation?

The training dataset that has the highest contamination level will be used for model training. Further, LISA computation is applied to the time series that contains the most outliers. In order to improve performance when executing several detection methods at the same time, only the 500 timestamps of a time series are taken into account.