tspDB API Reference

Prediction Index Management

`create_pindex ()`

This function creates a prediction index on an existing PostgreSQL time series table. Building an index enables predictive queries. see Predictive Queries Section. The table must have a time column chosen through the argument time_column, and one or multiple value columns determined by value_columns.

time_column Must be a column of integers(INT, or BIGINT) or timestamp(TIMESTAMP or TIMESTAMPTZ)

Required Arguments

Name	Type	Description
`table_name`	text	The name of the time series table to be indexed
`time_column`	text	The name of the time column to be indexed on. This is the column containing time values, which can be either integers or timestamps
`value_columns`	text array	The name of the values column(s) containing time series values to be indexed on.
`index_name`	text	The name of the prediction index. This is the index identifier used when executing predictive queries, deleting the index, .. etc.

Optional Arguments

Name	Type	Description
`auto_update`	boolean	Default: `true`. If true, update the prediction index when new rows are inserted to `table_name`
`agg_interval`	numeric	Default: `NULL`. Determines the aggregation interval in seconds if the time column is in timestamps. If `NULL`, it is set to the median of the time difference between the first 100 points.
`normalize`	boolean	Default: `true`. If true, normalize the data before fitting the prediction model. Recommended for indices on multiple time series.
`k`	int	Default: `NULL`. A parameter in the prediction algorithm. It represents the number of singular values to retain in the means prediction model. If `NULL`, then it is chosen automatically by selecting the $k^{th}$ top singular values that maintain 90% of the signal energy.
`T`	int	Default: `2,500,000` A parameter in the prediction algorithm. It denotes the number of entries in each submodel in the means prediction model. `T`>`T_0`.
`T0`	int	Default: `100`. The minimum number of entries (number of table rows $\times$ number of time series included in the index) for the prediction index to be built.
`gamma`	numeric	Default: `0.5`. Must be in the interval (0,1], otherwise, it will default to 0.5. A parameter in the prediction algorithm, which determines the number of new entries relative to `T` after which the last sub-model is fully updated
`var_direct`	boolean	Default: `true`. A parameter in the prediction algorithm. Default value recommended. If true, The variance is estimated directly by subtracting the estimated mean from the observations and then estimating the underlying mean of the difference through matrix estimation. Otherwise, it is estimated through the estimation of the underlying mean of the square of the observations.
`col_to_row_ratio`	int	Default: `10`. A parameter of the prediction model. It determines the ratio of no. columns to the number of rows in each sub-model's matrix.
`L`	int	Default: `NULL`. If not `NULL`, `col_to_row_ratio` is ignored and it determines the number of rows in each sub-model. L should be greater than 2 and less than floor(sqrt(T))
`k_var`	int	Default: `NULL`. A parameter in the prediction algorithm. It represents the number of singular values to retain in the variance prediction model. If `NULL`, it is chosen automatically by selecting the $k^{th}$ top singular values that maintain 90% of the signal energy. If 0, do not estimate variance.
`timescale`	boolean	Default: `false`. If true, assumes the time-series table is stored as TimescaleDB hypertable, which expedite updating/creating the pindex.

Sample Usage

1 - Create a prediction index pindex1on table measurements on time column time and value columns temprature.

select create_pindex('measurements','time','{"temprature"}','pindex1');

2 - Create a prediction index pindex2on table measurements on time column time and value columns temprature and humidity, but build it on the aggregated data with an agg_interval of 120 seconds

select create_pindex('measurements','time','{"temprature", "humidity"}','pindex2', 
    agg_interval => 120);

3 - Create a prediction index pindex3on table measurements on time column time and value columns temprature with T = 10000 and without estimating the variance.

select create_pindex('measurements','time','{"temprature", "humidity"}','pindex2', 
    T => 10000, k_var => 0);

`list_pindices()`

List all the predictive indices created in your database with some basic information about each index.

Returns

Name	Type	Description
`index_name`	text	The name of the prediction index.
`value_columns`	text array	The name of the indexed values column(s).
`relation`	text	The name of the indexed time series table.
`time_column`	text	The name of the time column indexed on.
`initial_timestamp`	text	The first timestamp/integer included in buliding the index.
`last_timestamp`	text	The last timestamp/integer included in buliding the index.
`agg_interval`	numeric	The aggregation interval in seconds (if the time columns is in timestamps)
`uncertainty_quantification`	boolean	true if the variance is estimated by the predicting index.

Sample Usage

To get a table of all prediction indices built in your database:

select * from list_pindices();

Or you can get a specific column for a specific index by writing an SQL query on list_pindices(). For example:

select agg_interval from list_pindices() where index_name = 'pindex1';

`delete_pindex()`

Delete the predictive index index_name from your database.

Required Arguments

Name	Type	Description
`index_name`	text	name of the predictive index to be deleted

Sample Usage

You can simply delete the created predictive index pindex_1 through

select delete_pindex('pindex_1');

`pindices_stat()`

Show some statistics regarding the predictive indices. These include in-sample accuracies for imputation and forecasting, out-of-sample accuracy (only when more than 2 sub-models are built), the number of points included in the model. Accuracies are in R². R² values typically range between 0 and 1. (One being the perfect accuracy)

Returns

Name	Type	Description
`index_name`	text	Prediction index name.
`column_name`	text	A name of a value column indexed by `index_name`.
`number_of_observations`	numeric	The number of datapoints used in the prediction model.
`number_of_trained_models`	numeric	The number of sub-models in the prediction model.
`imputation_score`	numeric	The in-sample imputation accuracy.
`forecast_score`	numeric	The in-sample forecasting accuracy.
`test_forecast_score`	numeric	The out-of-sample forecasting accuracy.

Sample Usage

To get a table of the statistics of all prediction indices built in your database:

select * from pindices_stat();

Similarly, you can get it for a specific column and index by

select * from pindices_stat() where index_name  = 'pindex1' and column_name = 'temprature';

`update_pindex()`

Update the prediction index index_name.

Note: if auto_update was set to true when the index was created, then it will be updated automatically.

Required Arguments

Name	Type	Description
`index_name`	text	Prediction index name.

Sample Usage

To update undex pindex_1:

select  update_pindex('pindex_1');

Predictive Queries

`predict ()`

Returns the predicted value along with the confidence interval for the value of value_column at time t using index_name

Required Arguments

Name	Type	Description
`table_name`	text	The name of the indexed time series table.
`value_column`	text	The name of the indexed value column to be predicted.
`t`	int or Timestamp	Index or timestamp indicating the queried time.
`index_name`	text	The name of the prediction index.

Optional Arguments

Name	Type	Description
`uq`	boolean	Default: `true`. If true, return upper and lower bound of the `c`% confidenc interval
`uq_method`	text	Defalut = `Gaussian` options: {`Gaussian`, `Chebyshev`} Uncertainty quantification method used to estimate the confidence interval
`c`	Numeric	Default: 95. confidence level for uncertainty quantification, c should be in the range (0,100).

Returns

Name	Type	Description
`prediction`	numeric	Predicted value
`LB`	numeric	Lower bound of prediction
`UB`	numeric	Upper bound of prediction

Sample Usage

Inline with the examples used in create_pindex() we can predict as follow: 1- To predict the temprature of table measurementsat time t = 100 using pindex1

select predict('measurements','temprature',100,'pindex1');

2- To predict the humidity of table measurementsat time t = 100 using pindex2 with a confidence interval of 98%:

select predict('measurements','humidity',100,'pindex2', c => 98);

`predict()`

This is the same predict() function, but returns the predicted value for a time range t1 to t2 using index_name along with the confidence interval for the value of value_column

Required Arguments

Name	Type	Description
`table_name`	text	The name of the indexed time series table.
`value_column`	text	The name of the indexed value column to be predicted.
`t1`	int or Timestamp	Index or timestamp indicating the start of queried time (inclusive).
`t2`	int or Timestamp	Index or timestamp indicating the end of queried time (inclusive).
`index_name`	text	The name of the prediction index.

Optional Arguments

Name	Type	Description
`uq`	boolean	Default: `true`. If true, return upper and lower bound of the `c`% confidenc interval
`uq_method`	text	Defalut = `Gaussian` options: {`Gaussian`, `Chebyshev`} Uncertainty quantification method used to estimate the confidence interval
`c`	Numeric	Default: 95. confidence level for uncertainty quantification, c should be in the range (0,100).

Returns

Name	Type	Description
`prediction`	numeric	Predicted value
`LB`	numeric	Lower bound of prediction
`UB`	numeric	Upper bound of prediction

Sample Usage

Inline with the examples used in create_pindex() we can predict as follow: 1- To predict the temprature of table measurementsat time range t = 100 to t =110 using pindex1

select predict('measurements','temprature',100,110,'pindex1');

2- To predict the humidity of table measurementsat time range t = 100 to t =110 using Chebyshev as the uncertainity quantification method:

select predict('measurements','humidity',100,110,'pindex2', uq_method => 'Chebyshev');

Additional Functionalities

`get_lowerbound()`

This method estimated a prediction error lower bound of a discrete time series. As such, the first step is to discretize the given sequence. If the given time series is already discrete, you can ignore the discretization_method. The method then estimates an entropy rate of the time series via Lempel-Ziv compression, and returns the classification error lower bound. The stationarity of the time series is assumed. More detailed explanation is given below, and please visit https://github.com/saeyoung/tslb for more information.

Discretization

When the given time series comprises of continuous random variables, we need to discretize the sequence prior to apply this method. See discretization_method.

Linear Model

We build a linear model between the compression ratio and the entropy rate. To do so, we sample a number of sequences that are similar to the given time series: that has the same number of observations (sequence length; number_of_observations ) and the same number of alphabets (the number of unique cases that appear in the sequence; k ). samples denote the number of sampled sequences, and each sequence's distribution is known as we set the true distribution for them. Thus, we can obtain 1) the entropy rate from the true distribution and 2) the compression ratio of the sequence via Lempel-Ziv algorithm.

Regression

We obtain the compression ratio of the (discretized) time series of interest and calculate the entropy rate using the aforementioned linear model. Finally, we estimate the classification error lower bound using Fano's Inequality. The returned value is the probability of the best classification error.

Required Arguments

Name	Type	Description
`table_name`	text	The name of the time series table.
`value_column`	text	The name of the time series column for which a lower bound for prediction error will be calculated.
`time_column`	text	The name of the time column in the time series table. This is the column containing time values, which can be either integers or timestamps.

Optional Arguments

Name	Type	Description
`number_of_observations`	int	Default: 1000. Number of observations used to estimate the lower bound. The most recent observations will be selected.
`samples`	int	Defalut = `100`. Number of samples to construct the linear model.
`discretization_method`	text	Defalut = `quantization` options: {`quantization`, `change`, `None`}. `quantization`: quantize the sequence into `k` equilength bins, from min to max. `change`: create three alphabets, 1) remains the same, 2) increased, and 3) decreased. `None` skip discretization, which should be used if the time series is already discrete.
`k`	int	Default: 3. Number of alphabets in the discretized sequence when `discretization_method` is set to `quantization`.

Returns

Name	Type	Description
`lower bound`	numeric	Prediction error lower bound. This is a predicted value for the classification error (probability) of the best estimator based on Fano's Inequality.

Sample Usage

Inline with the examples used before, we can estimate the prediction lower bound for the time series temperature in the table measurements as follow:


select get_lowerbound('measurements','temprature','time');

Similarly, we can change the number of observations used to estimate the lower bound to 2000 through the following query:


select get_lowerbound('measurements','temprature','time', number_of_observations => 2000);

We can convert the time series on a continuous scale into a ternary sequence: increased, decreased, and no change:


select get_lowerbound('measurements','temprature','time', discretization_method => 'change');

Another way to quantize the sequence is to divide the line between the observed minimum and the maximum into k=5 equilength intervals:


select get_lowerbound('measurements','temprature','time', discretization_method => 'quantization', k => 5);

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search