tspDB API Reference
Prediction Index Management
create_pindex ()
This function creates a prediction index on an existing PostgreSQL time series table. Building an index enables predictive queries. see Predictive Queries Section. The table must have a time column chosen through the argument time_column
, and one or multiple value columns determined by value_columns
.
time_column
Must be a column of integers(INT, or BIGINT) or timestamp(TIMESTAMP or TIMESTAMPTZ)
Required Arguments
Name | Type | Description |
---|---|---|
table_name |
text | The name of the time series table to be indexed |
time_column |
text | The name of the time column to be indexed on. This is the column containing time values, which can be either integers or timestamps |
value_columns |
text array | The name of the values column(s) containing time series values to be indexed on. |
index_name |
text | The name of the prediction index. This is the index identifier used when executing predictive queries, deleting the index, .. etc. |
Optional Arguments
Name | Type | Description |
---|---|---|
auto_update |
boolean | Default: true . If true, update the prediction index when new rows are inserted to table_name |
agg_interval |
numeric | Default: NULL . Determines the aggregation interval in seconds if the time column is in timestamps. If NULL , it is set to the median of the time difference between the first 100 points. |
normalize |
boolean | Default: true . If true, normalize the data before fitting the prediction model. Recommended for indices on multiple time series. |
k |
int | Default: NULL . A parameter in the prediction algorithm. It represents the number of singular values to retain in the means prediction model. If NULL , then it is chosen automatically by selecting the $k^{th}$ top singular values that maintain 90% of the signal energy. |
T |
int | Default: 2,500,000 A parameter in the prediction algorithm. It denotes the number of entries in each submodel in the means prediction model. T >T_0 . |
T0 |
int | Default: 100 . The minimum number of entries (number of table rows $\times$ number of time series included in the index) for the prediction index to be built. |
gamma |
numeric | Default: 0.5 . Must be in the interval (0,1], otherwise, it will default to 0.5. A parameter in the prediction algorithm, which determines the number of new entries relative to T after which the last sub-model is fully updated |
var_direct |
boolean | Default: true . A parameter in the prediction algorithm. Default value recommended. If true, The variance is estimated directly by subtracting the estimated mean from the observations and then estimating the underlying mean of the difference through matrix estimation. Otherwise, it is estimated through the estimation of the underlying mean of the square of the observations. |
col_to_row_ratio |
int | Default: 10 . A parameter of the prediction model. It determines the ratio of no. columns to the number of rows in each sub-model's matrix. |
L |
int | Default: NULL . If not NULL , col_to_row_ratio is ignored and it determines the number of rows in each sub-model. L should be greater than 2 and less than floor(sqrt(T)) |
k_var |
int | Default: NULL . A parameter in the prediction algorithm. It represents the number of singular values to retain in the variance prediction model. If NULL , it is chosen automatically by selecting the $k^{th}$ top singular values that maintain 90% of the signal energy. If 0, do not estimate variance. |
timescale |
boolean | Default: false . If true, assumes the time-series table is stored as TimescaleDB hypertable, which expedite updating/creating the pindex. |
Sample Usage
1 - Create a prediction index pindex1
on table measurements
on time column time
and value columns temprature
.
select create_pindex('measurements','time','{"temprature"}','pindex1');
2 - Create a prediction index pindex2
on table measurements
on time column time
and value columns temprature
and humidity
, but build it on the aggregated data with an agg_interval
of 120 seconds
select create_pindex('measurements','time','{"temprature", "humidity"}','pindex2',
agg_interval => 120);
3 - Create a prediction index pindex3
on table measurements
on time column time
and value columns temprature
with T
= 10000 and without estimating the variance.
select create_pindex('measurements','time','{"temprature", "humidity"}','pindex2',
T => 10000, k_var => 0);
list_pindices()
List all the predictive indices created in your database with some basic information about each index.
Returns
Name | Type | Description |
---|---|---|
index_name |
text | The name of the prediction index. |
value_columns |
text array | The name of the indexed values column(s). |
relation |
text | The name of the indexed time series table. |
time_column |
text | The name of the time column indexed on. |
initial_timestamp |
text | The first timestamp/integer included in buliding the index. |
last_timestamp |
text | The last timestamp/integer included in buliding the index. |
agg_interval |
numeric | The aggregation interval in seconds (if the time columns is in timestamps) |
uncertainty_quantification |
boolean | true if the variance is estimated by the predicting index. |
Sample Usage
To get a table of all prediction indices built in your database:
select * from list_pindices();
Or you can get a specific column for a specific index by writing an SQL query on list_pindices()
. For example:
select agg_interval from list_pindices() where index_name = 'pindex1';
delete_pindex()
Delete the predictive index index_name
from your database.
Required Arguments
Name | Type | Description |
---|---|---|
index_name |
text | name of the predictive index to be deleted |
Sample Usage
You can simply delete the created predictive index pindex_1
through
select delete_pindex('pindex_1');
pindices_stat()
Show some statistics regarding the predictive indices. These include in-sample accuracies for imputation and forecasting, out-of-sample accuracy (only when more than 2 sub-models are built), the number of points included in the model. Accuracies are in R2. R2 values typically range between 0 and 1. (One being the perfect accuracy)
Returns
Name | Type | Description |
---|---|---|
index_name |
text | Prediction index name. |
column_name |
text | A name of a value column indexed by index_name . |
number_of_observations |
numeric | The number of datapoints used in the prediction model. |
number_of_trained_models |
numeric | The number of sub-models in the prediction model. |
imputation_score |
numeric | The in-sample imputation accuracy. |
forecast_score |
numeric | The in-sample forecasting accuracy. |
test_forecast_score |
numeric | The out-of-sample forecasting accuracy. |
Sample Usage
To get a table of the statistics of all prediction indices built in your database:
select * from pindices_stat();
Similarly, you can get it for a specific column and index by
select * from pindices_stat() where index_name = 'pindex1' and column_name = 'temprature';
update_pindex()
Update the prediction index index_name
.
Note: if auto_update
was set to true when the index was created, then it will be updated automatically.
Required Arguments
Name | Type | Description |
---|---|---|
index_name |
text | Prediction index name. |
Sample Usage
To update undex pindex_1
:
select update_pindex('pindex_1');
Predictive Queries
predict ()
Returns the predicted value along with the confidence interval for the value of value_column
at time t
using index_name
Required Arguments
Name | Type | Description |
---|---|---|
table_name |
text | The name of the indexed time series table. |
value_column |
text | The name of the indexed value column to be predicted. |
t |
int or Timestamp | Index or timestamp indicating the queried time. |
index_name |
text | The name of the prediction index. |
Optional Arguments
Name | Type | Description |
---|---|---|
uq |
boolean | Default: true . If true, return upper and lower bound of the c % confidenc interval |
uq_method |
text | Defalut = Gaussian options: {Gaussian , Chebyshev } Uncertainty quantification method used to estimate the confidence interval |
c |
Numeric | Default: 95. confidence level for uncertainty quantification, c should be in the range (0,100). |
Returns
Name | Type | Description |
---|---|---|
prediction |
numeric | Predicted value |
LB |
numeric | Lower bound of prediction |
UB |
numeric | Upper bound of prediction |
Sample Usage
Inline with the examples used in create_pindex()
we can predict as follow:
1- To predict the temprature
of table measurements
at time t = 100 using pindex1
select predict('measurements','temprature',100,'pindex1');
2- To predict the humidity
of table measurements
at time t = 100 using pindex2
with a confidence interval of 98%:
select predict('measurements','humidity',100,'pindex2', c => 98);
predict()
This is the same predict()
function, but returns the predicted value for a time range t1
to t2
using index_name
along with the confidence interval for the value of value_column
Required Arguments
Name | Type | Description |
---|---|---|
table_name |
text | The name of the indexed time series table. |
value_column |
text | The name of the indexed value column to be predicted. |
t1 |
int or Timestamp | Index or timestamp indicating the start of queried time (inclusive). |
t2 |
int or Timestamp | Index or timestamp indicating the end of queried time (inclusive). |
index_name |
text | The name of the prediction index. |
Optional Arguments
Name | Type | Description |
---|---|---|
uq |
boolean | Default: true . If true, return upper and lower bound of the c % confidenc interval |
uq_method |
text | Defalut = Gaussian options: {Gaussian , Chebyshev } Uncertainty quantification method used to estimate the confidence interval |
c |
Numeric | Default: 95. confidence level for uncertainty quantification, c should be in the range (0,100). |
Returns
Name | Type | Description |
---|---|---|
prediction |
numeric | Predicted value |
LB |
numeric | Lower bound of prediction |
UB |
numeric | Upper bound of prediction |
Sample Usage
Inline with the examples used in create_pindex()
we can predict as follow:
1- To predict the temprature
of table measurements
at time range t = 100 to t =110 using pindex1
select predict('measurements','temprature',100,110,'pindex1');
2- To predict the humidity
of table measurements
at time range t = 100 to t =110 using Chebyshev as the uncertainity quantification method:
select predict('measurements','humidity',100,110,'pindex2', uq_method => 'Chebyshev');
Additional Functionalities
get_lowerbound()
This method estimated a prediction error lower bound of a discrete time series. As such, the first step is to discretize the given sequence. If the given time series is already discrete, you can ignore the discretization_method. The method then estimates an entropy rate of the time series via Lempel-Ziv compression, and returns the classification error lower bound. The stationarity of the time series is assumed. More detailed explanation is given below, and please visit https://github.com/saeyoung/tslb for more information.
- Discretization
When the given time series comprises of continuous random variables, we need to discretize the sequence prior to apply this method. See discretization_method
.
- Linear Model
We build a linear model between the compression ratio and the entropy rate. To do so, we sample a number of sequences that are similar to the given time series: that has the same number of observations (sequence length; number_of_observations
) and the same number of alphabets (the number of unique cases that appear in the sequence; k
). samples
denote the number of sampled sequences, and each sequence's distribution is known as we set the true distribution for them. Thus, we can obtain 1) the entropy rate from the true distribution and 2) the compression ratio of the sequence via Lempel-Ziv algorithm.
- Regression
We obtain the compression ratio of the (discretized) time series of interest and calculate the entropy rate using the aforementioned linear model. Finally, we estimate the classification error lower bound using Fano's Inequality. The returned value is the probability of the best classification error.
Required Arguments
Name | Type | Description |
---|---|---|
table_name |
text | The name of the time series table. |
value_column |
text | The name of the time series column for which a lower bound for prediction error will be calculated. |
time_column |
text | The name of the time column in the time series table. This is the column containing time values, which can be either integers or timestamps. |
Optional Arguments
Name | Type | Description |
---|---|---|
number_of_observations |
int | Default: 1000. Number of observations used to estimate the lower bound. The most recent observations will be selected. |
samples |
int | Defalut = 100 . Number of samples to construct the linear model. |
discretization_method |
text | Defalut = quantization options: {quantization , change , None }. quantization : quantize the sequence into k equilength bins, from min to max. change : create three alphabets, 1) remains the same, 2) increased, and 3) decreased. None skip discretization, which should be used if the time series is already discrete. |
k |
int | Default: 3. Number of alphabets in the discretized sequence when discretization_method is set to quantization . |
Returns
Name | Type | Description |
---|---|---|
lower bound |
numeric | Prediction error lower bound. This is a predicted value for the classification error (probability) of the best estimator based on Fano's Inequality. |
Sample Usage
Inline with the examples used before, we can estimate the prediction lower bound for the time series temperature
in the table measurements
as follow:
select get_lowerbound('measurements','temprature','time');
Similarly, we can change the number of observations used to estimate the lower bound to 2000 through the following query:
select get_lowerbound('measurements','temprature','time', number_of_observations => 2000);
We can convert the time series on a continuous scale into a ternary sequence: increased, decreased, and no change:
select get_lowerbound('measurements','temprature','time', discretization_method => 'change');
Another way to quantize the sequence is to divide the line between the observed minimum and the maximum into k=5 equilength intervals:
select get_lowerbound('measurements','temprature','time', discretization_method => 'quantization', k => 5);