Error detection rules included in traval

This notebook shows simple examples of the error detection rules in traval.

[1]:

import numpy as np
import pandas as pd

import traval
from traval import rulelib as rlib

Create a very simple time series:

[35]:

date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
s1

[35]:

2020-01-01    0
2020-01-02    1
2020-01-03    2
2020-01-04    3
2020-01-05    4
2020-01-06    5
2020-01-07    6
2020-01-08    7
2020-01-09    8
2020-01-10    9
Freq: D, dtype: int64

`rule_ufunc_threshold`: float threshold

Rule comparing series to threshold value.

[2]:

c1 = rlib.rule_ufunc_threshold(s1, (np.greater_equal,), 5)
assert (c1["correction_code"] == 2).sum() == 5
c1

[2]:

	correction_code	series_values	comparison_values
2020-01-01	0	NaN	NaN
2020-01-02	0	NaN	NaN
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	0	NaN	NaN
2020-01-06	2	5.0	5.0
2020-01-07	2	6.0	5.0
2020-01-08	2	7.0	5.0
2020-01-09	2	8.0	5.0
2020-01-10	2	9.0	5.0

`rule_ufunc_threshold`: threshold series

Rule comparing series to threshold series.

[3]:

# rule_ufunc_threshold: series
idx = date_range[:3].to_list() + date_range[-4:-1].to_list()
thresh_series = pd.Series(index=idx, data=5.0)
full_threshold_series = traval.ts_utils.resample_short_series_to_long_series(
    thresh_series, s1
)
c2 = rlib.rule_ufunc_threshold(s1, (np.greater_equal,), thresh_series)
assert (c2["correction_code"] == 2).sum() == 5
c2

[3]:

	correction_code	series_values	comparison_values
2020-01-01	0	NaN	NaN
2020-01-02	0	NaN	NaN
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	0	NaN	NaN
2020-01-06	2	5.0	5.0
2020-01-07	2	6.0	5.0
2020-01-08	2	7.0	5.0
2020-01-09	2	8.0	5.0
2020-01-10	2	9.0	5.0

`rule_diff_ufunc_threshold`

Rule comparing diff of series to threshold value.

[4]:

# rule_diff_ufunc_threshold
s1.loc[date_range[4]] += 1
c3 = rlib.rule_diff_ufunc_threshold(s1, (np.greater_equal,), 1.1)
assert (c3["correction_code"] == 2).sum() == 1
c3

[4]:

	correction_code	series_values	comparison_values
2020-01-01	0	NaN	NaN
2020-01-02	0	NaN	NaN
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	2	5.0	1.1
2020-01-06	0	NaN	NaN
2020-01-07	0	NaN	NaN
2020-01-08	0	NaN	NaN
2020-01-09	0	NaN	NaN
2020-01-10	0	NaN	NaN

`rule_other_ufunc_threshold`

Rule comparing other series to threshold.

[5]:

# rule_other_ufunc_threshold
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
val = s1.copy()
c4 = rlib.rule_other_ufunc_threshold(s1, val, (np.less,), 5)
assert (c4["correction_code"] == -2).sum() == 5
c4

[5]:

	correction_code	series_values	comparison_values
2020-01-01	-2	0.0	5.0
2020-01-02	-2	1.0	5.0
2020-01-03	-2	2.0	5.0
2020-01-04	-2	3.0	5.0
2020-01-05	-2	4.0	5.0
2020-01-06	0	NaN	NaN
2020-01-07	0	NaN	NaN
2020-01-08	0	NaN	NaN
2020-01-09	0	NaN	NaN
2020-01-10	0	NaN	NaN

`rule_max_gradient`

Rule that checks the maximum gradient between two values.

[6]:

# rule_max_gradient
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
s1.loc[date_range[4]] += 1
c5 = rlib.rule_max_gradient(s1, max_step=1.0, max_timestep="1D")
assert (c5["correction_code"] == 2).sum() == 1
c5

[6]:

	correction_code	series_values	comparison_values
2020-01-01	0	NaN	NaN
2020-01-02	0	NaN	NaN
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	2	5.0	1.0
2020-01-06	0	NaN	NaN
2020-01-07	0	NaN	NaN
2020-01-08	0	NaN	NaN
2020-01-09	0	NaN	NaN
2020-01-10	0	NaN	NaN

`rule_spike_detection`

Rule that detects spikes, single observations that differ significantly from both neighbors.

[7]:

# rule_spike_detection
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
s1.iloc[4] += 3
c6 = rlib.rule_spike_detection(s1, threshold=2, spike_tol=2)
assert (c6["correction_code"] == 99).sum() == 1
c6

[7]:

	correction_code	series_values	comparison_values
2020-01-01	0	NaN	NaN
2020-01-02	0	NaN	NaN
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	99	7.0	NaN
2020-01-06	0	NaN	NaN
2020-01-07	0	NaN	NaN
2020-01-08	0	NaN	NaN
2020-01-09	0	NaN	NaN
2020-01-10	0	NaN	NaN

`rule_offset_detection`

Rule that looks for periods that are offset relative to the rest of the time series.

[8]:

# rule_offset_detection
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
s1.iloc[3:7] += 10
c7 = rlib.rule_offset_detection(s1, threshold=5, updown_diff=2.0)
assert (c7["correction_code"] == 99).sum() == 4
c7

[8]:

	correction_code	series_values	comparison_values
2020-01-01	0.0	NaN	NaN
2020-01-02	0.0	NaN	NaN
2020-01-03	0.0	NaN	NaN
2020-01-04	99.0	NaN	NaN
2020-01-05	99.0	NaN	NaN
2020-01-06	99.0	NaN	NaN
2020-01-07	99.0	NaN	NaN
2020-01-08	0.0	NaN	NaN
2020-01-09	0.0	NaN	NaN
2020-01-10	0.0	NaN	NaN

`rule_outside_n_sigma`

Rule that checks if measurements are outside \(N\) standard deviations of the time series.

[9]:

# rule_outside_n_sigma
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
c8 = rlib.rule_outside_n_sigma(s1, n=1.0)
assert (c8["correction_code"] == -2).sum() == 2
assert (c8["correction_code"] == 2).sum() == 2
c8

[9]:

	correction_code	series_values	comparison_values
2020-01-01	-2	0.0	1.47235
2020-01-02	-2	1.0	1.47235
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	0	NaN	NaN
2020-01-06	0	NaN	NaN
2020-01-07	0	NaN	NaN
2020-01-08	0	NaN	NaN
2020-01-09	2	8.0	7.52765
2020-01-10	2	9.0	7.52765

`rule_diff_outside_of_n_sigma`

Rule that checks if the diff of a series lies outside of \(N\) standard deviations of the differences.

[10]:

# rule_diff_outside_of_n_sigma
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
s1.iloc[5:] += np.arange(5)
c9 = rlib.rule_diff_outside_of_n_sigma(s1, 2.0)
assert (c9["correction_code"] == 2).sum() == 4
c9

[10]:

	correction_code	series_values	comparison_values
2020-01-01	0	NaN	NaN
2020-01-02	0	NaN	NaN
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	0	NaN	NaN
2020-01-06	0	NaN	NaN
2020-01-07	2	2.0	1.054093
2020-01-08	2	2.0	1.054093
2020-01-09	2	2.0	1.054093
2020-01-10	2	2.0	1.054093

`rule_outside_bandwidth`

Rule checking values lie outside some given upper and lower thresholds.

[11]:

# rule_outside_bandwidth
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
lb = pd.Series(index=date_range[[0, -1]], data=[1, 2])
ub = pd.Series(index=date_range[[0, -1]], data=[7, 8])
c10 = rlib.rule_outside_bandwidth(s1, lb, ub)
assert (c10["correction_code"] == -2).sum() == 2
assert (c10["correction_code"] == 2).sum() == 2
c10

[11]:

	correction_code	series_values	comparison_values
2020-01-01	-2	0.0	1.000000
2020-01-02	-2	1.0	1.111111
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	0	NaN	NaN
2020-01-06	0	NaN	NaN
2020-01-07	0	NaN	NaN
2020-01-08	0	NaN	NaN
2020-01-09	2	8.0	7.888889
2020-01-10	2	9.0	8.000000

`rule_shift_to_manual_obs`

Rule that corrects observations and shifts them to manual observations using linear interpolation of the differences between the time series and the manual observations.

[36]:

# rule_shift_to_manual_obs
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
h = pd.Series(index=date_range[[1, -1]], data=[2, 10])
a = rlib.rule_shift_to_manual_obs(s1, h, max_dt="2D", method="linear")
assert (a.iloc[1:] == s1.iloc[1:] + 1).all()
assert a.iloc[0] == s1.iloc[0]
ax = s1.plot()
h.plot(ax=ax, marker="x", ls="none")
a.plot(ax=ax, ls="dashed");

`rule_compare_to_manual_obs`

Rule that compares a time series to manual observations. Values are marked as suspect when the linear interpolated difference between the time series and the manual observations exceeds some threshold.

[39]:

# rule compare_to_manual_obs
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
h = pd.Series(index=date_range[[1, -1]], data=[2, 7])
c11 = rlib.rule_compare_to_manual_obs(
    s1, h, threshold=1.0, max_dt="2D", method="linear"
)
ax = s1.plot(label="series")
h.plot(ax=ax, marker="o", ls="none", label="manual observations")
s1.loc[c11["correction_code"] != 0].plot(
    ax=ax, marker="x", ls="none", label="suspect observations", c="C3"
)
ax.legend(loc=(0, 1), frameon=False, ncol=3, fontsize="small")
c11

[39]:

	correction_code	series_values	comparison_values
2020-01-01	0	NaN	NaN
2020-01-02	0	NaN	NaN
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	0	NaN	NaN
2020-01-06	0	NaN	NaN
2020-01-07	0	NaN	NaN
2020-01-08	-2	-1.250	-1.0
2020-01-09	-2	-1.625	-1.0
2020-01-10	-2	-2.000	-1.0

`rule_combine_corrections_or`

Rule for combining results of any number of other rules. Observations are suspect if ANY rule flags an observation as suspect.

[42]:

# rule_combine_corrections_or
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.DataFrame(index=date_range, columns=["correction_code"], data=0)
s2 = s1.copy()
s1.iloc[0] = 99
s2.iloc[-1] = -2
c11 = rlib.rule_combine_corrections_or(s1, s2)
assert (c11["correction_code"] == 99).sum() == 2
c11

[42]:

	correction_code	series_values	comparison_values
2020-01-01	99	NaN	NaN
2020-01-02	0	NaN	NaN
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	0	NaN	NaN
2020-01-06	0	NaN	NaN
2020-01-07	0	NaN	NaN
2020-01-08	0	NaN	NaN
2020-01-09	0	NaN	NaN
2020-01-10	99	NaN	NaN

`rule_combine_corrections_and`

Rule for combining results of any number of other rules. Observations are suspect if ALL rules flag an observation as suspect.

[55]:

# rule_combine_corrections_and
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.DataFrame(index=date_range, columns=["correction_code"], data=0)
s2 = s1.copy()
s1.iloc[0:2] = 99
s2.iloc[1:3] = -2
c12 = rlib.rule_combine_corrections_and(s1, s2)
assert (c12["correction_code"] == 99).sum() == 1
c12

[55]:

	correction_code	series_values	comparison_values
2020-01-01	0	NaN	NaN
2020-01-02	99	NaN	NaN
2020-01-03	0	NaN	NaN
2020-01-04	0	NaN	NaN
2020-01-05	0	NaN	NaN
2020-01-06	0	NaN	NaN
2020-01-07	0	NaN	NaN
2020-01-08	0	NaN	NaN
2020-01-09	0	NaN	NaN
2020-01-10	0	NaN	NaN

`rule_funcdict`

Rule that takes a dictionary of functions and applies those iteratively to the original time series. Observations are suspect if any rule flags an observation as suspect.

[24]:

# rule_funcdict_to_nan
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
fdict = {"lt_3": lambda s: s < 3.0, "gt_7": lambda s: s > 7.0}
c13 = rlib.rule_funcdict(s1, fdict)
assert (c13["correction_code"] == 99).sum() == 5
c13

[24]:

	correction_code	series_values	comparison_values
2020-01-01	99	0.0	NaN
2020-01-02	99	1.0	NaN
2020-01-03	99	2.0	NaN
2020-01-04	0	NaN	NaN
2020-01-05	0	NaN	NaN
2020-01-06	0	NaN	NaN
2020-01-07	0	NaN	NaN
2020-01-08	0	NaN	NaN
2020-01-09	99	8.0	NaN
2020-01-10	99	9.0	NaN

`rule_keep_comments`

Rule that keeps observations that have some comment associated with it. Can be used to filter validated time series comments to obtain specific observations.

[33]:

# rule_keep_comments
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
raw = pd.Series(index=date_range, data=np.arange(10), dtype=float)
comments = ["keep"] * 4 + [""] * 3 + ["discard"] * 3
comment_series = pd.Series(index=raw.index, data=comments)
c14 = rlib.rule_keep_comments(raw, ["keep"], comment_series)
assert (c14["correction_code"] == 99).sum() == 4
assert (c14["comparison_values"] == "keep").sum() == 4
c14

[33]:

	correction_code	series_values	comparison_values
2020-01-01	99	0.0	keep
2020-01-02	99	1.0	keep
2020-01-03	99	2.0	keep
2020-01-04	99	3.0	keep
2020-01-05	0	NaN
2020-01-06	0	NaN
2020-01-07	0	NaN
2020-01-08	0	NaN
2020-01-09	0	NaN
2020-01-10	0	NaN

[ ]:

Error detection rules included in traval

rule_ufunc_threshold: float threshold

rule_ufunc_threshold: threshold series

rule_diff_ufunc_threshold

rule_other_ufunc_threshold

rule_max_gradient

rule_spike_detection

rule_offset_detection

rule_outside_n_sigma

rule_diff_outside_of_n_sigma

rule_outside_bandwidth

rule_shift_to_manual_obs

rule_compare_to_manual_obs

rule_combine_corrections_or

rule_combine_corrections_and

rule_funcdict

rule_keep_comments