Error detection rules included in traval

This notebook shows simple examples of the error detection rules in traval.

[1]:
import numpy as np
import pandas as pd

import traval
from traval import rulelib as rlib

Create a very simple time series:

[35]:
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
s1
[35]:
2020-01-01    0
2020-01-02    1
2020-01-03    2
2020-01-04    3
2020-01-05    4
2020-01-06    5
2020-01-07    6
2020-01-08    7
2020-01-09    8
2020-01-10    9
Freq: D, dtype: int64

rule_ufunc_threshold: float threshold

Rule comparing series to threshold value.

[2]:
c1 = rlib.rule_ufunc_threshold(s1, (np.greater_equal,), 5)
assert (c1["correction_code"] == 2).sum() == 5
c1
[2]:
correction_code series_values comparison_values
2020-01-01 0 NaN NaN
2020-01-02 0 NaN NaN
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 0 NaN NaN
2020-01-06 2 5.0 5.0
2020-01-07 2 6.0 5.0
2020-01-08 2 7.0 5.0
2020-01-09 2 8.0 5.0
2020-01-10 2 9.0 5.0

rule_ufunc_threshold: threshold series

Rule comparing series to threshold series.

[3]:
# rule_ufunc_threshold: series
idx = date_range[:3].to_list() + date_range[-4:-1].to_list()
thresh_series = pd.Series(index=idx, data=5.0)
full_threshold_series = traval.ts_utils.resample_short_series_to_long_series(
    thresh_series, s1
)
c2 = rlib.rule_ufunc_threshold(s1, (np.greater_equal,), thresh_series)
assert (c2["correction_code"] == 2).sum() == 5
c2
[3]:
correction_code series_values comparison_values
2020-01-01 0 NaN NaN
2020-01-02 0 NaN NaN
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 0 NaN NaN
2020-01-06 2 5.0 5.0
2020-01-07 2 6.0 5.0
2020-01-08 2 7.0 5.0
2020-01-09 2 8.0 5.0
2020-01-10 2 9.0 5.0

rule_diff_ufunc_threshold

Rule comparing diff of series to threshold value.

[4]:
# rule_diff_ufunc_threshold
s1.loc[date_range[4]] += 1
c3 = rlib.rule_diff_ufunc_threshold(s1, (np.greater_equal,), 1.1)
assert (c3["correction_code"] == 2).sum() == 1
c3
[4]:
correction_code series_values comparison_values
2020-01-01 0 NaN NaN
2020-01-02 0 NaN NaN
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 2 5.0 1.1
2020-01-06 0 NaN NaN
2020-01-07 0 NaN NaN
2020-01-08 0 NaN NaN
2020-01-09 0 NaN NaN
2020-01-10 0 NaN NaN

rule_other_ufunc_threshold

Rule comparing other series to threshold.

[5]:
# rule_other_ufunc_threshold
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
val = s1.copy()
c4 = rlib.rule_other_ufunc_threshold(s1, val, (np.less,), 5)
assert (c4["correction_code"] == -2).sum() == 5
c4
[5]:
correction_code series_values comparison_values
2020-01-01 -2 0.0 5.0
2020-01-02 -2 1.0 5.0
2020-01-03 -2 2.0 5.0
2020-01-04 -2 3.0 5.0
2020-01-05 -2 4.0 5.0
2020-01-06 0 NaN NaN
2020-01-07 0 NaN NaN
2020-01-08 0 NaN NaN
2020-01-09 0 NaN NaN
2020-01-10 0 NaN NaN

rule_max_gradient

Rule that checks the maximum gradient between two values.

[6]:
# rule_max_gradient
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
s1.loc[date_range[4]] += 1
c5 = rlib.rule_max_gradient(s1, max_step=1.0, max_timestep="1D")
assert (c5["correction_code"] == 2).sum() == 1
c5
[6]:
correction_code series_values comparison_values
2020-01-01 0 NaN NaN
2020-01-02 0 NaN NaN
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 2 5.0 1.0
2020-01-06 0 NaN NaN
2020-01-07 0 NaN NaN
2020-01-08 0 NaN NaN
2020-01-09 0 NaN NaN
2020-01-10 0 NaN NaN

rule_spike_detection

Rule that detects spikes, single observations that differ significantly from both neighbors.

[7]:
# rule_spike_detection
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
s1.iloc[4] += 3
c6 = rlib.rule_spike_detection(s1, threshold=2, spike_tol=2)
assert (c6["correction_code"] == 99).sum() == 1
c6
[7]:
correction_code series_values comparison_values
2020-01-01 0 NaN NaN
2020-01-02 0 NaN NaN
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 99 7.0 NaN
2020-01-06 0 NaN NaN
2020-01-07 0 NaN NaN
2020-01-08 0 NaN NaN
2020-01-09 0 NaN NaN
2020-01-10 0 NaN NaN

rule_offset_detection

Rule that looks for periods that are offset relative to the rest of the time series.

[8]:
# rule_offset_detection
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
s1.iloc[3:7] += 10
c7 = rlib.rule_offset_detection(s1, threshold=5, updown_diff=2.0)
assert (c7["correction_code"] == 99).sum() == 4
c7
[8]:
correction_code series_values comparison_values
2020-01-01 0.0 NaN NaN
2020-01-02 0.0 NaN NaN
2020-01-03 0.0 NaN NaN
2020-01-04 99.0 NaN NaN
2020-01-05 99.0 NaN NaN
2020-01-06 99.0 NaN NaN
2020-01-07 99.0 NaN NaN
2020-01-08 0.0 NaN NaN
2020-01-09 0.0 NaN NaN
2020-01-10 0.0 NaN NaN

rule_outside_n_sigma

Rule that checks if measurements are outside \(N\) standard deviations of the time series.

[9]:
# rule_outside_n_sigma
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
c8 = rlib.rule_outside_n_sigma(s1, n=1.0)
assert (c8["correction_code"] == -2).sum() == 2
assert (c8["correction_code"] == 2).sum() == 2
c8
[9]:
correction_code series_values comparison_values
2020-01-01 -2 0.0 1.47235
2020-01-02 -2 1.0 1.47235
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 0 NaN NaN
2020-01-06 0 NaN NaN
2020-01-07 0 NaN NaN
2020-01-08 0 NaN NaN
2020-01-09 2 8.0 7.52765
2020-01-10 2 9.0 7.52765

rule_diff_outside_of_n_sigma

Rule that checks if the diff of a series lies outside of \(N\) standard deviations of the differences.

[10]:
# rule_diff_outside_of_n_sigma
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
s1.iloc[5:] += np.arange(5)
c9 = rlib.rule_diff_outside_of_n_sigma(s1, 2.0)
assert (c9["correction_code"] == 2).sum() == 4
c9
[10]:
correction_code series_values comparison_values
2020-01-01 0 NaN NaN
2020-01-02 0 NaN NaN
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 0 NaN NaN
2020-01-06 0 NaN NaN
2020-01-07 2 2.0 1.054093
2020-01-08 2 2.0 1.054093
2020-01-09 2 2.0 1.054093
2020-01-10 2 2.0 1.054093

rule_outside_bandwidth

Rule checking values lie outside some given upper and lower thresholds.

[11]:
# rule_outside_bandwidth
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
lb = pd.Series(index=date_range[[0, -1]], data=[1, 2])
ub = pd.Series(index=date_range[[0, -1]], data=[7, 8])
c10 = rlib.rule_outside_bandwidth(s1, lb, ub)
assert (c10["correction_code"] == -2).sum() == 2
assert (c10["correction_code"] == 2).sum() == 2
c10
[11]:
correction_code series_values comparison_values
2020-01-01 -2 0.0 1.000000
2020-01-02 -2 1.0 1.111111
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 0 NaN NaN
2020-01-06 0 NaN NaN
2020-01-07 0 NaN NaN
2020-01-08 0 NaN NaN
2020-01-09 2 8.0 7.888889
2020-01-10 2 9.0 8.000000

rule_shift_to_manual_obs

Rule that corrects observations and shifts them to manual observations using linear interpolation of the differences between the time series and the manual observations.

[36]:
# rule_shift_to_manual_obs
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
h = pd.Series(index=date_range[[1, -1]], data=[2, 10])
a = rlib.rule_shift_to_manual_obs(s1, h, max_dt="2D", method="linear")
assert (a.iloc[1:] == s1.iloc[1:] + 1).all()
assert a.iloc[0] == s1.iloc[0]
ax = s1.plot()
h.plot(ax=ax, marker="x", ls="none")
a.plot(ax=ax, ls="dashed");
../_images/examples_ex03_rules_25_0.png

rule_compare_to_manual_obs

Rule that compares a time series to manual observations. Values are marked as suspect when the linear interpolated difference between the time series and the manual observations exceeds some threshold.

[39]:
# rule compare_to_manual_obs
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
h = pd.Series(index=date_range[[1, -1]], data=[2, 7])
c11 = rlib.rule_compare_to_manual_obs(
    s1, h, threshold=1.0, max_dt="2D", method="linear"
)
ax = s1.plot(label="series")
h.plot(ax=ax, marker="o", ls="none", label="manual observations")
s1.loc[c11["correction_code"] != 0].plot(
    ax=ax, marker="x", ls="none", label="suspect observations", c="C3"
)
ax.legend(loc=(0, 1), frameon=False, ncol=3, fontsize="small")
c11
[39]:
correction_code series_values comparison_values
2020-01-01 0 NaN NaN
2020-01-02 0 NaN NaN
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 0 NaN NaN
2020-01-06 0 NaN NaN
2020-01-07 0 NaN NaN
2020-01-08 -2 -1.250 -1.0
2020-01-09 -2 -1.625 -1.0
2020-01-10 -2 -2.000 -1.0
../_images/examples_ex03_rules_27_1.png

rule_combine_corrections_or

Rule for combining results of any number of other rules. Observations are suspect if ANY rule flags an observation as suspect.

[42]:
# rule_combine_corrections_or
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.DataFrame(index=date_range, columns=["correction_code"], data=0)
s2 = s1.copy()
s1.iloc[0] = 99
s2.iloc[-1] = -2
c11 = rlib.rule_combine_corrections_or(s1, s2)
assert (c11["correction_code"] == 99).sum() == 2
c11
[42]:
correction_code series_values comparison_values
2020-01-01 99 NaN NaN
2020-01-02 0 NaN NaN
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 0 NaN NaN
2020-01-06 0 NaN NaN
2020-01-07 0 NaN NaN
2020-01-08 0 NaN NaN
2020-01-09 0 NaN NaN
2020-01-10 99 NaN NaN

rule_combine_corrections_and

Rule for combining results of any number of other rules. Observations are suspect if ALL rules flag an observation as suspect.

[55]:
# rule_combine_corrections_and
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.DataFrame(index=date_range, columns=["correction_code"], data=0)
s2 = s1.copy()
s1.iloc[0:2] = 99
s2.iloc[1:3] = -2
c12 = rlib.rule_combine_corrections_and(s1, s2)
assert (c12["correction_code"] == 99).sum() == 1
c12
[55]:
correction_code series_values comparison_values
2020-01-01 0 NaN NaN
2020-01-02 99 NaN NaN
2020-01-03 0 NaN NaN
2020-01-04 0 NaN NaN
2020-01-05 0 NaN NaN
2020-01-06 0 NaN NaN
2020-01-07 0 NaN NaN
2020-01-08 0 NaN NaN
2020-01-09 0 NaN NaN
2020-01-10 0 NaN NaN

rule_funcdict

Rule that takes a dictionary of functions and applies those iteratively to the original time series. Observations are suspect if any rule flags an observation as suspect.

[24]:
# rule_funcdict_to_nan
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
s1 = pd.Series(index=date_range, data=np.arange(10))
fdict = {"lt_3": lambda s: s < 3.0, "gt_7": lambda s: s > 7.0}
c13 = rlib.rule_funcdict(s1, fdict)
assert (c13["correction_code"] == 99).sum() == 5
c13
[24]:
correction_code series_values comparison_values
2020-01-01 99 0.0 NaN
2020-01-02 99 1.0 NaN
2020-01-03 99 2.0 NaN
2020-01-04 0 NaN NaN
2020-01-05 0 NaN NaN
2020-01-06 0 NaN NaN
2020-01-07 0 NaN NaN
2020-01-08 0 NaN NaN
2020-01-09 99 8.0 NaN
2020-01-10 99 9.0 NaN

rule_keep_comments

Rule that keeps observations that have some comment associated with it. Can be used to filter validated time series comments to obtain specific observations.

[33]:
# rule_keep_comments
date_range = pd.date_range("2020-01-01", freq="D", periods=10)
raw = pd.Series(index=date_range, data=np.arange(10), dtype=float)
comments = ["keep"] * 4 + [""] * 3 + ["discard"] * 3
comment_series = pd.Series(index=raw.index, data=comments)
c14 = rlib.rule_keep_comments(raw, ["keep"], comment_series)
assert (c14["correction_code"] == 99).sum() == 4
assert (c14["comparison_values"] == "keep").sum() == 4
c14
[33]:
correction_code series_values comparison_values
2020-01-01 99 0.0 keep
2020-01-02 99 1.0 keep
2020-01-03 99 2.0 keep
2020-01-04 99 3.0 keep
2020-01-05 0 NaN
2020-01-06 0 NaN
2020-01-07 0 NaN
2020-01-08 0 NaN
2020-01-09 0 NaN
2020-01-10 0 NaN
[ ]: