API Documentation

Detector

class traval.detector.Detector(series, truth=None)[source]

Detector object for applying error detection algorithms to time series.

The Detector is used to apply error detection algorithms to a time series and optionally contains a ‘truth’ series, to which the error detection result can be compared. An example of a ‘truth’ series is a manually validated time series. Custom error detection algorithms can be defined using the RuleSet object.

Parameters:
  • series (pd.Series or pd.DataFrame) – time series to check

  • truth (pd.Series or pd.DataFrame, optional) – series that represents the ‘truth’, i.e. a benchmark to which the error detection result can be compared, by default None

Examples

Given a time series ‘series’ and some ruleset ‘rset’:

>>> d = Detector(series)
>>> d.apply_ruleset(rset)
>>> d.plot_overview()

See also

traval.RuleSet

object for defining detection algorithms

static _validate_input_series(series)[source]

Internal method for checking type and dtype of series.

Parameters:

series (object) – time series to check, must be pd.Series or pd.DataFrame. Datatype of series or first column of DataFrame must be float.

Raises:

TypeError – if series or dtype of series does not comply

apply_ruleset(ruleset, compare=True)[source]

Apply RuleSet to series.

Parameters:
  • ruleset (traval.RuleSet) – RuleSet object containing detection rules

  • compare (bool or list of int, optional) – if True, compare all results to original series and store in dictionary under comparisons attribute, default is True. If False, do not store comparisons. If list of int, store only those step numbers as comparisons. Note: value of -1 refers to last step for convenience.

See also

traval.RuleSet

object for defining detection algorithms

confusion_matrix(steps=None, truth=None)[source]

Calculate confusion matrix stats for detection rules.

Note: the calculated statistics per rule contain overlapping counts, i.e. multiple rules can mark the same observatin as suspect.

Parameters:
  • steps (int, list of int or None, optional) – steps for which to calculate confusion matrix statistics, by default None which uses all steps.

  • truth (pd.Series or pd.DataFrame, optional) – series representing the “truth”, i.e. a benchmark to which the resulting series is compared. By default None, which uses the stored truth series. Argument is included so a different truth can be passed.

Returns:

df – dataframe containing confusion matrix data, i.e. counts of true positives, false positives, true negatives and false negatives.

Return type:

pd.DataFrame

get_comment_series(steps=None)[source]
get_corrections_comparison(truth=None)[source]
get_corrections_dataframe(as_correction_codes=False, as_addable_df=False)[source]

Get DataFrame containing corrections.

Parameters:
  • as_correction_codes (bool, optional) – return DataFrame with correction codes, by default False

  • as_addable_df (bool, optional) – return DataFrame with corrections dataframe that you can add to the original time series to obtain the final result. Corrections are NaN when errors are detected, and nonzero where observations are shifted, and zero everywhere else.

Returns:

df – DataFrame containing corrections.

Return type:

pandas.DataFrame

get_final_result()[source]

Get final time series with flagged values set to NaN.

Returns:

series – time series produced by final step in RuleSet with flagged values set to NaN.

Return type:

pandas.Series

get_indices(category, step, truth=None)[source]
get_results_dataframe()[source]

Get results as DataFrame.

Returns:

df – results with flagged values set to NaN per applied rule.

Return type:

pandas.DataFrame

get_series(step, category=None)[source]
plot_overview(mark_suspects=True, **kwargs)[source]

Plot time series with flagged values per applied rule.

Parameters:

mark_suspects (bool, optional) – mark suspect values with red X, by default True

Returns:

ax – axes objects

Return type:

list of matplotlib.pyplot.Axes

reset()[source]

Reset Detector object.

set_truth(truth)[source]

Set ‘truth’ series.

Used for comparison with detection result.

Parameters:

truth (pd.Series or pd.DataFrame) – Series or DataFrame containing the “truth”, i.e. a benchmark to compare the detection result to.

stats_per_comment(step=None, truth=None)[source]
uniqueness(truth=None)[source]

Calculate unique contribution per rule to stats.

Note: the calculated statistics per rule are under counted, i.e. when multiple rules mark the same observation as suspect it is not contained in this result.

Parameters:
  • steps (int, list of int or None, optional) – steps for which to calculate confusion matrix statistics, by default None which uses all steps.

  • truth (pd.Series or pd.DataFrame, optional) – series representing the “truth”, i.e. a benchmark to which the resulting series is compared. By default None, which uses the stored truth series. Argument is included so a different truth can be passed.

Returns:

df – dataframe containing confusion matrix data, i.e. unique counts of true positives, false positives, true negatives and false negatives.

Return type:

pd.DataFrame

RuleSet

class traval.ruleset.RuleSet(name=None)[source]

Create RuleSet object for storing detection rules.

The RuleSet object stores detection rules and other relevant information in a dictionary. The order in which rules are carried out, the functions that parse the time series, the extra arguments required by those functions are all stored together.

The detection functions must take a series as the first argument, and return a series with corrections based on the detection rule. In the corrections series invalid values are set to np.nan, and adjustments are defined with a float. No change is defined as 0. Extra keyword arguments for the function can be passed through a kwargs dictionary. These kwargs are also allowed to contain functions. These functions must return some value based on the name of the series.

Parameters:

name (str, optional) – name of the RuleSet, by default None

Examples

Given two detection functions ‘foo’ and ‘bar’:

>>> rset = RuleSet(name="foobar")
>>> rset.add_rule("foo", foo, apply_to=0)  # add rule 1
>>> rset.add_rule("bar", bar, apply_to=1, kwargs={"n": 2})  # add rule 2
>>> print(rset)  # print overview of rules
add_rule(name, func, apply_to=None, kwargs=None)[source]

Add rule to RuleSet.

Parameters:
  • name (str) – name of the rule

  • func (callable) – function that takes series as input and returns a correction series.

  • apply_to (int or tuple of ints, optional) – series to apply the rule to, by default None, which defaults to the original series. E.g. 0 is the original series, 1 is the result of step 1, etc. If a tuple of ints is passed, the results of those steps are collected and passed to func.

  • kwargs (dict, optional) – dictionary of additional keyword arguments for func, by default None. Additional arguments can be functions as well, in which case they must return some value based on the name of the series to which the RuleSet will be applied.

del_rule(name)[source]

Delete rule from RuleSet.

Parameters:

name (str) – name of the rule to delete

classmethod from_json(fname)[source]

Load RuleSet object from JSON file.

Attempts to load functions in the RuleSet by searching for the function name in traval.rulelib. If the function cannot be found, only the name of the function is preserved. This means a RuleSet with custom functions will not be fully functional when loaded from a JSON file.

Parameters:

fname (str) – filename or path to file

Returns:

RuleSet object

Return type:

RuleSet

See also

to_json

store RuleSet as JSON file (does not support custom functions)

to_pickle

store RuleSet as pickle (supports custom functions)

from_pickle

load RuleSet from pickle file

classmethod from_pickle(fname)[source]

Load RuleSet object form pickle file.

Parameters:

fname (str) – filename or path to file

Returns:

RuleSet object, including custom functions and parameters

Return type:

RuleSet

See also

to_pickle

store RuleSet as pickle (supports custom functions)

to_json

store RuleSet as json file (does not support custom functions)

from_json

load RuleSet from json file

get_resolved_ruleset(name)[source]

Get ruleset for a specific time series.

Retrieves the result of all functions that obtain parameters based on the name of the time series.

Parameters:

name (str) – name of the time series

Returns:

new copy of ruleset with parameters for a specific time series

Return type:

RuleSet

to_dataframe()[source]

Convert RuleSet to pandas.DataFrame.

Returns:

rdf – DataFrame containing all the information from the RuleSet

Return type:

pandas.DataFrame

to_json(fname=None, verbose=True)[source]

Write RuleSet to disk as json file.

Note that it is not possible to write custom functions to a JSON file. When writing the JSON only the name of the function is stored. When loading a JSON file, the function name is used to search within traval.rulelib. If the function can be found, it loads that function. A RuleSet making use of functions in the default rulelib.

Parameters:
  • fname (str) – filename or path to file

  • verbose (bool, optional) – prints message when operation complete, default is True

See also

from_json

load RuleSet from json file

to_pickle

store RuleSet as pickle (supports custom functions)

from_pickle

load RuleSet from pickle file

to_pickle(fname, verbose=True)[source]

Write RuleSet to disk as pickle.

Parameters:
  • fname (str) – filename or path of file

  • verbose (bool, optional) – prints message when operation complete, default is True

See also

from_pickle

load RuleSet from pickle file

to_json

store RuleSet as json file (does not support custom functions)

from_json

load RuleSet from json file

update_rule(name, func, apply_to=None, kwargs=None)[source]

Update rule in RuleSet.

Parameters:
  • name (str) – name of the rule

  • func (callable) – function that takes series as input and returns a correction series.

  • apply_to (int or tuple of ints, optional) – series to apply the rule to, by default None, which defaults to the original series. E.g. 0 is the original series, 1 is the result of step 1, etc. If a tuple of ints is passed, the results of those steps are collected and passed to func.

  • kwargs (dict, optional) – dictionary of additional keyword arguments for func, by default None. Additional arguments can be functions as well, in which case they must return some value based on the name of the series to which the RuleSet will be applied.

class traval.ruleset.RuleSetEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Encode values in RuleSet to JSON.

default(o)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return super().default(o)

Rule Library

traval.rulelib.rule_combine_corrections_and(*args)[source]

Combination rule, combine corrections for any number of time series.

Used for combining intermediate results in branching algorithm trees to create one final result, i.e. (corr_s1 AND corr_s2)

Returns:

corrections – a series with same index as the input time series containing corrections. Contains corrections where all of the input series values contain corrections.

Return type:

pd.Series

traval.rulelib.rule_combine_corrections_or(*args)[source]

Combination rule, combine corrections for any number of time series.

Used for combining intermediate results in branching algorithm trees to create one final result, i.e. (corr_s1 OR corr_s2)

Returns:

corrections – a series with same index as the input time series containing corrections. Contains corrections where all of the input series values contain corrections.

Return type:

pd.Series

traval.rulelib.rule_combine_nan_and(*args)[source]

Combination rule, combine NaN values for any number of time series.

Used for combining intermediate results in branching algorithm trees to create one final result, i.e. (s1.isna() AND s2.isna())

Returns:

corrections – a series with same index as the input time series containing corrections. Contains NaNs where any of the input series values is NaN.

Return type:

pd.Series

traval.rulelib.rule_combine_nan_or(*args)[source]

Combination rule, combine NaN values for any number of time series.

Used for combining intermediate results in branching algorithm trees to create one final result, i.e. (s1.isna() OR s2.isna())

Returns:

corrections – a series with same index as the input time series containing corrections. Contains NaNs where any of the input series values is NaN.

Return type:

pd.Series

traval.rulelib.rule_diff_outside_of_n_sigma(series, n=2.0, max_gap='7D')[source]

Detection rule, calculate diff of series and identify suspect.

observations based on values outside of n * standard deviation of the difference.

Parameters:
  • series (pd.Series) – time series in which suspect values are identified

  • n (float, optional) – number of standard deviations to use, by default 2

  • max_gap (str, optional) – only considers observations within this maximum gap between measurements to calculate diff, by default “7D”.

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values are set to np.nan.

Return type:

pd.Series

traval.rulelib.rule_diff_ufunc_threshold(series, ufunc, threshold, max_gap='7D')[source]

Detection rule, flag values based on diff, operator and threshold.

Calculate diff of series and identify suspect observations based on comparison with threshold value.

The argument ufunc is a tuple containing a function, e.g. an operator function (i.e. ‘>’, ‘<’, ‘>=’, ‘<=’). These are passed using their named equivalents, e.g. in numpy: np.greater, np.less, np.greater_equal, np.less_equal. This function essentially does the following: ufunc(series, threshold_series). The argument is passed as a tuple to bypass RuleSet logic.

Parameters:
  • series (pd.Series) – time series in which suspect values are identified

  • ufunc (tuple) – tuple containing ufunc (i.e. (numpy.greater_equal,) ). The function must be callable according to ufunc(series, threshold). The function is passed as a tuple to bypass RuleSet logic.

  • threshold (float) – value to compare diff of time series to

  • max_gap (str, optional) – only considers observations within this maximum gap between measurements to calculate diff, by default “7D”.

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values are set to np.nan.

Return type:

pd.Series

traval.rulelib.rule_flat_signal(series, window, min_obs, std_threshold=0.0075, qbelow=None, qabove=None, hbelow=None, habove=None)[source]

Detection rule, flag values based on dead signal in rolling window.

Flag values when variation in signal within a window falls below a certain threshold value. Optionally provide quantiles below or above which to look for dead/flat signals.

Parameters:
  • series (pd.Series) – time series to analyse

  • window (int) – number of days in window

  • min_obs (int) – minimum number of observations in window to calculate standard deviation

  • std_threshold (float, optional) – standard deviation threshold value, by default 7.5e-3

  • qbelow (float, optional) – quantile value between 0 and 1, signifying an upper limit. Only search for flat signals below this limit. By default None.

  • qabove (float, optional) – quantile value between 0 and 1, signifying a lower limit. Only search for flat signals above this limit. By default None.

  • hbelow (float, optional) – absolute value in units of time series signifying an upper limit. Only search for flat signals below this limit. By default None.

  • habove (float, optional) – absolute value in units of time series signifying a lower limit. Only search for flat signals above this limit. By default None.

Returns:

corrections – a series with same index as the input time series containing corrections. Contains NaNs where the signal is considered flat or dead.

Return type:

pd.Series

traval.rulelib.rule_funcdict(series, funcdict)[source]

Detection rule, flag values with dictionary of functions.

Use dictionary of functions to identify suspect values and set them to NaN.

Parameters:
  • series (pd.Series) – time series in which suspect values are identified

  • funcdict (dict) – dictionary with function names as keys and functions/methods as values. Each function is applied to each value in the time series using series.apply(func). Suspect values are those where the function evaluates to True.

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values (according to the provided functions) are set to np.nan.

Return type:

pd.Series

traval.rulelib.rule_hardmax(series, threshold, offset=0.0)[source]

Detection rule, flag values greater than threshold value.

traval.rulelib.rule_hardmin(series, threshold, offset=0.0)[source]

Detection rule, flag values lower than threshold value.

traval.rulelib.rule_keep_comments(series, keep_comments, comment_series)[source]

Filter rule, modify time series to keep data with certain comments.

This rule was invented to extract time series only containing certain types of errors, based on labeled data. For example, to get only erroneous observations caused by sensors above the groundwater level:

  • series: the raw time series

  • keep_comments: list of comments to keep, e.g. [‘dry sensor’]

  • comment_series: time series containing the comments for erroneous obs

Parameters:
  • series (pd.Series) – time series to filter

  • keep_comments (list of str) – list of comments to keep

  • comment_series (pd.Series) – time series containing comments, should have same index as series

Returns:

corrections – dataframe containing correction code 99 where comment is in keep_comments and 0 otherwise.

Return type:

pd.DataFrame

traval.rulelib.rule_max_gradient(series, max_step=0.5, max_timestep='1D')[source]

Detection rule, flag values when maximum gradient exceeded.

Flag values when maximum gradient between two observations is exceeded. Use negative max_step to flag values with negative gradient.

Parameters:
  • series (pd.Series) – time series in which suspect values are identified

  • max_step (float, optional) – max jump between two observations within given timestep, by default 0.5

  • timestep (str, optional) – maximum timestep to consider, by default “1D”. The gradient is not calculated for values that lie farther apart.

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values are set to np.nan.

Return type:

pd.Series

traval.rulelib.rule_offset_detection(series, threshold=0.15, updown_diff=0.1, max_gap='7D', search_method='time', return_df=False)[source]

Detection rule, detect periods with an offset error.

This rule looks for jumps in both positive and negative direction that are larger than a particular threshold. It then tries to match jumps in upward direction to one in downward direction of a similar size. If this is possible, all observations between two matching but oppposite jumps are set to NaN.

Parameters:
  • series (pd.Series) – time series in which to look for offset errors

  • threshold (float, optional) – minimum jump to consider as offset error, by default 0.35

  • updown_diff (float, optional) – the maximum difference between two opposite jumps to consider them matching, by default 0.1

  • max_gap (str, optional) – only considers observations within this maximum gap between measurements to calculate diff, by default “7D”.

  • search_method (str) – method for seeking matching opposite jumps. Options are “match” or “time”. Method “match” looks for the jump closest in magnitude to the current jump. Method “time” looks for the next jump in time that meets the updown_diff criterium.

  • return_df (bool, optional) – return the dataframe containing the potential offsets, by default False

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values are set to np.nan.

Return type:

pd.Series

traval.rulelib.rule_other_ufunc_threshold(series, other, ufunc, threshold)[source]

Detection rule, flag values based on other series and threshold.

Correct values based on comparison of another time series with a threshold value.

The argument ufunc is a tuple containing an operator function (i.e. ‘>’, ‘<’, ‘>=’, ‘<=’). These are passed using their named equivalents, e.g. in numpy: np.greater, np.less, np.greater_equal, np.less_equal. This function essentially does the following: ufunc(series, threshold_series). The argument is passed as a tuple to bypass RuleSet logic.

Parameters:
  • series (pd.Series) – time series in which suspect values are identified, only used to test if index of other overlaps

  • other (pd.Series) – other time series based on which suspect values are identified

  • ufunc (tuple) – tuple containing ufunc (i.e. (numpy.greater_equal,) ). The function must be callable according to ufunc(series, threshold). The function is passed as a tuple to bypass RuleSet logic.

  • threshold (float) – value to compare time series to

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values are set to np.nan.

Return type:

pd.Series

traval.rulelib.rule_outside_bandwidth(series, lowerbound, upperbound)[source]

Detection rule, set suspect values to NaN if outside bandwidth.

Parameters:
  • series (pd.Series) – time series in which suspect values are identified

  • lowerbound (pd.Series) – time series containing the lower bound, if bound values are less frequent than series, bound is interpolated to series.index

  • upperbound (pd.Series) – time series containing the upper bound, if bound values are less frequent than series, bound is interpolated to series.index

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values are set to np.nan.

Return type:

pd.Series

traval.rulelib.rule_outside_n_sigma(series, n=2.0)[source]

Detection rule, set values outside of n * standard deviation to NaN.

Parameters:
  • series (pd.Series) – time series in which suspect values are identified

  • n (float, optional) – number of standard deviations to use, by default 2

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values are set to np.nan.

Return type:

pd.Series

traval.rulelib.rule_pastas_outside_pi(series, ml, ci=0.95, min_ci=None, smoothfreq=None, tmin=None, tmax=None, savedir=None, verbose=False)[source]

Detection rule, flag values based on pastas model prediction interval.

Flag suspect outside prediction interval calculated by pastas time series model. Uses a pastas.Model and a confidence interval as input.

Parameters:
  • series (pd.Series) – time series to identify suspect observations in

  • ml (pastas.Model) – time series model for series

  • ci (float, optional) – confidence interval for calculating bandwidth, by default 0.95. Higher confidence interval means that bandwidth is wider and more observations will fall within the bounds.

  • min_ci (float, optional) – value indicating minimum distance between upper and lower bounds, if ci does not meet this requirement, this value is added to the bounds. This can be used to prevent extremely narrow prediction intervals. Default is None.

  • smoothfreq (str, optional) – str indicating no. of periods and frequency str (i.e. “1D”) for smoothing upper and lower bounds only used if smoothbounds=True, default is None.

  • tmin (str or pd.Timestamp, optional) – set tmin for model simulation

  • tmax (str or pd.Timestamp, optional) – set tmax for model simulation

  • savedir (str, optional) – save calculated prediction interval to folder as pickle file.

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values are set to np.nan.

Return type:

pd.Series

traval.rulelib.rule_shift_to_manual_obs(series, hseries, method='linear', max_dt='1D', reset_dates=None)[source]

Adjustment rule, for shifting time series onto manual observations.

Used for shifting time series based on sensor observations onto manual verification measurements. By default uses linear interpolation between two manual verification observations.

Parameters:
  • series (pd.Series) – time series to adjust

  • hseries (pd.Series) – time series containing manual observations

  • method (str, optional) – method to use for interpolating between two manual observations, by default “linear”. Other options are those that are accepted by series.reindex(): ‘bfill’, ‘ffill’, ‘nearest’.

  • max_dt (str, optional) – maximum amount of time between manual observation and value in series, by default “1D”

  • reset_dates (list, optional) – list of dates (as str or pd.Timestamp) on which to reset the adjustments to 0.0, by default None. Useful for resetting the adjustments when the sensor is replaced, for example.

Returns:

adjusted_series – time series containing adjustments to shift series onto manual observations.

Return type:

pd.Series

traval.rulelib.rule_spike_detection(series, threshold=0.15, spike_tol=0.15, max_gap='7D')[source]

Detection rule, identify spikes in time series and set to NaN.

Spikes are sudden jumps in the value of a time series that last 1 timestep. They can be both negative or positive.

Parameters:
  • series (pd.Series) – time series in which suspect values are identified

  • threshold (float, optional) – the minimum size of the jump to qualify as a spike, by default 0.15

  • spike_tol (float, optional) – offset between value of time series before spike and after spike, by default 0.15. After a spike, the value of the time series is usually close to but not identical to the value that preceded the spike. Use this parameter to control how close the value has to be.

  • max_gap (str, optional) – only considers observations within this maximum gap between measurements to calculate diff, by default “7D”.

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values are set to np.nan.

Return type:

pd.Series

traval.rulelib.rule_ufunc_threshold(series, ufunc, threshold, offset=0.0)[source]

Detection rule, flag values based on operator and threshold value.

Set values to Nan based on operator function and threshold value. The argument ufunc is a tuple containing an operator function (i.e. ‘>’, ‘<’, ‘>=’, ‘<=’). These are passed using their named equivalents, e.g. in numpy: np.greater, np.less, np.greater_equal, np.less_equal. This function essentially does the following: ufunc(series, threshold).

Parameters:
  • series (pd.Series) – time series in which suspect values are identified

  • ufunc (tuple) – tuple containing ufunc (i.e. (numpy.greater_equal,) ). The function must be callable according to ufunc(series, threshold). The function is passed as a tuple to bypass RuleSet logic.

  • threshold (float or pd.Series) – value or time series to compare series with

  • offset (float, optional) – value that is added to the threshold, e.g. if some extra tolerance is allowable. Default value is 0.0.

Returns:

corrections – a series with same index as the input time series containing corrections. Suspect values are set to np.nan.

Return type:

pd.Series

Time Series Comparison

class traval.ts_comparison.DateTimeIndexComparison(idx1, idx2)[source]

Helper class for comparing two DateTimeIndexes.

idx_in_both()[source]

Index members in both DateTimeIndexes.

Returns:

index with entries in both

Return type:

DateTimeIndex

idx_in_idx1()[source]

Index members only in Index #1.

Returns:

index with entries only in index #1

Return type:

DateTimeIndex

idx_in_idx2()[source]

Index members only in Index #2.

Returns:

index with entries only in index #2

Return type:

DateTimeIndex

class traval.ts_comparison.SeriesComparison(s1, s2, names=None, diff_threshold=0.0)[source]

Object for comparing two time series.

Comparison yields the following categories:

  • in_both_identical: in both series and difference <= than diff_threshold

  • in_both_different: in both series and difference > than diff_threshold

  • in_s1: only in series #1

  • in_s2: only in series #2

  • in_both_nan: NaN in both

Parameters:
  • s1 (pd.Series or pd.DataFrame) – first series to compare

  • s2 (pd.Series or pd.DataFrame) – second series to compare

  • diff_threshold (float, optional) – value beyond which a difference is considered significant, by default 0.0. Two values whose difference is smaller than threshold are considered identical.

compare_by_comment()[source]

Compare series per comment.

Returns:

comparison – series containing the possible comparison outcomes, but split into categories, one for each unique comment. Comments must be passed via series2.

Return type:

pd.Series

Raises:

ValueError – if no comment series is found

comparison_series()[source]

Create series that indicates what happend to a value.

Series index is the union of s1 and s2 with a value indicating the status of the comparison:

  • -1: value is modified

  • 0: value stays the same

  • 1: value only in series 1

  • 2: value only in series 2

  • -9999: value is NaN in both series

Returns:

s – series containing status of value from comparison

Return type:

pd.Series

class traval.ts_comparison.SeriesComparisonRelative(s1, truth, base, diff_threshold=0.0)[source]

Object for comparing two time series relative to a third time series.

Extends the SeriesComparison object to include a comparison between two time series and a third base time series. This is used for example, when comparing the results of two error detection outcomes to the original raw time series.

Comparison yields both the results from SeriesComparison as well as the following categories for the relative comparison to the base time series:

  • kept_in_both: both time series and the base time series contain values

  • flagged_in_s1: value is NaN/missing in series #1

  • flagged_in_s2: value is NaN/missing in series #2

  • flagged_in_both: value is NaN/missing in both series #1 and series #2

  • in_all_nan: value is NaN in all time series (series #1, #2 and base)

  • introduced_in_s1: value is NaN/missing in base but has value in series #1

  • introduced_in_s2: value is NaN/missing in base but has value in series #2

  • introduced_in_both: value is NaN/missing in base but has value in both time series

Parameters:
  • s1 (pd.Series or pd.DataFrame) – first series to compare

  • truth (pd.Series or pd.DataFrame) – second series to compare, if a “truth” time series is available pass it as the second time series. Stored in object as ‘s2’.

  • base (pd.Series or pd.DataFrame) – time series to compare other two series with

  • diff_threshold (float, optional) – value beyond which a difference is considered significant, by default 0.0. Two values whose difference is smaller than threshold are considered identical.

See also

SeriesComparison

Comparison of two time series relative to each other

compare_to_base_by_comment()[source]

Compare two series to base series per comment.

Returns:

comparison – Series containing the number of observations in each possible comparison category, but split per unique comment. Comments must be provided via ‘truth’ series (series2).

Return type:

pd.Series

Raises:

ValueError – if no comment series is available.

Time series Utilities

class traval.ts_utils.CorrectionCode(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Codes and labels for labeling error detection results.

traval.ts_utils.bandwidth_moving_avg_n_sigma(series, window, n)[source]

Calculate bandwidth around time series based moving average + n * std.

Parameters:
  • series (pd.Series) – series to calculate bandwidth for

  • window (int) – number of observations to consider for moving average

  • n (float) – number of standard deviations from moving average for bandwidth

Returns:

bandwidth – dataframe with 2 columns, with lower and upper bandwidth

Return type:

pd.DataFrame

traval.ts_utils.corrections_as_float(corrections)[source]

Convert correction code series to NaNs.

Excludes codes 0 and 4, which are used to indicate no correction and a modification of the value, respectively.

Parameters:

corrections (pd.DataFrame) – dataframe with correction code and original + modified values

Returns:

c – return corrections series with floats where value is modified

Return type:

pd.Series

traval.ts_utils.corrections_as_nan(corrections)[source]

Convert correction code series to NaNs.

Excludes codes 0 and 4, which are used to indicate no correction and a modification of the value, respectively.

Parameters:

corrections (pd.Series or pd.DataFrame) – series or dataframe with correction code

Returns:

c – return corrections series with nans where value is corrected

Return type:

pd.Series

traval.ts_utils.create_synthetic_raw_time_series(raw_series, truth_series, comments)[source]

Create synthetic raw time series.

Updates ‘truth_series’ (where values are labelled with a comment) with values from raw_series. Used for removing unlabeled changes between a raw and validated time series.

Parameters:
  • raw_series (pd.Series) – time series with raw data

  • truth_series (pd.Series) – time series with validated data

  • comments (pd.Series) – time series with comments. Index must be same as ‘truth_series’. When value does not have a comment it must be an empty string: ‘’.

Returns:

s – synthetic raw time series, same as truth_series but updated with raw_series where value has been commented.

Return type:

pd.Series

traval.ts_utils.diff_with_gap_awareness(series, max_gap='7D')[source]

Get diff of time series with a limit on gap between two values.

Parameters:
  • series (pd.Series) – time series to calculate diff for

  • max_gap (str, optional) – maximum period between two observations for calculating diff, otherwise set value to NaN, by default “7D”

Returns:

diff – time series with diff, with NaNs whenever two values are farther apart than max_gap.

Return type:

pd.Series

traval.ts_utils.get_correction_status_name(corrections)[source]

Get correction status name from correction codes.

Parameters:

correction_code (pd.DataFrame or pd.Series) – dataframe or series containing corrections codes

Returns:

dataframe or series filled with correction status name

Return type:

pd.DataFrame or pd.Series

traval.ts_utils.get_empty_corrections_df(series)[source]

Method to get corrections empty dataframe.

Parameters:

series (pd.Series) – time series to apply corrections to

traval.ts_utils.interpolate_series_to_new_index(series, new_index)[source]

Interpolate time series to new DateTimeIndex.

Parameters:
  • series (pd.Series) – original series

  • new_index (DateTimeIndex) – new index to interpolate series to

Returns:

si – new series with new index, with interpolated values

Return type:

pd.Series

traval.ts_utils.mask_corrections_above_below(series, mask_above, threshold_above, mask_below, threshold_below)[source]

Get corrections where above threshold.

Parameters:
  • series (pd.Series) – time series to apply corrections to

  • threshold_above (pd.Series) – time series with values to compare with

  • mask_above (DateTimeIndex or boolean np.array) – DateTimeIndex containing timestamps where value should be set to NaN, or boolean array with same length as series set to True where value should be set to NaN. (Uses pandas .loc[mask] to set values.)

  • threshold_below (pd.Series) – time series with values to compare with

  • mask_below (DateTimeIndex or boolean np.array) – DateTimeIndex containing timestamps where value should be set to NaN, or boolean array with same length as series set to True where value should be set to NaN. (Uses pandas .loc[mask] to set values.)

traval.ts_utils.mask_corrections_above_threshold(series, threshold, mask)[source]

Get corrections where below threshold.

Parameters:
  • series (pd.Series) – time series to apply corrections to

  • threshold (pd.Series) – time series with values to compare with

  • mask (DateTimeIndex or boolean np.array) – DateTimeIndex containing timestamps where value should be set to NaN, or boolean array with same length as series set to True where value should be set to NaN. (Uses pandas .loc[mask] to set values.)

traval.ts_utils.mask_corrections_below_threshold(series, threshold, mask)[source]

Get corrections where below threshold.

Parameters:
  • series (pd.Series) – time series to apply corrections to

  • threshold (pd.Series) – time series with values to compare with

  • mask (DateTimeIndex or boolean np.array) – DateTimeIndex containing timestamps where value should be set to NaN, or boolean array with same length as series set to True where value should be set to NaN. (Uses pandas .loc[mask] to set values.)

traval.ts_utils.mask_corrections_equal_value(series, values, mask)[source]

Get corrections where equal to value.

Parameters:
  • series (pd.Series) – time series to apply corrections to

  • values (pd.Series) – time series with values to compare with

  • mask (DateTimeIndex or boolean np.array) – DateTimeIndex containing timestamps where value should be set to NaN, or boolean array with same length as series set to True where value should be set to NaN. (Uses pandas .loc[mask] to set values.)

traval.ts_utils.mask_corrections_modified_value(series, values, mask)[source]

Get corrections where value was modified.

Parameters:
  • series (pd.Series) – time series to apply corrections to

  • values (pd.Series) – time series with values to compare with

  • mask (DateTimeIndex or boolean np.array) – DateTimeIndex containing timestamps where value should be set to NaN, or boolean array with same length as series set to True where value should be set to NaN. (Uses pandas .loc[mask] to set values.)

traval.ts_utils.mask_corrections_no_comparison_value(series, mask)[source]

Get corrections where equal to value.

Parameters:
  • series (pd.Series) – time series to apply corrections to

  • mask (DateTimeIndex or boolean np.array) – DateTimeIndex containing timestamps where value should be set to NaN, or boolean array with same length as series set to True where value should be set to NaN. (Uses pandas .loc[mask] to set values.)

traval.ts_utils.mask_corrections_not_equal_value(series, values, mask)[source]

Get corrections where not equal to value.

Parameters:
  • series (pd.Series) – time series to apply corrections to

  • values (pd.Series) – time series with values to compare with

  • mask (DateTimeIndex or boolean np.array) – DateTimeIndex containing timestamps where value should be set to NaN, or boolean array with same length as series set to True where value should be set to NaN. (Uses pandas .loc[mask] to set values.)

traval.ts_utils.resample_short_series_to_long_series(short_series, long_series)[source]

Resample a short time series to index from a longer time series.

First uses ‘ffill’ then ‘bfill’ to fill new series.

Parameters:
  • short_series (pd.Series) – short time series

  • long_series (pd.Series) – long time series

Returns:

new_series – series with index from long_series and data from short_series

Return type:

pd.Series

traval.ts_utils.spike_finder(series, threshold=0.15, spike_tol=0.15, max_gap='7D')[source]

Find spikes in time series.

Spikes are sudden jumps in the value of a time series that last 1 timestep. They can be both negative or positive.

Parameters:
  • series (pd.Series) – time series to find spikes in

  • threshold (float, optional) – the minimum size of the jump to qualify as a spike, by default 0.15

  • spike_tol (float, optional) – offset between value of time series before spike and after spike, by default 0.15. After a spike, the value of the time series is usually close to but not identical to the value that preceded the spike. Use this parameter to control how close the value has to be.

  • max_gap (str, optional) – only considers observations within this maximum gap between measurements to calculate diff, by default “7D”.

Returns:

upspikes, downspikes – pandas DateTimeIndex objects containing timestamps of upward and downward spikes.

Return type:

pandas.DateTimeIndex

traval.ts_utils.unique_nans_in_series(series, *args)[source]

Get mask where NaNs in series are unique compared to other series.

Parameters:
  • series (pd.Series) – identify unique NaNs in series

  • *args – any number of pandas.Series

Returns:

mask – mask with value True where NaN is unique to series

Return type:

pd.Series

Binary Classification

class traval.binary_classifier.BinaryClassifier(tp, fp, tn, fn)[source]

Class for calculating binary classification statistics.

property accuracy

Accuracy of binary classification.

ACC = (TP + TN) / (TP + FP + FN + TN)

where - TP : True Positives - TN : True Negatives - FP : False Positives - FN : False Negatives

confusion_matrix(as_array=False)[source]

Calculate confusion matrix.

Confusion matrix shows the performance of the algorithm given a certain truth. An abstract example of the confusion matrix:

Algorithm |

|-------------------| | error | correct |

——|---------|———|---------|
error | TP | FN |
Truth |---------|———|---------|
correct | FP | TN |

——|---------|———|---------|

where: - TP: True Positives = errors correctly detected by algorithm - TN: True Negatives = correct values correctly not flagged by algorithm - FP: False Positives = correct values marked as errors by algorithm - FN: False Negatives = errors not detected by algorithm

Parameters:

as_array (bool, optional) – return data as array instead of DataFrame, by default False

Returns:

data – confusion matrix

Return type:

pd.DataFrame or np.array

property false_discovery_rate

False discovery rate.

FDR = 1 - PPV = FP / (FP + TP)

where - TP : True Positives - FP : False Positives

property false_negative_rate

False Negative Rate = (1 - sensitivity).

FNR = FN / (FN + TP)

where - FN : False Negatives - TP : True Positives

property false_omission_rate

False omission rate.

FOR = 1 - NPV = FN / (TN + FN)

where - TN : True Negatives - FN : False Negatives

property false_positive_rate

False Positive Rate = (1 - specificity).

FPR = FP / (FP + TN)

where - FP : False Positives - TN : True Negatives

classmethod from_confusion_matrix(cmat)[source]

Create BinaryClassifier from confusion matrix.

Note

Confusion Matrix must be passed as an np.array or pd.DataFrame corresponding to: [[TP, FN], [FP, TN]], like the one returned by BinaryClassifier.confusion_matrix

Parameters:

cmat (np.array or pd.DataFrame) –

a 2x2 dataset with structure [[TP, FN],

[FP, TN]]

Returns:

BinaryClassifier object based on values in confusion matrix.

Return type:

BinaryClassifier

See also

BinaryClassifier.confusion_matrix

for explanation (of abbreviations)

classmethod from_series_comparison_relative(comparison)[source]

Binary Classification object from SeriesComparisonRelative object.

Parameters:

comparison (traval.SeriesComparisonRelative) – object comparing two time series with base time series

Returns:

object for calculating binary classification statistics

Return type:

BinaryClassifier

get_all_statistics(use_abbreviations=True)[source]

Get all statistics in pandas.Series.

Parameters:

use_abbreviations (bool, optional) – whether to use abbreviations or full names for index, by default True

Returns:

s – series containing all statistics

Return type:

pandas.Series

property informedness

Informedness statistic (a.k.a. Youden’s J statistic).

Measure of diagnostic performance, and has a zero value when a diagnostic test gives the same proportion of positive results for groups with and without a condition, i.e the test is useless. A value of 1 indicates that there are no false positives or false negatives, i.e. the test is perfect.

Calculated as:

informedness = specificity + sensitivity - 1.

property matthews_correlation_coefficient

Matthews correlation coefficient (MCC).

The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.

Returns:

phi – the Matthews correlation coefficient

Return type:

float

See also

mcc

convenience method for calculating MCC

property mcc

Convenience method for calculating Matthews correlation coefficient.

Returns:

phi – the Matthews correlation coefficient

Return type:

float

See also

matthews_correlation_coefficient

more information about the statistic

property negative_predictive_value

Negative predictive value.

NPV = TN / (TN + FN)

where - TN : True Negatives - FN : False Negatives

property positive_predictive_value

Positive predictive value (a.k.a. precision).

PPV = TP / (TP + FP)

where - TP : True Positives - FP : False Positives

property prevalence

Prevalence of true errors in total population.

Prevalence = (TP + FN) / (TP + FP + FN + TN)

where - TP : True Positives - TN : True Negatives - FP : False Positives - FN : False Negatives

property sensitivity

Sensitivity or True Positive Rate.

Statistic describing ratio of true positives identified, which also says something about the avoidance of false negatives.

Sensitivity = TP / (TP + FN)

where - TP : True Positives - FN : False Negatives

property specificity

Specificity or True Negative Rate.

Statistic describing ratio of true negatives identified, which also says something about the avoidance of false positives.

Specificity = TN / (TN + FP)

where - TN : True Negatives - FP : False Positives

property true_negative_rate

True Negative Rate. Synonym for specificity.

See specificity for description.

property true_positive_rate

True Positive Rate. Synonym for sensitivity.

See sensitivity for description.

Plots

class traval.plots.ComparisonPlots(cp)[source]

Mix-in class for plots for comparing time series.

plot_relative_comparison(mark_unique=True, mark_different=True, mark_identical=True, mark_introduced=False, ax=None)[source]

Plot comparison between two time series relative to base time series.

Parameters:
  • mark_unique (bool, optional) – mark unique observations with colored X’s, by default True

  • mark_different (bool, optional) – highlight where series are different in red, by default True

  • mark_identical (bool, optional) – highlight where series are identical with green, by default True

  • mark_introduced (bool, optional) – mark observations that are not in the base time series with X’s, by default False

  • ax (axis, optional) – axis to plot on, by default None

Returns:

ax – axis handle

Return type:

axis

plot_series_comparison(mark_unique=True, mark_different=True, mark_identical=True, ax=None)[source]

Plot comparison between two time series.

Parameters:
  • mark_unique (bool, optional) – mark unique values with colored X’s, by default True

  • mark_different (bool, optional) – highlight where time series differ with red, by default True

  • mark_identical (bool, optional) – highlight where time series are identical with green, by default True

  • ax (axis, optional) – axis object to plot on, by default None

Returns:

ax – axis object

Return type:

axis

reset_color_dict()[source]

Reset color_dict to default values.

update_color_dict(key, color=None, alpha=None)[source]

Update colors for plots.

Parameters:
  • key (str) – name of category to update, see ComparisonPlots.color_dict.keys() for options

  • color (str, optional) – color name, by default None

  • alpha (float, optional) – alpha value, by default None

traval.plots.det_plot(fpr, fnr, labels, ax=None, **kwargs)[source]

Detection Error Tradeoff plot.

Adapted from scikitlearn DetCurveDisplay.

Parameters:
  • fpr (list or value or array) – false positive rate. If passed as a list loops through each entry and plots it. Otherwise just plots the array or value.

  • fnr (list or value or array) – false negative rate. If passed as a list loops through each entry and plots it. Otherwise just plots the array or value.

  • labels (list or str) – label for each fpr/fnr entry.

  • ax (matplotlib.pyplot.Axes, optional) – axes handle to plot on, by default None, which creates a new figure

Returns:

ax – axes handle

Return type:

matplotlib.pyplot.Axes

traval.plots.roc_plot(tpr, fpr, labels, colors=None, ax=None, plot_diagonal=True, colorbar_label=None, **kwargs)[source]

Receiver operator characteristic plot.

Plots the false positive rate (x-axis) versus the true positive rate (y-axis). The ‘tpr’ and ‘fpr’ can be passed as: - values: outcome of a single error detection algorithm - arrays: outcomes of error detection algorithm in which a detection

parameter is varied.

  • lists: for passing multiple results, entries can be values or arrays, as listed above.

Parameters:
  • tpr (list or value or array) – true positive rate. If passed as a list loops through each entry and plots it. Otherwise just plots the array or value.

  • fpr (list or value or array) – false positive rate. If passed as a list loops through each entry and plots it. Otherwise just plots the array or value.

  • labels (list or str) – label for each tpr/fpr entry.

  • ax (matplotlib.pyplot.Axes, optional) – axes to plot on, default is None, which creates new figure

  • plot_diagonal (bool, optional) – whether to plot the diagonal (useful for combining multiple ROC plots)

  • **kwargs – passed to ax.scatter

Returns:

ax – axes instance

Return type:

matplotlib.pyplot.Axes