Advertisement

TS-Evolutionary_Prototyping: A Python module for finding the prototype in large sets of time series

Open AccessPublished:December 22, 2022DOI:https://doi.org/10.1016/j.simpa.2022.100458

      Highlights

      • A software for time series summarisation and prototyping in large datasets.
      • An evolutionary strategy mainly based in genetic algorithms to find the prototype.
      • Use of elastic distances as distance measures between series.
      • A Python library that obtains the centroid as prototype in large sets of time series.

      Abstract

      Time series analysis has become one of the basic building blocks for the technological fields of science and engineering. Therefore, there are a large number of software tools that encompass the preparation of the data, the performance of a large number of processing tasks with the data, the generation of datasets and finally the implementation of the necessary evaluation techniques. Of particular importance within the above tasks is the prototyping or summarisation of sets of time series as they have direct application in the resolution of clustering problems. In this work, we introduce a Python package that implements an evolutionary strategy to find prototypes. Given a set of time series, the implemented software finds prototypes using dynamic time warping (DTW) as the distance measure between series and does not restrict the search space for the prototype to the series of the input set. The software also includes use cases for clustering and classification.

      Keywords

      Code metadata
      Tabled 1
      Current code version0.3.1
      Permanent link to code/repository used for this code versionhttps://github.com/SoftwareImpacts/SIMPAC-2022-270
      Code Ocean compute capsulehttps://codeocean.com/capsule/7671671/tree/v1
      Legal Code LicenseMIT
      Code versioning system usedgit
      Software code languages, tools, and services usedPython and C
      Compilation requirements, operating environments & dependenciesFile requirements.txt within the source code
      If available Link to developer documentation/manualJupyter notebooks within the source code
      Support email for questions[email protected]

      1. Motivation and significance

      Generally speaking, data prototyping has been widely used in many areas to discover patterns that will allow identifying the most significant information in large and complex datasets. In light of the availability of large datasets in recent years, the change has been radical, because information from real applications that did not used to be storable is now storable. Moreover, it is available over time without a compelling need to delete it in order to store a more recent one. The relation between such large amounts of data and time series is that many applications store this information in the form of time series. Typical examples include the collection of athletic data, economic data, biometric data [
      • Nnamoko N.
      • Korkontzelos I.
      • Barrowclough J.
      • Liptrott M.
      CyberSignature: A user authentication tool based on behavioural biometrics.
      ], even the images that make up a video. All this large amount of data in such different fields and their availability in public datasets has allowed a large number of scientific contributions linked to time series. Within the subfields of research related to time series we can identify, among others, dimensionality reduction, representation techniques, use of distances with different characteristics, different clustering algorithms and the definition of prototypes, which is the focus of the software presented here.
      With regard to the availability of tools for working with time series, in [
      • Ahmadzadeh A.
      • Sinha K.
      • Aydin B.
      • Angryk R.A.
      MVTS-Data Toolkit: A Python package for preprocessing multivariate time series data.
      ] a domain-independent Python tool is presented with the necessary preprocessing routines for further work with multivariate time series. In [
      • Barandas M.
      • Folgado D.
      • Fernandes L.
      • Santos S.
      • Abreu M.
      • Bota P.
      • Liu H.
      • Schultz T.
      • Gamboa H.
      TSFEL: Time Series Feature Extraction Library.
      ] an implementation, also in Python, called the Time Series Feature Extraction Library (TSFEL), is presented. It allows the extraction of features, one of the preliminary steps necessary to later use machine learning pipelines. A Python implementation is presented in [
      • van Kuppevelt D.
      • Meijer C.
      • Huber F.
      • van der Ploeg A.
      • Georgievska S.
      • van Hees V.T.
      Mcfly: Automated deep learning on time series.
      ] where a time series classification is performed with deep learning tools and whose main feature is that the selection of hyper-parameters is completed automatically. On another note, to quantify the information flow between two time series using two entropy measures, the RTransferEntropy package was created [
      • Behrendt S.
      • Dimpfl T.
      • Peter F.J.
      • Zimmermann D.J.
      RTransferEntropy — Quantifying information flow between different time series using effective transfer entropy.
      ]. Additional relevant software is Time Series Lab [
      • Lit R.
      • Koopman S.
      • Harvey A.
      Time Series Lab.
      ], which allows the modelling and forecasting of time series using advanced techniques.
      On the other hand, there are tools that are no longer so general but are instead specific to more concrete fields of application, such as [
      • Li M.
      • Hinnov L.
      • Kump L.
      Acycle: Time-series analysis software for paleoclimate research and education.
      ] which allows the recognition and the interpretation of paleoclimate signals in sedimentary proxy datasets. In [
      • Jesson M.A.
      • Bridgeman J.
      • Sterling M.
      Novel software developments for the automated post-processing of high volumes of velocity time-series.
      ], the authors specifically work with velocity time series and describe a set of innovative software developments for automating the parsing, filtering (despiking) and calculation of mean flow and turbulence parameters. In [
      • Colditz R.R.
      • Conrad C.
      • Wehrmann T.
      • Schmidt M.
      • Dech S.
      TiSeG: A flexible software tool for time-series generation of MODIS data utilizing the quality assessment science data set.
      ] there is presented a time-series generator (TiSeG), which analyses the pixel-level QA science datasets of all MODIS gridded ground products (MODLand) suitable for time-series generation.
      The proposed software implements a library of functions to obtain the centroid, which is something that can serve as the prototype of a set of time series using an elastic distance, specifically DTW [
      • Zhao Y.
      • Lin L.
      • Lu W.
      • Meng Y.
      Landsat time series clustering under modified dynamic time warping.
      ]. To achieve this, a series of evolutionary algorithms are programmed, studying different strategies and ways of approaching the problem. A key point is the representation of the series by means of segments and that is why it is necessary to use an elastic distance such as the DTW. Furthermore, the concept of alignment used in these distances is the main element in the operators of the genetic algorithm implemented, such as evaluation and crossover. The library also contains an implementation of the Nearest Centroid (NC) algorithm [
      • Thulasidas M.
      Nearest centroid: A bridge between statistics and machine learning.
      ], a variant of the K-Nearest Neighbours (KNN) algorithm [
      • Yin Y.
      • Shang P.
      Forecasting traffic time series with multivariate predicting method.
      ], using DTW as the distance measure. The incorporation of Jupyter notebooks [
      • Fangohr H.
      • Kluyver T.
      • DiPierro M.
      Jupyter in computational science.
      ] in the repository allows its use by any user regardless of their level of knowledge in this field.

      2. Software description

      2.1 Software functionalities

      In this section the major functionalities of the software are introduced.

      2.1.1 Implementation of a genetic algorithm to find a prototype in a set of series

      Genetic algorithms (GA) are a subset of the set of evolutionary algorithms [
      • Cai X.
      • Zhang N.
      • Venayagamoorthy G.K.
      • Wunsch D.C.
      Time series prediction with recurrent neural networks trained by a hybrid PSO-EA algorithm.
      ]. In this work, a genetic algorithm is implemented to obtain a representative of a set of time series using DTW as the distance measure. To this end, a set of operators has been implemented to govern the operation of every genetic approach: mutation, crossover, selection and evaluation of the fitness function.

      2.1.2 Development of automatic learning algorithms for the use of prototyping functionalities

      With the aim of including this algorithm in tasks that go beyond the calculation of the centroid of a set of series, a series of learning algorithms have been implemented that are going to be used in the tasks of grouping and classification of time series. Two Jupyter notebooks are provided within the software as use example of this tasks. Their content is detailed in Section 3.
      Figure thumbnail gr2
      Fig. 2Architecture for prototyping and classification using the NC algorithm.

      2.1.3 Efficient implementation of the elastic distance

      The calculation of the distance between two time series is critical, and it is essential that it be as accurate as possible, as it serves as the basis for more complex processes such as clustering. Rigid distances, such as Manhattan, Euclidean, Minkowski or Chebyshev [
      • Rodrigues O.
      Combining Minkowski and Cheyshev: New distance proposal and survey of distance metrics using k-nearest neighbours classifier.
      ], are unable to capture deformations between two series that appear similar to a human. Because of this, other distances, called elastic distances, have arisen, which seek to measure distance more flexibly and naturally. DTW is a measure of the distance between two time series belonging to the set and FastDTW is the algorithm proposed in [
      • Salvador S.
      • Chan P.
      Toward accurate dynamic time warping in linear time and space.
      ]. The complexity of the last algorithm is O(n) and it uses the ideas of global and local conditions together with dimensionality reduction. This allows the obtention of a good estimate of the alignment obtained by the classical DTW algorithm.
      Figure thumbnail gr3
      Fig. 3Definition of the parameters of the genetic algorithm.
      Figure thumbnail gr4
      Fig. 4Loading the dataset and plotting series of Class 1.
      Figure thumbnail gr5
      Fig. 5Resulting prototype for the clustering example for ‘50 words’ dataset, Class 1.

      2.2 Software Architecture

      The library TS-Evolutionary_Prototyping comprises two main modules. The ga module, shown in Fig. 1, contains the genetic algorithm that allows computing the centroid of a set of series using dynamic time warping (DTW) as the measure of distance. The module NC, shown in Fig. 2, contains an implementation of the Nearest Centroid algorithm, which classifies series by determining the distance between the centroid of each class and the series to be classified.
      With respect to the ga module, now we detail each one of the submodules that implements the basic operations of the genetic algorithms. The submodule fitness.py computes the fitness function of an individual C, that is, where C is a series. To do this, it is necessary to compute the DTW distance by using the FastDTW algorithm. This distance is computed between C and each series of the set S; the submodule crossover, crosses two individuals. The submodule mutate modifies an individual following three types of mutation. The terminology of genetic time series refers to ‘individuals’: here, each individual is one of the time series.
      As auxiliary sub-modules, we have: dtw, which simply calls dtw.c where DTW and FastDTW are implemented using the programming language C due to the inefficiency of the Python implementation of these distances. The interpolation submodule obtains a series of length m from the series C by linear interpolation. The normalisation submodule normalises and denormalises sets of time series: for the standardisation, the mean and standard deviation are used. Finally, the submodule generate.py contains two functions: random_generate randomly generates an individual of length L, and sample_generate randomly selects a series from a set.

      3. Illustrative examples

      Two Jupyter notebooks are provided. The first one is an example of searching for a prototype in a set of data series (Section 3.1). The second one (Section 3.2) contains two classification examples: one between two classes and the other between five classes.

      3.1 Searching for a prototype of a set of data series

      In this example, the centroid of one of the classes contained in the ‘50 words’ dataset [
      • Kaya H.
      • Gunduz-Oguducu S.
      A distance based time series classification framework.
      ] is computed using our library. The dataset ‘50 words’ consists of word profiles for George Washington’s manuscripts.
      Fig. 3 shows the parameters of the genetic algorithm with 100 individuals, the number of generations equal to 30, and the crossover and mutation probabilities with values 0.1 and 0.05, respectively. Furthermore, in Fig. 4 a graphical representation of the whole set of time series belonging to a concrete class, class 1 of the dataset, can be observed. Additionally, in the same figure, the code to load the dataset, select the class number 1 and plot the series belonging to this class appears. Finally, Fig. 5 shows the resulting prototype for class 1 of the dataset once the clustering process has finished.

      3.2 Classification using nearest centroid

      In this example, the Nearest Centroid (NC) algorithm is used to classify series from a Epileptic Seizure Recognition Data Set [
      • Andrzejak R.G.
      • Lehnertz K.
      • Mormann F.
      • Rieke C.
      • David P.
      • Elger C.E.
      Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state.
      ]. The Euclidean and DTW distances are used. In the case of DTW, our library is the one used to compute the centroids for each class in the dataset. But the use of the Euclidean distance comes from the implementation of NC in the library sklearn [
      • Hao J.
      • Ho T.K.
      Machine learning made easy: A review of Scikit-learn package in Python programming language.
      ] and this allows comparing the accuracy of these two approaches. In this notebook two problems are raised. The first one is a binary classification between two classes: epilepsy vs non-epilepsy, while the second example is a case of multiclass classification. In the own dataset, it can be found concrete information about the nature of the classes.
      Fig. 6.(a) shows the prototypes for the binary classification, Centroid 0 corresponds to the epilepsy class whilst Centroid 1, represents the non-epilepsy class. Additionally, Fig. 6.(b) presents the information about the accuracy. Using our prototyping method, an accuracy of 0.947 is achieved, “Accuracy DTW NC” label. Subsequently, it is compared with the implementation of the NC algorithm provided by the sklearn library, which in this case does not use an elastic distance, but uses the Euclidean distance. The code to run it and the resulting accuracy are shown being the accuracy equal to 0.633, “Accuracy Euclidean NC” label.
      Figure thumbnail gr6
      Fig. 6(a) Prototypes obtained for the binary classification, epilepsy vs non-epilepsy, using the dataset the “Epileptic Seizure Recognition”. (b) Accuracy using DTW with our approach vs. Accuracy using Nearest Centroid implementation from sklearn with the Euclidean distance.
      Figure thumbnail gr7
      Fig. 7Possible extensions of the generic framework.

      4. Impact

      We consider the software to be significant for its contribution to the search for a prototype of a set of time. Time series summarisation attempts to condense a set of time series by means of a single representative, a common task in several clustering algorithms. The summary thus obtained can provide high-level information that facilitates the identification of frequent patterns among other applications. With respect to the contributions of the software from a scientific point of view, in [
      • Leon-Alcaide P.
      • Rodriguez-Benitez L.
      • Castillo-Herrera E.
      • Moreno-Garcia J.
      • Jimenez-Linares L.
      An evolutionary approach for efficient prototyping of large time series datasets.
      ] we presented the entire theoretical framework and the formal specification of the algorithms that now have been implemented in this library. It must be highlighted that the prototype series of the set is not a series belonging to the set itself. Furthermore, the representation of the genes in the evolutionary algorithms is based on the similarity of the shapes of the series, defining the segment concept and using an elastic distance.
      The effectiveness of this technique is mainly based on considering an approximation of the fitness function by extracting only a subset of the set of series. This subset is then compared to the prototype population and this allows the method to be scalable and to work with large sets of time series. This can be considered one of the greatest strengths of this software tool. Finally, the obtained results allow the explainability of the models, something that is becoming increasingly important in certain fields. On the contrary, the use of black box models, such as neural networks, is limited in terms of interpretability.

      5. Conclusions and Future Works

      A software capable of obtaining a centroid from a set of time series using the DTW distance measurement has been implemented. Moreover, the implemented algorithm is scalable to large sets of series, achieving accurate representations efficiently, and the library can be used with machine learning algorithms, both clustering and classification. For instance, for time series classification, the NC algorithm integrated with the genetic algorithm has been implemented.
      The proposed prototype can be expanded designing a more generic methodological framework built on top of the proposed prototype. For instance, in Fig. 7, the text boxes in green show what we have implemented in this work. On the contrary, the rest of the text boxes represent some of the extensions that could be incorporated in the future. Now, we give a more detailed idea of some of these extensions and another possibilities of expanding the work. For instance, new functionalities can be added to this library as new possibilities can be explored by genetic algorithms using a cooperative strategy. Furthermore, it would be interesting to implement variants of the DTW or some Kernel Distance Measures for time series recently proposed in the literature. Also, with the emergence of new time series representations such as SAX, motif, etc., we could extend the number of notebooks with examples integrating the library tslearn [
      • Tavenard R.
      • Faouzi J.
      • Vandewiele G.
      • Divo F.
      • Androz G.
      • Holtz C.
      • Payne M.
      • Yurchak R.
      • Rußwurm M.
      • Kolar K.
      • Woods E.
      Tslearn, a machine learning toolkit for time series data.
      ] were these are already implemented. Another possibility of future work is the establishment of relations between the prototype and a bag of words where the theory of fuzzy sets and their linguistic variables could be incorporated. Finally, by focusing on time series in a specific field of application, ad hoc optimisation techniques could be implemented.

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      Acknowledgements

      This work has been funded by the Department of Information and System Technologies and the Research Vice-Rectory of the University of Castilla-La Mancha, Spain .

      References

        • Nnamoko N.
        • Korkontzelos I.
        • Barrowclough J.
        • Liptrott M.
        CyberSignature: A user authentication tool based on behavioural biometrics.
        Softw. Impacts. 2022; 100443https://doi.org/10.1016/j.simpa.2022.100443
        • Ahmadzadeh A.
        • Sinha K.
        • Aydin B.
        • Angryk R.A.
        MVTS-Data Toolkit: A Python package for preprocessing multivariate time series data.
        SoftwareX. 2020; 12100518https://doi.org/10.1016/J.SOFTX.2020.100518
        • Barandas M.
        • Folgado D.
        • Fernandes L.
        • Santos S.
        • Abreu M.
        • Bota P.
        • Liu H.
        • Schultz T.
        • Gamboa H.
        TSFEL: Time Series Feature Extraction Library.
        SoftwareX. 2020; 11100456https://doi.org/10.1016/J.SOFTX.2020.100456
        • van Kuppevelt D.
        • Meijer C.
        • Huber F.
        • van der Ploeg A.
        • Georgievska S.
        • van Hees V.T.
        Mcfly: Automated deep learning on time series.
        SoftwareX. 2020; 12100548https://doi.org/10.1016/J.SOFTX.2020.100548
        • Behrendt S.
        • Dimpfl T.
        • Peter F.J.
        • Zimmermann D.J.
        RTransferEntropy — Quantifying information flow between different time series using effective transfer entropy.
        SoftwareX. 2019; 10100265https://doi.org/10.1016/J.SOFTX.2019.100265
        • Lit R.
        • Koopman S.
        • Harvey A.
        Time Series Lab.
        2021 (URL https://timeserieslab.com)
        • Li M.
        • Hinnov L.
        • Kump L.
        Acycle: Time-series analysis software for paleoclimate research and education.
        Comput. Geosci. 2019; 127: 12-22https://doi.org/10.1016/J.CAGEO.2019.02.011
        • Jesson M.A.
        • Bridgeman J.
        • Sterling M.
        Novel software developments for the automated post-processing of high volumes of velocity time-series.
        Adv. Eng. Softw. 2015; 89: 36-42https://doi.org/10.1016/J.ADVENGSOFT.2015.06.007
        • Colditz R.R.
        • Conrad C.
        • Wehrmann T.
        • Schmidt M.
        • Dech S.
        TiSeG: A flexible software tool for time-series generation of MODIS data utilizing the quality assessment science data set.
        IEEE Trans. Geosci. Remote Sens. 2008; 46: 3296-3308https://doi.org/10.1109/TGRS.2008.921412
        • Zhao Y.
        • Lin L.
        • Lu W.
        • Meng Y.
        Landsat time series clustering under modified dynamic time warping.
        in: 2016 4th International Workshop on Earth Observation and Remote Sensing Applications, EORSA. 2016: 62-66https://doi.org/10.1109/EORSA.2016.7552767
        • Thulasidas M.
        Nearest centroid: A bridge between statistics and machine learning.
        in: 2020 IEEE International Conference on Teaching, Assessment, and Learning for Engineering, TALE. 2020: 9-16https://doi.org/10.1109/TALE48869.2020.9368396
        • Yin Y.
        • Shang P.
        Forecasting traffic time series with multivariate predicting method.
        Appl. Math. Comput. 2016; 291: 266-278https://doi.org/10.1016/J.AMC.2016.07.017
        • Fangohr H.
        • Kluyver T.
        • DiPierro M.
        Jupyter in computational science.
        Comput. Sci. Eng. 2021; 23: 5-6https://doi.org/10.1109/MCSE.2021.3059494
        • Cai X.
        • Zhang N.
        • Venayagamoorthy G.K.
        • Wunsch D.C.
        Time series prediction with recurrent neural networks trained by a hybrid PSO-EA algorithm.
        Neurocomputing. 2007; 70https://doi.org/10.1016/j.neucom.2005.12.138
        • Rodrigues O.
        Combining Minkowski and Cheyshev: New distance proposal and survey of distance metrics using k-nearest neighbours classifier.
        Pattern Recognit. Lett. 2018; 110https://doi.org/10.1016/j.patrec.2018.03.021
        • Salvador S.
        • Chan P.
        Toward accurate dynamic time warping in linear time and space.
        Intell. Data Anal. 2007; 11https://doi.org/10.3233/ida-2007-11508
        • Kaya H.
        • Gunduz-Oguducu S.
        A distance based time series classification framework.
        Inf. Syst. 2015; 51: 27-42https://doi.org/10.1016/j.is.2015.02.005
        • Andrzejak R.G.
        • Lehnertz K.
        • Mormann F.
        • Rieke C.
        • David P.
        • Elger C.E.
        Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state.
        Phys. Rev. E. 2001; 64https://doi.org/10.1103/PhysRevE.64.061907
        • Hao J.
        • Ho T.K.
        Machine learning made easy: A review of Scikit-learn package in Python programming language.
        J. Educ. Behav. Statist. 2019; 44https://doi.org/10.3102/1076998619832248
        • Leon-Alcaide P.
        • Rodriguez-Benitez L.
        • Castillo-Herrera E.
        • Moreno-Garcia J.
        • Jimenez-Linares L.
        An evolutionary approach for efficient prototyping of large time series datasets.
        Inform. Sci. 2020; 511: 74-93https://doi.org/10.1016/J.INS.2019.09.044
        • Tavenard R.
        • Faouzi J.
        • Vandewiele G.
        • Divo F.
        • Androz G.
        • Holtz C.
        • Payne M.
        • Yurchak R.
        • Rußwurm M.
        • Kolar K.
        • Woods E.
        Tslearn, a machine learning toolkit for time series data.
        J. Mach. Learn. Res. 2020; 21 (URL http://jmlr.org/papers/v21/20-091.html): 1-6