If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
A software for time series summarisation and prototyping in large datasets.
An evolutionary strategy mainly based in genetic algorithms to find the prototype.
Use of elastic distances as distance measures between series.
A Python library that obtains the centroid as prototype in large sets of time series.
Time series analysis has become one of the basic building blocks for the technological fields of science and engineering. Therefore, there are a large number of software tools that encompass the preparation of the data, the performance of a large number of processing tasks with the data, the generation of datasets and finally the implementation of the necessary evaluation techniques. Of particular importance within the above tasks is the prototyping or summarisation of sets of time series as they have direct application in the resolution of clustering problems. In this work, we introduce a Python package that implements an evolutionary strategy to find prototypes. Given a set of time series, the implemented software finds prototypes using dynamic time warping (DTW) as the distance measure between series and does not restrict the search space for the prototype to the series of the input set. The software also includes use cases for clustering and classification.
Generally speaking, data prototyping has been widely used in many areas to discover patterns that will allow identifying the most significant information in large and complex datasets. In light of the availability of large datasets in recent years, the change has been radical, because information from real applications that did not used to be storable is now storable. Moreover, it is available over time without a compelling need to delete it in order to store a more recent one. The relation between such large amounts of data and time series is that many applications store this information in the form of time series. Typical examples include the collection of athletic data, economic data, biometric data [
], even the images that make up a video. All this large amount of data in such different fields and their availability in public datasets has allowed a large number of scientific contributions linked to time series. Within the subfields of research related to time series we can identify, among others, dimensionality reduction, representation techniques, use of distances with different characteristics, different clustering algorithms and the definition of prototypes, which is the focus of the software presented here.
With regard to the availability of tools for working with time series, in [
] an implementation, also in Python, called the Time Series Feature Extraction Library (TSFEL), is presented. It allows the extraction of features, one of the preliminary steps necessary to later use machine learning pipelines. A Python implementation is presented in [
] where a time series classification is performed with deep learning tools and whose main feature is that the selection of hyper-parameters is completed automatically. On another note, to quantify the information flow between two time series using two entropy measures, the RTransferEntropy package was created [
], the authors specifically work with velocity time series and describe a set of innovative software developments for automating the parsing, filtering (despiking) and calculation of mean flow and turbulence parameters. In [
]. To achieve this, a series of evolutionary algorithms are programmed, studying different strategies and ways of approaching the problem. A key point is the representation of the series by means of segments and that is why it is necessary to use an elastic distance such as the DTW. Furthermore, the concept of alignment used in these distances is the main element in the operators of the genetic algorithm implemented, such as evaluation and crossover. The library also contains an implementation of the Nearest Centroid (NC) algorithm [
]. In this work, a genetic algorithm is implemented to obtain a representative of a set of time series using DTW as the distance measure. To this end, a set of operators has been implemented to govern the operation of every genetic approach: mutation, crossover, selection and evaluation of the fitness function.
2.1.2 Development of automatic learning algorithms for the use of prototyping functionalities
With the aim of including this algorithm in tasks that go beyond the calculation of the centroid of a set of series, a series of learning algorithms have been implemented that are going to be used in the tasks of grouping and classification of time series. Two Jupyter notebooks are provided within the software as use example of this tasks. Their content is detailed in Section 3.
2.1.3 Efficient implementation of the elastic distance
The calculation of the distance between two time series is critical, and it is essential that it be as accurate as possible, as it serves as the basis for more complex processes such as clustering. Rigid distances, such as Manhattan, Euclidean, Minkowski or Chebyshev [
], are unable to capture deformations between two series that appear similar to a human. Because of this, other distances, called elastic distances, have arisen, which seek to measure distance more flexibly and naturally. DTW is a measure of the distance between two time series belonging to the set and FastDTW is the algorithm proposed in [
]. The complexity of the last algorithm is and it uses the ideas of global and local conditions together with dimensionality reduction. This allows the obtention of a good estimate of the alignment obtained by the classical DTW algorithm.
2.2 Software Architecture
The library TS-Evolutionary_Prototyping comprises two main modules. The ga module, shown in Fig. 1, contains the genetic algorithm that allows computing the centroid of a set of series using dynamic time warping (DTW) as the measure of distance. The module NC, shown in Fig. 2, contains an implementation of the Nearest Centroid algorithm, which classifies series by determining the distance between the centroid of each class and the series to be classified.
With respect to the ga module, now we detail each one of the submodules that implements the basic operations of the genetic algorithms. The submodule fitness.py computes the fitness function of an individual C, that is, where C is a series. To do this, it is necessary to compute the DTW distance by using the FastDTW algorithm. This distance is computed between C and each series of the set S; the submodule crossover, crosses two individuals. The submodule mutate modifies an individual following three types of mutation. The terminology of genetic time series refers to ‘individuals’: here, each individual is one of the time series.
As auxiliary sub-modules, we have: dtw, which simply calls dtw.c where DTW and FastDTW are implemented using the programming language C due to the inefficiency of the Python implementation of these distances. The interpolation submodule obtains a series of length from the series by linear interpolation. The normalisation submodule normalises and denormalises sets of time series: for the standardisation, the mean and standard deviation are used. Finally, the submodule generate.py contains two functions: random_generate randomly generates an individual of length , and sample_generate randomly selects a series from a set.
3. Illustrative examples
Two Jupyter notebooks are provided. The first one is an example of searching for a prototype in a set of data series (Section 3.1). The second one (Section 3.2) contains two classification examples: one between two classes and the other between five classes.
3.1 Searching for a prototype of a set of data series
In this example, the centroid of one of the classes contained in the ‘50 words’ dataset [
] is computed using our library. The dataset ‘50 words’ consists of word profiles for George Washington’s manuscripts.
Fig. 3 shows the parameters of the genetic algorithm with 100 individuals, the number of generations equal to 30, and the crossover and mutation probabilities with values 0.1 and 0.05, respectively. Furthermore, in Fig. 4 a graphical representation of the whole set of time series belonging to a concrete class, class 1 of the dataset, can be observed. Additionally, in the same figure, the code to load the dataset, select the class number 1 and plot the series belonging to this class appears. Finally, Fig. 5 shows the resulting prototype for class 1 of the dataset once the clustering process has finished.
3.2 Classification using nearest centroid
In this example, the Nearest Centroid (NC) algorithm is used to classify series from a Epileptic Seizure Recognition Data Set [
]. The Euclidean and DTW distances are used. In the case of DTW, our library is the one used to compute the centroids for each class in the dataset. But the use of the Euclidean distance comes from the implementation of NC in the library sklearn [
] and this allows comparing the accuracy of these two approaches. In this notebook two problems are raised. The first one is a binary classification between two classes: epilepsy vs non-epilepsy, while the second example is a case of multiclass classification. In the own dataset, it can be found concrete information about the nature of the classes.
Fig. 6.(a) shows the prototypes for the binary classification, Centroid 0 corresponds to the epilepsy class whilst Centroid 1, represents the non-epilepsy class. Additionally, Fig. 6.(b) presents the information about the accuracy. Using our prototyping method, an accuracy of 0.947 is achieved, “Accuracy DTW NC” label. Subsequently, it is compared with the implementation of the NC algorithm provided by the sklearn library, which in this case does not use an elastic distance, but uses the Euclidean distance. The code to run it and the resulting accuracy are shown being the accuracy equal to 0.633, “Accuracy Euclidean NC” label.
We consider the software to be significant for its contribution to the search for a prototype of a set of time. Time series summarisation attempts to condense a set of time series by means of a single representative, a common task in several clustering algorithms. The summary thus obtained can provide high-level information that facilitates the identification of frequent patterns among other applications. With respect to the contributions of the software from a scientific point of view, in [
] we presented the entire theoretical framework and the formal specification of the algorithms that now have been implemented in this library. It must be highlighted that the prototype series of the set is not a series belonging to the set itself. Furthermore, the representation of the genes in the evolutionary algorithms is based on the similarity of the shapes of the series, defining the segment concept and using an elastic distance.
The effectiveness of this technique is mainly based on considering an approximation of the fitness function by extracting only a subset of the set of series. This subset is then compared to the prototype population and this allows the method to be scalable and to work with large sets of time series. This can be considered one of the greatest strengths of this software tool. Finally, the obtained results allow the explainability of the models, something that is becoming increasingly important in certain fields. On the contrary, the use of black box models, such as neural networks, is limited in terms of interpretability.
5. Conclusions and Future Works
A software capable of obtaining a centroid from a set of time series using the DTW distance measurement has been implemented. Moreover, the implemented algorithm is scalable to large sets of series, achieving accurate representations efficiently, and the library can be used with machine learning algorithms, both clustering and classification. For instance, for time series classification, the NC algorithm integrated with the genetic algorithm has been implemented.
The proposed prototype can be expanded designing a more generic methodological framework built on top of the proposed prototype. For instance, in Fig. 7, the text boxes in green show what we have implemented in this work. On the contrary, the rest of the text boxes represent some of the extensions that could be incorporated in the future. Now, we give a more detailed idea of some of these extensions and another possibilities of expanding the work. For instance, new functionalities can be added to this library as new possibilities can be explored by genetic algorithms using a cooperative strategy. Furthermore, it would be interesting to implement variants of the DTW or some Kernel Distance Measures for time series recently proposed in the literature. Also, with the emergence of new time series representations such as SAX, motif, etc., we could extend the number of notebooks with examples integrating the library tslearn [
] were these are already implemented. Another possibility of future work is the establishment of relations between the prototype and a bag of words where the theory of fuzzy sets and their linguistic variables could be incorporated. Finally, by focusing on time series in a specific field of application, ad hoc optimisation techniques could be implemented.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
This work has been funded by the Department of Information and System Technologies and the Research Vice-Rectory of the University of Castilla-La Mancha, Spain
CyberSignature: A user authentication tool based on behavioural biometrics.