Rdimtools: An R package for Dimension Reduction and Intrinsic Dimension Estimation

Discovering patterns of the complex high-dimensional data is a long-standing problem. Dimension Reduction (DR) and Intrinsic Dimension Estimation (IDE) are two fundamental thematic programs that facilitate geometric understanding of the data. We present Rdimtools - an R package that supports 133 DR and 17 IDE algorithms whose extent makes multifaceted scrutiny of the data in one place easier. Rdimtools is distributed under the MIT license and is accessible from CRAN, GitHub, and its package website, all of which deliver instruction for installation, self-contained examples, and API documentation.


Introduction
Scientists and practitioners of today often face high-dimensional data. The primary objective of data analysis would be to find patterns and gain understanding behind what is observed. When the data dimension exceeds the scope of human perception, it is mandated to rely upon some algorithms that can extract information into fathomable forms. Dimension reduction (DR) is one approach to discover structure in high-dimensional data that has long been and is still a major research program with a vast literature (Engel et al., 2012;Ma and Zhu, 2013). DR methods explore low-dimensional structure embedded in highdimensional space, which makes them appealing procedures for data visualization as well as a preliminary step for statistical analysis (Jolliffe, 1982;McKeown et al., 2003). Another core instrument in high-dimensional data analysis is intrinsic dimension estimation (IDE). As its name suggests, IDE tries to estimate the true dimensionality of a low-dimensional structure from which the observed data is generated (Camastra and Staiano, 2016).
We present an R package Rdimtools (version 0.1.2) that implements 133 DR and 17 IDE algorithms at an unprecedented scale. Each algorithm is designed to reveal certain characteristics, which may bound our understanding of the data by what an individual algorithm acknowledges. We believe a comprehensive toolbox like Rdimtools helps users to grasp the nature of complex data by leveraging fragmented knowledge that multiple algorithms elaborate separately.

Related Work
Many libraries have been proposed to provide a number of algorithms in a unified framework of its own, including drtoolbox (van der Maaten et al., 2009) in MATLAB, scikit-learn (Pedregosa et al., 2011) in Python, and a C ++ template library tapkee (Lisitsyn et al., 2013)  You with a known basis of popularity. In R, packages dimRed (Kraemer et al., 2018), dyndimred (Cannoodt and Saelens, 2020), intrinsicDimension (Johnsson, 2016), and others can be comparable although their scopes are not alike Rdimtools as summarized in Table 1

Dependencies and Development
Rdimtools internalizes most of capabilities via a balanced mixture of R and C ++ . Below is the list of R packages upon which Rdimtools depends.
• CVXR (Fu et al., 2018) solves a semidefinite program and a sparse regression problem with complex constraint in 3 DR functions. • RSpectra (Qiu and Mei, 2019) simplifies large-scale spectral decomposition when we only need the k largest or smallest eigenpairs. • RcppDE (Eddelbuettel, 2018) performs black-box optimization via differential evolution in 2 functions to find an optimal set of parameters. • Rcpp (Eddelbuettel and François, 2011;Eddelbuettel, 2013) enables convenient integration of C ++ codes with the binding of Armadillo C ++ linear algebra library (Sanderson and Curtin, 2016) via RcppArmadillo (Eddelbuettel and Sanderson, 2014).
Computational gain from C ++ is shown in Figure 1.
Rdimtools follows modern convention of open source software development. The project is hosted on GitHub 1 for collaborative development and each commit to the repository triggers a check to secure completeness via a continuous integration service 2 .
Rdimtools is also available from Comprehensive R Archive Network (CRAN) 3 for easy installation and use. Distribution through CRAN mandates to include working examples for every function after checks. It plays a role of integration testing when the package is updated. Also, CRAN requires compatibility with major operating systems. 4   We describe a common structure of DR functions as shown in Figure 2. Given the multivariate data matrix X ∈ R n×p where rows are p-dimensional observations, preprocessing comes in first if applicable. The aux.preprocess() routine provides 5 different operations; 'center', 'scale', 'cscale', 'decorrelate', and 'whiten'. Transformation is saved in an R list for future use. An algorithm is applied to the transformed data and returns projected coordinates Y ∈ R n×d for a predefined target dimension d < p. If an algorithm is of linear type, a 5. https://kyoustat.com/Rdimtools You projection matrix in R p×d is also returned. When an algorithm is one of 16 linear methods that employ feature selection, it also returns an index vector of d variables that are selected. Figure 2: Common structure of DR functions. Given the data X ∈ R n×p and parameters including the target dimension d, a DR function returns an embedding Y ∈ R n×d , preprocessing information, and projection matrix if an algorithm is of linear type.

Functionalities of Rdimtools
17 IDE algorithms all return an estimated dimension estdim while methods that employ bottom-up estimation schemes also report a length-n vector of local estimates at each point.
Other notable auxiliary functions include aux.gensamples() to generate samples from 10 popular data models, aux.graphnbd() to construct k-and -nearest-neighbor graphs that are used to approximate a data manifold embedded in R p , aux.kernelcov() to compute a positive definite kernel matrix K(x i , x j ) = φ(x i ), φ(x j ) using 20 types of kernels (Hofmann et al., 2008), and aux.shortestpath() that implements Floyd-Warshall algorithm (Floyd, 1962) to find shortest-path distances on a graph that approximate every pairwise geodesic distance on a data manifold reconstructed by a nearest-neighbor graph.

Conclusion
Rdimtools puts an unprecedented number DR and IDE tools for high-dimensional data analysis in a single R package. We scratch couple R with C ++ for fast, flexible development and efficiency. The package is maintained and distributed via CRAN, GitHub, and a package website to secure easy access, continuous integration, transparency, and collaborative development. All venues contain examples and a full API documentation.
We plan to further develop the Rdimtools package to incorporate more algorithms and out-of-memory support in response to the needs for big data analysis. Another direction of development in progress is to translate all subroutines written in R into pure C ++ codes. This has proven to be successful in reducing communication costs of complicated algorithms. In the long run, we hope the latter effort opens up an opportunity for Rdimtools project to evolve into a standalone C ++ library for wider use.