Local Intrinsic Dimension Estimation

Authors

Hamidreza Kamkari

Gabriel Loaiza-Ganem

1 Introduction

Figure 1: An illustration showing that LID is a natural measure of relative complexity. We depict two manifolds of MNIST digits, corresponding to 1s and 8s, as 1d and 2d submanifolds of \mathbb{R}^3, respectively. The relatively simpler manifold of 1s exhibits a single factor of variation (“tilt”), whereas 8s have an additional factor of variation (“disproportionality”).

High-dimensional data in deep learning applications such as images often resides on low-dimensional submanifolds, which makes learning the properties of the learned manifold by a generative model a relevant problem (Loaiza-Ganem et al. 2024). One of the most important properties of a manifold is its intrinsic dimensionality which can loosely be defined as the number of factors of variation that describe the data. In reality, rather than having a single manifold representing the data distribution, we have a collection of manifolds (Brown et al. 2023) (or more recently the CW complex hypothesis (Wang and Wang, n.d.)) that describe the data distribution. Intuitively, this means that for example for a dataset of MNIST digits, the manifold of 1s and 8s are different, and they might have different intrinsic dimensionalities. Therefore, instead of (global) intrinsic dimensionality, we are interested at local intrinsic dimensionality (LID) which is a property of a point with respect to the manifold that contains it.

Various definitions of intrinsic dimension exist (Hurewicz and Wallman 1948), (Falconer 2007), (Lee 2012), we follow the standard one from geometry: a d-dimensional manifold is a set which is locally homeomorphic to \mathbb{R}^d. For a given disjoint union of manifolds and a point x in this union, the of x is the dimension of the manifold it belongs to. Note that LID is not an intrinsic property of the point x, but rather a property of x with respect to the manifold that contains it. Intuitively, \text{LID}(x) corresponds to the number of factors of variation present in the manifold containing x, and it is thus a natural measure of the relative complexity of x, as illustrated in Figure 1.

Computing the LID for a given point is a complex task. Traditional non-parametric (or model-free) methods, such as those in the skdim-library (Bac et al. 2021), are computationally intensive and not scalable to high-dimensional data. Consequently, there is growing interest in using deep generative models for LID estimation. This approach is valuable not only for understanding the data manifold but also for evaluating the generative model itself. Discrepancies between the model-implied LID and the ground truth can highlight model deficiencies and help us to improve the quality of generative models. Here, we thoroughly explore LID methods for deep generative models with a particular focus on score-based diffusion models (Song et al. 2021), and explore their applications in trustworthy machine learning.

2 What is LID Used For?

LID estimates can be interpretated as a measure of complexity (Kamkari, Ross, Hosseinzadeh, et al. 2024) and can be useful in many scenarios. These estimates can also be used to detect outliers (Houle, Schubert, and Zimek 2018) (Anderberg et al. 2024) (Kamkari, Ross, Cresswell, et al. 2024), AI-generated text (Tulchinskii et al. 2023), and adversarial examples (Ma et al. 2018). Connections between the generalization achieved by a neural network and the LID estimates of its internal representations have also been shown (Ansuini et al. 2019), (Birdal et al. 2021), (Magai and Ayzenberg 2022), (Brown et al. 2022). These insights can be leveraged to identify which representations contain maximal semantic content (Valeriani et al. 2023), and help explain why LID estimates can be helpful as regularizers (Zhu et al. 2018) and for pruning large models (Xue et al. 2022). LID estimation is thus not only of mathematical and statistical interest, but can also benefit the empirical performance of deep learning models at numerous tasks.

References

Anderberg, Alastair, James Bailey, Ricardo JGB Campello, Michael E Houle, Henrique O Marques, Miloš Radovanović, and Arthur Zimek. 2024. “Dimensionality-Aware Outlier Detection.” In Proceedings of the 2024 SIAM International Conference on Data Mining, 652–60.
Ansuini, Alessio, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. 2019. “Intrinsic Dimension of Data Representations in Deep Neural Networks.” In Advances in Neural Information Processing Systems.
Bac, Jonathan, Evgeny M. Mirkes, Alexander N. Gorban, Ivan Tyukin, and Andrei Zinovyev. 2021. Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation.” Entropy 23 (10): 1368.
Birdal, Tolga, Aaron Lou, Leonidas J Guibas, and Umut Simsekli. 2021. “Intrinsic Dimension, Persistent Homology and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems.
Brown, Bradley CA, Anthony L Caterini, Brendan Leigh Ross, Jesse C Cresswell, and Gabriel Loaiza-Ganem. 2023. “Verifying the Union of Manifolds Hypothesis for Image Data.” In International Conference on Learning Representations.
Brown, Bradley CA, Jordan Juravsky, Anthony L Caterini, and Gabriel Loaiza-Ganem. 2022. “Relating Regularization and Generalization Through the Intrinsic Dimension of Activations.” arXiv:2211.13239.
Falconer, Kenneth. 2007. Fractal Geometry: Mathematical Foundations and Applications. John Wiley & Sons.
Houle, Michael E, Erich Schubert, and Arthur Zimek. 2018. “On the Correlation Between Local Intrinsic Dimensionality and Outlierness.” In Similarity Search and Applications: 11th International Conference, SISAP 2018, 177–91. Springer.
Hurewicz, Witold, and Henry Wallman. 1948. Dimension Theory (PMS-4). Princeton University Press.
Kamkari, Hamidreza, Brendan Leigh Ross, Jesse C Cresswell, Anthony L Caterini, Rahul G Krishnan, and Gabriel Loaiza-Ganem. 2024. “A Geometric Explanation of the Likelihood OOD Detection Paradox.” arXiv Preprint arXiv:2403.18910.
Kamkari, Hamidreza, Brendan Leigh Ross, Rasa Hosseinzadeh, Jesse C Cresswell, and Gabriel Loaiza-Ganem. 2024. “A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with Diffusion Models.” arXiv Preprint arXiv:2406.03537.
Lee, John M. 2012. Introduction to Smooth Manifolds. 2nd ed. Springer.
Loaiza-Ganem, Gabriel, Brendan Leigh Ross, Rasa Hosseinzadeh, Anthony L Caterini, and Jesse C Cresswell. 2024. “Deep Generative Models Through the Lens of the Manifold Hypothesis: A Survey and New Connections.” arXiv Preprint arXiv:2404.02954.
Ma, Xingjun, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E Houle, and James Bailey. 2018. “Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality.” In International Conference on Learning Representations.
Magai, German, and Anton Ayzenberg. 2022. “Topology and Geometry of Data Manifold in Deep Learning.” arXiv:2204.08624.
Song, Yang, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. “Score-Based Generative Modeling Through Stochastic Differential Equations.” In International Conference on Learning Representations.
Tulchinskii, Eduard, Kristian Kuznetsov, Laida Kushnareva, Daniil Cherniavskii, Sergey Nikolenko, Evgeny Burnaev, Serguei Barannikov, and Irina Piontkovskaya. 2023. “Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts.” In Advances in Neural Information Processing Systems.
Valeriani, Lucrezia, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. 2023. “The Geometry of Hidden Representations of Large Transformer Models.” In Advances in Neural Information Processing Systems.
Wang, Yi, and Zhiren Wang. n.d. “CW Complex Hypothesis for Image Data.” In Forty-First International Conference on Machine Learning.
Xue, Fanghui, Biao Yang, Yingyong Qi, and Jack Xin. 2022. “Searching Intrinsic Dimensions of Vision Transformers.” arXiv:2204.07722.
Zhu, Wei, Qiang Qiu, Jiaji Huang, Robert Calderbank, Guillermo Sapiro, and Ingrid Daubechies. 2018. “LDMNet: Low Dimensional Manifold Regularized Neural Networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2743–51.