DIMACS :: Details

Data Analysis in High-dimensional Spaces

May 06, 2021, 5:00 PM - 6:00 PM

Location:

Online Event

Adi Ben-Israel, Rutgers University

1. The unreliability of the Euclidean distance in high-dimension, making a proximity query meaningless and unstable because there is poor discrimination between the nearest and furthest neighbor [3], see also [4].

2. The uniform probability distribution on the n-dimensional unit sphere S_n, and some non-intuitive results for large n. For example, if x is any point in S_n, taken as the "north pole", then most of the area of S_n is concentrated in the "equator".

3. The advantage of the ℓ1-distance, which is less sensitive to high dimensionality, and has been shown to "provide the best discrimination in high-dimensional data spaces," [1, p. 427].

4. Clustering high-dimensional data using the ℓ1 distance, [2].

References

[1] C.C. Aggarwal et al, On the surprising behavior of distance metrics in high dimensional space, Lecture Notes in Computer Science, vol 1973(2001), Springer, https://doi.org/10.1007/3-540-44503-X_27

[2] T. Asamov and A. Ben-Israel, A probabilistic ℓ1 method for clustering high-dimensional data, Probability in the Engineering and Informational Sciences, 2021, 1-16

[3] K. Beyer et al, When is "nearest neighbor" meaningful?, Lecture Notes in Computer Science, vol 1540(1999), Springer, https://doi.org/10.1007/3-540-49257-7_15

[4] J.M. Hammersley, The distribution of distance in a hypersphere, The Annals of Mathematical Statistics 21(1950), 447-452.

Presented Via Zoom: https://rutgers.zoom.us/j/94346444480

Password: 6564120420