Finding an average distance between two sets of points (vectors) in a n-demensional euclidean space can be slow ... harnessing
O(a * b * n) of time complexity, where
b denote the numbers of points in each set.
A normal way to do this is something like:
import numpy as np def dist(a, b): return np.linalg.norm(a - b) def dist_cluster_avg(points_A, points_B): from itertools import product s = sum(dist(a, b) for a, b in product(points_A, points_B)) return s / len(points_A) / len(points_B)
Even though, the
def dist(a, b) is arguable a very fast distance function, looping over each pair of points can be inexorably slow, to some extent because of the Python itself.
There are so many ways out there to improve by the means of optimizations, it is important to note that the time complexity stays the same, but it is much better implemented.
I found this, using
def dist_cluster_avg_fast(points_A, points_B): from scipy.spatial.distance import cdist arr = cdist(points_A, points_B, metric='euclidean') return np.sum(arr) / (len(points_A) * len(points_B))
How much the gain ? You might ask, here it is.
points_A = np.random.rand(500, 100) points_B = np.random.rand(500, 100) dist_cluster_avg(points_A, points_B) => 2.569478988647461 seconds dist_cluster_avg_fast(points_A, points_B) => 0.15470504760742188 seconds
Much faster huh?