按两列搜索大型数组

我有一个很大的数组，看起来像下面的东西：np.random.seed(42)arr = np.random.permutation(np.array([ (1,1,2,2,2,2,3,3,4,4,4), (8,9,3,4,7,9,1,9,3,4,50000)]).T)它没有排序，该数组的行是唯一的，我也知道两个列中的值的界限，它们是[0, n]和[0, k]。因此，数组的最大可能大小为(n+1)*(k+1)，但实际大小更接近于该值的对数。我需要两列搜索阵列找到这样row认为arr[row,:] = (i,j)，并返回-1时，(i,j)数组中缺席。此类功能的简单实现是：def get(arr, i, j): cond = (arr[:,0] == i) & (arr[:,1] == j) if np.any(cond): return np.where(cond)[0][0] else: return -1不幸的是，因为在我的情况arr是非常大（> 90M行），这是非常低效的，尤其是因为我需要调用get()多次。或者，我尝试将其翻译成带有(i,j)键的字典，这样index[(i,j)] = row可以通过以下方式访问：def get(index, i, j): try: retuen index[(i,j)] except KeyError: return -1这行得通（并且在比我小的数据上进行测试时要快得多），但同样，可以通过以下方式即时创建dict：index = {}for row in range(arr.shape[0]): i,j = arr[row, :] index[(i,j)] = row就我而言，这会花费大量时间并占用大量RAM。我也在考虑先进行排序arr，然后再使用诸如之类的东西np.searchsorted，但这并没有带我到任何地方。所以我需要的是一个快速get(arr, i, j)返回的函数>>> get(arr, 2, 3)4>>> get(arr, 4, 100)-1

查看完整描述

3 回答

HUX布斯

TA贡献1876条经验获得超6个赞

部分解决方案是：

In [36]: arr

Out[36]:

array([[ 2, 9],

[ 1, 8],

[ 4, 4],

[ 4, 50000],

[ 2, 3],

[ 1, 9],

[ 4, 3],

[ 2, 7],

[ 3, 9],

[ 2, 4],

[ 3, 1]])

In [37]: (i,j) = (2, 3)

# we can use `assume_unique=True` which can speed up the calculation

In [38]: np.all(np.isin(arr, [i,j], assume_unique=True), axis=1, keepdims=True)

Out[38]:

array([[False],

[False],

[ True],

[False],

[False]])

# we can use `assume_unique=True` which can speed up the calculation

In [39]: mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1, keepdims=True)

In [40]: np.argwhere(mask)

Out[40]: array([[4, 0]])

如果需要最终结果作为标量，则不要使用keepdims参数并将数组转换为标量，例如：

# we can use `assume_unique=True` which can speed up the calculation

In [41]: mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1)

In [42]: np.argwhere(mask)

Out[42]: array([[4]])

In [43]: np.asscalar(np.argwhere(mask))

Out[43]: 4

反对回复 2021-04-09

热搜

最近搜索清空

按两列搜索大型数组

按两列搜索大型数组

3 回答

添加回答