3 回答
TA贡献1875条经验 获得超5个赞
如前所述,我认为获得所需结果的直接(ish)方法是仅使用正常的 K 均值聚类,然后根据需要修改生成的输出。
解释:这个想法是得到 K-means 输出,然后遍历它们:跟踪前一项的集群组和当前的集群组,并控制根据条件创建的新集群。代码中的解释。
import numpy as np
from sklearn.cluster import KMeans
lst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]
km = KMeans(3,).fit(np.array(lst).reshape(-1,1))
print(km.labels_)
# [0 0 1 1 1 2 2]: OK output
lst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]
km = KMeans(3,).fit(np.array(lst).reshape(-1,1))
print(km.labels_)
# [0 0 1 1 1 2 1 1 0 0]. Desired output: [0 0 1 1 1 1 1 1 2 2]
def linear_order_clustering(km_labels, outlier_tolerance = 1):
'''Expects clustering outputs as an array/list'''
prev_label = km_labels[0] #keeps track of last seen item's real cluster
cluster = 0 #like a counter for our new linear clustering outputs
result = [cluster] #initialize first entry
for i, label in enumerate(km_labels[1:]):
if prev_label == label:
#just written for clarity of control flow,
#do nothing special here
pass
else: #current cluster label did not match previous label
#check if previous cluster label reappears
#on the right of current cluster label position
#(aka current non-matching cluster is sandwiched
#within a reasonable tolerance)
if (outlier_tolerance and
prev_label in km_labels[i + 1: i + 2 + outlier_tolerance]): label = prev_label #if so, overwrite current label
else:
cluster += 1 #its genuinely a new cluster
result.append(cluster)
prev_label = label
return result
请注意,我仅对 1 个异常值的容差进行了测试,并且不能保证它在所有情况下都能按原样运行。然而,这应该让你开始。
输出:
print(km.labels_)
result = linear_order_clustering(km.labels_)
print(result)
[1 1 0 0 0 2 0 0 1 1]
[0, 0, 1, 1, 1, 1, 1, 1, 2, 2]
TA贡献1821条经验 获得超6个赞
我会通过几次来解决这个问题。首先,我将有第一个函数/方法来进行分析以确定每个组的聚类中心并返回这些中心的数组。然后,我会将这些中心与列表一起放入另一个函数/方法中,以组装列表中每个数字的集群 ID 列表。然后我会返回排序的列表。
添加回答
举报