为了账号安全,请及时绑定邮箱和手机立即绑定

生成具有重复率的 numpy 数组

生成具有重复率的 numpy 数组

MM们 2021-09-11 10:06:23
这是我的问题:我必须生成一些合成数据(如 7/8 列),相互关联(使用皮尔逊系数)。我可以很容易地做到这一点,但接下来我必须在每列中插入一定比例的重复项(是的,皮尔逊系数会更低),每列都不同。问题是我不想亲自插入重复的内容,因为在我的情况下它就像作弊。有人知道如何生成已经重复的相关数据吗?我已经搜索过,但通常问题是关于删除或避免重复..语言:python3 要生成相关数据,我使用了这个简单的代码:生成相关数据
查看完整描述

2 回答

?
慕妹3146593

TA贡献1820条经验 获得超9个赞

尝试这样的事情:


indices = np.random.randint(0, array.shape[0], size = int(np.ceil(percentage * array.shape[0])))


for index in indices:

  array.append(array[index])

在这里,我假设您的数据存储在array一个 ndarray 中,其中每行包含您的 7/8 列数据。上面的代码应该创建一个随机索引数组,您选择其条目(行)并再次附加到数组中。


查看完整回答
反对 回复 2021-09-11
?
catspeake

TA贡献1111条经验 获得超0个赞

我找到了解决办法。我发布了代码,它可能对某人有帮助。


#this are the data, generated randomically with a given shape

rnd = np.random.random(size=(10**7, 8))

#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)

#I added other 7 columns, with varing range of values (all upper than 0.7)

attr1 = np.random.uniform(0.8, .95, size = (8,1))

#attr2,3,4,5,6,7 like attr1


#corr_mat is the matrix, union of columns

corr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))


from statsmodels.stats.correlation_tools import cov_nearest

#using that function i found the nearest covariance matrix to my matrix,

#to be sure that it's positive definite

a = cov_nearest(corr_mat)


from scipy.linalg import cholesky


upper_chol = cholesky(a)


# Finally, compute the inner product of upper_chol and rnd

ans = rnd @ upper_chol

#ans now has randomically correlated data (high correlation, but is customizable)


#next i create a pandas Dataframe with ans values

df = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4', 

                            'att5', 'att6', 'att7', 'att8'])


#last step is to truncate float values of ans in a variable way, so i got 

#duplicates in varying percentage

a = df.values

for i in range(8):

     trunc = np.random.randint(5,12)

     print(trunc)

     a.T[i] = a.T[i].round(decimals=trunc)



#float values of ans have 16 decimals, so i randomically choose an int

# between 5 and 12 and i use it to truncate each value

最后,这些是我每列的重复百分比:


duplicate rate attribute: att1 = 5.159390000000002


duplicate rate attribute: att2 = 11.852260000000001


duplicate rate attribute: att3 = 12.036079999999998


duplicate rate attribute: att4 = 35.10611


duplicate rate attribute: att5 = 4.6471599999999995


duplicate rate attribute: att6 = 35.46553


duplicate rate attribute: att7 = 0.49115000000000464


duplicate rate attribute: att8 = 37.33252


查看完整回答
反对 回复 2021-09-11
  • 2 回答
  • 0 关注
  • 263 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信