您知道是否有一种方法可以使python random.sample与生成器对象一起工作。我试图从一个很大的文本语料库中获取一个随机样本。问题是random.sample()引发以下错误。TypeError: object of type 'generator' has no len()我当时在想,也许有某种方法itertools可以解决某些问题,但是经过一点搜索却找不到。一个有些虚构的例子:import randomdef list_item(ls): for item in ls: yield itemrandom.sample( list_item(range(100)), 20 )更新根据MartinPieters要求,我对当前提出的三种方法做了一些时间安排。结果如下。Sampling 1000 from 10000Using iterSample 0.0163 sUsing sample_from_iterable 0.0098 sUsing iter_sample_fast 0.0148 sSampling 10000 from 100000Using iterSample 0.1786 sUsing sample_from_iterable 0.1320 sUsing iter_sample_fast 0.1576 sSampling 100000 from 1000000Using iterSample 3.2740 sUsing sample_from_iterable 1.9860 sUsing iter_sample_fast 1.4586 sSampling 200000 from 1000000Using iterSample 7.6115 sUsing sample_from_iterable 3.0663 sUsing iter_sample_fast 1.4101 sSampling 500000 from 1000000Using iterSample 39.2595 sUsing sample_from_iterable 4.9994 sUsing iter_sample_fast 1.2178 sSampling 2000000 from 5000000Using iterSample 798.8016 sUsing sample_from_iterable 28.6618 sUsing iter_sample_fast 6.6482 s因此,事实证明,array.insert当涉及大样本量时,存在严重的缺陷。我用来计时方法的代码from heapq import nlargestimport randomimport timeitdef iterSample(iterable, samplesize): results = [] for i, v in enumerate(iterable): r = random.randint(0, i) if r < samplesize: if i < samplesize: results.insert(r, v) # add first samplesize items in random order else: results[r] = v # at a decreasing rate, replace random items if len(results) < samplesize: raise ValueError("Sample larger than population.") return resultsdef sample_from_iterable(iterable, samplesize): return (x for _, x in nlargest(samplesize, ((random.random(), x) for x in iterable)))我还行了一项测试,以检查所有方法是否确实都对发生器进行了无偏向采样。因此,对于所有方法,我都1000从10000 100000时间上对元素进行采样,并计算出总体中每个项目出现的平均频率,事实证明~.1这三种方法都符合预期。
添加回答
举报
0/150
提交
取消