为了账号安全,请及时绑定邮箱和手机立即绑定

从 joblib 文件加载的 TfidfVectorizer 模型仅在同一会话中训练时才有效

从 joblib 文件加载的 TfidfVectorizer 模型仅在同一会话中训练时才有效

www说 2023-08-22 10:38:50
sklearn...TfidfVectorizer仅当分析器返回对象列表时,在训练后立即应用它才有效nltk.tree.Tree。这是一个谜,因为模型在应用之前总是从文件加载。与在该会话中进行训练时相比,在自己的会话开始时加载和应用模型文件时,调试显示模型文件没有任何错误或不同。分析仪在这两种情况下均适用并正常工作。下面是一个帮助重现这种神秘行为的脚本:import joblibimport numpy as npfrom nltk import Treefrom sklearn.feature_extraction.text import TfidfVectorizerdef lexicalized_production_analyzer(sentence_trees):    productions_per_sentence = [tree.productions() for tree in sentence_trees]    return np.concatenate(productions_per_sentence)def train(corpus):    model = TfidfVectorizer(analyzer=lexicalized_production_analyzer)    model.fit(corpus)    joblib.dump(model, "model.joblib")def apply(corpus):    model = joblib.load("model.joblib")    result = model.transform(corpus)    return result# exmaple datatrees = [Tree('ROOT', [Tree('FRAG', [Tree('S', [Tree('VP', [Tree('VBG', ['arkling']), Tree('NP', [Tree('NP', [Tree('NNS', ['dots'])]), Tree('VP', [Tree('VBG', ['nestling']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['grass'])])])])])])]), Tree(',', [',']), Tree('VP', [Tree('VBG', ['winking']), Tree('CC', ['and']), Tree('VP', [Tree('VBG', ['glimmering']), Tree('PP', [Tree('IN', ['like']), Tree('NP', [Tree('NNS', ['jewels'])])])])]), Tree('.', ['.'])])]), Tree('ROOT', [Tree('FRAG', [Tree('NP', [Tree('NP', [Tree('NNP', ['Rose']), Tree('NNS', ['petals'])]), Tree('NP', [Tree('NP', [Tree('ADVP', [Tree('RB', ['perhaps'])]), Tree(',', [',']), Tree('CC', ['or']), Tree('NP', [Tree('DT', ['some'])]), Tree('NML', [Tree('NN', ['kind'])])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NN', ['confetti'])])])])]), Tree('.', ['.'])])])]corpus = [trees, trees, trees]首先训练模型并保存model.joblib文件。train(corpus)result = apply(corpus)print("number of elements in results: " + str(result.getnnz()))print("shape of results: " + str(result.shape))我们打印结果数.getnnz()以表明该模型正在处理 120 个元素:number of elements in results: 120shape of results: (3, 40)但是该模型两次都是从文件加载的,并且没有全局变量(我知道),因此我们无法想到为什么它在一种情况下有效而在另一种情况下不起作用。
查看完整描述

1 回答

?
GCT1015

TA贡献1827条经验 获得超4个赞

Pythonhash函数在运行之间是不确定的,这意味着该值在运行之间可能不一致。因此,哈希值被腌制,joblib而不是按应有的方式重新计算。所以这看起来像是 中的一个错误nltk。这会导致模型在重新加载时看不到产生式规则,因为散列不匹配,因此就好像产生式规则从未存储在词汇中一样。

相当棘手!

在修复此特定问题之前nltk,在运行训练和测试脚本之前设置PYTHONHASHSEED将强制哈希每次都相同。

PYTHONHASHSEED=0 python script.py


查看完整回答
反对 回复 2023-08-22
  • 1 回答
  • 0 关注
  • 114 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信