1 回答
TA贡献1860条经验 获得超9个赞
原因是您已经使用了自定义和默认,因此在提取要素时,请检查和 之间是否存在任何不一致tokenizerstop_words='english'stop_wordstokenizer
如果您深入研究代码,您会发现此代码片段正在执行一致性检查:sklearn/feature_extraction/text.py
def _check_stop_words_consistency(self, stop_words, preprocess, tokenize):
"""Check if stop words are consistent
Returns
-------
is_consistent : True if stop words are consistent with the preprocessor
and tokenizer, False if they are not, None if the check
was previously performed, "error" if it could not be
performed (e.g. because of the use of a custom
preprocessor / tokenizer)
"""
if id(self.stop_words) == getattr(self, '_stop_words_id', None):
# Stop words are were previously validated
return None
# NB: stop_words is validated, unlike self.stop_words
try:
inconsistent = set()
for w in stop_words or ():
tokens = list(tokenize(preprocess(w)))
for token in tokens:
if token not in stop_words:
inconsistent.add(token)
self._stop_words_id = id(self.stop_words)
if inconsistent:
warnings.warn('Your stop_words may be inconsistent with '
'your preprocessing. Tokenizing the stop '
'words generated tokens %r not in '
'stop_words.' % sorted(inconsistent))
如您所见,如果发现不一致,它会引发警告。
添加回答
举报