合并NLP中的相关词

忽然笑

TA贡献1806条经验获得超5个赞

更新 2020 年 10 月 21 日

我决定构建一个 Python 模块来处理我在这个答案中概述的任务。该模块名为wordhoard ，可以从pypi下载

我曾尝试在需要确定关键字（例如医疗保健）和关键字同义词（例如健康计划、预防医学）的频率的项目中使用Word2vec 和WordNet。我发现大多数 NLP 库无法生成我需要的结果，因此我决定使用自定义关键字和同义词构建自己的词典。这种方法适用于多个项目中的文本分析和分类。

我确信精通 NLP 技术的人可能有更强大的解决方案，但下面的解决方案是一次又一次为我工作的类似解决方案。

我对答案进行了编码以匹配您问题中的词频数据，但可以对其进行修改以使用任何关键字和同义词数据集。

import string

# Python Dictionary

# I manually created these word relationship - primary_word:synonyms

word_relationship = {"father": ['dad', 'daddy', 'old man', 'pa', 'pappy', 'papa', 'pop'],

"mother": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}

# This input text is from various poems about mothers and fathers

input_text = 'The hand that rocks the cradle also makes the house a home. It is the prayers of the mother ' \

'that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of ' \

'her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She ' \

'has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the ' \

'greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend, ' \

'This to me you have always been. Through the good times and the bad, Your understanding I have had.'

# converts the input text to lowercase and splits the words based on empty space.

wordlist = input_text.lower().split()

# remove all punctuation from the wordlist

remove_punctuation = [''.join(ch for ch in s if ch not in string.punctuation)

for s in wordlist]

# list for word frequencies

wordfreq = []

# count the frequencies of a word

for w in remove_punctuation:

wordfreq.append(remove_punctuation.count(w))

word_frequencies = (dict(zip(remove_punctuation, wordfreq)))

word_matches = []

# loop through the dictionaries

for word, frequency in word_frequencies.items():

for keyword, synonym in word_relationship.items():

match = [x for x in synonym if word == x]

if word == keyword or match:

match = ' '.join(map(str, match))

# append the keywords (mother), synonyms(mom) and frequencies to a list

word_matches.append([keyword, match, frequency])

# used to hold the final keyword and frequencies

final_results = {}

# list comprehension to obtain the primary keyword and its frequencies

synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]

# iterate synonym_matches and output total frequency count for a specific keyword

for item in synonym_matches:

if item[0] not in final_results.keys():

frequency_count = 0

frequency_count = frequency_count + item[1]

final_results[item[0]] = frequency_count

else:

frequency_count = frequency_count + item[1]

final_results[item[0]] = frequency_count

print(final_results)

# output

{'mother': 3, 'father': 2}

其他方法

以下是一些其他方法及其开箱即用的输出。

NLTK字网

在此示例中，我查找了“母亲”一词的同义词。请注意，WordNet 没有与“mother”一词相关的同义词“mom”或“mum”。这两个词在我上面的示例文本中。另请注意，“父亲”一词被列为“母亲”的同义词。

from nltk.corpus import wordnet

synonyms = []

word = 'mother'

for synonym in wordnet.synsets(word):

for item in synonym.lemmas():

if word != synonym.name() and len(synonym.lemma_names()) > 1:

synonyms.append(item.name())

print(synonyms)

['mother', 'female_parent', 'mother', 'fuss', 'overprotect', 'beget', 'get', 'engender', 'father', 'mother', 'sire', 'generate', 'bring_forth']

Py字典

在此示例中，我使用 PyDictionary 查找“mother”一词的同义词，该词典查询synonym.com。此示例中的同义词包括单词“mom”和“mum”。此示例还包括 WordNet 未生成的其他同义词。

但是，PyDictionary 还生成了“妈妈”的同义词列表。这与“母亲”这个词无关。PyDictionary 似乎是从页面的形容词部分而不是名词部分提取此列表。计算机很难区分形容词妈妈和名词妈妈。

from PyDictionary import PyDictionary

dictionary_mother = PyDictionary('mother')

print(dictionary_mother.getSynonyms())

# output

[{'mother': ['mother-in-law', 'female parent', 'supermom', 'mum', 'parent', 'mom', 'momma', 'para I', 'mama', 'mummy', 'quadripara', 'mommy', 'quintipara', 'ma', 'puerpera', 'surrogate mother', 'mater', 'primipara', 'mammy', 'mamma']}]

dictionary_mum = PyDictionary('mum')

print(dictionary_mum.getSynonyms())

# output

[{'mum': ['incommunicative', 'silent', 'uncommunicative']}]

其他一些可能的方法是使用牛津词典 API 或查询 thesaurus.com。这两种方法也都有缺陷。例如，牛津词典 API 需要 API 密钥和基于查询编号的付费订阅。thesaurus.com 缺少可能对单词分组有用的潜在同义词。

https://www.thesaurus.com/browse/mother

synonyms: mom, parent, ancestor, creator, mommy, origin, predecessor, progenitor, source, child-bearer, forebearer, procreator

更新

为语料库中的每个潜在单词生成精确的同义词列表很困难，并且需要多管齐下的方法。下面的代码使用 WordNet 和 PyDictionary 创建同义词的超集。与所有其他答案一样，这种组合方法也会导致词频的过度计算。我一直在尝试通过在最终的同义词词典中组合键和值对来减少这种过度计数。后一个问题比我预期的要困难得多，可能需要我提出自己的问题来解决。最后，我认为根据您的用例，您需要确定哪种方法最有效，并且可能需要结合多种方法。

感谢您提出这个问题，因为它使我能够了解解决复杂问题的其他方法。

from string import punctuation

from nltk.corpus import stopwords

from nltk.corpus import wordnet

from PyDictionary import PyDictionary

input_text = """The hand that rocks the cradle also makes the house a home. It is the prayers of the mother

that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of

her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She

has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the

greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend,

This to me you have always been. Through the good times and the bad, Your understanding I have had."""

def normalize_textual_information(text):

# split text into tokens by white space

token = text.split()

# remove punctuation from each token

table = str.maketrans('', '', punctuation)

token = [word.translate(table) for word in token]

# remove any tokens that are not alphabetic

token = [word.lower() for word in token if word.isalpha()]

# filter out English stop words

stop_words = set(stopwords.words('english'))

# you could add additional stops like this

stop_words.add('cannot')

stop_words.add('could')

stop_words.add('would')

token = [word for word in token if word not in stop_words]

# filter out any short tokens

token = [word for word in token if len(word) > 1]

return token

def generate_word_frequencies(words):

# list to hold word frequencies

word_frequencies = []

# loop through the tokens and generate a word count for each token

for word in words:

word_frequencies.append(words.count(word))

# aggregates the words and word_frequencies into tuples and coverts them into a dictionary

word_frequencies = (dict(zip(words, word_frequencies)))

# sort the frequency of the words from low to high

sorted_frequencies = {key: value for key, value in

sorted(word_frequencies.items(), key=lambda item: item[1])}

return sorted_frequencies

def get_synonyms_internet(word):

dictionary = PyDictionary(word)

synonym = dictionary.getSynonyms()

return synonym

words = normalize_textual_information(input_text)

all_synsets_1 = {}

for word in words:

for synonym in wordnet.synsets(word):

if word != synonym.name() and len(synonym.lemma_names()) > 1:

for item in synonym.lemmas():

if word != item.name():

all_synsets_1.setdefault(word, []).append(str(item.name()).lower())

all_synsets_2 = {}

for word in words:

word_synonyms = get_synonyms_internet(word)

for synonym in word_synonyms:

if word != synonym and synonym is not None:

all_synsets_2.update(synonym)

word_relationship = {**all_synsets_1, **all_synsets_2}

frequencies = generate_word_frequencies(words)

word_matches = []

word_set = {}

duplication_check = set()

for word, frequency in frequencies.items():

for keyword, synonym in word_relationship.items():

match = [x for x in synonym if word == x]

if word == keyword or match:

match = ' '.join(map(str, match))

if match not in word_set or match not in duplication_check or word not in duplication_check:

duplication_check.add(word)

duplication_check.add(match)

word_matches.append([keyword, match, frequency])

# used to hold the final keyword and frequencies

final_results = {}

# list comprehension to obtain the primary keyword and its frequencies

synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]

# iterate synonym_matches and output total frequency count for a specific keyword

for item in synonym_matches:

if item[0] not in final_results.keys():

frequency_count = 0

frequency_count = frequency_count + item[1]

final_results[item[0]] = frequency_count

else:

frequency_count = frequency_count + item[1]

final_results[item[0]] = frequency_count

# do something with the final results

反对回复 2023-07-11

慕斯王

TA贡献1864条经验获得超2个赞

这是一个难题，最佳解决方案取决于您要解决的用例。这是一个难题，因为要组合单词，您需要理解单词的语义。您可以将mom和组合mother在一起，因为它们在语义上相关。

识别两个单词是否在语义上相关的一种方法是关联分布式单词嵌入（向量），如 word2vec、Glove、fasttext 等。您可以找到所有单词向量与某个单词的余弦相似度，并且可以选取前 5 个接近的单词并创建新单词。

使用 word2vec 的示例

# Load a pretrained word2vec model

import gensim.downloader as api

model = api.load('word2vec-google-news-300')

vectors = [model.get_vector(w) for w in words]

for i, w in enumerate(vectors):

first_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][1]

second_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][2]

print (f"{words[i]} + {words[first_best_match]}")

print (f"{words[i]} + {words[second_best_match]}")

输出：

mom + mother

mom + teacher

mother + mom

mother + teacher

london + mom

london + life

life + mother

life + mom

teach + teacher

teach + mom

teacher + teach

teacher + mother

您可以尝试对余弦相似度设置阈值，并仅选择余弦相似度大于此阈值的那些。

语义相似性的一个问题是，它们在语义上可能是相反的，因此它们是相似的（男人-女人），另一方面（男人-国王）在语义上相似，因为它们是相同的。

反对回复 2023-07-11

梵蒂冈之花

TA贡献1900条经验获得超5个赞

您想要实现的是语义文本相似性。

例如：

#@title Load the Universal Sentence Encoder's TF Hub module

from absl import logging

import tensorflow as tf

import tensorflow_hub as hub

import matplotlib.pyplot as plt

import numpy as np

import os

import pandas as pd

import re

import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]

model = hub.load(module_url)

print ("module %s loaded" % module_url)

def embed(input):

return model(input)

def plot_similarity(labels, features, rotation):

corr = np.inner(features, features)

sns.set(font_scale=1.2)

g = sns.heatmap(

corr,

xticklabels=labels,

yticklabels=labels,

vmin=0,

vmax=1,

cmap="YlOrRd")

g.set_xticklabels(labels, rotation=rotation)

g.set_title("Semantic Textual Similarity")

def run_and_plot(messages_):

message_embeddings_ = embed(messages_)

plot_similarity(messages_, message_embeddings_, 90)

messages = [

"Mother",

"Mom",

"Mama",

"Dog",

"Cat"

]

run_and_plot(messages)

//img3.sycdn.imooc.com/64aceccf0001cbb406410465.jpg

反对回复 2023-07-11

繁星coding

TA贡献1797条经验获得超4个赞

另一种古怪的方法是使用旧的 PyDictionary 库来解决这个问题。您可以使用

dictionary.getSynonyms()

函数循环遍历列表中的所有单词并对它们进行分组。列出的所有可用同义词将被覆盖并映射到一组。允许您分配最终变量并对同义词求和。在你的例子中。您选择最后一个单词“Mother”，它会显示同义词的最终计数。

反对回复 2023-07-11

慕容708150

TA贡献1831条经验获得超4个赞

您可以生成词嵌入向量并使用一些聚类算法。最后，您需要调整算法的超参数以获得高精度的结果。

from sklearn.cluster import DBSCAN

from sklearn.decomposition import PCA

import spacy

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

# Load the large english model

nlp = spacy.load("en_core_web_lg")

tokens = nlp("dog cat banana apple teaching teacher mom mother mama mommy berlin paris")

# Generate word embedding vectors

vectors = np.array([token.vector for token in tokens])

vectors.shape

# (12, 300)

让我们使用主成分分析算法来可视化 3 维空间中的嵌入：

pca_vecs = PCA(n_components=3).fit_transform(vectors)

pca_vecs.shape

# (12, 3)

fig = plt.figure(figsize=(6, 6))

ax = fig.add_subplot(111, projection='3d')

xs, ys, zs = pca_vecs[:, 0], pca_vecs[:, 1], pca_vecs[:, 2]

_ = ax.scatter(xs, ys, zs)

for x, y, z, lable in zip(xs, ys, zs, tokens):

ax.text(x+0.3, y, z, str(lable))

//img1.sycdn.imooc.com//64acecfd0001ebed03360326.jpg

让我们使用 DBSCAN 算法对单词进行聚类：

model = DBSCAN(eps=5, min_samples=1)

model.fit(vectors)

for word, cluster in zip(tokens, model.labels_):

print(word, '->', cluster)

输出：

dog -> 0

cat -> 0

banana -> 1

apple -> 2

teaching -> 3

teacher -> 3

mom -> 4

mother -> 4

mama -> 4

mommy -> 4

berlin -> 5

paris -> 6

反对回复 2023-07-11

千巷猫影

TA贡献1829条经验获得超7个赞

这个想法是使用这个词典来识别相似的单词。

简而言之：运行一些知识发现算法，根据英语语法提取知识

这是一个同义词库：它有 18MB。

这是同义词库的摘录，您可以尝试通过某种算法来匹配单词替换。

{"word": "ma", "key": "ma_1", "pos": "noun", "synonyms": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}

对于使用外部 api 的快速修复，这里是链接：它们允许使用 api 执行更多操作，例如获取同义词、查找多个定义、查找押韵单词等等。

反对回复 2023-07-11

合并NLP中的相关词

合并NLP中的相关词

6 回答

添加回答

热搜

最近搜索清空

合并NLP中的相关词

合并NLP中的相关词

6 回答

添加回答