为了账号安全,请及时绑定邮箱和手机立即绑定

在 Keras 模型中使用 Tf-Idf

在 Keras 模型中使用 Tf-Idf

富国沪深 2022-07-26 10:22:30
我已将我的训练、测试和验证句子读入 train_sentences、test_sentences、val_sentences然后我在这些上应用了 Tf-IDF 矢量化器。vectorizer = TfidfVectorizer(max_features=300)vectorizer = vectorizer.fit(train_sentences)X_train = vectorizer.transform(train_sentences)X_val = vectorizer.transform(val_sentences)X_test = vectorizer.transform(test_sentences)我的模型看起来像这样model = Sequential()model.add(Input(????))model.add(Flatten())model.add(Dense(256, activation='relu'))model.add(Dense(32, activation='relu'))model.add(Dense(8, activation='sigmoid'))model.summary()model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])通常我们在 word2vec 的情况下在嵌入层中传递嵌入矩阵。我应该如何在 Keras 模型中使用 Tf-IDF?请给我一个使用的例子。谢谢。
查看完整描述

1 回答

?
收到一只叮咚

TA贡献1821条经验 获得超4个赞

我无法想象将 TF/IDF 值与嵌入向量结合的充分理由,但这里有一个可能的解决方案:使用功能 API、多个Inputs 和concatenate函数。


要连接层输出,它们的形状必须对齐(被连接的轴除外)。一种方法是平均嵌入,然后连接到 TF/IDF 值的向量。


设置和一些示例数据


from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split


from sklearn.datasets import fetch_20newsgroups


import numpy as np


import keras


from keras.models import Model

from keras.layers import Dense, Activation, concatenate, Embedding, Input


from keras.preprocessing.text import Tokenizer

from keras.preprocessing.sequence import pad_sequences


# some sample training data

bunch = fetch_20newsgroups()

all_sentences = []


for document in bunch.data:

  sentences = document.split("\n")

  all_sentences.extend(sentences)


all_sentences = all_sentences[:1000]


X_train, X_test = train_test_split(all_sentences, test_size=0.1)

len(X_train), len(X_test)


vectorizer = TfidfVectorizer(max_features=300)

vectorizer = vectorizer.fit(X_train)


df_train = vectorizer.transform(X_train)


tokenizer = Tokenizer()

tokenizer.fit_on_texts(X_train)


maxlen = 50


sequences_train = tokenizer.texts_to_sequences(X_train)

sequences_train = pad_sequences(sequences_train, maxlen=maxlen)

模型定义


vocab_size = len(tokenizer.word_index) + 1

embedding_size = 300


input_tfidf = Input(shape=(300,))

input_text = Input(shape=(maxlen,))


embedding = Embedding(vocab_size, embedding_size, input_length=maxlen)(input_text)


# this averaging method taken from:

# https://stackoverflow.com/a/54217709/1987598


mean_embedding = keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1))(embedding)


concatenated = concatenate([input_tfidf, mean_embedding])


dense1 = Dense(256, activation='relu')(concatenated)

dense2 = Dense(32, activation='relu')(dense1)

dense3 = Dense(8, activation='sigmoid')(dense2)


model = Model(inputs=[input_tfidf, input_text], outputs=dense3)


model.summary()


model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

模型汇总输出


Model: "model_2"

__________________________________________________________________________________________________

Layer (type)                    Output Shape         Param #     Connected to                     

==================================================================================================

input_11 (InputLayer)           (None, 50)           0                                            

__________________________________________________________________________________________________

embedding_5 (Embedding)         (None, 50, 300)      633900      input_11[0][0]                   

__________________________________________________________________________________________________

input_10 (InputLayer)           (None, 300)          0                                            

__________________________________________________________________________________________________

lambda_1 (Lambda)               (None, 300)          0           embedding_5[0][0]                

__________________________________________________________________________________________________

concatenate_4 (Concatenate)     (None, 600)          0           input_10[0][0]                   

                                                                 lambda_1[0][0]                   

__________________________________________________________________________________________________

dense_5 (Dense)                 (None, 256)          153856      concatenate_4[0][0]              

__________________________________________________________________________________________________

dense_6 (Dense)                 (None, 32)           8224        dense_5[0][0]                    

__________________________________________________________________________________________________

dense_7 (Dense)                 (None, 8)            264         dense_6[0][0]                    

==================================================================================================

Total params: 796,244

Trainable params: 796,244

Non-trainable params: 0


查看完整回答
反对 回复 2022-07-26
  • 1 回答
  • 0 关注
  • 98 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信