4 回答
TA贡献1942条经验 获得超3个赞
您的新数据可能与您用于训练和测试模型的第一个数据集有很大不同。预处理技术和统计分析将帮助您表征数据并比较不同的数据集。由于各种原因,可能会观察到新数据的性能不佳,包括:
您的初始数据集在统计上不能代表更大的数据集(例如:您的数据集是一个极端案例)
过度拟合:你过度训练你的模型,它包含训练数据的特异性(噪声)
不同的预处理方法
不平衡的训练数据集。ML 技术最适合平衡数据集(训练集中不同类别的平等出现)
TA贡献1799条经验 获得超8个赞
我对情绪分析中不同分类的表现进行了调查研究。对于特定的推特数据集,我曾经执行逻辑回归、朴素贝叶斯、支持向量机、k 最近邻 (KNN) 和决策树等模型。对所选数据集的观察表明,Logistic 回归和朴素贝叶斯在所有类型的测试中都准确地表现良好。接下来是SVM。然后进行准确的决策树分类。从结果来看,KNN 的准确度得分最低。逻辑回归和朴素贝叶斯模型在情绪分析和预测方面分别表现更好。 情绪分类器(准确度分数 RMSE) LR (78.3541 1.053619) NB (76.764706 1.064738) SVM (73.5835 1.074752) DT (69.2941 1.145234) KNN (62.9476 1.376589)
在这些情况下,特征提取非常关键。
TA贡献2039条经验 获得超7个赞
导入必需品
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import time
df = pd.read_csv('FilePath', header=0)
X = df['content']
y = df['sentiment']
def lrSentimentAnalysis(n):
# Using CountVectorizer to convert text into tokens/features
vect = CountVectorizer(ngram_range=(1, 1))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=n)
# Using training data to transform text into counts of features for each message
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
# dual = [True, False]
max_iter = [100, 110, 120, 130, 140, 150]
C = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5]
solvers = ['newton-cg', 'lbfgs', 'liblinear']
param_grid = dict(max_iter=max_iter, C=C, solver=solvers)
LR1 = LogisticRegression(penalty='l2', multi_class='auto')
grid = GridSearchCV(estimator=LR1, param_grid=param_grid, cv=10, n_jobs=-1)
grid_result = grid.fit(X_train_dtm, y_train)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
y_pred = grid_result.predict(X_test_dtm)
print ('Accuracy Score: ', metrics.accuracy_score(y_test, y_pred) * 100, '%')
# print('Confusion Matrix: ',metrics.confusion_matrix(y_test,y_pred))
# print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
# print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print ('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
return [n, metrics.accuracy_score(y_test, y_pred) * 100, grid_result.best_estimator_.get_params()['max_iter'],
grid_result.best_estimator_.get_params()['C'], grid_result.best_estimator_.get_params()['solver']]
def darwConfusionMetrix(accList):
# Using CountVectorizer to convert text into tokens/features
vect = CountVectorizer(ngram_range=(1, 1))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=accList[0])
# Using training data to transform text into counts of features for each message
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
# Accuracy using Logistic Regression Model
LR = LogisticRegression(penalty='l2', max_iter=accList[2], C=accList[3], solver=accList[4])
LR.fit(X_train_dtm, y_train)
y_pred = LR.predict(X_test_dtm)
# creating a heatmap for confusion matrix
data = metrics.confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(data, columns=np.unique(y_test), index=np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize=(10, 7))
sns.set(font_scale=1.4) # for label size
sns.heatmap(df_cm, cmap="Blues", annot=True, annot_kws={"size": 16}) # font size
fig0 = plt.gcf()
fig0.show()
fig0.savefig('FilePath', dpi=100)
def findModelWithBestAccuracy(accList):
accuracyList = []
for item in accList:
accuracyList.append(item[1])
N = accuracyList.index(max(accuracyList))
print('Best Model:', accList[N])
return accList[N]
accList = []
print('Logistic Regression')
print('grid search method for hyperparameter tuning (accurcy by cross validation) ')
for i in range(2, 7):
n = i / 10.0
print ("\nsplit ", i - 1, ": n=", n)
accList.append(lrSentimentAnalysis(n))
darwConfusionMetrix(findModelWithBestAccuracy(accList))
TA贡献1794条经验 获得超8个赞
预处理是构建性能良好的分类器的重要部分。当您在训练和测试集性能之间存在如此大的差异时,很可能在您的(测试集)预处理中发生了一些错误。
无需任何编程也可使用分类器。
您可以访问 Web 服务洞察分类器并先尝试免费构建。
添加回答
举报