留言版代码-90万人民网领导留言版数据及LDA主题模型Python代码

人民网领导留言板文本数据，在科研中应用颇丰,本次我们分享该数据并利用留言板文本数据样本进行主题建模。数据及代码获取方式见后文。

一、领导留言板数据

从人民网获取领导留言板数据，时间跨度为2019年12月-2022年5月，记录留言版代码留言版代码，共20字段，其数据字段如下：

#字段

领导ID
地方领导
所属省份
用户
用户地区
留言内容
留言时间
留言类别ID
留言类别
留言主题
留言类型
点赞数
处理状态
回复机构
回复内容
回复时间
满意程度评分
解决程度评分
办理态度评分
办理速度评分

二、LDA主题建模部分代码

import pandas as pd
data = pd.read_excel('Case/留言数据.xlsx')
data.head()

import re
import jieba

stopwords = open('data/dict/stoplist.txt',encoding = 'utf-8').read().split('n')

def clean_text(text):
    text = re.sub(r'd+','',text)
    words = jieba.lcut(text)
    words = [w for w in words if w not in stopwords]
    return ' '.join(words)

data['content'] = data['text'].apply(clean_text)
data.head()

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

vectorizer = CountVectorizer(max_df = 0.5,min_df=10,max_features = 1000)

doc_term_matrix = vectorizer.fit_transform(data['content'])
doc_term_matrix

# 构建LDA话题模型
from sklearn.decomposition import LatentDirichletAllocation
lda_model = LatentDirichletAllocation(n_components = 4,#话题数
                                      max_iter = 50,
                                      learning_method = 'batch',
                                      learning_offset = 50,
                                      random_state = 20230212)    

lda_output = lda_model.fit_transform(doc_term_matrix)

print(lda_model)  #模型参数
print(lda_output) #话题分布情况

import pyLDAvis
import pyLDAvis.sklearn
import warnings
warnings.filterwarnings('ignore')

pic = pyLDAvis.sklearn.prepare(lda_model,doc_term_matrix,vectorizer)

pyLDAvis.display(pic)