如何用Python构建AI问答系统?2026年BERT模型实战指南
This article provides a step-by-step guide to building a simple AI question-answering system using Python, web scraping, and Hugging Face's BERT model. It covers data collection from Baidu Baike, natural language processing, model training with neural networks, and system implementation with practical code examples.
原文翻译: 本文提供了使用Python、网络爬虫和Hugging Face的BERT模型构建简易AI问答系统的分步指南。内容涵盖从百度百科收集数据、自然语言处理、神经网络模型训练以及带有实用代码示例的系统实现。
Building a Simple AI Q&A System from Scratch: A Practice by an Interdisciplinary Learner
引言与动机
Introduction and Motivation
当前,以ChatGPT为代表的大型语言模型正引发广泛关注,它们为日常工作带来了显著的效率提升。作为一名财务管理专业的大四学生,我长期以来都怀揣着构建一个属于自己的AI系统的梦想,哪怕它最初非常简单。尽管我的专业学习路径与计算机科学相去甚远,但一次偶然的机会让我接触到了财务数据分析,并由此开始了Python编程的自学之旅。如今,临近毕业,面对未来的不确定性,我决定静下心来,利用所学知识尝试实现这个“小愿望”,于是便有了这个项目。本文旨在分享这一实践过程,技术实现上或有不足,还请各位专家海涵。
Currently, large language models represented by ChatGPT are attracting widespread attention, bringing significant efficiency improvements to daily work. As a senior student majoring in Financial Management, I have long harbored the dream of building my own AI system, even if it starts out very simple. Although my academic path is far from computer science, a chance encounter with financial data analysis led me to start self-learning Python programming. Now, approaching graduation and facing future uncertainties, I decided to settle down and attempt to realize this "small wish" using the knowledge I've acquired, thus giving birth to this project. This article aims to share this practical process. There may be technical shortcomings, and I welcome the understanding of experts.
核心设计思路
Core Design Concept
我的目标是构建一个能够自主学习会计知识并进行问答的程序。由于时间有限,无法手动整理大量资料,因此我选择让程序直接从百度百科获取数据。项目的基本实现思路如下:
My goal is to build a program capable of autonomously learning accounting knowledge and conducting Q&A. Due to time constraints and the inability to manually organize large amounts of data, I chose to let the program fetch data directly from Baidu Baike. The basic implementation concept of the project is as follows:
- 数据爬取:使用Python爬虫框架(如Scrapy或BeautifulSoupPython的HTML/XML解析库,用于从网页中提取和清洗数据,支持多种解析器,常用于网络爬虫项目。)爬取百度百科相关词条的网页内容。
Data Scraping: Use Python scraping frameworks (like Scrapy or BeautifulSoupPython的HTML/XML解析库,用于从网页中提取和清洗数据,支持多种解析器,常用于网络爬虫项目。) to crawl the webpage content of relevant entries on Baidu Baike.
- 数据处理:对爬取的网页内容进行自然语言处理和数据清洗,提取有效信息,并存储到数据库或本地文件中。
Data Processing: Perform natural language processing and data cleaning on the scraped webpage content, extract useful information, and store it in a database or local files.
- 模型训练:使用机器学习算法(如神经网络、随机森林)对处理后的数据进行训练,构建问答匹配模型。
Model Training: Use machine learning algorithms (such as neural networks, random forests) to train the processed data and build a Q&A matching model.
- 系统开发:开发一个交互式问答系统,接收用户输入的问题,利用训练好的模型找到最匹配的答案并返回。
System Development: Develop an interactive Q&A system that receives user-input questions, utilizes the trained model to find the best matching answer, and returns it.
当然,一个完整的系统还需要考虑诸多细节,例如反爬虫策略、未匹配问题的处理等。但作为一个简易的AI原型,本项目将聚焦于核心流程的实现。
Of course, a complete system requires consideration of many details, such as anti-scraping strategies and handling of unmatched questions. However, as a simple AI prototype, this project will focus on the implementation of the core pipeline.
技术实现与演进
Technical Implementation and Evolution
3.1 基础版本:基于Word2VecA technique for learning vector representations of words from large text corpora, capturing semantic relationships.与SVM
3.1 Basic Version: Based on Word2VecA technique for learning vector representations of words from large text corpora, capturing semantic relationships. and SVM
首先,根据上述思路,我编写了一个基础版本的代码。该版本使用jieba进行中文分词,利用gensim的Word2VecA technique for learning vector representations of words from large text corpora, capturing semantic relationships.模型将文本转换为向量,然后使用支持向量机(SVM)作为分类器来匹配问题与答案。
First, following the above concept, I wrote a basic version of the code. This version uses
jiebafor Chinese word segmentation, utilizes the Word2VecA technique for learning vector representations of words from large text corpora, capturing semantic relationships. model fromgensimto convert text into vectors, and then uses a Support Vector Machine (SVM) as the classifier to match questions with answers.
import requests
from bs4 import BeautifulSoup
import jieba
from gensim.models.word2vec import Word2Vec
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.metrics.pairwise import cosine_similarity
# ... (Data fetching function and Word2VecVectorizer class)
class QuestionAnswerSystem:
def __init__(self):
self.vectorizer = None
self.clf = None
self.question_vectors = None
self.answer_vectors = None
self.df = None
def fit(self, questions, answers):
self.vectorizer = Word2VecVectorizer()
self.vectorizer.fit(questions)
self.question_vectors = self.vectorizer.transform(questions)
self.answer_vectors = self.vectorizer.transform(answers)
self.clf = SVC(kernel='linear', probability=True)
self.clf.fit(self.question_vectors, range(len(questions)))
def ask(self, question, n=5):
question = list(jieba.cut(question))
question_vec = self.vectorizer.transform([question])[0]
probas = self.clf.predict_proba([question_vec])[0]
top_n = np.argsort(probas)[-n:]
sims = cosine_similarity([question_vec], self.question_vectors[top_n])[0]
best_match = np.argmax(sims)
return self.df.iloc[top_n[best_match]]['answer']
这个版本能够回答一些预设的基础问题,但其语义理解能力受限于浅层的Word2VecA technique for learning vector representations of words from large text corpora, capturing semantic relationships.向量和简单的SVM模型。
This version can answer some preset basic questions, but its semantic understanding capability is limited by the shallow Word2VecA technique for learning vector representations of words from large text corpora, capturing semantic relationships. vectors and the simple SVM model.
3.2 进阶版本:集成BERT预训练模型
3.2 Advanced Version: Integrating BERT Pre-trained Model
为了提升效果,我引入了更强大的预训练语言模型BERT。使用Hugging FaceA platform where DeepSeek regularly releases models and research findings to promote open access.的transformers库加载中文BERT模型,对问题和答案进行深度语义编码,然后通过一个简单的全连接神经网络进行分类。
To improve performance, I introduced the more powerful pre-trained language model BERT. Using the
transformerslibrary from Hugging FaceA platform where DeepSeek regularly releases models and research findings to promote open access. to load a Chinese BERT model, deep semantic encoding is performed on questions and answers, followed by classification through a simple fully connected neural network.
import torch
from transformers import BertTokenizer, BertModel
from torch.utils.data import TensorDataset, DataLoader
import torch.nn.functional as F
class BertEncoder:
def __init__(self):
self.tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
self.model = BertModel.from_pretrained('bert-base-chinese')
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def encode(self, sentence):
input_ids = torch.tensor([self.tokenizer.encode(sentence, add_special_tokens=True)], dtype=torch.long).to(self.device)
with torch.no_grad():
outputs = self.model(input_ids)
encoded = outputs[0][:, 0, :].cpu().numpy()
return encoded
class QuestionAnswerSystem:
def __init__(self):
self.encoder = None
self.clf = None
self.question_vectors = None
self.answer_vectors = None
self.df = None
def fit(self, questions, answers):
self.encoder = BertEncoder()
self.question_vectors = self.encoder.encode(questions)
self.answer_vectors = self.encoder.encode(answers)
self.clf = torch.nn.Sequential(torch.nn.Linear(768, 256),
torch.nn.ReLU(),
torch.nn.Linear(256, len(questions)),
torch.nn.Softmax())
self.clf.to(self.encoder.device)
# ... (Training loop with DataLoader, CrossEntropyLoss, and Adam optimizer)
def ask(self, question, n=5):
question_vec = self.encoder.encode(question)
inputs = torch.tensor(question_vec, dtype=torch.float32).to(self.encoder.device)
outputs = self.clf(inputs)
probas = F.softmax(outputs, dim=1).detach().cpu().numpy()[0]
top_n = np.argsort(probas)[-n:]
sims = cosine_similarity([question_vec], self.question_vectors[top_n])[0]
best_match = np.argmax(sims)
return self.df.iloc[top_n[best_match]]['answer']
BERT模型显著提升了语义表示的准确性,使得问答系统能够更精准地理解用户意图。然而,BERT模型的计算复杂度较高,训练和推理速度相对较慢。
The BERT model significantly improves the accuracy of semantic representation, enabling the Q&A system to understand user intent more precisely. However, the computational complexity of the BERT model is high, resulting in relatively slower training and inference speeds.
3.3 功能扩展:结果持久化
3.3 Feature Extension: Result Persistence
为了方便保存和查看训练结果,我为QuestionAnswerSystem类添加了save_result_to_excel方法,利用pandas库将问答对保存到Excel文件中。
To facilitate saving and reviewing training results, I added a
save_result_to_excelmethod to theQuestionAnswerSystemclass, utilizing thepandaslibrary to save Q&A pairs to an Excel file.
def save_result_to_excel(self, filename='qa_result.xlsx'):
df_result = self.df.copy()
df_result.columns = ['Question', 'Answer']
df_result.to_excel(filename, index=False)
# 在主程序中调用
# qas.save_result_to_excel('qa_result.xlsx')
3.4 持续优化:交互式数据标注与评估
3.4 Continuous Optimization: Interactive Data Annotation and Evaluation
一个能够自我完善的系统需要反馈机制。我对ask方法进行了扩展,增加了is_correct参数,允许用户在交互过程中对系统给出的答案进行正确性标注。同时,新增了check_accuracy方法,用于基于已标注数据计算系统的准确率。
A system capable of self-improvement requires a feedback mechanism. I extended the
askmethod by adding anis_correctparameter, allowing users to annotate the correctness of the system's answers during interaction. Additionally, acheck_accuracymethod was added to calculate the system's accuracy based on the annotated data.
def ask(self, question, n=5, is_correct=None):
# ... (原有代码获取答案)
if is_correct is not None:
# 将问题、答案、标注存储到df中
new_row = pd.DataFrame({'question': [question], 'answer': [answer], 'is_correct': [is_correct]})
self.df = pd.concat([self.df, new_row], ignore_index=True)
return answer
def check_accuracy(self):
if 'is_correct' not in self.df.columns:
print("暂无标注数据。")
return
correct_count = self.df['is_correct'].sum()
total_count = len(self.df[self.df['is_correct'].notna()])
accuracy = correct_count / total_count if total_count > 0 else 0
print(f"基于{total_count}条标注数据,系统准确率为: {accuracy:.2%}")
这一改进使得系统能够在实际使用中收集反馈数据,为后续的模型迭代和优化提供了可能。
This improvement enables the system to collect feedback data during actual use, providing possibilities for subsequent model iteration and optimization.
总结与展望
Summary and Outlook
本项目展示了一个非计算机专业的学生如何利用现有工具和库,从零开始构建一个简易AI问答系统的完整过程。从基于Word2VecA technique for learning vector representations of words from large text corpora, capturing semantic relationships.和SVM的基础方案,到集成BERT预训练模型的进阶方案,再到加入数据持久化和交互式评估的功能扩展,每一步都是学习与实践的结合。
This project demonstrates the complete process of how a student from a non-computer science background can utilize existing tools and libraries to build a simple AI Q&A system from scratch. From the basic solution based on Word2VecA technique for learning vector representations of words from large text corpora, capturing semantic relationships. and SVM, to the advanced solution integrating the BERT pre-trained model, and further to feature extensions like data persistence and interactive evaluation, each step represents a combination of learning and practice.
当然,这只是一个原型系统,存在许多可以改进的地方,例如:
Of course, this is only a prototype system with many areas for improvement, such as:
- 知识库构建:实现自动化、持续化的网络数据爬取与更新。
Knowledge Base Construction: Implement automated, continuous web data scraping and updating.
- 模型优化:尝试更先进的句子编码模型(如Sentence-BERT)、更复杂的匹配网络,或直接使用检索增强生成(RAG)架构。
Model Optimization: Experiment with more advanced sentence encoding models (e.g., Sentence-BERT), more complex matching networks, or directly adopt a Retrieval-Augmented Generation (RAG) architecture.
- 工程化:将系统封装为API服务,设计更友好的用户界面,并优化其处理大规模请求的性能。
Engineering: Package the system as an API service, design a more user-friendly interface, and optimize its performance for handling large-scale requests.
通过这个项目,我深刻体会到,跨领域学习虽然充满挑战,但将不同领域的知识结合应用,往往能催生出有价值的实践与创新。希望我的这次实践能对有着类似想法的学习者有所启发。
Through this project, I deeply realized that although interdisciplinary learning is full of challenges, applying knowledge from different fields together can often lead to valuable practice and innovation. I hope my practice can inspire learners with similar ideas.
常见问题(FAQ)
如何利用Hugging FaceA platform where DeepSeek regularly releases models and research findings to promote open access.优化GEO内容策略中的问答系统?
本文通过集成Hugging FaceA platform where DeepSeek regularly releases models and research findings to promote open access.的BERT预训练模型,结合百度百科数据爬取和神经网络训练,实现高效问答匹配,提升GEO内容策略的自动化和准确性。
构建AI问答系统时,如何从基础版本演进到BERT集成?
从基于Word2VecA technique for learning vector representations of words from large text corpora, capturing semantic relationships.与SVM的基础版本开始,逐步升级为集成Hugging FaceA platform where DeepSeek regularly releases models and research findings to promote open access. BERT预训练模型的进阶版本,并通过交互式数据标注持续优化系统性能。
在GEO优化中,如何实现问答系统的数据持久化与扩展?
通过Python爬虫获取百度百科数据,进行自然语言处理清洗后存储,并设计结果持久化功能,确保系统能稳定处理大规模GEO相关查询。
版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。
文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。
若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。