大模型微调项目实战教程旨在帮助零基础用户掌握使用大语言模型Qwen2进行指令微调的方法。本文将指导您从环境安装、数据准备、模型加载、微调到训练可视化和推理的全流程,并提供完整代码示例。首先,确保Python版本大于3.8,并安装所需的库如snaplab
、modelscope
、transformers
、datasets
、peft
和accelerate
。然后,准备复旦中文新闻数据集,包括训练集和测试集。使用modelscope
下载Qwen2-1.5B-Instruct模型,安装SwanLab监控训练过程,配置训练参数,并完成从数据预处理到模型训练的全过程。最终,通过SwanLab可视化训练指标,并使用训练好的模型进行文本分类任务的推理。
Qwen2大模型微调入门实战(附完整代码)——零基础入门到精通
环境安装
确保您的系统上安装了Python版本大于3.8,并安装所需的库以支持模型训练和微调。
pip install -U swanlab modelscope transformers datasets peft accelerate
数据集准备
使用复旦中文新闻数据集(zh_cls_fudan_news),下载其训练集(train.jsonl
)和测试集(test.jsonl
)到您的根目录下。
模型加载与训练可视化工具的配置
使用SwanLab监控训练过程,以可视化模型性能。确保SwanLab已经正确安装并可以与您的训练脚本集成。
完整代码实现
在开始训练之前,确保目录结构如下:
- train.py
- train.jsonl
- test.jsonl
在train.py
文件中,以下代码将指导您完成从数据预处理、模型训练到效果评估的全过程。
import json
import pandas as pd
from datasets import Dataset
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
from swanlab.integration.huggingface import SwanLabCallback
def dataset_jsonl_transfer(file_path, new_path):
messages = []
with open(file_path, "r") as file:
for line in file:
data = json.loads(line)
context = data["text"]
category = data["category"]
label = data["output"]
message = {
"instruction": "你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型",
"input": f"文本:{context}, 类型选项:{category}",
"output": label,
}
messages.append(message)
with open(new_path, "w", encoding="utf-8") as file:
for message in messages:
file.write(json.dumps(message, ensure_ascii=False) + "\n")
def process_func(example):
instruction = tokenizer(f"系统消息:\n你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型\n用户消息:\n{example['input']}\n助手消息:\n", add_special_tokens=False)
response = tokenizer(f"{example['output']}", add_special_tokens=False)
input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]
labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}
def predict(messages, model, tokenizer):
device = "cuda" if torch.cuda.is_available() else "cpu"
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response
def main():
model_dir = snapshot_download("qwen/Qwen2-1.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir)
train_path = "train.jsonl"
test_path = "test.jsonl"
train_jsonl_new_path = "new_train.jsonl"
test_jsonl_new_path = "new_test.jsonl"
dataset_jsonl_transfer(train_path, train_jsonl_new_path)
dataset_jsonl_transfer(test_path, test_jsonl_new_path)
train_df = pd.read_json(train_jsonl_new_path, lines=True)
train_ds = Dataset.from_pandas(train_df)
train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)
model = get_peft_model(model, config)
args = TrainingArguments(
output_dir="./output/Qwen2",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
logging_steps=10,
num_train_epochs=2,
save_steps=100,
learning_rate=1e-4,
save_on_each_node=True,
gradient_checkpointing=True,
report_to="none",
)
swanlab_callback = SwanLabCallback(
project="Qwen2-fintune",
experiment_name="Qwen2-1.5B-Instruct",
description="使用通义千问Qwen2-1.5B-Instruct模型在zh_cls_fudan-news数据集上微调。",
config={
"model": "qwen/Qwen2-1.5B-Instruct",
"dataset": "huangjintao/zh_cls_fudan-news",
}
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
callbacks=[swanlab_callback],
)
trainer.train()
# 测试模型
test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10]
test_text_list = []
for index, row in test_df.iterrows():
instruction = row['instruction']
input_value = row['input']
messages = [
{"role": "system", "content": f"{instruction}"},
{"role": "user", "content": f"{input_value}"}
]
response = predict(messages, model, tokenizer)
messages.append({"role": "assistant", "content": f"{response}"})
result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"
test_text_list.append(result_text)
swanlab_callback.log({"Prediction": test_text_list})
swanlab_callback.finish()
if __name__ == "__main__":
main()
训练结果可视化
训练完成后,通过SwanLab可以可视化训练过程中的指标,如损失函数的变化、准确率等。
推理
通过训练得到的模型可以进行文本分类任务的推理,以下代码展示了如何使用训练好的模型进行预测:
def predict(messages, model, tokenizer):
device = "cuda" if torch.cuda.is_available() else "cpu"
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response
示例
以训练好的Qwen2模型进行文本分类任务预测:
test_texts = {
'instruction': "你是一个文本分类领域的专家,你会接收到一段文本和几个潜在的分类选项,请输出文本内容的正确类型",
'input': "文本:航空动力学报JOURNAL OF AEROSPACE POWER1998年 第4期 No.4 1998科技期刊管路系统敷设的并行工程模型研究*陈志英* * 马 枚北京航空航天大学【摘要】 提出了一种应用于并行工程模型转换研究的标号法,该法是将现行串行设计过程(As-is)转换为并行设计过程(To-be)。本文应用该法将发动机外部管路系统敷设过程模型进行了串并行转换,应用并行工程过程重构的手段,得到了管路敷设并行过程模型。"
}
instruction = test_texts['instruction']
input_value = test_texts['input']
messages = [
{"role": "system", "content": f"{instruction}"},
{"role": "user", "content": f"{input_value}"}
]
response = predict(messages, model, tokenizer)
print(response)
通过上述步骤,您已经完成了从零基础到掌握Qwen2大模型指令微调的全过程,希望本文提供的代码与实践案例能够帮助您在大模型微调的道路上更进一步。
共同学习,写下你的评论
评论加载中...
作者其他优质文章