首页手记 Qwen2大模型微调入门实战（附完整代码）&mdash...

Qwen2大模型微调入门实战（附完整代码）——零基础入门到精通，收藏这一篇就够了

标签：

杂七杂八

概述

大模型微调项目实战教程旨在帮助零基础用户掌握使用大语言模型Qwen2进行指令微调的方法。本文将指导您从环境安装、数据准备、模型加载、微调到训练可视化和推理的全流程，并提供完整代码示例。首先，确保Python版本大于3.8，并安装所需的库如snaplab、modelscope、transformers、datasets、peft和accelerate。然后，准备复旦中文新闻数据集，包括训练集和测试集。使用modelscope下载Qwen2-1.5B-Instruct模型，安装SwanLab监控训练过程，配置训练参数，并完成从数据预处理到模型训练的全过程。最终，通过SwanLab可视化训练指标，并使用训练好的模型进行文本分类任务的推理。

Qwen2大模型微调入门实战（附完整代码）——零基础入门到精通

环境安装

确保您的系统上安装了Python版本大于3.8，并安装所需的库以支持模型训练和微调。

pip install -U swanlab modelscope transformers datasets peft accelerate

数据集准备

使用复旦中文新闻数据集（zh_cls_fudan_news），下载其训练集(train.jsonl)和测试集(test.jsonl)到您的根目录下。

模型加载与训练可视化工具的配置

使用SwanLab监控训练过程，以可视化模型性能。确保SwanLab已经正确安装并可以与您的训练脚本集成。

完整代码实现

在开始训练之前，确保目录结构如下：

- train.py
- train.jsonl
- test.jsonl

在train.py文件中，以下代码将指导您完成从数据预处理、模型训练到效果评估的全过程。

import json
import pandas as pd
from datasets import Dataset
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
from swanlab.integration.huggingface import SwanLabCallback

def dataset_jsonl_transfer(file_path, new_path):
    messages = []
    with open(file_path, "r") as file:
        for line in file:
            data = json.loads(line)
            context = data["text"]
            category = data["category"]
            label = data["output"]
            message = {
                "instruction": "你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型",
                "input": f"文本:{context}, 类型选项:{category}",
                "output": label,
            }
            messages.append(message)
    with open(new_path, "w", encoding="utf-8") as file:
        for message in messages:
            file.write(json.dumps(message, ensure_ascii=False) + "\n")

def process_func(example):
    instruction = tokenizer(f"系统消息：\n你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型\n用户消息：\n{example['input']}\n助手消息：\n", add_special_tokens=False)
    response = tokenizer(f"{example['output']}", add_special_tokens=False)
    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

def predict(messages, model, tokenizer):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

def main():
    model_dir = snapshot_download("qwen/Qwen2-1.5B-Instruct")
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForCausalLM.from_pretrained(model_dir)

    train_path = "train.jsonl"
    test_path = "test.jsonl"
    train_jsonl_new_path = "new_train.jsonl"
    test_jsonl_new_path = "new_test.jsonl"

    dataset_jsonl_transfer(train_path, train_jsonl_new_path)
    dataset_jsonl_transfer(test_path, test_jsonl_new_path)

    train_df = pd.read_json(train_jsonl_new_path, lines=True)
    train_ds = Dataset.from_pandas(train_df)
    train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)

    config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.1,
    )
    model = get_peft_model(model, config)

    args = TrainingArguments(
        output_dir="./output/Qwen2",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        logging_steps=10,
        num_train_epochs=2,
        save_steps=100,
        learning_rate=1e-4,
        save_on_each_node=True,
        gradient_checkpointing=True,
        report_to="none",
    )

    swanlab_callback = SwanLabCallback(
        project="Qwen2-fintune",
        experiment_name="Qwen2-1.5B-Instruct",
        description="使用通义千问Qwen2-1.5B-Instruct模型在zh_cls_fudan-news数据集上微调。",
        config={
            "model": "qwen/Qwen2-1.5B-Instruct",
            "dataset": "huangjintao/zh_cls_fudan-news",
        }
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
        callbacks=[swanlab_callback],
    )
    trainer.train()

    # 测试模型
    test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10]
    test_text_list = []
    for index, row in test_df.iterrows():
        instruction = row['instruction']
        input_value = row['input']
        messages = [
            {"role": "system", "content": f"{instruction}"},
            {"role": "user", "content": f"{input_value}"}
        ]
        response = predict(messages, model, tokenizer)
        messages.append({"role": "assistant", "content": f"{response}"})
        result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"
        test_text_list.append(result_text)

    swanlab_callback.log({"Prediction": test_text_list})
    swanlab_callback.finish()

if __name__ == "__main__":
    main()

训练结果可视化

训练完成后，通过SwanLab可以可视化训练过程中的指标，如损失函数的变化、准确率等。

推理

通过训练得到的模型可以进行文本分类任务的推理，以下代码展示了如何使用训练好的模型进行预测：

def predict(messages, model, tokenizer):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

示例

以训练好的Qwen2模型进行文本分类任务预测：

test_texts = {
    'instruction': "你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型",
    'input': "文本:航空动力学报JOURNAL OF AEROSPACE POWER1998年 第4期 No.4 1998科技期刊管路系统敷设的并行工程模型研究*陈志英*　*　马　枚北京航空航天大学【摘要】　提出了一种应用于并行工程模型转换研究的标号法，该法是将现行串行设计过程(As-is)转换为并行设计过程(To-be)。本文应用该法将发动机外部管路系统敷设过程模型进行了串并行转换，应用并行工程过程重构的手段，得到了管路敷设并行过程模型。"
}
instruction = test_texts['instruction']
input_value = test_texts['input']
messages = [
    {"role": "system", "content": f"{instruction}"},
    {"role": "user", "content": f"{input_value}"}
]
response = predict(messages, model, tokenizer)
print(response)

通过上述步骤，您已经完成了从零基础到掌握Qwen2大模型指令微调的全过程，希望本文提供的代码与实践案例能够帮助您在大模型微调的道路上更进一步。

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

繁星淼淼

手记
篇

粉丝

45

获赞与收藏

264

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 31416 348

网络编程入门教程

20个小节 12873 242

Pandas 入门教程

25个小节 18758 351

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空