开源 LLM 生态

当前开源大语言模型（LLM）生态繁荣，本文介绍主流模型和使用方式。

模型家族

Meta LLaMA 系列

目前最流行的开源模型家族。

模型	参数规模	特点	许可
LLaMA 1 (2023)	7B, 13B, 33B, 65B	首个开源大模型	研究许可
LLaMA 2 (2023)	7B, 13B, 70B	商用许可（需申请）	免费商用（受限）
LLaMA 3 (2024)	8B, 70B, 405B（即将）	性能接近 GPT-4	免费商用

使用：

python

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Mistral AI

欧洲开源模型，性价比高。

Mistral 7B：7B 参数性能≈13B
Mixtral 8x7B：MoE 架构，45B 激活参数，性能≈GPT-3.5
Mistral Large：闭源商用

优势：Apache 2.0 许可，无使用限制。

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

Google Gemma

Google 的开源轻量模型：

Gemma 2B/7B：轻量级，适合边缘设备
使用轻量级推理框架：Keras, JAX

python

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")

国产开源模型

通义千问 (Qwen)

阿里大模型，免费商用：

Qwen 1.5：7B, 14B, 72B
Qwen2：性能 SOTA，支持 128K 上下文
Qwen-VL：多模态版本

python

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)

ChatGLM

智谱 AI 的 bilingual 模型（中英文）：

ChatGLM3-6B：6B 参数，量化后仅 4GB
支持多轮对话、工具调用

python

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()

百川 (Baichuan)

Baichuan2-13B：中英文双语
Apache 2.0 许可

阶跃星辰 (Step)

Step：多模态能力（图文理解）
国内可用，API 稳定

其他优秀模型

模型	机构	特点
Falcon	阿联酋 TII	Apache 2.0，180B 参数
MPT	MosaicML	商用许可，长上下文
XGen	Salesforce	8B/7B，免费商用
BLOOM	BigScience	多语言，176B
stablelm	Stability AI	商用许可

模型格式

Hugging Face Transformers

标准格式，使用 from_pretrained 加载：

model/
├── config.json
├── pytorch_model.bin (或 model.safetensors)
├── tokenizer.json / tokenizer_config.json
└── ...

GGUF（llama.cpp 格式）

CPU 推理优化格式：

bash

# 转换
python convert.py model.safetensors --outtype q4_0

# 使用
llama.cpp/main -m model-q4_0.gguf -p "Hello"

GPTQ / AWQ

量化格式，GPU 推理加速：

python

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "model",
    device_map="auto",
    torch_dtype=torch.float16,
    quantization_config=GPTQConfig(bits=4, group_size=128)
)

模型 Zoo（资源）

综合平台

Hugging Face Hub：https://huggingface.co/models
CivitAI：https://civitai.com（社区模型）
ModelScope：https://modelscope.cn（阿里巴巴，国内）

按任务分类

任务	推荐模型
通用对话	Llama-3-chat, Qwen1.5, ChatGLM3
代码生成	CodeLlama, StarCoder2, CodeQwen
数学推理	MetaMath, WizardMath
中文优化	Qwen, ChatGLM, Baichuan
长文档	Yi-34B-200K, LongChat
指令遵循	Zephyr, NeuralChat

推理框架

入门级

Transformers：HuggingFace 官方库，最简单
llama.cpp：CPU 推理，GGUF 格式
Ollama：一键部署工具

bash

# Ollama 运行 LLaMA2
ollama run llama2

生产级

vLLM：高吞吐，PagedAttention
TGI（Text Generation Inference）：HuggingFace 官方，支持量化
TensorRT-LLM：NVIDIA 最优性能
OpenVINO：Intel CPU 优化

bash

# vLLM OpenAI 兼容 API
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-chat-hf

# 调用
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama2", "prompt": "Hello"}'

微调方法

全量微调（Full Fine-tuning）

python

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)
trainer.train()

缺点：需要大显存（7B 模型需 14-16GB）

LoRA（Low-Rank Adaptation）

冻结原始权重，只训练低秩矩阵：

python

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,  # rank
    lora_alpha=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # 通常 0.1-1% 参数可训练

优点：

显存需求降低 3 倍
训练快 2-3 倍
可合并权重，无推理开销

QLoRA

4-bit 量化 + LoRA：

python

from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    quantization_config=quant_config,
    device_map="auto"
)
model = get_peft_model(model, lora_config)

效果：7B 模型仅需 5GB 显存即可微调！

DPO（Direct Preference Optimization）

无需 RLHF，直接通过偏好数据训练：

python

from trl import DPOTrainer

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,  # 参考模型（通常为原始模型）
    args=training_args,
    train_dataset=dpo_dataset
)

优点：比 PPO 简单，训练稳定。

评估基准

MMLU：多学科选择题（57 个学科）
GSM8K：小学数学题
HumanEval：代码生成（通过率）
MATH：竞赛数学题
Hellaswag：常识推理
TruthfulQA：真实性评估

参考榜单：https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

未来趋势

MoE（Mixture of Experts）：稀疏激活，性能提升显著（Mixtral）
多模态统一：文本+图像+音频统一模型（LLaVA、CogVLM）
长上下文：128K、1M token 成为标配
推理优化：KV Cache 量化、FlashAttention 2
冷启动加速：Medusa、Lookahead Decoding

开源 LLM 生态日新月异，保持关注最新模型和技术的发布！

开源 LLM 生态 ​

模型家族 ​

Meta LLaMA 系列 ​

Mistral AI ​

Google Gemma ​

国产开源模型 ​

通义千问 (Qwen) ​

ChatGLM ​

百川 (Baichuan) ​

阶跃星辰 (Step) ​

其他优秀模型 ​

模型格式 ​

Hugging Face Transformers ​

GGUF（llama.cpp 格式） ​

GPTQ / AWQ ​

模型 Zoo（资源） ​

综合平台 ​

按任务分类 ​

推理框架 ​

入门级 ​

生产级 ​

微调方法 ​

全量微调（Full Fine-tuning） ​

LoRA（Low-Rank Adaptation） ​

QLoRA ​

DPO（Direct Preference Optimization） ​

评估基准 ​

未来趋势 ​

开源 LLM 生态

模型家族

Meta LLaMA 系列

Mistral AI

Google Gemma

国产开源模型

通义千问 (Qwen)

ChatGLM

百川 (Baichuan)

阶跃星辰 (Step)

其他优秀模型

模型格式

Hugging Face Transformers

GGUF（llama.cpp 格式）

GPTQ / AWQ

模型 Zoo（资源）

综合平台

按任务分类

推理框架

入门级

生产级

微调方法

全量微调（Full Fine-tuning）

LoRA（Low-Rank Adaptation）

QLoRA

DPO（Direct Preference Optimization）

评估基准

未来趋势