feat: optimized dataset convertion efficiency, add on-demand training start/stop script
This commit is contained in:
111
README.md
111
README.md
@@ -92,6 +92,8 @@ git submodule update --init --recursive Megatron-LM
|
||||
|
||||
- `scripts/convert_phase_to_megatron.py`
|
||||
|
||||
该脚本直接读取 parquet,并使用 Megatron 的 tokenizer 与 `IndexedDatasetBuilder` 写出 `.bin` / `.idx`,不再生成 JSONL 中间文件。
|
||||
|
||||
转换前需要准备 tokenizer 的 4 个定义文件:
|
||||
|
||||
- `merges.txt`
|
||||
@@ -108,6 +110,36 @@ wget https://hf-mirror.com/thu-pacman/PCMind-2.1-Kaiyuan-2B/resolve/refs%2Fpr%2F
|
||||
wget https://hf-mirror.com/thu-pacman/PCMind-2.1-Kaiyuan-2B/resolve/refs%2Fpr%2F1/merges.txt
|
||||
```
|
||||
|
||||
转换示例:
|
||||
|
||||
```bash
|
||||
python scripts/convert_phase_to_megatron.py \
|
||||
--input-dir /apps/yi/model_training/data/phase1 \
|
||||
--output-dir /ssd/yi/converted_data/megatron_phase1 \
|
||||
--megatron-dir /apps/yi/model_training/Megatron-LM \
|
||||
--tokenizer-model /apps/yi/model_training/data/tokenizer \
|
||||
--text-key text \
|
||||
--output-prefix-prefix phase1 \
|
||||
--num-shards 4 \
|
||||
--workers-per-shard 16 \
|
||||
--batch-size 8192 \
|
||||
--chunksize 64
|
||||
```
|
||||
|
||||
并发参数建议:
|
||||
|
||||
- `--num-shards`:同时处理多少个 parquet 文件
|
||||
- `--workers-per-shard`:每个 parquet 的 tokenizer worker 数
|
||||
- 总 tokenizer worker 数约等于 `num_shards * workers_per_shard`
|
||||
- 建议让总 worker 数接近机器物理 CPU core 数,再根据磁盘 I/O 和 tokenizer 吞吐调整
|
||||
|
||||
输出文件名保持与训练脚本兼容,例如:
|
||||
|
||||
```text
|
||||
/ssd/yi/converted_data/megatron_phase1/phase1_part-00000_text_document.bin
|
||||
/ssd/yi/converted_data/megatron_phase1/phase1_part-00000_text_document.idx
|
||||
```
|
||||
|
||||
## 4. 模型定义与训练脚本
|
||||
|
||||
模型定义主要放在 `scripts/kaiyuan2b-training` 中。
|
||||
@@ -266,7 +298,84 @@ cd scripts/kaiyuan2b-training
|
||||
bash training_smoke_gpt2.sh
|
||||
```
|
||||
|
||||
### 8.3 Profiling
|
||||
### 8.3 动态启动训练
|
||||
|
||||
推荐使用 `scripts/kaiyuan2b-training/start_training.sh` 动态拉起训练任务。该脚本只负责选择并后台启动已有的 `training_*.sh`,不改变 `data`、`hparams`、`model` 分文件定义的组织方式。
|
||||
|
||||
当前支持的模型入口:
|
||||
|
||||
- `gpt_smoke`:对应 `training_smoke_gpt2.sh`
|
||||
- `qwen3_1p7b`:对应 `training_smoke_qwen3_1p7b.sh`
|
||||
|
||||
示例:
|
||||
|
||||
```bash
|
||||
cd scripts/kaiyuan2b-training
|
||||
|
||||
# 启动 gpt_smoke smoke 训练
|
||||
bash start_training.sh gpt_smoke smoke smoke_gpt
|
||||
|
||||
# 启动 qwen3_1p7b smoke 训练
|
||||
bash start_training.sh qwen3_1p7b qwen3_1p7b_smoke_yi qwen3_1p7b_smoke_yi
|
||||
```
|
||||
|
||||
启动后会写入:
|
||||
|
||||
- PID 状态:`/apps/yi/model_training/artifacts/run_state/<train_name>.pid`
|
||||
- 任务元信息:`/apps/yi/model_training/artifacts/run_state/<train_name>.env`
|
||||
- 训练日志:`/apps/yi/model_training/artifacts/logs/<train_name>.log`
|
||||
- TensorBoard 日志:`/apps/yi/model_training/artifacts/tb_logs/<train_name>`
|
||||
- checkpoint:`/apps/yi/model_training/artifacts/checkpoints/<train_name>`
|
||||
|
||||
`start_training.sh` 会自动给训练命令追加 Megatron 的 `--exit-signal-handler`,用于支持收到 `SIGTERM` 后保存 checkpoint 并退出。
|
||||
|
||||
可以通过环境变量传入额外 Megatron 参数:
|
||||
|
||||
```bash
|
||||
EXTRA_ARGS="--exit-duration-in-mins 120" \
|
||||
bash start_training.sh gpt_smoke smoke smoke_gpt_2h
|
||||
```
|
||||
|
||||
### 8.4 停止训练
|
||||
|
||||
使用 `stop_training.sh` 按 `train_name` 停止任务:
|
||||
|
||||
```bash
|
||||
cd scripts/kaiyuan2b-training
|
||||
bash stop_training.sh smoke_gpt
|
||||
```
|
||||
|
||||
停止脚本会向训练进程组发送 `SIGTERM`。由于训练启动时已经启用 `--exit-signal-handler`,Megatron 会在训练循环中保存 checkpoint 后退出。默认等待 300 秒,可通过 `GRACE_SECONDS` 调整:
|
||||
|
||||
```bash
|
||||
GRACE_SECONDS=600 bash stop_training.sh smoke_gpt
|
||||
```
|
||||
|
||||
不建议直接 `kill -9`,除非确认 checkpoint 保存已经卡死且必须释放 GPU。
|
||||
|
||||
### 8.5 保留最近 K 个 checkpoint
|
||||
|
||||
两个正式训练脚本都会自动清理旧 checkpoint,默认保留最近 3 个 `iter_*` 目录,避免长期训练炸盘:
|
||||
|
||||
- `scripts/kaiyuan2b-training/training_smoke_gpt2.sh`
|
||||
- `scripts/kaiyuan2b-training/training_smoke_qwen3_1p7b.sh`
|
||||
|
||||
可以通过环境变量修改:
|
||||
|
||||
```bash
|
||||
# 保留最近 5 个 checkpoint
|
||||
CHECKPOINT_KEEP_RECENT=5 bash start_training.sh qwen3_1p7b
|
||||
|
||||
# 每 60 秒检查一次旧 checkpoint
|
||||
CHECKPOINT_CLEANUP_INTERVAL_SECONDS=60 bash start_training.sh gpt_smoke
|
||||
|
||||
# 关闭自动清理
|
||||
CHECKPOINT_KEEP_RECENT=0 bash start_training.sh gpt_smoke
|
||||
```
|
||||
|
||||
清理逻辑只删除 checkpoint 目录下形如 `iter_0001000` 的旧目录,不删除 `latest_checkpointed_iteration.txt`、TensorBoard 日志或其他 artifact。
|
||||
|
||||
### 8.6 Profiling
|
||||
|
||||
```bash
|
||||
cd scripts/kaiyuan2b-profiling
|
||||
|
||||
Reference in New Issue
Block a user