feat: optimized dataset convertion efficiency, add on-demand training start/stop script

This commit is contained in:
2026-05-06 22:32:18 +08:00
parent 056df3b6ca
commit 0008288964
6 changed files with 607 additions and 114 deletions

111
README.md
View File

@@ -92,6 +92,8 @@ git submodule update --init --recursive Megatron-LM
- `scripts/convert_phase_to_megatron.py` - `scripts/convert_phase_to_megatron.py`
该脚本直接读取 parquet并使用 Megatron 的 tokenizer 与 `IndexedDatasetBuilder` 写出 `.bin` / `.idx`,不再生成 JSONL 中间文件。
转换前需要准备 tokenizer 的 4 个定义文件: 转换前需要准备 tokenizer 的 4 个定义文件:
- `merges.txt` - `merges.txt`
@@ -108,6 +110,36 @@ wget https://hf-mirror.com/thu-pacman/PCMind-2.1-Kaiyuan-2B/resolve/refs%2Fpr%2F
wget https://hf-mirror.com/thu-pacman/PCMind-2.1-Kaiyuan-2B/resolve/refs%2Fpr%2F1/merges.txt wget https://hf-mirror.com/thu-pacman/PCMind-2.1-Kaiyuan-2B/resolve/refs%2Fpr%2F1/merges.txt
``` ```
转换示例:
```bash
python scripts/convert_phase_to_megatron.py \
--input-dir /apps/yi/model_training/data/phase1 \
--output-dir /ssd/yi/converted_data/megatron_phase1 \
--megatron-dir /apps/yi/model_training/Megatron-LM \
--tokenizer-model /apps/yi/model_training/data/tokenizer \
--text-key text \
--output-prefix-prefix phase1 \
--num-shards 4 \
--workers-per-shard 16 \
--batch-size 8192 \
--chunksize 64
```
并发参数建议:
- `--num-shards`:同时处理多少个 parquet 文件
- `--workers-per-shard`:每个 parquet 的 tokenizer worker 数
- 总 tokenizer worker 数约等于 `num_shards * workers_per_shard`
- 建议让总 worker 数接近机器物理 CPU core 数,再根据磁盘 I/O 和 tokenizer 吞吐调整
输出文件名保持与训练脚本兼容,例如:
```text
/ssd/yi/converted_data/megatron_phase1/phase1_part-00000_text_document.bin
/ssd/yi/converted_data/megatron_phase1/phase1_part-00000_text_document.idx
```
## 4. 模型定义与训练脚本 ## 4. 模型定义与训练脚本
模型定义主要放在 `scripts/kaiyuan2b-training` 中。 模型定义主要放在 `scripts/kaiyuan2b-training` 中。
@@ -266,7 +298,84 @@ cd scripts/kaiyuan2b-training
bash training_smoke_gpt2.sh bash training_smoke_gpt2.sh
``` ```
### 8.3 Profiling ### 8.3 动态启动训练
推荐使用 `scripts/kaiyuan2b-training/start_training.sh` 动态拉起训练任务。该脚本只负责选择并后台启动已有的 `training_*.sh`,不改变 `data``hparams``model` 分文件定义的组织方式。
当前支持的模型入口:
- `gpt_smoke`:对应 `training_smoke_gpt2.sh`
- `qwen3_1p7b`:对应 `training_smoke_qwen3_1p7b.sh`
示例:
```bash
cd scripts/kaiyuan2b-training
# 启动 gpt_smoke smoke 训练
bash start_training.sh gpt_smoke smoke smoke_gpt
# 启动 qwen3_1p7b smoke 训练
bash start_training.sh qwen3_1p7b qwen3_1p7b_smoke_yi qwen3_1p7b_smoke_yi
```
启动后会写入:
- PID 状态:`/apps/yi/model_training/artifacts/run_state/<train_name>.pid`
- 任务元信息:`/apps/yi/model_training/artifacts/run_state/<train_name>.env`
- 训练日志:`/apps/yi/model_training/artifacts/logs/<train_name>.log`
- TensorBoard 日志:`/apps/yi/model_training/artifacts/tb_logs/<train_name>`
- checkpoint`/apps/yi/model_training/artifacts/checkpoints/<train_name>`
`start_training.sh` 会自动给训练命令追加 Megatron 的 `--exit-signal-handler`,用于支持收到 `SIGTERM` 后保存 checkpoint 并退出。
可以通过环境变量传入额外 Megatron 参数:
```bash
EXTRA_ARGS="--exit-duration-in-mins 120" \
bash start_training.sh gpt_smoke smoke smoke_gpt_2h
```
### 8.4 停止训练
使用 `stop_training.sh``train_name` 停止任务:
```bash
cd scripts/kaiyuan2b-training
bash stop_training.sh smoke_gpt
```
停止脚本会向训练进程组发送 `SIGTERM`。由于训练启动时已经启用 `--exit-signal-handler`Megatron 会在训练循环中保存 checkpoint 后退出。默认等待 300 秒,可通过 `GRACE_SECONDS` 调整:
```bash
GRACE_SECONDS=600 bash stop_training.sh smoke_gpt
```
不建议直接 `kill -9`,除非确认 checkpoint 保存已经卡死且必须释放 GPU。
### 8.5 保留最近 K 个 checkpoint
两个正式训练脚本都会自动清理旧 checkpoint默认保留最近 3 个 `iter_*` 目录,避免长期训练炸盘:
- `scripts/kaiyuan2b-training/training_smoke_gpt2.sh`
- `scripts/kaiyuan2b-training/training_smoke_qwen3_1p7b.sh`
可以通过环境变量修改:
```bash
# 保留最近 5 个 checkpoint
CHECKPOINT_KEEP_RECENT=5 bash start_training.sh qwen3_1p7b
# 每 60 秒检查一次旧 checkpoint
CHECKPOINT_CLEANUP_INTERVAL_SECONDS=60 bash start_training.sh gpt_smoke
# 关闭自动清理
CHECKPOINT_KEEP_RECENT=0 bash start_training.sh gpt_smoke
```
清理逻辑只删除 checkpoint 目录下形如 `iter_0001000` 的旧目录,不删除 `latest_checkpointed_iteration.txt`、TensorBoard 日志或其他 artifact。
### 8.6 Profiling
```bash ```bash
cd scripts/kaiyuan2b-profiling cd scripts/kaiyuan2b-profiling

View File

@@ -1,130 +1,240 @@
import argparse import argparse
import json
import os import os
import subprocess import sys
import time
from concurrent.futures import ProcessPoolExecutor, as_completed from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Pool
from pathlib import Path from pathlib import Path
from tqdm import tqdm from types import SimpleNamespace
import pyarrow.parquet as pq import pyarrow.parquet as pq
from tqdm import tqdm
""" """
Convert Pyarrow parquets to megatron format, use jsonl as intermediate format. Convert Kaiyuan parquet files directly to Megatron indexed dataset format.
Takes in parquet schema: Expected parquet schema:
text: <string> text: <string>
The previous implementation used parquet -> JSONL -> Megatron preprocess_data.py.
This implementation removes the JSONL intermediate file and writes .bin/.idx with
Megatron's IndexedDatasetBuilder directly.
Usage: Usage:
python /apps/yi/model_training/scripts/convert_phase_to_megatron.py \ python /apps/yi/model_training/scripts/convert_phase_to_megatron.py \
--input-dir /apps/yi/model_training/data/phase1 \ --input-dir /apps/yi/model_training/data/phase1 \
--output-dir /ssd/yi/converted_data/megatron_phase1 \ --output-dir /ssd/yi/converted_data/megatron_phase1 \
--tmp-dir /ssd/yi/converted_data/tmp_jsonl \
--megatron-dir /apps/yi/model_training/Megatron-LM \ --megatron-dir /apps/yi/model_training/Megatron-LM \
--tokenizer-model /apps/yi/model_training/data/tokenizer \ --tokenizer-model /apps/yi/model_training/data/tokenizer \
--text-key text \ --text-key text \
--num-shards 4 \ --num-shards 4 \
--workers-per-shard 16 \ --workers-per-shard 16 \
--start 100 \ --start 100 \
--end 220 # 1 of total 220 parquets --end 220
""" """
_TOKENIZER = None
_APPEND_EOD = True
def parquet_to_jsonl(parquet_path: Path, jsonl_path: Path, text_key: str): def make_tokenizer_args(args):
jsonl_path.parent.mkdir(parents=True, exist_ok=True) return SimpleNamespace(
rank=0,
rows = 0 make_vocab_size_divisible_by=128,
with jsonl_path.open("w", encoding="utf-8") as fout: tensor_model_parallel_size=1,
pf = pq.ParquetFile(parquet_path) padded_vocab_size=None,
for batch in pf.iter_batches(columns=[text_key], batch_size=8192): vocab_size=args.vocab_size,
col = batch.column(0).to_pylist() vocab_file=args.vocab_file,
for text in col: merge_file=args.merge_file,
if isinstance(text, str) and text.strip(): vocab_extra_ids=0,
fout.write(json.dumps({text_key: text}, ensure_ascii=False) + "\n") tokenizer_type=args.tokenizer_type,
rows += 1 tokenizer_model=args.tokenizer_model,
return rows metadata_path=args.tokenizer_metadata,
special_tokens=args.tokenizer_special_tokens,
tokenizer_sentencepiece_legacy=args.tokenizer_sentencepiece_legacy,
def run_one(args_tuple): tokenizer_hf_no_use_fast=args.tokenizer_hf_no_use_fast,
( tokenizer_hf_no_include_special_tokens=args.tokenizer_hf_no_include_special_tokens,
parquet_path, trust_remote_code=args.trust_remote_code,
output_dir, tiktoken_pattern=args.tiktoken_pattern,
tmp_dir, tiktoken_num_special_tokens=args.tiktoken_num_special_tokens,
text_key, null_tokenizer_eod_id=args.null_tokenizer_eod_id,
megatron_dir, null_tokenizer_pad_id=args.null_tokenizer_pad_id,
tokenizer_type, tokenizer_prompt_format=None,
tokenizer_model, image_tag_type=None,
workers_per_shard, force_system_message=False,
keep_jsonl, sft_tokenizer_prompt_format=None,
overwrite,
) = args_tuple
parquet_path = Path(parquet_path)
stem = parquet_path.name.replace(".zstd.parquet", "").replace(".parquet", "")
jsonl_path = Path(tmp_dir) / f"{stem}.jsonl"
output_prefix = Path(output_dir) / f"phase1_{stem}"
bin_file = Path(str(output_prefix) + f"_{text_key}_document.bin")
idx_file = Path(str(output_prefix) + f"_{text_key}_document.idx")
if not overwrite and bin_file.exists() and idx_file.exists():
return f"[SKIP] {parquet_path.name}: existing bin/idx"
print(f"[START] {parquet_path.name}", flush=True)
rows = parquet_to_jsonl(parquet_path, jsonl_path, text_key)
print(f"[JSONL DONE] {parquet_path.name}: rows={rows}, jsonl={jsonl_path}", flush=True)
print(f"[MEGATRON START] {parquet_path.name}", flush=True)
cmd = [
"python",
str(Path(megatron_dir) / "tools/preprocess_data.py"),
"--input", str(jsonl_path),
"--output-prefix", str(output_prefix),
"--tokenizer-type", tokenizer_type,
"--tokenizer-model", tokenizer_model,
"--json-keys", text_key,
"--workers", str(workers_per_shard),
"--append-eod",
]
env = os.environ.copy()
proc = subprocess.run(
cmd,
cwd=megatron_dir,
env=env,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
) )
if proc.returncode != 0:
return f"[FAIL] {parquet_path.name}\n{proc.stdout[-4000:]}"
if not keep_jsonl: def add_megatron_to_path(megatron_dir):
jsonl_path.unlink(missing_ok=True) megatron_dir = str(Path(megatron_dir).resolve())
if megatron_dir not in sys.path:
return f"[OK] {parquet_path.name}: rows={rows}, output_prefix={output_prefix}" sys.path.insert(0, megatron_dir)
def main(): def build_megatron_tokenizer(args):
add_megatron_to_path(args.megatron_dir)
from megatron.core.tokenizers.utils.build_tokenizer import build_tokenizer
return build_tokenizer(make_tokenizer_args(args))
def init_worker(args):
global _TOKENIZER, _APPEND_EOD
_APPEND_EOD = args.append_eod
_TOKENIZER = build_megatron_tokenizer(args)
if _APPEND_EOD and _TOKENIZER.eod is None:
raise ValueError("Tokenizer has no EOD/EOS token, but --append-eod is enabled.")
def encode_text(text):
if not isinstance(text, str):
return None
text = text.strip()
if not text:
return None
token_ids = _TOKENIZER.tokenize(text)
if not token_ids:
return None
sentence_lens = [len(token_ids)]
if _APPEND_EOD:
token_ids.append(_TOKENIZER.eod)
sentence_lens[-1] += 1
return token_ids, sentence_lens
def output_paths(output_prefix, text_key):
prefix = Path(output_prefix)
return (
Path(str(prefix) + f"_{text_key}_document.bin"),
Path(str(prefix) + f"_{text_key}_document.idx"),
)
def remove_partial_outputs(output_prefix, text_key):
bin_file, idx_file = output_paths(output_prefix, text_key)
bin_file.unlink(missing_ok=True)
idx_file.unlink(missing_ok=True)
def convert_one_parquet(args_tuple):
parquet_path, args = args_tuple
parquet_path = Path(parquet_path)
stem = parquet_path.name.replace(".zstd.parquet", "").replace(".parquet", "")
output_prefix = Path(args.output_dir) / f"{args.output_prefix_prefix}_{stem}"
bin_file, idx_file = output_paths(output_prefix, args.text_key)
if not args.overwrite and bin_file.exists() and idx_file.exists():
return f"[SKIP] {parquet_path.name}: existing bin/idx"
remove_partial_outputs(output_prefix, args.text_key)
add_megatron_to_path(args.megatron_dir)
from megatron.core.datasets import indexed_dataset
tokenizer = build_megatron_tokenizer(args)
dtype = indexed_dataset.DType.optimal_dtype(tokenizer.vocab_size)
builder = indexed_dataset.IndexedDatasetBuilder(str(bin_file), dtype=dtype)
start_time = time.time()
rows = 0
docs = 0
tokens = 0
def consume_encoded(encoded):
nonlocal docs, tokens
if encoded is None:
return
token_ids, sentence_lens = encoded
builder.add_document(token_ids, sentence_lens)
docs += 1
tokens += len(token_ids)
if args.log_interval and docs % args.log_interval == 0:
elapsed = max(time.time() - start_time, 1e-6)
print(
f"[{parquet_path.name}] docs={docs} "
f"tokens={tokens} docs/s={docs / elapsed:.2f}",
flush=True,
)
pf = pq.ParquetFile(parquet_path)
if args.workers_per_shard == 1:
init_worker(args)
for batch in pf.iter_batches(columns=[args.text_key], batch_size=args.batch_size):
texts = batch.column(0).to_pylist()
rows += len(texts)
for text in texts:
consume_encoded(encode_text(text))
else:
with Pool(processes=args.workers_per_shard, initializer=init_worker, initargs=(args,)) as pool:
for batch in pf.iter_batches(columns=[args.text_key], batch_size=args.batch_size):
texts = batch.column(0).to_pylist()
rows += len(texts)
for encoded in pool.imap(encode_text, texts, chunksize=args.chunksize):
consume_encoded(encoded)
builder.finalize(str(idx_file))
elapsed = max(time.time() - start_time, 1e-6)
return (
f"[OK] {parquet_path.name}: rows={rows}, docs={docs}, tokens={tokens}, "
f"elapsed={elapsed:.1f}s, docs/s={docs / elapsed:.2f}, output_prefix={output_prefix}"
)
def parse_args():
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument("--input-dir", required=True) parser.add_argument("--input-dir", required=True)
parser.add_argument("--output-dir", required=True) parser.add_argument("--output-dir", required=True)
parser.add_argument("--tmp-dir", required=True) parser.add_argument("--tmp-dir", default=None, help="Deprecated; kept for CLI compatibility.")
parser.add_argument("--megatron-dir", default="/apps/model_training/Megatron-LM") parser.add_argument("--megatron-dir", default="/apps/model_training/Megatron-LM")
parser.add_argument("--tokenizer-type", default="HuggingFaceTokenizer") parser.add_argument("--tokenizer-type", default="HuggingFaceTokenizer")
parser.add_argument("--tokenizer-model", required=True) parser.add_argument("--tokenizer-model", required=True)
parser.add_argument("--tokenizer-metadata", default=None)
parser.add_argument("--tokenizer-special-tokens", nargs="*", default=None)
parser.add_argument("--tokenizer-sentencepiece-legacy", action="store_true")
parser.add_argument("--tokenizer-hf-no-use-fast", action="store_true")
parser.add_argument("--tokenizer-hf-no-include-special-tokens", action="store_true")
parser.add_argument("--trust-remote-code", action="store_true")
parser.add_argument("--vocab-file", default=None)
parser.add_argument("--merge-file", default=None)
parser.add_argument("--vocab-size", type=int, default=None)
parser.add_argument("--tiktoken-pattern", default=None)
parser.add_argument("--tiktoken-num-special-tokens", type=int, default=1000)
parser.add_argument("--null-tokenizer-eod-id", type=int, default=None)
parser.add_argument("--null-tokenizer-pad-id", type=int, default=-1)
parser.add_argument("--text-key", default="text") parser.add_argument("--text-key", default="text")
parser.add_argument("--num-shards", type=int, default=1, help="parallel parquet shards") parser.add_argument("--output-prefix-prefix", default="phase1")
parser.add_argument("--workers-per-shard", type=int, default=8) parser.add_argument("--num-shards", type=int, default=1, help="Parallel parquet files.")
parser.add_argument("--workers-per-shard", type=int, default=max((os.cpu_count() or 8) // 2, 1))
parser.add_argument("--batch-size", type=int, default=8192, help="Parquet record batch size.")
parser.add_argument("--chunksize", type=int, default=64, help="Tokenizer pool imap chunk size.")
parser.add_argument("--log-interval", type=int, default=10000)
parser.add_argument("--start", type=int, default=0) parser.add_argument("--start", type=int, default=0)
parser.add_argument("--end", type=int, default=None) parser.add_argument("--end", type=int, default=None)
parser.add_argument("--keep-jsonl", action="store_true") parser.add_argument("--append-eod", action=argparse.BooleanOptionalAction, default=True)
parser.add_argument("--keep-jsonl", action="store_true", help="Deprecated; no JSONL is written.")
parser.add_argument("--overwrite", action="store_true") parser.add_argument("--overwrite", action="store_true")
args = parser.parse_args() return parser.parse_args()
def main():
args = parse_args()
if args.num_shards < 1:
raise ValueError("--num-shards must be >= 1")
if args.workers_per_shard < 1:
raise ValueError("--workers-per-shard must be >= 1")
if args.batch_size < 1:
raise ValueError("--batch-size must be >= 1")
if args.chunksize < 1:
raise ValueError("--chunksize must be >= 1")
files = sorted(Path(args.input_dir).glob("*.zstd.parquet")) files = sorted(Path(args.input_dir).glob("*.zstd.parquet"))
if not files: if not files:
@@ -132,33 +242,24 @@ def main():
files = files[args.start:args.end] files = files[args.start:args.end]
print(f"Converting {len(files)} files") print(f"Converting {len(files)} files")
print(f"Parallel shards: {args.num_shards}") print(f"Parallel parquet files: {args.num_shards}")
print(f"Workers per shard: {args.workers_per_shard}") print(f"Tokenizer workers per parquet: {args.workers_per_shard}")
print(f"Total tokenizer workers: {args.num_shards * args.workers_per_shard}")
Path(args.output_dir).mkdir(parents=True, exist_ok=True) Path(args.output_dir).mkdir(parents=True, exist_ok=True)
Path(args.tmp_dir).mkdir(parents=True, exist_ok=True) if args.tmp_dir:
Path(args.tmp_dir).mkdir(parents=True, exist_ok=True)
tasks = [ tasks = [(str(f), args) for f in files]
( if args.num_shards == 1:
str(f), for task in tqdm(tasks):
args.output_dir, print(convert_one_parquet(task), flush=True)
args.tmp_dir, else:
args.text_key, with ProcessPoolExecutor(max_workers=args.num_shards) as ex:
args.megatron_dir, futs = [ex.submit(convert_one_parquet, task) for task in tasks]
args.tokenizer_type, for fut in tqdm(as_completed(futs), total=len(futs)):
args.tokenizer_model, print(fut.result(), flush=True)
args.workers_per_shard,
args.keep_jsonl,
args.overwrite,
)
for f in files
]
with ProcessPoolExecutor(max_workers=args.num_shards) as ex:
futs = [ex.submit(run_one, t) for t in tasks]
for fut in tqdm(as_completed(futs), total=len(futs)):
print(fut.result(), flush=True)
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@@ -0,0 +1,92 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
ARTIFACT_ROOT=${ARTIFACT_ROOT:-/apps/yi/model_training/artifacts}
RUN_STATE_DIR="${ARTIFACT_ROOT}/run_state"
LOG_DIR="${ARTIFACT_ROOT}/logs"
usage() {
cat <<'EOF'
Usage:
bash start_training.sh <model> [mode] [train_name]
Models:
gpt_smoke
qwen3_1p7b
Examples:
bash start_training.sh gpt_smoke smoke smoke_gpt
bash start_training.sh qwen3_1p7b qwen3_1p7b_smoke_yi qwen3_1p7b_smoke_yi
Environment overrides:
CHECKPOINT_KEEP_RECENT=3
CHECKPOINT_CLEANUP_INTERVAL_SECONDS=300
EXTRA_ARGS="--exit-duration-in-mins 120"
EOF
}
model=${1:-}
mode=${2:-}
train_name=${3:-}
if [ -z "$model" ] || [ "$model" = "-h" ] || [ "$model" = "--help" ]; then
usage
exit 0
fi
case "$model" in
gpt_smoke)
train_script="${SCRIPT_DIR}/training_smoke_gpt2.sh"
mode=${mode:-smoke}
train_name=${train_name:-smoke_gpt}
;;
qwen3_1p7b)
train_script="${SCRIPT_DIR}/training_smoke_qwen3_1p7b.sh"
mode=${mode:-qwen3_1p7b_smoke_yi}
train_name=${train_name:-qwen3_1p7b_smoke_yi}
;;
*)
echo "Unknown model: $model" >&2
usage >&2
exit 1
;;
esac
mkdir -p "$RUN_STATE_DIR" "$LOG_DIR"
pid_file="${RUN_STATE_DIR}/${train_name}.pid"
meta_file="${RUN_STATE_DIR}/${train_name}.env"
log_file="${LOG_DIR}/${train_name}.log"
if [ -f "$pid_file" ]; then
old_pid=$(cat "$pid_file")
if [ -n "$old_pid" ] && kill -0 "$old_pid" 2>/dev/null; then
echo "Training already appears to be running: train_name=${train_name}, pid=${old_pid}" >&2
exit 1
fi
fi
combined_extra_args="--exit-signal-handler ${EXTRA_ARGS:-}"
cd "$SCRIPT_DIR"
EXTRA_ARGS="$combined_extra_args" setsid bash "$train_script" "$mode" "$train_name" > "$log_file" 2>&1 &
pid=$!
pgid=$(ps -o pgid= -p "$pid" | tr -d ' ' || true)
printf '%s\n' "$pid" > "$pid_file"
cat > "$meta_file" <<EOF
MODEL=${model}
MODE=${mode}
TRAIN_NAME=${train_name}
PID=${pid}
PGID=${pgid}
LOG_FILE=${log_file}
TRAIN_SCRIPT=${train_script}
CHECKPOINT_KEEP_RECENT=${CHECKPOINT_KEEP_RECENT:-3}
CHECKPOINT_CLEANUP_INTERVAL_SECONDS=${CHECKPOINT_CLEANUP_INTERVAL_SECONDS:-300}
EOF
echo "Started training: model=${model}, mode=${mode}, train_name=${train_name}, pid=${pid}, pgid=${pgid:-unknown}"
echo "Log: ${log_file}"
echo "Stop: bash ${SCRIPT_DIR}/stop_training.sh ${train_name}"

View File

@@ -0,0 +1,68 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
ARTIFACT_ROOT=${ARTIFACT_ROOT:-/apps/yi/model_training/artifacts}
RUN_STATE_DIR="${ARTIFACT_ROOT}/run_state"
GRACE_SECONDS=${GRACE_SECONDS:-300}
usage() {
cat <<'EOF'
Usage:
bash stop_training.sh <train_name>
Environment overrides:
GRACE_SECONDS=300
EOF
}
train_name=${1:-}
if [ -z "$train_name" ] || [ "$train_name" = "-h" ] || [ "$train_name" = "--help" ]; then
usage
exit 0
fi
pid_file="${RUN_STATE_DIR}/${train_name}.pid"
meta_file="${RUN_STATE_DIR}/${train_name}.env"
if [ ! -f "$pid_file" ]; then
echo "PID file not found: ${pid_file}" >&2
exit 1
fi
pid=$(cat "$pid_file")
if [ -z "$pid" ] || ! kill -0 "$pid" 2>/dev/null; then
echo "Training is not running for train_name=${train_name}; cleaning stale state."
rm -f "$pid_file" "$meta_file"
exit 0
fi
pgid=$(ps -o pgid= -p "$pid" | tr -d ' ' || true)
if [ -z "$pgid" ] && [ -f "$meta_file" ]; then
pgid=$(grep '^PGID=' "$meta_file" | cut -d= -f2- || true)
fi
echo "Sending SIGTERM to training process group: train_name=${train_name}, pid=${pid}, pgid=${pgid:-unknown}"
if [ -n "$pgid" ]; then
kill -TERM "-${pgid}" 2>/dev/null || kill -TERM "$pid" 2>/dev/null || true
else
kill -TERM "$pid" 2>/dev/null || true
fi
deadline=$((SECONDS + GRACE_SECONDS))
while kill -0 "$pid" 2>/dev/null; do
if [ "$SECONDS" -ge "$deadline" ]; then
echo "Training did not exit within ${GRACE_SECONDS}s." >&2
echo "If checkpoint saving is still running, wait and inspect logs before forcing termination." >&2
if [ -n "$pgid" ]; then
echo "Force kill manually if needed: kill -KILL -${pgid}" >&2
else
echo "Force kill manually if needed: kill -KILL ${pid}" >&2
fi
exit 2
fi
sleep 5
done
rm -f "$pid_file" "$meta_file"
echo "Stopped training: train_name=${train_name}"

View File

@@ -11,6 +11,9 @@ MEGATRON_PATH=/apps/yi/model_training/Megatron-LM
ARTIFACT_ROOT=/apps/yi/model_training/artifacts ARTIFACT_ROOT=/apps/yi/model_training/artifacts
TB_DIR="${ARTIFACT_ROOT}/tb_logs/${TRAIN_NAME}" TB_DIR="${ARTIFACT_ROOT}/tb_logs/${TRAIN_NAME}"
CKPT_DIR="${ARTIFACT_ROOT}/checkpoints/${TRAIN_NAME}" CKPT_DIR="${ARTIFACT_ROOT}/checkpoints/${TRAIN_NAME}"
CHECKPOINT_KEEP_RECENT=${CHECKPOINT_KEEP_RECENT:-3}
CHECKPOINT_CLEANUP_INTERVAL_SECONDS=${CHECKPOINT_CLEANUP_INTERVAL_SECONDS:-300}
EXTRA_ARGS=${EXTRA_ARGS:-}
source params/optim_common.sh source params/optim_common.sh
source params/gpt_smoke/model.sh source params/gpt_smoke/model.sh
@@ -45,6 +48,64 @@ PARALLEL_ARGS="
mkdir -p "$CKPT_DIR" "$TB_DIR" mkdir -p "$CKPT_DIR" "$TB_DIR"
cleanup_old_checkpoints_once() {
local ckpt_dir=$1
local keep=$2
if ! [[ "$keep" =~ ^[0-9]+$ ]] || [ "$keep" -le 0 ] || [ ! -d "$ckpt_dir" ]; then
return 0
fi
local latest=""
if [ -f "${ckpt_dir}/latest_checkpointed_iteration.txt" ]; then
read -r latest < "${ckpt_dir}/latest_checkpointed_iteration.txt" || latest=""
if [[ "$latest" =~ ^[0-9]+$ ]]; then
latest=$(printf "iter_%07d" "$latest")
else
latest=""
fi
fi
local checkpoints=()
while IFS= read -r path; do
checkpoints+=("$path")
done < <(find "$ckpt_dir" -maxdepth 1 -type d -name 'iter_[0-9][0-9][0-9][0-9][0-9][0-9][0-9]' -print | sort)
local delete_count=$((${#checkpoints[@]} - keep))
if [ "$delete_count" -le 0 ]; then
return 0
fi
local i base
for ((i = 0; i < delete_count; i++)); do
base=$(basename "${checkpoints[$i]}")
if [ "$base" = "$latest" ]; then
continue
fi
echo "[checkpoint-cleanup] deleting old checkpoint: ${checkpoints[$i]}"
rm -rf -- "${checkpoints[$i]}"
done
}
checkpoint_cleanup_loop() {
local ckpt_dir=$1
local keep=$2
local interval=$3
if ! [[ "$interval" =~ ^[0-9]+$ ]] || [ "$interval" -le 0 ]; then
return 0
fi
while true; do
sleep "$interval"
cleanup_old_checkpoints_once "$ckpt_dir" "$keep"
done
}
checkpoint_cleanup_loop "$CKPT_DIR" "$CHECKPOINT_KEEP_RECENT" "$CHECKPOINT_CLEANUP_INTERVAL_SECONDS" &
CHECKPOINT_CLEANUP_PID=$!
trap 'kill "$CHECKPOINT_CLEANUP_PID" 2>/dev/null || true; cleanup_old_checkpoints_once "$CKPT_DIR" "$CHECKPOINT_KEEP_RECENT"' EXIT
DISTRIBUTED_ARGS=" DISTRIBUTED_ARGS="
--nproc_per_node 8 --nproc_per_node 8
--nnodes 1 --nnodes 1
@@ -63,4 +124,5 @@ torchrun $DISTRIBUTED_ARGS \
$RUN_ARGS \ $RUN_ARGS \
$LOGGING_ARGS\ $LOGGING_ARGS\
--save "$CKPT_DIR" \ --save "$CKPT_DIR" \
--load "$CKPT_DIR" \ --load "$CKPT_DIR" \
$EXTRA_ARGS

View File

@@ -13,6 +13,9 @@ SCRIPT_DIR=/apps/yi/model_training/scripts/kaiyuan2b-training
PARAMS_DIR="${SCRIPT_DIR}/params" PARAMS_DIR="${SCRIPT_DIR}/params"
TB_DIR="${ARTIFACT_ROOT}/tb_logs/${TRAIN_NAME}" TB_DIR="${ARTIFACT_ROOT}/tb_logs/${TRAIN_NAME}"
CKPT_DIR="${ARTIFACT_ROOT}/checkpoints/${TRAIN_NAME}" CKPT_DIR="${ARTIFACT_ROOT}/checkpoints/${TRAIN_NAME}"
CHECKPOINT_KEEP_RECENT=${CHECKPOINT_KEEP_RECENT:-3}
CHECKPOINT_CLEANUP_INTERVAL_SECONDS=${CHECKPOINT_CLEANUP_INTERVAL_SECONDS:-300}
EXTRA_ARGS=${EXTRA_ARGS:-}
source "${PARAMS_DIR}/optim_common.sh" source "${PARAMS_DIR}/optim_common.sh"
source "${PARAMS_DIR}/qwen3_1p7b/model.sh" source "${PARAMS_DIR}/qwen3_1p7b/model.sh"
@@ -56,6 +59,64 @@ fi
mkdir -p "$CKPT_DIR" "$TB_DIR" mkdir -p "$CKPT_DIR" "$TB_DIR"
cleanup_old_checkpoints_once() {
local ckpt_dir=$1
local keep=$2
if ! [[ "$keep" =~ ^[0-9]+$ ]] || [ "$keep" -le 0 ] || [ ! -d "$ckpt_dir" ]; then
return 0
fi
local latest=""
if [ -f "${ckpt_dir}/latest_checkpointed_iteration.txt" ]; then
read -r latest < "${ckpt_dir}/latest_checkpointed_iteration.txt" || latest=""
if [[ "$latest" =~ ^[0-9]+$ ]]; then
latest=$(printf "iter_%07d" "$latest")
else
latest=""
fi
fi
local checkpoints=()
while IFS= read -r path; do
checkpoints+=("$path")
done < <(find "$ckpt_dir" -maxdepth 1 -type d -name 'iter_[0-9][0-9][0-9][0-9][0-9][0-9][0-9]' -print | sort)
local delete_count=$((${#checkpoints[@]} - keep))
if [ "$delete_count" -le 0 ]; then
return 0
fi
local i base
for ((i = 0; i < delete_count; i++)); do
base=$(basename "${checkpoints[$i]}")
if [ "$base" = "$latest" ]; then
continue
fi
echo "[checkpoint-cleanup] deleting old checkpoint: ${checkpoints[$i]}"
rm -rf -- "${checkpoints[$i]}"
done
}
checkpoint_cleanup_loop() {
local ckpt_dir=$1
local keep=$2
local interval=$3
if ! [[ "$interval" =~ ^[0-9]+$ ]] || [ "$interval" -le 0 ]; then
return 0
fi
while true; do
sleep "$interval"
cleanup_old_checkpoints_once "$ckpt_dir" "$keep"
done
}
checkpoint_cleanup_loop "$CKPT_DIR" "$CHECKPOINT_KEEP_RECENT" "$CHECKPOINT_CLEANUP_INTERVAL_SECONDS" &
CHECKPOINT_CLEANUP_PID=$!
trap 'kill "$CHECKPOINT_CLEANUP_PID" 2>/dev/null || true; cleanup_old_checkpoints_once "$CKPT_DIR" "$CHECKPOINT_KEEP_RECENT"' EXIT
DISTRIBUTED_ARGS=" DISTRIBUTED_ARGS="
--nproc_per_node 8 --nproc_per_node 8
--nnodes 1 --nnodes 1
@@ -79,5 +140,5 @@ torchrun $DISTRIBUTED_ARGS \
--cuda-graph-warmup-steps 3 \ --cuda-graph-warmup-steps 3 \
--transformer-impl transformer_engine \ --transformer-impl transformer_engine \
--cross-entropy-loss-fusion \ --cross-entropy-loss-fusion \
--cross-entropy-fusion-impl te --cross-entropy-fusion-impl te \
$EXTRA_ARGS