项目的 commit graph 长这样:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
* a0e9308 (HEAD -> master) merge upstream/master 
|\
| * bdec219 (upstream/master, upstream/HEAD) feat(lora): add GetLoRAParameters and MergeAndUnload APIs
| * 08c0e1d fix(lora): fix dimension mismatch and refactor TP helper functions
| * 26f6d7d fix(lora): improve LoRA configuration and DDP integration
| * 5d7f15f feat(lora): add Low-Rank Adaptation support for efficient fine-tuning
| * b1e4b03 fix: add end of test cleanup
| * 08e856c fix: add compare utils
| * 15bfad1 fix: remove redundant and duplicate codes
| * 92ca5d3 fix: add retry logic in feishu writer
| * 4682f71 feat: organize test cases by group
* | d51daf3 (origin/master, origin/HEAD) fix: fix header
* | a54c1ec Merge branch 'master' of github.com:ArcaLunar/InfiniTrain
|\|
| * 791c75e fix: fix rank argument in ddp multi-node training
* | 90b8b6a merge ckpt tools
|\ \
| |/
|/|
| * 856fe84 (origin/ckpt-sync) add report
| * d7cbcde fix: gpt2 training weight saving/loading bug
| * 0d404e3 fix: save also train iter states
| * 27543b9 fix: isolate by models and resumes
| * 6e32e4e fix: test config file
| * 4862334 test: add sample resume test
| * 48c9cf3 fix: adjust logging level
| * f91acd7 feat: log to terminal; async dump states to file
| * d0f3a5d feat: logging when loading checkpoint
| * 771632e feat: example adaptation
| * e515f81 feat: param api for optimizer and model; checkpoint api
* | 733ad19 fix: save device/ccl impl in ProcessGroup
* | bf2eae5 fix: resolve requested changes, remove unnecessary api, remove nccl macros, mv unique_id file helper functions to utils
* | a1bea05 fix: add EventFlag enum, fix mutex usage in ProcessGroupFactory::Instance()
* | f33fc2b feat: integrate runtime_common, and modify ProcessGroup related apis
* | e281dea fix: fix nccl error in process_group and seg fault in profiler
* | 73a2bad draft: remove ProgressGroup and Work derivitives for NCCL
|/
* d278062 refactor: enforce strict backend contracts for DeviceGuardImpl

开发流程其实大致就是原来是在 old master 上 branch 出来开发的,不过上游又陆续出了一些 commit,于是一边开发一边 merge upstream/master,导致这个 graph 看起来十分交错……

然而提交 PR 时被要求合并成一个 commit,于是只能动刀.

因为从 d51daf3 开始的 commit 都是上游的 commit history,反正等我的 commit history 清理完了可以再合并起来,所以先直接 hard reset 到 origin/HEAD

1
git reset --hard d51daf3

d278062 开始到 d51daf3 的 commit 其实包含了我的修改和一部分上游的修改,但无所谓了,全部合并!直接 soft reset 到 common parent 再全部 commit 起来

1
2
git reset --soft d278062
git commit -m "feat: feat: checkpoint save & load"

然后再重新 merge 一下上游的 master 即可

1
git merge upstream/master

最终的效果如下,看起来还是很清爽的.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
* ef7a4a9 (HEAD -> master) merge upstream/master 
|\
| * bdec219 (upstream/master, upstream/HEAD) feat(lora): add GetLoRAParameters and MergeAndUnload APIs
| * 08c0e1d fix(lora): fix dimension mismatch and refactor TP helper functions
| * 26f6d7d fix(lora): improve LoRA configuration and DDP integration
| * 5d7f15f feat(lora): add Low-Rank Adaptation support for efficient fine-tuning
| * b1e4b03 fix: add end of test cleanup
| * 08e856c fix: add compare utils
| * 15bfad1 fix: remove redundant and duplicate codes
| * 92ca5d3 fix: add retry logic in feishu writer
| * 4682f71 feat: organize test cases by group
| * 791c75e fix: fix rank argument in ddp multi-node training
| * 733ad19 fix: save device/ccl impl in ProcessGroup
| * bf2eae5 fix: resolve requested changes, remove unnecessary api, remove nccl macros, mv unique_id file helper functions to utils
| * a1bea05 fix: add EventFlag enum, fix mutex usage in ProcessGroupFactory::Instance()
| * f33fc2b feat: integrate runtime_common, and modify ProcessGroup related apis
| * e281dea fix: fix nccl error in process_group and seg fault in profiler
| * 73a2bad draft: remove ProgressGroup and Work derivitives for NCCL
* | f4a5fc1 feat: checkpoint save & load
|/ * d278062 refactor: enforce strict backend contracts for DeviceGuardImpl