CurateClick

大模型RLHF（ppo奖励模型）训练偏好数据排序助手

一个支持ollama本地模型的工具，用于对大模型生成回答进行排序，生成偏好数据以训练奖励模型，进而优化SFT模型。

• Copy the embed code to showcase this product on your website

• Share on X to spread the word about this amazing tool

描述：简单来说就是你经过sft微调后，想通过RLHF（ppo奖励模型）训练怎么样的模型，就给你的模型生成回答进行排序，最后再导出偏好数据去训练奖励模型，再用奖励模型去训练sft模型开源地址：https://github.com/zjrwtx/preference_databuilder演示视频地址：https://www.bilibili.com/video/BV1P1421675z/?spm_id_from=333.999.0.0&vd_source=5531fb0981ef79f87198a3c2651dff93