AI初步_基于基础模型训练一个问答模型

愿景

I7-10700F 16G内存 1660super(6G显存) 显卡
- 内存可以加,显卡买不起
deepin 20.9 社区版
已通过nvidia官方下载驱动的方式,安装CUDA套件
已安装 anaconda3并安装配置pytorch虚拟环境
- pytorch官网:https://pytorch.org/get-started/locally/
- conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
- 注:安装时间较久,下载过程可重试

参考 https://blog.csdn.net/weixin_44599230/article/details/130005752

按上文下载了作者项目和数据集,将batch_size 设置为2跑了一遍
- 用时4小时以上
- GPU使用率在45%
第一次AI尝试成功..还是有点小兴奋,but:
- 由于数据集质量问题,问答效果不理想
- 感觉这样的模型也没有啥意义
我需要一个质量高的,契合个人的数据集,不然这个事就没有意义
同时要不要换模型,这个模型是一个1024的模型,数据不能长
- 可换 Alpaca-LoRA 模型
- 参考 https://www.fenchuan8.com/shows/3037.html

编程问答数据积
- https://hyper.ai/datasets/30703
- 这个全是英文的,考虑要不要至少把问题转成中文
Alpaca-LoRA的中文数据集
- https://github.com/LC1332/Chinese-alpaca-lora/blob/main/data/trans_chinese_alpaca_data.json
中文社区语料(部分已过时)
- https://github.com/brightmart/nlp_chinese_corpus
- https://computenest.console.aliyun.com/dataset/service/service-da8a5b4ed4994ddc8890/detail/detail (阿里计算巢)
阿里计算巢数据
- https://computenest.console.aliyun.com/dataset/service/detail
- 维基百科(wiki2019zh) - 中文
- CMMLU评测集 - 中文
- 流萤Firefly数据集firefly-train-1.1M (不确定通用性)
- 中英文翻译语料
- 知乎问题数据集
Alpaca Chinese Dataset翻译语料
- https://github.com/open-chinese/alpaca-chinese-dataset