kinokokoro/sharegpt_filtered
收藏Hugging Face2024-05-19 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/kinokokoro/sharegpt_filtered
下载链接
链接失效反馈官方服务:
资源简介:
Share GPT filtered according to the following criteria, as well as manual filtering for the first 350 items.
1. Non-Japanese responses:
Error Type: ChatGPT responds in another language (English, Chinese, Korean, etc.) even though this is not desired behavior for a Japanese-language model. Note that the USER may speak in any language he or she prefers, but the model should always respond in Japanese.
Example: human: 西部世界第一季有多少集 gpt: 西部世界第一季有 10 集。
Solution: Split the conversation into turns and check for the presence of a hiragana or katakana character within gpt's respond. All proper Japanese responses should contain at least one hiragana/katakana character, so we can avoid the need for more complex language detection schemes.
Note: There is one exception to this, and that's when the USER has asked for another language (Such as an English or Chinese translation, which is a common use case for GPTs), so we will detect the presence of the 語 (Language) kanji in the entire doc, and if we see it, assume that the user is asking for a translation.
2. API-related Errors:
Error Type: A failed API call causes the response to not be properly recorded.
Example: APIError: [400]: Your input contains more than the maximum of 50000 characters in a single cell.
Solution: Check if there's an answer from GPT at all. If there's no answer from GPT, discard the conversation.
3. Content Policy violations:
Error Type: ChatGPT refuses to answer due to a content policy violation. Obviously we do not want this.
Example: "This content may violate our content policy. If you believe this to be in error, please submit your feedback — your input will aid our research in this area."
Solution: Remove any fields that contain the word "content policy". The exact wording ChatGPT uses for this varies but the word "content policy" is constant across all of them.
4. Inserted links
Error Type: ChatGPT inserts links into its responses sometimes, which is undesirable behavior, especially for our system which lacks any kind of link capability.
Example: 3. Teamsで会議した内容が自動で議事録化され、その中で議論されていない内容までAIが提示してくる (参考動画:[https://www.youtube.com/watch?v=IwX2vGXF8BA)
Solution: Remove any links that are present in ChatGPT's responses, but not in the user's text. (We can envision a scenario where the user links into the text and asks ChatGPT to do something with them, for instance, making a press release with a link to a website.)
5. Obsolete Knowledge Cutoffs:
Error Type: Responses include obsolete knowledge cutoffs, like 2022. Some of these datasets are quite old, and date back to the original ChatGPT knowledge cutoff, which is late 2021.
Example: また、私の知識は2021年9月までのものであり、それ以降の情報は持っていませんので、その点にご注意ください。指定された情報源に基づいた回答を提供する場合、具体的なURLや記事名を指定して質問していただくとより正確な回答が得られることがあります。
Solution: Filter anything where GPT's response contains the words 私 and 2021, 2022, or 2023.
提供机构:
kinokokoro
原始信息汇总
数据集过滤标准总结
-
非日语响应过滤
- 问题描述:ChatGPT在应答时使用非日语(如英语、中文等),尽管期望模型仅使用日语响应。
- 解决方案:通过检查响应中是否包含平假名或片假名字符来判断是否为日语。若用户请求翻译,则通过检测文档中是否包含“語”字来识别。
-
API相关错误处理
- 问题描述:API调用失败导致响应未被正确记录。
- 解决方案:若GPT未提供响应,则丢弃该对话。
-
内容政策违规处理
- 问题描述:ChatGPT因内容政策违规拒绝回答。
- 解决方案:移除包含“内容政策”字样的字段。
-
插入链接的处理
- 问题描述:ChatGPT在响应中插入链接,这在某些系统中是不希望的行为。
- 解决方案:移除ChatGPT响应中用户文本未包含的链接。
-
过时知识截止处理
- 问题描述:响应中包含过时的知识截止信息,如2021年、2022年或2023年。
- 解决方案:过滤掉响应中包含“私”和“2021”、“2022”、“2023”字样的内容。



