five

bsc-dolly-15k-en

收藏
魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/BSC-LT/bsc-dolly-15k-en
下载链接
链接失效反馈
官方服务:
资源简介:
## BSC Dolly 15k EN Reviewed version from the [Argilla Dolly v2 English version](https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual), originally created by [Databricks](https://huggingface.co/datasets/databricks/databricks-dolly-15k). We provide two subsets: "annotated", where some instances were labelled with potential problems; and "filtered", which only contains the instances without the issues that we observed. ## Annotation process While analysing the Argilla Dolly v2 English version, we observed the following: 1. Task classification: - There are three classes with context: 'Closed QA', 'Information Extraction' and 'Summarization'. The rest without context. - Context is not necessary in all cases and there are instructions that already contain context. - Incorrect categories (the intention does not always correspond to the category). - 2. Confusion between "Summarization" and "Open Generative QA" / "Information Extraction" tasks: - Tasks categorized as "Summarization" have in some cases the intent of "Open Generative QA" / "Information Extraction", and due to their dependency on context, the answer is longer. 3. To note: - 15,014 examples, half of "QA" type in various formats. - 70% have no context; when they do, they come from the first part of Wikipedia. - Many answers are also from Wikipedia. - Possible improvements in cleaning up text extracted from Wikipedia and handling acronyms. 4. Errors in the dataset: - Some summaries are longer than the original text. - Some contexts in "Information Extraction" do not contain the exact information to answer the question asked. - There are many repeated questions that are kept because the answer is different in each case. From the previous observations, we performed the following processing: - Processed "context" column to: - Remove spellings, citations, or unit conversions inside (parenthesis) and [brackets]. - Removed source webpage links. - Removed: - Summary instances where intent is clear & response is longer than context (63) - Instances where the information is not explicitly mentioned in the context (3) - Instances with webpage links in the response or instruction (29) - Exact (instruction/context/response) duplicates (14) - Instruction/context duplicates (9) - Instances where instruction is most similar to the response (6) - - Changes: - Some instances in Summarization/Information Extraction/ Closed QA are lacking context after Argilla's curation process. These instances are moved to General QA since they have no longer context and ask about specifics (86).

BSC Dolly 15k EN 本数据集为经审核版本,源自[Argilla Dolly v2 英文版本](https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual),其原始数据集由[Databricks](https://huggingface.co/datasets/databricks/databricks-dolly-15k)构建。 本数据集包含两个子集:分别为「带标注(annotated)」子集与「已过滤(filtered)」子集。其中「带标注」子集内部分样本被标记了潜在问题;「已过滤」子集仅保留了经我们校验无上述问题的样本。 ## 标注流程 在对Argilla Dolly v2英文版本进行分析时,我们观测到以下问题: 1. 任务分类: - 带上下文的任务共三类:封闭式问答(Closed QA)、信息抽取(Information Extraction)与摘要生成(Summarization),其余任务均无上下文输入。 - 上下文并非所有任务的必需项,部分指令本身已包含上下文信息。 - 存在类别标注错误的情况(任务意图与标注类别并不总是匹配)。 2. 「摘要生成」与「开放式生成问答(Open Generative QA)」/「信息抽取」任务间存在标注混淆: - 部分被标注为「摘要生成」的样本,实际任务意图为「开放式生成问答」或「信息抽取」;由于依赖上下文输入,其生成的回答篇幅也相对更长。 3. 注意事项: - 数据集共包含15014条样本,其中半数为各类格式的问答(QA)类样本。 - 70%的样本无上下文输入;若存在上下文,则均取自维基百科的开篇段落。 - 多数回答内容同样源自维基百科。 - 从维基百科提取的文本清理以及缩略词处理环节仍有优化空间。 4. 数据集现存问题: - 部分生成的摘要篇幅长于原始上下文文本。 - 部分「信息抽取」任务的上下文未包含回答对应问题所需的准确信息。 - 存在大量重复的问题样本,由于不同样本对应的回答存在差异,故予以保留。 基于上述观测结果,我们开展了如下预处理操作: - 对「上下文(context)」列进行如下处理: - 移除括号(圆括号与方括号)内的拼写注释、引用标注或单位换算信息。 - 移除来源网页链接。 - 移除以下类型的样本: - 任务意图明确且回答篇幅长于上下文的摘要类样本(共63条) - 上下文未明确提及对应问题所需信息的样本(共3条) - 回答或指令中包含网页链接的样本(共29条) - 指令、上下文与回答完全重复的样本(共14条) - 指令与上下文重复的样本(共9条) - 指令与回答语义高度相似的样本(共6条) - 样本类别调整: - 经Argilla整理后,部分摘要生成、信息抽取与封闭式问答样本缺失了上下文输入。由于此类样本已无上下文且为特定领域的问答任务,我们将其划归至通用问答(General QA)类别,共86条。
提供机构:
maas
创建时间:
2025-01-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作