Chrisyichuan/text-qa-pair

Name: Chrisyichuan/text-qa-pair
Creator: Chrisyichuan
Published: 2026-04-10 09:04:55
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Chrisyichuan/text-qa-pair

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - question-answering - text-retrieval language: - en pretty_name: Text QA Pairs With Filtered Hard Negatives size_categories: - 10K<n<100K --- # Text QA Pair Dataset With Filtered Hard Negatives This dataset contains English Wikipedia text QA pairs together with retrieval candidates and LLM-filtered hard negatives. The source pairs were first mined with text retrieval (`top20` candidates per query), then filtered with an LLM: 1. Check whether the positive passage can answer the query. 2. Scan non-positive retrieval candidates in rank order. 3. If a candidate passage is judged `CORRECT`, it is treated as a false negative and skipped. 4. If a candidate passage is judged `WRONG` or `CANNOT_ANSWER`, it is kept as a hard negative. 5. Keep examples only when at least 7 filtered hard negatives are found. ## Summary - Total chunks: `10` - Input examples: `18,572` - Kept examples: `14,952` - Skipped examples: `3,620` - API errors during filtering: `0` - Skipped because not enough hard negatives: `3,504` - Skipped because the positive passage was not judged correct: `116` ## Directory Layout The dataset is sharded by row range: ```text chunk_000000_001999/ chunk_002000_003999/ ... chunk_018000_018571/ ``` Each chunk contains: - `filtered_hn.jsonl`: final kept examples for training - `candidate_reviews.jsonl`: per-candidate LLM review logs - `summary.json`: chunk-level aggregate statistics ## `filtered_hn.jsonl` Format Each line is one JSON object. Main fields: - `query`: natural-language question - `answer`: short answer from the source generation step - `passage`: positive passage text - `article_id`: positive article ID - `chunk_index`: positive chunk index within the article - `source_sentence`: supporting sentence used during pair generation - `source_type`: source type metadata - `positive_rank`: rank of the positive chunk in retrieval top-20 - `positive_score`: retrieval score of the positive chunk - `retrieve_top20`: raw top-20 retrieval candidates - `neg_hits`: filtered hard-negative candidates that survived LLM filtering - `neg_passages`: text-only view of `neg_hits` - `source_positive_rank`: copied positive rank metadata from the mining stage - `source_positive_score`: copied positive score metadata from the mining stage `neg_hits` entries contain: - `rank` - `score` - `article_id` - `chunk_index` - `char_offset` - `n_tokens` - `title` - `url` - `text` ## `candidate_reviews.jsonl` Format This file stores the LLM filtering trace for auditing: - one row for the positive passage review - one row for each inspected candidate passage Common fields include: - `example_index` - `query` - `candidate_rank` - `candidate_article_id` - `candidate_chunk_index` - `candidate_score` - `candidate_title` - `candidate_url` - `answer` - `verdict` - `path_role` Possible verdicts: - `CORRECT` - `WRONG` - `CANNOT_ANSWER` - `API_ERROR` ## Example Example `filtered_hn.jsonl` record shape: ```json { "query": "Who was the first Hong Kong athlete to win medals in two different Olympic Games?", "answer": "Lee Wai-sze", "article_id": 4353883, "chunk_index": 1, "positive_rank": 14, "positive_score": 0.6786706447601318, "neg_hits": [ { "rank": 1, "score": 0.7558059096336365, "article_id": 3427201, "chunk_index": 0, "title": "Hong Kong at the Olympics", "url": "https://en.wikipedia.org/wiki/Hong_Kong_at_the_Olympics", "text": "..." } ], "neg_passages": [ "Hong Kong at the Olympics ..." ] } ``` ## Notes - This release is stored in chunked form to make long-running filtering and uploads resumable. - `filtered_hn.jsonl` is the training-ready output. - `candidate_reviews.jsonl` is mainly for inspection, auditing, and debugging false-negative filtering behavior.

提供机构：

Chrisyichuan

5,000+

优质数据集

54 个

任务类型

进入经典数据集