Chrisyichuan/text-qa-pair
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Chrisyichuan/text-qa-pair
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
- text-retrieval
language:
- en
pretty_name: Text QA Pairs With Filtered Hard Negatives
size_categories:
- 10K<n<100K
---
# Text QA Pair Dataset With Filtered Hard Negatives
This dataset contains English Wikipedia text QA pairs together with retrieval candidates and LLM-filtered hard negatives.
The source pairs were first mined with text retrieval (`top20` candidates per query), then filtered with an LLM:
1. Check whether the positive passage can answer the query.
2. Scan non-positive retrieval candidates in rank order.
3. If a candidate passage is judged `CORRECT`, it is treated as a false negative and skipped.
4. If a candidate passage is judged `WRONG` or `CANNOT_ANSWER`, it is kept as a hard negative.
5. Keep examples only when at least 7 filtered hard negatives are found.
## Summary
- Total chunks: `10`
- Input examples: `18,572`
- Kept examples: `14,952`
- Skipped examples: `3,620`
- API errors during filtering: `0`
- Skipped because not enough hard negatives: `3,504`
- Skipped because the positive passage was not judged correct: `116`
## Directory Layout
The dataset is sharded by row range:
```text
chunk_000000_001999/
chunk_002000_003999/
...
chunk_018000_018571/
```
Each chunk contains:
- `filtered_hn.jsonl`: final kept examples for training
- `candidate_reviews.jsonl`: per-candidate LLM review logs
- `summary.json`: chunk-level aggregate statistics
## `filtered_hn.jsonl` Format
Each line is one JSON object. Main fields:
- `query`: natural-language question
- `answer`: short answer from the source generation step
- `passage`: positive passage text
- `article_id`: positive article ID
- `chunk_index`: positive chunk index within the article
- `source_sentence`: supporting sentence used during pair generation
- `source_type`: source type metadata
- `positive_rank`: rank of the positive chunk in retrieval top-20
- `positive_score`: retrieval score of the positive chunk
- `retrieve_top20`: raw top-20 retrieval candidates
- `neg_hits`: filtered hard-negative candidates that survived LLM filtering
- `neg_passages`: text-only view of `neg_hits`
- `source_positive_rank`: copied positive rank metadata from the mining stage
- `source_positive_score`: copied positive score metadata from the mining stage
`neg_hits` entries contain:
- `rank`
- `score`
- `article_id`
- `chunk_index`
- `char_offset`
- `n_tokens`
- `title`
- `url`
- `text`
## `candidate_reviews.jsonl` Format
This file stores the LLM filtering trace for auditing:
- one row for the positive passage review
- one row for each inspected candidate passage
Common fields include:
- `example_index`
- `query`
- `candidate_rank`
- `candidate_article_id`
- `candidate_chunk_index`
- `candidate_score`
- `candidate_title`
- `candidate_url`
- `answer`
- `verdict`
- `path_role`
Possible verdicts:
- `CORRECT`
- `WRONG`
- `CANNOT_ANSWER`
- `API_ERROR`
## Example
Example `filtered_hn.jsonl` record shape:
```json
{
"query": "Who was the first Hong Kong athlete to win medals in two different Olympic Games?",
"answer": "Lee Wai-sze",
"article_id": 4353883,
"chunk_index": 1,
"positive_rank": 14,
"positive_score": 0.6786706447601318,
"neg_hits": [
{
"rank": 1,
"score": 0.7558059096336365,
"article_id": 3427201,
"chunk_index": 0,
"title": "Hong Kong at the Olympics",
"url": "https://en.wikipedia.org/wiki/Hong_Kong_at_the_Olympics",
"text": "..."
}
],
"neg_passages": [
"Hong Kong at the Olympics ..."
]
}
```
## Notes
- This release is stored in chunked form to make long-running filtering and uploads resumable.
- `filtered_hn.jsonl` is the training-ready output.
- `candidate_reviews.jsonl` is mainly for inspection, auditing, and debugging false-negative filtering behavior.
提供机构:
Chrisyichuan



