dgabri3le/english-ipa-dep-treebank
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/dgabri3le/english-ipa-dep-treebank
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- text-generation
- token-classification
tags:
- ipa
- phonetics
- dependency-parsing
- universal-dependencies
- syntax
- treebank
- phonology
size_categories:
- 10M<n<100M
---
# English IPA Dependency Treebank
A large-scale dataset of **10.4 million English sentences** paired with IPA (International Phonetic Alphabet) transcriptions and Universal Dependencies syntactic annotations.
Each sentence includes its full dependency parse — head indices, relation labels, and a linearized tagged-IPA representation that interleaves syntactic roles with phonetic content.
## Dataset Structure
Each sample contains:
| Field | Type | Description |
|-------|------|-------------|
| `raw_english` | string | Original English text before contraction expansion (e.g. "don't", "it's") |
| `english` | string | English sentence with contractions expanded (e.g. "do not", "it is"). Lowercased, filtered. For sentences without contractions, identical to `raw_english`. |
| `ipa` | string | IPA transcription (word-aligned with `english`) |
| `dep_labels` | list[string] | Universal Dependencies relation label per word |
| `dep_heads` | list[int] | Head word index per word (1-indexed, 0 = ROOT) |
| `tagged_ipa` | string | Linearized format: `[role] ipa_word [role] ipa_word ...` |
## Examples
### Example 1: Simple transitive sentence
**English:** `the cat sat on the table`
**IPA:** `ðʌ kæt sæt ɑn ðʌ teɪbʌl`
**Dependency parse:**
```
idx word role head head_word
─── ─────── ──────── ──── ─────────
1 the det 2 cat
2 cat nsubj 3 sat
3 sat root 0 ROOT
4 on case 6 table
5 the det 6 table
6 table obl 3 sat
```
**Tree diagram:**
```
sat (root)
╱ ╲
cat (nsubj) table (obl)
│ ╱ ╲
the (det) on (case) the (det)
```
**Tagged IPA:** `[det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl`
### Example 2: Complex sentence with subordinate clause
**English:** `the researchers found that the compound reduces inflammation effectively`
**IPA:** `ðʌ ɹisɝtʃɝz faʊnd ðæt ðʌ kɑmpaʊnd ɹɪdusɪz ɪnflʌmeɪʃʌn ɪfɛktɪvli`
**Dependency parse:**
```
idx word role head head_word
─── ────────────── ──────── ──── ──────────────
1 the det 2 researchers
2 researchers nsubj 3 found
3 found root 0 ROOT
4 that mark 7 reduces
5 the det 6 compound
6 compound nsubj 7 reduces
7 reduces ccomp 3 found
8 inflammation obj 7 reduces
9 effectively advmod 7 reduces
```
**Tree diagram:**
```
found (root)
╱ ╲
researchers (nsubj) reduces (ccomp)
│ ╱ │ ╲
the (det) compound inflam- effectively
(nsubj) mation (advmod)
╱ ╲ (obj)
the (det) that (mark)
```
**Tagged IPA:** `[det] ðʌ [nsubj] ɹisɝtʃɝz [root] faʊnd [mark] ðæt [det] ðʌ [nsubj] kɑmpaʊnd [ccomp] ɹɪdusɪz [obj] ɪnflʌmeɪʃʌn [advmod] ɪfɛktɪvli`
## How Dependency Trees Work
A dependency tree represents the syntactic structure of a sentence as directed links between words. Every word points to its **head** (the word it depends on) via a labeled **relation**.
### Key concepts
- **Root**: The main predicate of the sentence (usually the main verb). Its head index is 0 (no parent).
- **Head**: Each non-root word has exactly one head — the word that governs it syntactically.
- **Relation**: The label on the link describes the grammatical function: `nsubj` (subject), `obj` (object), `det` (determiner), `amod` (adjective modifier), etc.
### Reading the `dep_heads` array
`dep_heads` is a list of integers, one per word, using **1-based indexing**:
- `dep_heads[i] = j` means word `i` depends on word `j`
- `dep_heads[i] = 0` means word `i` is the root
For example, in `[2, 3, 0, 6, 6, 3]`:
- Word 1 → depends on word 2
- Word 2 → depends on word 3
- Word 3 → ROOT (head = 0)
- Word 4 → depends on word 6
- Word 5 → depends on word 6
- Word 6 → depends on word 3
### The `tagged_ipa` format
The `tagged_ipa` field provides a linearized representation that interleaves each word's dependency role with its IPA transcription:
```
[det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl
```
This format preserves the original word order while annotating each word's syntactic function. It can be used directly as input to sequence models that need both phonetic and syntactic information.
### Common Universal Dependencies relations
| Relation | Meaning | Example |
|----------|---------|---------|
| `root` | Main predicate | *sat* in "the cat sat" |
| `nsubj` | Nominal subject | *cat* in "the **cat** sat" |
| `obj` | Direct object | *fish* in "the cat ate **fish**" |
| `det` | Determiner | *the* in "**the** cat" |
| `amod` | Adjective modifier | *big* in "the **big** cat" |
| `advmod` | Adverb modifier | *quickly* in "ran **quickly**" |
| `case` | Case marker / preposition | *on* in "sat **on** the table" |
| `obl` | Oblique nominal | *table* in "sat on the **table**" |
| `nmod` | Nominal modifier | *wood* in "table of **wood**" |
| `conj` | Conjunct | *dogs* in "cats and **dogs**" |
| `cc` | Coordinating conjunction | *and* in "cats **and** dogs |
| `mark` | Subordinating marker | *that* in "said **that** he left" |
| `ccomp` | Clausal complement | *left* in "said that he **left**" |
| `xcomp` | Open clausal complement | *run* in "wants to **run**" |
| `aux` | Auxiliary verb | *has* in "**has** eaten" |
For the full set of relations, see the [Universal Dependencies documentation](https://universaldependencies.org/u/dep/).
## Data Sources
Sentences were drawn from multiple registers for linguistic diversity:
- News articles (multiple years)
- Wikipedia (encyclopedic)
- Parliamentary proceedings (Europarl)
## Filtering
Sentences were filtered to ensure quality:
- 5–30 words per sentence
- Must begin with a letter and end with sentence-ending punctuation
- No URLs, email addresses, or quoted text
- Limited digit content (≤2 number sequences, ≤15% digit characters)
- No bullet points, list markers, or section headers
- Deduplicated by exact English text match
## Processing Pipeline
1. **Sentence extraction** from source corpora
2. **Quality filtering** (see above)
3. **IPA transcription** via [epitran](https://github.com/dmort27/epitran) with CMU pronunciation dictionary fallback
4. **Word alignment verification** — English and IPA word counts must match exactly
5. **Dependency parsing** via [Stanza](https://stanfordnlp.github.io/stanza/) with `tokenize_pretokenized=True` to preserve word-level alignment
6. **Deduplication** by English text
## Usage
```python
from datasets import load_dataset
ds = load_dataset("dgabri3le/english-ipa-dep-treebank", split="train", streaming=True)
for sample in ds:
print(sample["english"])
print(sample["tagged_ipa"])
print(sample["dep_labels"])
break
```
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{english_ipa_dep_treebank_2026,
title={English IPA Dependency Treebank},
author={Gabriele, Daniel},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/dgabri3le/english-ipa-dep-treebank}
}
```
language:
- en
license: cc-by-4.0
task_categories:
- 文本生成(text-generation)
- 词元分类(token-classification)
tags:
- IPA(国际音标International Phonetic Alphabet)
- 语音学(phonetics)
- 依存句法分析(dependency-parsing)
- 通用依存关系(Universal Dependencies)
- 句法(syntax)
- 树库(treebank)
- 音系学(phonology)
size_categories:
- 10M<n<100M
---
# 英语IPA依存树库
本数据集为大规模语料库,包含1040万条英语句子,配套国际音标(IPA,International Phonetic Alphabet)转录结果与通用依存关系(Universal Dependencies)句法注释。每条句子均包含完整的依存句法解析结果——中心词索引、关系标签,以及将句法角色与语音内容交错排列的线性化带标注IPA表示形式。
## 数据集结构
每个样本包含以下字段:
| 字段名 | 数据类型 | 描述 |
|-------|----------|------|
| `raw_english` | 字符串 | 未展开缩略形式的原始英语文本(例如"don't"、"it's") |
| `english` | 字符串 | 已展开缩略形式的英语句子(例如"do not"、"it is"),已统一转为小写并经过过滤。若句子无缩略形式,则与`raw_english`完全一致。 |
| `ipa` | 字符串 | 国际音标转录结果(与`english`中的单词对齐) |
| `dep_labels` | 列表[字符串] | 每个单词对应的通用依存关系标签 |
| `dep_heads` | 列表[整数] | 每个单词的中心词索引(采用1基索引,0表示ROOT) |
| `tagged_ipa` | 字符串 | 线性化格式:`[role] ipa_word [role] ipa_word ...` |
## 示例
### 示例1:简单及物句
**英语原文:** `the cat sat on the table`
**国际音标:** `ðʌ kæt sæt ɑn ðʌ teɪbʌl`
**依存句法解析:**
序号 单词 句法角色 中心词 中心词单词
─── ─────── ──────── ──── ─────────
1 the det 2 cat
2 cat nsubj 3 sat
3 sat root 0 ROOT
4 on case 6 table
5 the det 6 table
6 table obl 3 sat
**句法树示意图:**
sat (root)
╱ ╲
cat (nsubj) table (obl)
│ ╱ ╲
the (det) on (case) the (det)
**带标注IPA:** `[det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl`
### 示例2:带从属分句的复杂句
**英语原文:** `the researchers found that the compound reduces inflammation effectively`
**国际音标:** `ðʌ ɹisɝtʃɝz faʊnd ðæt ðʌ kɑmpaʊnd ɹɪdusɪz ɪnflʌmeɪʃʌn ɪfɛktɪvli`
**依存句法解析:**
序号 单词 句法角色 中心词 中心词单词
─── ────────────── ──────── ──── ──────────────
1 the det 2 researchers
2 researchers nsubj 3 found
3 found root 0 ROOT
4 that mark 7 reduces
5 the det 6 compound
6 compound nsubj 7 reduces
7 reduces ccomp 3 found
8 inflammation obj 7 reduces
9 effectively advmod 7 reduces
**句法树示意图:**
found (root)
╱ ╲
researchers (nsubj) reduces (ccomp)
│ ╱ │ ╲
the (det) compound inflam- effectively
(nsubj) mation (advmod)
╱ ╲ (obj)
the (det) that (mark)
**带标注IPA:** `[det] ðʌ [nsubj] ɹisɝtʃɝz [root] faʊnd [mark] ðæt [det] ðʌ [nsubj] kɑmpaʊnd [ccomp] ɹɪdusɪz [obj] ɪnflʌmeɪʃʌn [advmod] ɪfɛktɪvli`
## 依存句法树工作原理
依存句法树通过单词间的有向连接表示句子的句法结构,每个单词通过带标签的连接指向其**中心词**(即它所依赖的单词)。
### 核心概念
- **根节点(ROOT)**:句子的核心谓词(通常为主动词),其中心词索引为0(无父节点)。
- **中心词**:每个非根节点单词恰好拥有一个中心词,即对其施加句法支配的单词。
- **依存关系**:连接边上的标签用于描述语法功能,例如`nsubj`(主语)、`obj`(宾语)、`det`(限定词)等。
### 解读`dep_heads`数组
`dep_heads`为整数列表,每个元素对应一个单词,采用**1基索引**:
- `dep_heads[i] = j` 表示单词`i`依赖于单词`j`
- `dep_heads[i] = 0` 表示单词`i`为根节点
例如对于列表`[2, 3, 0, 6, 6, 3]`:
- 单词1 → 依赖于单词2
- 单词2 → 依赖于单词3
- 单词3 → 根节点(中心词为0)
- 单词4 → 依赖于单词6
- 单词5 → 依赖于单词6
- 单词6 → 依赖于单词3
### `tagged_ipa`格式
`tagged_ipa`字段提供了一种线性化表示形式,将每个单词的依存句法角色与其IPA转录结果交错排列:
[det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl
该格式保留了原始单词顺序,同时标注了每个单词的句法功能,可直接作为同时需要语音与句法信息的序列模型输入。
### 通用依存关系常见标签
| 关系标签 | 含义 | 示例 |
|----------|------|------|
| `root` | 核心谓词 | "the cat sat"中的"sat" |
| `nsubj` | 名词性主语 | "the **cat** sat"中的"cat" |
| `obj` | 直接宾语 | "the cat ate **fish**"中的"fish" |
| `det` | 限定词 | "**the** cat"中的"the" |
| `amod` | 形容词修饰语 | "the **big** cat"中的"big" |
| `advmod` | 副词修饰语 | "ran **quickly**"中的"quickly" |
| `case` | 格标记/介词 | "sat **on** the table"中的"on" |
| `obl` | 间接名词成分 | "sat on the **table**"中的"table" |
| `nmod` | 名词性修饰语 | "table of **wood**"中的"wood" |
| `conj` | 并列成分 | "cats and **dogs**"中的"dogs" |
| `cc` | 并列连词 | "cats **and** dogs"中的"and" |
| `mark` | 从属分句标记 | "said **that** he left"中的"that" |
| `ccomp` | 从句补语 | "said that he **left**"中的"left" |
| `xcomp` | 开放式从句补语 | "wants to **run**"中的"run" |
| `aux` | 助动词 | "**has** eaten"中的"has" |
如需完整的关系标签列表,请参阅[通用依存关系官方文档](https://universaldependencies.org/u/dep/)。
## 数据来源
句子从多个语域中抽取以保证语言多样性:
- 新闻文章(多年份)
- 维基百科(百科类文本)
- 议会会议记录(Europarl语料库)
## 数据过滤
为保证数据质量,对句子进行了如下过滤:
- 每条句子包含5~30个单词
- 必须以字母开头,以句末标点结尾
- 不含URL、电子邮箱地址或引用文本
- 数字内容受限(≤2组数字序列,数字字符占比≤15%)
- 无项目符号、列表标记或章节标题
- 基于精确英语文本匹配去重
## 处理流程
1. **语料抽取**:从源语料库中提取句子
2. **质量过滤**:详见上文说明
3. **IPA转录**:通过[epitran](https://github.com/dmort27/epitran)工具生成,辅以CMU发音词典作为兜底方案
4. **单词对齐验证**:英语与IPA的单词数量必须完全匹配
5. **依存句法解析**:通过[Stanza](https://stanfordnlp.github.io/stanza/)工具完成,设置`tokenize_pretokenized=True`以保留单词级对齐
6. **去重**:基于英语文本进行去重
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("dgabri3le/english-ipa-dep-treebank", split="train", streaming=True)
for sample in ds:
print(sample["english"])
print(sample["tagged_ipa"])
print(sample["dep_labels"])
break
## 引用声明
若使用本数据集,请引用如下文献:
bibtex
@dataset{english_ipa_dep_treebank_2026,
title={English IPA Dependency Treebank},
author={Gabriele, Daniel},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/dgabri3le/english-ipa-dep-treebank}
}
提供机构:
dgabri3le



