five

dgabri3le/english-ipa-dep-treebank

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/dgabri3le/english-ipa-dep-treebank
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 task_categories: - text-generation - token-classification tags: - ipa - phonetics - dependency-parsing - universal-dependencies - syntax - treebank - phonology size_categories: - 10M<n<100M --- # English IPA Dependency Treebank A large-scale dataset of **10.4 million English sentences** paired with IPA (International Phonetic Alphabet) transcriptions and Universal Dependencies syntactic annotations. Each sentence includes its full dependency parse — head indices, relation labels, and a linearized tagged-IPA representation that interleaves syntactic roles with phonetic content. ## Dataset Structure Each sample contains: | Field | Type | Description | |-------|------|-------------| | `raw_english` | string | Original English text before contraction expansion (e.g. "don't", "it's") | | `english` | string | English sentence with contractions expanded (e.g. "do not", "it is"). Lowercased, filtered. For sentences without contractions, identical to `raw_english`. | | `ipa` | string | IPA transcription (word-aligned with `english`) | | `dep_labels` | list[string] | Universal Dependencies relation label per word | | `dep_heads` | list[int] | Head word index per word (1-indexed, 0 = ROOT) | | `tagged_ipa` | string | Linearized format: `[role] ipa_word [role] ipa_word ...` | ## Examples ### Example 1: Simple transitive sentence **English:** `the cat sat on the table` **IPA:** `ðʌ kæt sæt ɑn ðʌ teɪbʌl` **Dependency parse:** ``` idx word role head head_word ─── ─────── ──────── ──── ───────── 1 the det 2 cat 2 cat nsubj 3 sat 3 sat root 0 ROOT 4 on case 6 table 5 the det 6 table 6 table obl 3 sat ``` **Tree diagram:** ``` sat (root) ╱ ╲ cat (nsubj) table (obl) │ ╱ ╲ the (det) on (case) the (det) ``` **Tagged IPA:** `[det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl` ### Example 2: Complex sentence with subordinate clause **English:** `the researchers found that the compound reduces inflammation effectively` **IPA:** `ðʌ ɹisɝtʃɝz faʊnd ðæt ðʌ kɑmpaʊnd ɹɪdusɪz ɪnflʌmeɪʃʌn ɪfɛktɪvli` **Dependency parse:** ``` idx word role head head_word ─── ────────────── ──────── ──── ────────────── 1 the det 2 researchers 2 researchers nsubj 3 found 3 found root 0 ROOT 4 that mark 7 reduces 5 the det 6 compound 6 compound nsubj 7 reduces 7 reduces ccomp 3 found 8 inflammation obj 7 reduces 9 effectively advmod 7 reduces ``` **Tree diagram:** ``` found (root) ╱ ╲ researchers (nsubj) reduces (ccomp) │ ╱ │ ╲ the (det) compound inflam- effectively (nsubj) mation (advmod) ╱ ╲ (obj) the (det) that (mark) ``` **Tagged IPA:** `[det] ðʌ [nsubj] ɹisɝtʃɝz [root] faʊnd [mark] ðæt [det] ðʌ [nsubj] kɑmpaʊnd [ccomp] ɹɪdusɪz [obj] ɪnflʌmeɪʃʌn [advmod] ɪfɛktɪvli` ## How Dependency Trees Work A dependency tree represents the syntactic structure of a sentence as directed links between words. Every word points to its **head** (the word it depends on) via a labeled **relation**. ### Key concepts - **Root**: The main predicate of the sentence (usually the main verb). Its head index is 0 (no parent). - **Head**: Each non-root word has exactly one head — the word that governs it syntactically. - **Relation**: The label on the link describes the grammatical function: `nsubj` (subject), `obj` (object), `det` (determiner), `amod` (adjective modifier), etc. ### Reading the `dep_heads` array `dep_heads` is a list of integers, one per word, using **1-based indexing**: - `dep_heads[i] = j` means word `i` depends on word `j` - `dep_heads[i] = 0` means word `i` is the root For example, in `[2, 3, 0, 6, 6, 3]`: - Word 1 → depends on word 2 - Word 2 → depends on word 3 - Word 3 → ROOT (head = 0) - Word 4 → depends on word 6 - Word 5 → depends on word 6 - Word 6 → depends on word 3 ### The `tagged_ipa` format The `tagged_ipa` field provides a linearized representation that interleaves each word's dependency role with its IPA transcription: ``` [det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl ``` This format preserves the original word order while annotating each word's syntactic function. It can be used directly as input to sequence models that need both phonetic and syntactic information. ### Common Universal Dependencies relations | Relation | Meaning | Example | |----------|---------|---------| | `root` | Main predicate | *sat* in "the cat sat" | | `nsubj` | Nominal subject | *cat* in "the **cat** sat" | | `obj` | Direct object | *fish* in "the cat ate **fish**" | | `det` | Determiner | *the* in "**the** cat" | | `amod` | Adjective modifier | *big* in "the **big** cat" | | `advmod` | Adverb modifier | *quickly* in "ran **quickly**" | | `case` | Case marker / preposition | *on* in "sat **on** the table" | | `obl` | Oblique nominal | *table* in "sat on the **table**" | | `nmod` | Nominal modifier | *wood* in "table of **wood**" | | `conj` | Conjunct | *dogs* in "cats and **dogs**" | | `cc` | Coordinating conjunction | *and* in "cats **and** dogs | | `mark` | Subordinating marker | *that* in "said **that** he left" | | `ccomp` | Clausal complement | *left* in "said that he **left**" | | `xcomp` | Open clausal complement | *run* in "wants to **run**" | | `aux` | Auxiliary verb | *has* in "**has** eaten" | For the full set of relations, see the [Universal Dependencies documentation](https://universaldependencies.org/u/dep/). ## Data Sources Sentences were drawn from multiple registers for linguistic diversity: - News articles (multiple years) - Wikipedia (encyclopedic) - Parliamentary proceedings (Europarl) ## Filtering Sentences were filtered to ensure quality: - 5–30 words per sentence - Must begin with a letter and end with sentence-ending punctuation - No URLs, email addresses, or quoted text - Limited digit content (≤2 number sequences, ≤15% digit characters) - No bullet points, list markers, or section headers - Deduplicated by exact English text match ## Processing Pipeline 1. **Sentence extraction** from source corpora 2. **Quality filtering** (see above) 3. **IPA transcription** via [epitran](https://github.com/dmort27/epitran) with CMU pronunciation dictionary fallback 4. **Word alignment verification** — English and IPA word counts must match exactly 5. **Dependency parsing** via [Stanza](https://stanfordnlp.github.io/stanza/) with `tokenize_pretokenized=True` to preserve word-level alignment 6. **Deduplication** by English text ## Usage ```python from datasets import load_dataset ds = load_dataset("dgabri3le/english-ipa-dep-treebank", split="train", streaming=True) for sample in ds: print(sample["english"]) print(sample["tagged_ipa"]) print(sample["dep_labels"]) break ``` ## Citation If you use this dataset, please cite: ```bibtex @dataset{english_ipa_dep_treebank_2026, title={English IPA Dependency Treebank}, author={Gabriele, Daniel}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/dgabri3le/english-ipa-dep-treebank} } ```

language: - en license: cc-by-4.0 task_categories: - 文本生成(text-generation) - 词元分类(token-classification) tags: - IPA(国际音标International Phonetic Alphabet) - 语音学(phonetics) - 依存句法分析(dependency-parsing) - 通用依存关系(Universal Dependencies) - 句法(syntax) - 树库(treebank) - 音系学(phonology) size_categories: - 10M<n<100M --- # 英语IPA依存树库 本数据集为大规模语料库,包含1040万条英语句子,配套国际音标(IPA,International Phonetic Alphabet)转录结果与通用依存关系(Universal Dependencies)句法注释。每条句子均包含完整的依存句法解析结果——中心词索引、关系标签,以及将句法角色与语音内容交错排列的线性化带标注IPA表示形式。 ## 数据集结构 每个样本包含以下字段: | 字段名 | 数据类型 | 描述 | |-------|----------|------| | `raw_english` | 字符串 | 未展开缩略形式的原始英语文本(例如"don't"、"it's") | | `english` | 字符串 | 已展开缩略形式的英语句子(例如"do not"、"it is"),已统一转为小写并经过过滤。若句子无缩略形式,则与`raw_english`完全一致。 | | `ipa` | 字符串 | 国际音标转录结果(与`english`中的单词对齐) | | `dep_labels` | 列表[字符串] | 每个单词对应的通用依存关系标签 | | `dep_heads` | 列表[整数] | 每个单词的中心词索引(采用1基索引,0表示ROOT) | | `tagged_ipa` | 字符串 | 线性化格式:`[role] ipa_word [role] ipa_word ...` | ## 示例 ### 示例1:简单及物句 **英语原文:** `the cat sat on the table` **国际音标:** `ðʌ kæt sæt ɑn ðʌ teɪbʌl` **依存句法解析:** 序号 单词 句法角色 中心词 中心词单词 ─── ─────── ──────── ──── ───────── 1 the det 2 cat 2 cat nsubj 3 sat 3 sat root 0 ROOT 4 on case 6 table 5 the det 6 table 6 table obl 3 sat **句法树示意图:** sat (root) ╱ ╲ cat (nsubj) table (obl) │ ╱ ╲ the (det) on (case) the (det) **带标注IPA:** `[det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl` ### 示例2:带从属分句的复杂句 **英语原文:** `the researchers found that the compound reduces inflammation effectively` **国际音标:** `ðʌ ɹisɝtʃɝz faʊnd ðæt ðʌ kɑmpaʊnd ɹɪdusɪz ɪnflʌmeɪʃʌn ɪfɛktɪvli` **依存句法解析:** 序号 单词 句法角色 中心词 中心词单词 ─── ────────────── ──────── ──── ────────────── 1 the det 2 researchers 2 researchers nsubj 3 found 3 found root 0 ROOT 4 that mark 7 reduces 5 the det 6 compound 6 compound nsubj 7 reduces 7 reduces ccomp 3 found 8 inflammation obj 7 reduces 9 effectively advmod 7 reduces **句法树示意图:** found (root) ╱ ╲ researchers (nsubj) reduces (ccomp) │ ╱ │ ╲ the (det) compound inflam- effectively (nsubj) mation (advmod) ╱ ╲ (obj) the (det) that (mark) **带标注IPA:** `[det] ðʌ [nsubj] ɹisɝtʃɝz [root] faʊnd [mark] ðæt [det] ðʌ [nsubj] kɑmpaʊnd [ccomp] ɹɪdusɪz [obj] ɪnflʌmeɪʃʌn [advmod] ɪfɛktɪvli` ## 依存句法树工作原理 依存句法树通过单词间的有向连接表示句子的句法结构,每个单词通过带标签的连接指向其**中心词**(即它所依赖的单词)。 ### 核心概念 - **根节点(ROOT)**:句子的核心谓词(通常为主动词),其中心词索引为0(无父节点)。 - **中心词**:每个非根节点单词恰好拥有一个中心词,即对其施加句法支配的单词。 - **依存关系**:连接边上的标签用于描述语法功能,例如`nsubj`(主语)、`obj`(宾语)、`det`(限定词)等。 ### 解读`dep_heads`数组 `dep_heads`为整数列表,每个元素对应一个单词,采用**1基索引**: - `dep_heads[i] = j` 表示单词`i`依赖于单词`j` - `dep_heads[i] = 0` 表示单词`i`为根节点 例如对于列表`[2, 3, 0, 6, 6, 3]`: - 单词1 → 依赖于单词2 - 单词2 → 依赖于单词3 - 单词3 → 根节点(中心词为0) - 单词4 → 依赖于单词6 - 单词5 → 依赖于单词6 - 单词6 → 依赖于单词3 ### `tagged_ipa`格式 `tagged_ipa`字段提供了一种线性化表示形式,将每个单词的依存句法角色与其IPA转录结果交错排列: [det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl 该格式保留了原始单词顺序,同时标注了每个单词的句法功能,可直接作为同时需要语音与句法信息的序列模型输入。 ### 通用依存关系常见标签 | 关系标签 | 含义 | 示例 | |----------|------|------| | `root` | 核心谓词 | "the cat sat"中的"sat" | | `nsubj` | 名词性主语 | "the **cat** sat"中的"cat" | | `obj` | 直接宾语 | "the cat ate **fish**"中的"fish" | | `det` | 限定词 | "**the** cat"中的"the" | | `amod` | 形容词修饰语 | "the **big** cat"中的"big" | | `advmod` | 副词修饰语 | "ran **quickly**"中的"quickly" | | `case` | 格标记/介词 | "sat **on** the table"中的"on" | | `obl` | 间接名词成分 | "sat on the **table**"中的"table" | | `nmod` | 名词性修饰语 | "table of **wood**"中的"wood" | | `conj` | 并列成分 | "cats and **dogs**"中的"dogs" | | `cc` | 并列连词 | "cats **and** dogs"中的"and" | | `mark` | 从属分句标记 | "said **that** he left"中的"that" | | `ccomp` | 从句补语 | "said that he **left**"中的"left" | | `xcomp` | 开放式从句补语 | "wants to **run**"中的"run" | | `aux` | 助动词 | "**has** eaten"中的"has" | 如需完整的关系标签列表,请参阅[通用依存关系官方文档](https://universaldependencies.org/u/dep/)。 ## 数据来源 句子从多个语域中抽取以保证语言多样性: - 新闻文章(多年份) - 维基百科(百科类文本) - 议会会议记录(Europarl语料库) ## 数据过滤 为保证数据质量,对句子进行了如下过滤: - 每条句子包含5~30个单词 - 必须以字母开头,以句末标点结尾 - 不含URL、电子邮箱地址或引用文本 - 数字内容受限(≤2组数字序列,数字字符占比≤15%) - 无项目符号、列表标记或章节标题 - 基于精确英语文本匹配去重 ## 处理流程 1. **语料抽取**:从源语料库中提取句子 2. **质量过滤**:详见上文说明 3. **IPA转录**:通过[epitran](https://github.com/dmort27/epitran)工具生成,辅以CMU发音词典作为兜底方案 4. **单词对齐验证**:英语与IPA的单词数量必须完全匹配 5. **依存句法解析**:通过[Stanza](https://stanfordnlp.github.io/stanza/)工具完成,设置`tokenize_pretokenized=True`以保留单词级对齐 6. **去重**:基于英语文本进行去重 ## 使用方法 python from datasets import load_dataset ds = load_dataset("dgabri3le/english-ipa-dep-treebank", split="train", streaming=True) for sample in ds: print(sample["english"]) print(sample["tagged_ipa"]) print(sample["dep_labels"]) break ## 引用声明 若使用本数据集,请引用如下文献: bibtex @dataset{english_ipa_dep_treebank_2026, title={English IPA Dependency Treebank}, author={Gabriele, Daniel}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/dgabri3le/english-ipa-dep-treebank} }
提供机构:
dgabri3le
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作