dgabri3le/english-ipa-dep-treebank

Name: dgabri3le/english-ipa-dep-treebank
Creator: dgabri3le
Published: 2026-04-20 21:03:41
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/dgabri3le/english-ipa-dep-treebank

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 task_categories: - text-generation - token-classification tags: - ipa - phonetics - dependency-parsing - universal-dependencies - syntax - treebank - phonology size_categories: - 10M<n<100M --- # English IPA Dependency Treebank A large-scale dataset of **10.4 million English sentences** paired with IPA (International Phonetic Alphabet) transcriptions and Universal Dependencies syntactic annotations. Each sentence includes its full dependency parse — head indices, relation labels, and a linearized tagged-IPA representation that interleaves syntactic roles with phonetic content. ## Dataset Structure Each sample contains: | Field | Type | Description | |-------|------|-------------| | `raw_english` | string | Original English text before contraction expansion (e.g. "don't", "it's") | | `english` | string | English sentence with contractions expanded (e.g. "do not", "it is"). Lowercased, filtered. For sentences without contractions, identical to `raw_english`. | | `ipa` | string | IPA transcription (word-aligned with `english`) | | `dep_labels` | list[string] | Universal Dependencies relation label per word | | `dep_heads` | list[int] | Head word index per word (1-indexed, 0 = ROOT) | | `tagged_ipa` | string | Linearized format: `[role] ipa_word [role] ipa_word ...` | ## Examples ### Example 1: Simple transitive sentence **English:** `the cat sat on the table` **IPA:** `ðʌ kæt sæt ɑn ðʌ teɪbʌl` **Dependency parse:** ``` idx word role head head_word ─── ─────── ──────── ──── ───────── 1 the det 2 cat 2 cat nsubj 3 sat 3 sat root 0 ROOT 4 on case 6 table 5 the det 6 table 6 table obl 3 sat ``` **Tree diagram:** ``` sat (root) ╱ ╲ cat (nsubj) table (obl) │ ╱ ╲ the (det) on (case) the (det) ``` **Tagged IPA:** `[det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl` ### Example 2: Complex sentence with subordinate clause **English:** `the researchers found that the compound reduces inflammation effectively` **IPA:** `ðʌ ɹisɝtʃɝz faʊnd ðæt ðʌ kɑmpaʊnd ɹɪdusɪz ɪnflʌmeɪʃʌn ɪfɛktɪvli` **Dependency parse:** ``` idx word role head head_word ─── ────────────── ──────── ──── ────────────── 1 the det 2 researchers 2 researchers nsubj 3 found 3 found root 0 ROOT 4 that mark 7 reduces 5 the det 6 compound 6 compound nsubj 7 reduces 7 reduces ccomp 3 found 8 inflammation obj 7 reduces 9 effectively advmod 7 reduces ``` **Tree diagram:** ``` found (root) ╱ ╲ researchers (nsubj) reduces (ccomp) │ ╱ │ ╲ the (det) compound inflam- effectively (nsubj) mation (advmod) ╱ ╲ (obj) the (det) that (mark) ``` **Tagged IPA:** `[det] ðʌ [nsubj] ɹisɝtʃɝz [root] faʊnd [mark] ðæt [det] ðʌ [nsubj] kɑmpaʊnd [ccomp] ɹɪdusɪz [obj] ɪnflʌmeɪʃʌn [advmod] ɪfɛktɪvli` ## How Dependency Trees Work A dependency tree represents the syntactic structure of a sentence as directed links between words. Every word points to its **head** (the word it depends on) via a labeled **relation**. ### Key concepts - **Root**: The main predicate of the sentence (usually the main verb). Its head index is 0 (no parent). - **Head**: Each non-root word has exactly one head — the word that governs it syntactically. - **Relation**: The label on the link describes the grammatical function: `nsubj` (subject), `obj` (object), `det` (determiner), `amod` (adjective modifier), etc. ### Reading the `dep_heads` array `dep_heads` is a list of integers, one per word, using **1-based indexing**: - `dep_heads[i] = j` means word `i` depends on word `j` - `dep_heads[i] = 0` means word `i` is the root For example, in `[2, 3, 0, 6, 6, 3]`: - Word 1 → depends on word 2 - Word 2 → depends on word 3 - Word 3 → ROOT (head = 0) - Word 4 → depends on word 6 - Word 5 → depends on word 6 - Word 6 → depends on word 3 ### The `tagged_ipa` format The `tagged_ipa` field provides a linearized representation that interleaves each word's dependency role with its IPA transcription: ``` [det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl ``` This format preserves the original word order while annotating each word's syntactic function. It can be used directly as input to sequence models that need both phonetic and syntactic information. ### Common Universal Dependencies relations | Relation | Meaning | Example | |----------|---------|---------| | `root` | Main predicate | *sat* in "the cat sat" | | `nsubj` | Nominal subject | *cat* in "the **cat** sat" | | `obj` | Direct object | *fish* in "the cat ate **fish**" | | `det` | Determiner | *the* in "**the** cat" | | `amod` | Adjective modifier | *big* in "the **big** cat" | | `advmod` | Adverb modifier | *quickly* in "ran **quickly**" | | `case` | Case marker / preposition | *on* in "sat **on** the table" | | `obl` | Oblique nominal | *table* in "sat on the **table**" | | `nmod` | Nominal modifier | *wood* in "table of **wood**" | | `conj` | Conjunct | *dogs* in "cats and **dogs**" | | `cc` | Coordinating conjunction | *and* in "cats **and** dogs | | `mark` | Subordinating marker | *that* in "said **that** he left" | | `ccomp` | Clausal complement | *left* in "said that he **left**" | | `xcomp` | Open clausal complement | *run* in "wants to **run**" | | `aux` | Auxiliary verb | *has* in "**has** eaten" | For the full set of relations, see the [Universal Dependencies documentation](https://universaldependencies.org/u/dep/). ## Data Sources Sentences were drawn from multiple registers for linguistic diversity: - News articles (multiple years) - Wikipedia (encyclopedic) - Parliamentary proceedings (Europarl) ## Filtering Sentences were filtered to ensure quality: - 5–30 words per sentence - Must begin with a letter and end with sentence-ending punctuation - No URLs, email addresses, or quoted text - Limited digit content (≤2 number sequences, ≤15% digit characters) - No bullet points, list markers, or section headers - Deduplicated by exact English text match ## Processing Pipeline 1. **Sentence extraction** from source corpora 2. **Quality filtering** (see above) 3. **IPA transcription** via [epitran](https://github.com/dmort27/epitran) with CMU pronunciation dictionary fallback 4. **Word alignment verification** — English and IPA word counts must match exactly 5. **Dependency parsing** via [Stanza](https://stanfordnlp.github.io/stanza/) with `tokenize_pretokenized=True` to preserve word-level alignment 6. **Deduplication** by English text ## Usage ```python from datasets import load_dataset ds = load_dataset("dgabri3le/english-ipa-dep-treebank", split="train", streaming=True) for sample in ds: print(sample["english"]) print(sample["tagged_ipa"]) print(sample["dep_labels"]) break ``` ## Citation If you use this dataset, please cite: ```bibtex @dataset{english_ipa_dep_treebank_2026, title={English IPA Dependency Treebank}, author={Gabriele, Daniel}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/dgabri3le/english-ipa-dep-treebank} } ```

language: - en license: cc-by-4.0 task_categories: - 文本生成（text-generation） - 词元分类（token-classification） tags: - IPA（国际音标International Phonetic Alphabet） - 语音学（phonetics） - 依存句法分析（dependency-parsing） - 通用依存关系（Universal Dependencies） - 句法（syntax） - 树库（treebank） - 音系学（phonology） size_categories: - 10M<n<100M --- # 英语IPA依存树库本数据集为大规模语料库，包含1040万条英语句子，配套国际音标（IPA，International Phonetic Alphabet）转录结果与通用依存关系（Universal Dependencies）句法注释。每条句子均包含完整的依存句法解析结果——中心词索引、关系标签，以及将句法角色与语音内容交错排列的线性化带标注IPA表示形式。 ## 数据集结构每个样本包含以下字段： | 字段名 | 数据类型 | 描述 | |-------|----------|------| | `raw_english` | 字符串 | 未展开缩略形式的原始英语文本（例如"don't"、"it's"） | | `english` | 字符串 | 已展开缩略形式的英语句子（例如"do not"、"it is"），已统一转为小写并经过过滤。若句子无缩略形式，则与`raw_english`完全一致。 | | `ipa` | 字符串 | 国际音标转录结果（与`english`中的单词对齐） | | `dep_labels` | 列表[字符串] | 每个单词对应的通用依存关系标签 | | `dep_heads` | 列表[整数] | 每个单词的中心词索引（采用1基索引，0表示ROOT） | | `tagged_ipa` | 字符串 | 线性化格式：`[role] ipa_word [role] ipa_word ...` | ## 示例 ### 示例1：简单及物句 **英语原文：** `the cat sat on the table` **国际音标：** `ðʌ kæt sæt ɑn ðʌ teɪbʌl` **依存句法解析：** 序号单词句法角色中心词中心词单词 ─── ─────── ──────── ──── ───────── 1 the det 2 cat 2 cat nsubj 3 sat 3 sat root 0 ROOT 4 on case 6 table 5 the det 6 table 6 table obl 3 sat **句法树示意图：** sat (root) ╱ ╲ cat (nsubj) table (obl) │ ╱ ╲ the (det) on (case) the (det) **带标注IPA：** `[det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl` ### 示例2：带从属分句的复杂句 **英语原文：** `the researchers found that the compound reduces inflammation effectively` **国际音标：** `ðʌ ɹisɝtʃɝz faʊnd ðæt ðʌ kɑmpaʊnd ɹɪdusɪz ɪnflʌmeɪʃʌn ɪfɛktɪvli` **依存句法解析：** 序号单词句法角色中心词中心词单词 ─── ────────────── ──────── ──── ────────────── 1 the det 2 researchers 2 researchers nsubj 3 found 3 found root 0 ROOT 4 that mark 7 reduces 5 the det 6 compound 6 compound nsubj 7 reduces 7 reduces ccomp 3 found 8 inflammation obj 7 reduces 9 effectively advmod 7 reduces **句法树示意图：** found (root) ╱ ╲ researchers (nsubj) reduces (ccomp) │ ╱ │ ╲ the (det) compound inflam- effectively (nsubj) mation (advmod) ╱ ╲ (obj) the (det) that (mark) **带标注IPA：** `[det] ðʌ [nsubj] ɹisɝtʃɝz [root] faʊnd [mark] ðæt [det] ðʌ [nsubj] kɑmpaʊnd [ccomp] ɹɪdusɪz [obj] ɪnflʌmeɪʃʌn [advmod] ɪfɛktɪvli` ## 依存句法树工作原理依存句法树通过单词间的有向连接表示句子的句法结构，每个单词通过带标签的连接指向其**中心词**（即它所依赖的单词）。 ### 核心概念 - **根节点（ROOT）**：句子的核心谓词（通常为主动词），其中心词索引为0（无父节点）。 - **中心词**：每个非根节点单词恰好拥有一个中心词，即对其施加句法支配的单词。 - **依存关系**：连接边上的标签用于描述语法功能，例如`nsubj`（主语）、`obj`（宾语）、`det`（限定词）等。 ### 解读`dep_heads`数组 `dep_heads`为整数列表，每个元素对应一个单词，采用**1基索引**： - `dep_heads[i] = j` 表示单词`i`依赖于单词`j` - `dep_heads[i] = 0` 表示单词`i`为根节点例如对于列表`[2, 3, 0, 6, 6, 3]`： - 单词1 → 依赖于单词2 - 单词2 → 依赖于单词3 - 单词3 → 根节点（中心词为0） - 单词4 → 依赖于单词6 - 单词5 → 依赖于单词6 - 单词6 → 依赖于单词3 ### `tagged_ipa`格式 `tagged_ipa`字段提供了一种线性化表示形式，将每个单词的依存句法角色与其IPA转录结果交错排列： [det] ðʌ [nsubj] kæt [root] sæt [case] ɑn [det] ðʌ [obl] teɪbʌl 该格式保留了原始单词顺序，同时标注了每个单词的句法功能，可直接作为同时需要语音与句法信息的序列模型输入。 ### 通用依存关系常见标签 | 关系标签 | 含义 | 示例 | |----------|------|------| | `root` | 核心谓词 | "the cat sat"中的"sat" | | `nsubj` | 名词性主语 | "the **cat** sat"中的"cat" | | `obj` | 直接宾语 | "the cat ate **fish**"中的"fish" | | `det` | 限定词 | "**the** cat"中的"the" | | `amod` | 形容词修饰语 | "the **big** cat"中的"big" | | `advmod` | 副词修饰语 | "ran **quickly**"中的"quickly" | | `case` | 格标记/介词 | "sat **on** the table"中的"on" | | `obl` | 间接名词成分 | "sat on the **table**"中的"table" | | `nmod` | 名词性修饰语 | "table of **wood**"中的"wood" | | `conj` | 并列成分 | "cats and **dogs**"中的"dogs" | | `cc` | 并列连词 | "cats **and** dogs"中的"and" | | `mark` | 从属分句标记 | "said **that** he left"中的"that" | | `ccomp` | 从句补语 | "said that he **left**"中的"left" | | `xcomp` | 开放式从句补语 | "wants to **run**"中的"run" | | `aux` | 助动词 | "**has** eaten"中的"has" | 如需完整的关系标签列表，请参阅[通用依存关系官方文档](https://universaldependencies.org/u/dep/)。 ## 数据来源句子从多个语域中抽取以保证语言多样性： - 新闻文章（多年份） - 维基百科（百科类文本） - 议会会议记录（Europarl语料库） ## 数据过滤为保证数据质量，对句子进行了如下过滤： - 每条句子包含5~30个单词 - 必须以字母开头，以句末标点结尾 - 不含URL、电子邮箱地址或引用文本 - 数字内容受限（≤2组数字序列，数字字符占比≤15%） - 无项目符号、列表标记或章节标题 - 基于精确英语文本匹配去重 ## 处理流程 1. **语料抽取**：从源语料库中提取句子 2. **质量过滤**：详见上文说明 3. **IPA转录**：通过[epitran](https://github.com/dmort27/epitran)工具生成，辅以CMU发音词典作为兜底方案 4. **单词对齐验证**：英语与IPA的单词数量必须完全匹配 5. **依存句法解析**：通过[Stanza](https://stanfordnlp.github.io/stanza/)工具完成，设置`tokenize_pretokenized=True`以保留单词级对齐 6. **去重**：基于英语文本进行去重 ## 使用方法 python from datasets import load_dataset ds = load_dataset("dgabri3le/english-ipa-dep-treebank", split="train", streaming=True) for sample in ds: print(sample["english"]) print(sample["tagged_ipa"]) print(sample["dep_labels"]) break ## 引用声明若使用本数据集，请引用如下文献： bibtex @dataset{english_ipa_dep_treebank_2026, title={English IPA Dependency Treebank}, author={Gabriele, Daniel}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/dgabri3le/english-ipa-dep-treebank} }

提供机构：

dgabri3le

5,000+

优质数据集

54 个

任务类型

进入经典数据集