five

lukeslp/etymology-atlas

收藏
Hugging Face2026-04-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lukeslp/etymology-atlas
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 task_categories: - feature-extraction language: - en tags: - linguistics - etymology - historical-linguistics - cognates - language-families - indo-european - phonology - typology - nlp - glottolog - wiktionary pretty_name: "Etymology Atlas: Global Language Relationships" size_categories: - 1M<n<10M --- # Etymology Atlas: Global Language Relationships How did the word 'mother' travel from Proto-Indo-European *méh₂tēr to modern languages across four continents? This dataset maps 4.17 million etymological relationships connecting words across 19,401 languages. The Etymology Atlas integrates five authoritative linguistic sources into a graph structure: - **etymology-db**: 3.75 million word relationships extracted from Wiktionary, capturing borrowings, cognates, derivations, and semantic shifts - **Glottolog**: Complete catalog of the world's languages with geographic coordinates, family trees, and endangerment status - **Lexibank IE-CoR**: 4,981 expert-annotated cognate sets for Indo-European languages, the gold standard for computational historical linguistics - **PHOIBLE**: Phoneme inventories for 2,177 languages with articulatory features (manner, place, voicing) - **WALS**: 76,475 typological feature values covering word order, tone systems, case marking, and 189 other structural properties Researchers can trace how words evolved, map language contact zones, analyze sound change patterns, or build phylogenetic models of language families. The Parquet format enables efficient querying of the full graph on a laptop. Data sources are linked by ISO 639-3 codes and Glottocodes, enabling cross-table joins between etymology, phonology, and typology. ## Files | File | Records | Description | |------|---------|-------------| | `etymologies.parquet` | 4.17M | Core etymology graph | | `languages.parquet` | 19,401 | Language metadata from Glottolog | | `cognate_sets.parquet` | 4,981 | Expert cognate groups (IE-CoR) | | `phonemes.parquet` | 105,484 | PHOIBLE phoneme data | | `linguistic_features.parquet` | 76,475 | WALS typological features | ## Usage ```python import pandas as pd # Load the core etymology graph (4.17M word relationships) df = pd.read_parquet("etymologies.parquet") # Columns: word, language, relation_type, related_word, related_language # Load language metadata langs = pd.read_parquet("languages.parquet") # 19,401 languages: ISO codes, coordinates, family, endangerment status # Find all words related to 'mother' across languages mother_cognates = df[df['word'] == 'mother'] # Join with language metadata to add coordinates result = mother_cognates.merge(langs[['glottocode','latitude','longitude','family']], on='glottocode') ``` ## Citation ```bibtex @dataset{etymology_atlas_2026, title = {Etymology Atlas: Global Language Relationships}, author = {Steuber, Luke}, year = {2026}, doi = {10.5281/zenodo.18321479}, url = {https://huggingface.co/datasets/lukeslp/etymology-atlas} } ``` **Visualization**: [Language Tree at dr.eamer.dev](https://dr.eamer.dev/datavis/poems/language/tree.html) ## Distribution - **GitHub**: [lukeslp/etymology-atlas](https://github.com/lukeslp/etymology-atlas) - **Kaggle**: [lucassteuber/etymology-atlas](https://www.kaggle.com/datasets/lucassteuber/etymology-atlas) - **HuggingFace**: [lukeslp/etymology-atlas](https://huggingface.co/datasets/lukeslp/etymology-atlas) ## License CC BY-SA 3.0 ## Author **Luke Steuber** · [lukesteuber.com](https://lukesteuber.com) · [@lukesteuber.com](https://bsky.app/profile/lukesteuber.com)
提供机构:
lukeslp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作