Cariban Lexical Database (CaLeD)
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10019096
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains a comprehensive collection of lexical items from various languages within the Carib linguistic family. It is structured to facilitate computational historical linguistics analysis, offering detailed information on language characteristics, word forms, and cognacy judgments. The data is curated to support research in linguistic typology, historical linguistics, and related fields.
Data Structure
The dataset is presented in a TSV (Tab-Separated Values) format, ensuring easy integration with common data analysis tools. Each lexical item in the dataset is detailed with multiple linguistic attributes, including phonological transcriptions, morphological analysis, and cognacy information. The following table summarizes the fields included in the dataset:
Field Name
Data Type
Description
ID
string
Unique identifier for each dataset entry.
ID_lang
string
Unique identifier for the language within the dataset.
Glottocode
string
Code uniquely identifying the language in the Glottolog database.
Glottolog_Name
string
Name of the language as recorded in the Glottolog database.
ISO639P3code
string
ISO 639-3 code for the language.
ID_param
string
Unique identifier for the linguistic parameter or concept within the dataset.
Concepticon_ID
integer
Identifier for the concept in the Concepticon database.
Concepticon_Gloss
string
Gloss or definition of the concept from the Concepticon database.
Value
string
Value of the linguistic data point, typically a word or phrase in the language.
Form
string
Phonetic or phonological transcription of the linguistic data point.
Segments
string
Further phonetic or phonological breakdown of the form.
Source
string
Reference to the source or citation where the data was obtained.
Morphemes
string
Morphological breakdown of the form.
SimpleCognate
integer
Cognacy judgment, indicating whether the form is cognate with forms of the same meaning in related languages.
PartialCognates
string
Partial cognacy coding, detailing the cognacy of individual segments or morphemes.
Intended Use
This dataset is intended for researchers and linguists specializing in the Carib linguistic family. It provides valuable insights into the lexical similarities and differences across the languages within this family, supporting studies on language evolution, relationships, and structure.
Additional Resources
Metadata for Validation: This dataset comes with comprehensive metadata following the Frictionless Data standard, ensuring that the data structure and types are accurately described for validation purposes. This metadata aids in maintaining the integrity and usability of the data across various computational platforms and research projects.
CLDF Version Available: For researchers utilizing the Cross-Linguistic Data Formats (CLDF), a version of this dataset is available in CLDF specifications. This version is provided as a zipped file, facilitating easier distribution and handling.
创建时间:
2024-04-21



