five

yachay/text_coordinates_regions

收藏
Hugging Face2023-09-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/yachay/text_coordinates_regions
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit tags: - multilingual - text - coordinates - geospatial - translation - NER - geo - geo-tagged - named-entity-recognition - natural-language-processing - geographic-data - geolocation - twitter - reddit task_categories: - feature-extraction - token-classification - text-classification pretty_name: Multilingual Geo-Tagged Social Media Posts (by 123 world regions) language: - en - zh - es - hi - ar - bn - pt - ru - ja - pa - de - jv - ms - te - vi - ko - fr - mr - ta - ur - tr - it - th - gu - fa - pl size_categories: - 100M<n<1B --- # Dataset Card for Multilingual Geo-Tagged Social Media Posts (by 123 world regions) ## Dataset Description - **Homepage:** https://huggingface.co/datasets/yachay/text_coordinates_regions - **Repository:** https://github.com/Yachay-AI/byt5-geotagging#datasets - **Paper:** https://dev.to/yachayai/applying-machine-learning-to-geolocate-twitter-posts-2m1d - **Leaderboard:** - **Point of Contact:** admin-team@yachay.ai ### Dataset Summary The "Regions" dataset is a multilingual corpus that encompasses textual data from the 123 most populated regions worldwide, with each region's data organized into separate .json files. This dataset consists of approximately 500,000 text samples, each paired with its geographic coordinates. **Key Features:** - **Textual Data:** The dataset contains 500,000 text samples. - **Geocoordinates:** Each text sample is associated with geocoordinates. - **Regional Coverage:** The dataset encompasses 123 of the world's most populated regions. - **Tweet Data:** Within each region, there are 5,000 individual tweets/comments. ### Supported Tasks and Leaderboards This dataset is well-suited for tasks such as geotagging, where the objective is to associate text with specific geographical locations. It can also be utilized for geolocation analysis, sentiment analysis in regional contexts, and regional text classification. ### Languages **Multilingual Dataset** This dataset is multilingual and contains text data in various languages from around the world. It does not have a fixed set of languages, and the language composition may vary across different versions or updates of the dataset. ## Dataset Structure **Structure and Naming Convention:** The naming convention for the JSON files follows the format "c_0.json" to "c_122.json," where "c_" represents the region category followed by a unique identifier ```bash / ├── .gitattributes ├── README.md ├── c_0.json # Each .json file attributes to one of 123 regions ├── c_1.json ├── ... ├── c_122.json ``` ### Data Instances The Regions dataset consists of a total of 500,000 data instances, with each instance comprising a text sample and its associated geocoordinates. These instances are distributed across the 123 in each json file. ### Data Fields **Text (text):** This field contains the text sample, typically holds natural language text data, such as comments, tweets, or any text-based content. **Coordinates (coordinates):** This field includes geographical coordinates, latitude and longitude, providing the geographic location associated with the text. ```json { "text": "sample text", "coordinates": [ "-75.04057630341867", "40.01714225600481" ] } ``` ### Data Splits This dataset is not pre-partitioned into training, validation, and test data splits, providing flexibility for users to split the data according to their specific research or application needs. Users can customize the data partitioning to suit their machine learning experiments and analytical requirements. ## Dataset Creation 2021 ### Curation Rationale The "Regions" dataset was created with an objective to train and enhance geotagging textual models. With 500,000 text samples, each paired with geocoordinates, it offers a resource for developing models that can associate text with specific geographical locations. Whether for geolocation analysis or other tasks merging text and geographic information, this dataset serves as a valuable training tool. ### Source Data #### Initial Data Collection and Normalization The initial data collection process focused on gathering geotagged comments from social media platforms, with a primary emphasis on Twitter. #### Who are the source language producers? Twitter Community ### Annotations #### Annotation process The coordinates in this dataset have been derived from metadata sources. #### Who are the annotators? No manual annotation was conducted for this dataset. ## Considerations for Using the Data ### Social Impact of Dataset The "Regions" dataset, with its multilingual text and geographic coordinates, presents an opportunity to advance research in geospatial NLP. However, it is crucial for users to exercise caution and ethical responsibility when handling location-related data to mitigate any potential privacy concerns or misuse. ### Discussion of Biases It's essential to acknowledge that the data collected from social media platforms may contain inherent biases, influenced by user demographics and platform dynamics. Researchers should be mindful of these biases and consider potential implications in their analyses. ### Other Known Limitations - The dataset's multilingual nature may lead to varying data quality and linguistic diversity across regions. - The use of geotagged social media comments means that the dataset may not cover less active or less represented regions. - The accuracy of geocoordinates is subject to inherent limitations of the data sources used for collection. ## Additional Information ### Dataset Curators Yachay AI ### Licensing Information MIT
提供机构:
yachay
原始信息汇总

数据集卡片:多语种地理标记社交媒体帖子(按123个世界区域划分)

数据集描述

数据集概述

"Regions"数据集是一个多语种语料库,包含来自全球123个最人口密集区域的文本数据,每个区域的数据被组织成单独的.json文件。该数据集包含约500,000个文本样本,每个样本都附带其地理坐标。

关键特性:

  • 文本数据: 数据集包含500,000个文本样本。
  • 地理坐标: 每个文本样本都关联有地理坐标。
  • 区域覆盖: 数据集涵盖了全球123个最人口密集的区域。
  • 推文数据: 每个区域内包含5,000条独立的推文/评论。

支持的任务和排行榜

该数据集适用于地理标记等任务,目标是将文本与特定地理位置关联。它还可用于地理位置分析、区域情境下的情感分析和区域文本分类。

语言

多语种数据集

该数据集是多语种的,包含来自世界各地的各种语言的文本数据。它没有固定的语言集合,语言构成可能在不同版本或更新中有所变化。

数据集结构

结构和命名约定:

JSON文件的命名约定遵循"c_0.json"到"c_122.json"的格式,其中"c_"代表区域类别,后跟一个唯一标识符。

bash / ├── .gitattributes ├── README.md ├── c_0.json # 每个.json文件对应123个区域中的一个 ├── c_1.json ├── ... ├── c_122.json

数据实例

"Regions"数据集包含总共500,000个数据实例,每个实例包含一个文本样本及其关联的地理坐标。这些实例分布在123个区域的每个json文件中。

数据字段

文本(text): 该字段包含文本样本,通常包含自然语言文本数据,如评论、推文或任何基于文本的内容。

坐标(coordinates): 该字段包含地理坐标,即纬度和经度,提供与文本关联的地理位置。

json { "text": "sample text", "coordinates": [ "-75.04057630341867", "40.01714225600481" ] }

数据分割

该数据集未预先分割成训练、验证和测试数据集,为用户提供了根据其特定研究或应用需求分割数据的灵活性。用户可以根据其机器学习实验和分析需求自定义数据分割。

数据集创建

2021年

创建理由

"Regions"数据集的创建旨在训练和增强地理标记文本模型。通过500,000个文本样本,每个样本都附带地理坐标,它为开发能够将文本与特定地理位置关联的模型提供了资源。无论是用于地理位置分析还是其他结合文本和地理信息的任务,该数据集都是一个有价值的训练工具。

源数据

初始数据收集和规范化

初始数据收集过程侧重于从社交媒体平台收集地理标记评论,主要关注Twitter。

源语言生产者是谁?

Twitter社区

注释

注释过程

该数据集中的坐标已从元数据源中导出。

注释者是谁?

该数据集未进行手动注释。

使用数据时的考虑因素

数据集的社会影响

"Regions"数据集,以其多语种文本和地理坐标,为地理空间NLP的研究提供了机会。然而,用户在处理位置相关数据时必须谨慎并负有道德责任,以减轻任何潜在的隐私问题或滥用风险。

偏见的讨论

必须承认,从社交媒体平台收集的数据可能包含固有偏见,受用户人口统计和平台动态的影响。研究人员应意识到这些偏见,并在其分析中考虑潜在影响。

其他已知限制

  • 数据集的多语种特性可能导致不同区域的数据质量和语言多样性存在差异。
  • 使用地理标记的社交媒体评论意味着数据集可能不涵盖活动较少或代表性较弱的区域。
  • 地理坐标的准确性受用于收集的数据源的固有限制。

附加信息

数据集策展人

Yachay AI

许可信息

MIT

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作