five

strickvl/afghanwire

收藏
Hugging Face2024-04-01 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/strickvl/afghanwire
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 configs: - config_name: default data_files: - split: articles path: data/articles.parquet task_categories: - text-classification - zero-shot-classification - summarization - feature-extraction language: - en pretty_name: afghanwire size_categories: - 1K<n<10K --- ## Afghanwire Dataset Description - **Homepage**: https://huggingface.co/datasets/strickvl/afghanwire - **Repository**: N/A - **Paper**: N/A - **Point of Contact:** Alex Strick van Linschoten ([@strickvl](https://huggingface.co/strickvl)) ![](assets/afghanwire-website.png "Screenshot of the Afghanwire website c. 2006") ### Dataset Summary The Afghanwire dataset is a comprehensive collection of translated Afghan media articles from the period of May 2006 to September 2009. It was created by the Afghanwire media agency, founded by Alex Strick van Linschoten and Felix Kuehn. The agency employed a group of Afghan translators who translated articles from Dari and Pashto media sources into English. The dataset includes translated newspaper and magazine articles, as well as summaries of radio and television content. As most of the original media from this period is no longer available online, and certainly not in English, this dataset represents the largest publicly available trove of translated Afghan media for the 2006-2009 period. The primary purpose of making this dataset available is to serve as a historical artifact. However, it also presents opportunities for various Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER) for entities that may be underrepresented in standard or smaller models, and potentially sentiment analysis. It is important to note that the dataset is unlabeled and consists solely of translated articles. ### Supported Tasks and Leaderboards There are a variety of potential tasks that could be done on this dataset, including text classification, zero-shot classification, summarization, and feature extraction. There are no leaderboards for this dataset. ### Languages The dataset is only in English, but the original source articles were in Dari and Pashto. ## Dataset Structure ### Data Instances Here's a JSON example from the main "articles" table: ```json { "id": "97", "paper": "142", "title": "Some parliament members left session yesterday", "date": "2006-08-13T00:00:00.000Z", "author": "-", "translator": "19", "topic": "0", "abstract": "Translated by: H. Stanikzai", "comment": "", "rating": "0", "language": "32", "type": "0", "article": "Bakhtar, in yesterday session of the parliament, the president of the parliament termed the activities and sessions of the previous week as to be effective, and as an objection on constitutional decree on prisons and confinements the session members left the parliament. The members who left the parliament are mostly from northern Afghanistan and they objected the policies of the government of Afghanistan, they have threatened not participate in the parliament session until the government has changed its policies. The parliament members claims that they are witness for the insecurity in the country, to return the rights and privileges of the military, an in impropriate policy of the government regarding the appointment of cadres the failed reforms in the ministries, bribery in government offices and administrative corruption, the unfair composition the diplomatic representative and the lack of cadre in the central and as well as local government offices. But some other parliament members criticized the action of the members who left the parliament and said that their objection was unlawful and is against the principals of the inner tasks of the parliament.", "ok": "0", "no_newsletter": "0", "eingegeben": 1155479462000, "newsletter": "0", "free": "0", "url": "", "top_topic": "0", "words": "", "translatorcomment": "", "datetranslation": "", "scan": "" } ``` The dataset consists of several supporting tables that are referenced in the main "articles" table, such as papers, article_tags, bib_books, cities, current_events, ethnics, glossary, historical_events, issue, languages, organisations, people, provinces, region_tree, renderbackgrounder, top_topcs, topic, and types. The dataset as a whole consists of 7990 articles that were translated during the period Afghanwire was open as an organisation. ### Data Fields - `id` - basic id for the article - `paper` - id/number for a paper mentioned in papers.parquet - `title` - article title - `date` - `author` (if present) - `translator` (who translated the article) - `topic` (associated with topic table) - `abstract` (sometimes mentions the translator) - `comment` (sometimes also mentions the translator) - `rating` (not always used. was a measure for interest level) - `language` (associated with the separate table) - `type` (associated with the article types table) - `article` - the full translation - `ok` - whether the translation has been edited - `no_newsletter` - a metatag to represent whether the article should be sent out as part of our newsletter or not - `eingegeben` - a unix timestamp for when the article was uploaded to the database - `newsletter` - whether to include the article in our newsletter - `free` - whether to make the article available for free or not - `url` - if available - `top_topic` - what high-level topic the article was associated with - `words` - word count (not always present) - `translatorcomment` - not always present - `datetranslation` - not always present - `scan` - whether there's a scan for the article or not ### Data Splits There are no predefined splits. The dataset is provided as a single large collection. ## Dataset Creation ### Curation Rationale The creator of this dataset, Alex Strick van Linschoten, had the database files stored on his hard drive for an extended period. By making this data publicly available, he aims to ensure that it can be utilized by others. The media articles were translated by Afghan translators and represent a snapshot of Afghanistan's media discourse during the 2006-2009 period. As the translations were privately funded and are likely unique, with no other copies existing elsewhere, this dataset is expected to be an extremely valuable resource for scholars and historians. ## Source Data The source data was collected by ordering newspapers and magazines from around Afghanistan to the Afghanwire office on a daily basis. The agency also monitored radio stations. The translators selected articles that they and the agency deemed representative and interesting for readers, and then translated them into English. It is worth noting that the data was originally used to populate a website and newsletter at afghanwire.com. However, the website is no longer active, and the files only existed in an old MySQL database on the creator's laptop. While the website is partially available on the Internet Archive ([snapshot from February 2009](https://web.archive.org/web/20090227154008/http://www.afghanwire.com:80/)), most of the articles were behind a login page, which does not function with the archive snapshots. This dataset aims to make the translated articles accessible to the public. ## Annotations This dataset does not contain any annotations aside from some manual topic classification. ### Personal and Sensitive Information The dataset does not contain any personally identifiable information (PII). All content is sourced from public media outlets and has been translated. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is valuable for historians and researchers as it provides insights into the debates and discussions within Afghan society during the 2006-2009 period. It challenges the notion of uniformity in opinions on various issues, such as attitudes towards the Taliban, the Afghan government, and ISAF/NATO forces. By foregrounding Afghan civil society in the discussion of Afghan history, this dataset plays an important role in shifting the focus from foreign powers and military forces to the voices of the Afghan people, which are often overlooked. The dataset contains information about events, tribes, and ethnic groups from across Afghanistan, including articles about Dai Kundi province, which might have otherwise been lost. Although the Afghanwire office was based in Kabul, efforts were made to obtain newspapers and magazines from the provinces to ensure a representative collection. However, it is acknowledged that there may be a slight bias towards the capital due to the office's location. ### Discussion of Biases The creators of this dataset made a concerted effort to avoid biases in both the selection of articles and the translation process. However, as with any dataset, the potential for biases cannot be entirely eliminated. ### Other Known Limitations Apart from the possibility of a slight overrepresentation of media from Kabul compared to other provinces, there are no other known limitations to this dataset. ## Additional Information ### Dataset Curators The dataset was curated by the Afghanwire organization. The translators, Hamid Stanikzai, Atif Mohammadzai, Abdul Hassib Rahimi, and Hamid Safi, selected the articles to be translated and deserve full credit for their work. ### Licensing Information This dataset is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. For more information, see https://creativecommons.org/licenses/by-sa/4.0/. ### Citation Information If you use this dataset in your research or project, please cite it as follows: ``` @misc{afghanwire_2024, author = {Afghanwire}, title = {Afghanwire Media Database 2006-2009}, year = {2024}, month = {April}, day = {1}, url = {https://huggingface.co/datasets/strickvl/afghanwire} } ```
提供机构:
strickvl
原始信息汇总

数据集概述

数据集名称

  • 名称: Afghanwire
  • 别名: afghanwire

数据集描述

  • 内容: 该数据集包含2006年5月至2009年9月期间,由Afghanwire媒体机构翻译的阿富汗媒体文章。这些文章最初以Dari和Pashto语言发布,后被翻译成英语。
  • 目的: 主要作为历史文献,同时适用于多种自然语言处理任务,如命名实体识别和情感分析。

数据集结构

  • 数据实例: 包含7990篇文章,每篇文章包含多个字段,如文章ID、标题、日期、作者、翻译者、主题、摘要、评论等。
  • 数据字段: 包括id, paper, title, date, author, translator, topic, abstract, comment, rating, language, type, article, ok, no_newsletter, eingegeben, newsletter, free, url, top_topic, words, translatorcomment, datetranslation, scan等。
  • 数据分割: 无预定义分割,数据集作为一个整体提供。

数据集创建

  • 来源: 数据源自阿富汗的报纸、杂志和广播内容,由Afghanwire机构收集并翻译。
  • 翻译过程: 由一组阿富汗翻译人员进行翻译,确保内容的准确性和代表性。

许可证

  • 许可证: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

语言

  • 原始语言: Dari和Pashto
  • 翻译语言: 英语

支持的任务

  • 任务类型: 文本分类、零样本分类、摘要、特征提取

数据集大小

  • 规模: 1K<n<10K

数据集联系人

  • 联系人: Alex Strick van Linschoten
  • 联系方式: @strickvl

引用信息

@misc{afghanwire_2024, author = {Afghanwire}, title = {Afghanwire Media Database 2006-2009}, year = {2024}, month = {April}, day = {1}, url = {https://huggingface.co/datasets/strickvl/afghanwire} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作