TeraGram dataset
收藏DataCite Commons2026-05-15 更新2026-05-18 收录
下载链接:
https://data.goettingen-research-online.de/citation?persistentId=doi:10.25625/GDCXQK
下载链接
链接失效反馈官方服务:
资源简介:
TeraGram is a large-scale dataset of Telegram messages from public chats. The dataset contains over 5.9 billion messages dating from 2015 to 2025, collected from 712 thousand channels and groups, enriched with metadata on forwards, reactions, and polls.
<br><br>
The data is distributed in Parquet files; large tables are split into batches with 1M lines in every batch. We provide a pipeline to ingest the dataset into a Postgres database, see our <a href="https://github.com/Priesemann-Group/telegram_quality_control">GitHub repository</a> for details.
<br><br>
For convenience, we also provide a 1% sample of this dataset in a <a href="https://zenodo.org/records/18262126">CSV format</a>.
<br><br>
The metadata fields are openly available for download, while access to the message content is restricted to protect the privacy of Telegram users. Qualified researchers may request access by sending an e-mail to the contact address. In the e-mail, please mention your institutional affiliation, a brief description of the research project and the type of messages you are interested in (e.g., whether you only need messages from a certain language or from certain chats). Requests will be reviewed and granted on a case-by-case basis.
<br><br>
For details regarding dataset collection and preliminary results, please refer to our paper "TeraGram: A Structured Longitudinal Dataset of the Telegram Messenger" (accepted to ICWSM 2026).
提供机构:
GRO.data
创建时间:
2026-04-24



