five

RyokoExtra/MissingKeys

收藏
Hugging Face2024-04-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RyokoExtra/MissingKeys
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification - text-generation - text-to-image - text-to-video language: - ja pretty_name: MissingKeys configs: - config_name: default default: true data_files: - split: all path: - 'data/*/*.jsonl' --- # Dataset Card for MissingKeys NOTE: This contains old data before 10/04/24. The uploader has moved to [here!](https://huggingface.co/datasets/WitchesSocialStream/misskey.io) ## Dataset Description - **Homepage:** Here! - **Repository:** N/A - **Paper:** N/A - **Leaderboard:** N/A - **Point of Contact:** KaraKaraWitch ### Dataset Summary MissingKeys is a raw dataset archive of the misskey.io network. ### Supported Tasks and Leaderboards This dataset is primarily intended for unsupervised training of text generation models; however, it may be useful for other purposes. - text-classification - text-generation ### Languages Primarily japanese, however there are also english as well. ## Dataset Structure All the files are located in jsonl files that has been compressed into .7z archives by date. ### Data Instances Here is a sample with all the potential fields: ```json { "id": "9hh9iux6al", "createdAt": "2023-07-22T07:38:17.994Z", "userId": "9grv7htulz", "user": { "uid": "9grv7htulz#chikusa_nao@misskey.backspace.fm", "name": "千種ナオ(ばすキー)", "avatarUrl": "https://proxy.misskeyusercontent.com/avatar.webp?url=https%3A%2F%2Fs3.isk01.sakurastorage.jp%2Fbackspacekey%2Fmisskey%2Fca098593-5c2f-4488-8b82-18961149cf92.png&avatar=1", "avatarBlurhash": "eGD8ztEK0KVb-=4TtSXm-jf4B7Vs~CEND*Fy%2Mct7%Lx.M{xcS0bv", "states": "bot,nyaa~", "hostInfo": "misskey@13.13.2#e4d440" "emojis": {}, "onlineStatus": "unknown" }, "text": "パソコン工房などのユニットコム系列だと、マザボ売るときにドライバディスクがないと30%買取金額が下がるという知見を得た", "cw": null, "visibility": "public", "localOnly": false, "renoteCount": 0, "repliesCount": 0, "reactions": {}, "reactionEmojis": {}, "emojis": {}, "fileIds": [], "files": [], "replyId": null, "renoteId": null, "uri": "https://misskey.backspace.fm/notes/9hh9iux6p7" } ``` If the value is "Falsey" in python, it has been removed to save on space. `states` is a comma seperated string that either includes: `bot` or `nyaa~` (Indicates they enabled cat mode) or both. ### Data Fields Refer to the sample above. I'll drop in some additional notes: `uid` in `user` follows this specific format: `user_id#username@user_host` ### Data Splits Each jsonl file is split at 100000 notes. ## Dataset Creation ### Curation Rationale Because we need a SNS dataset, and since twitter appears to be quite reluctant, we went for the alternative. ### Source Data #### Initial Data Collection and Normalization None. No normalization is performed as this is a raw dump of the dataset. However we have removed empty and null fields to conserve on space. #### Who are the source language producers? The related users of misskey.io network. ### Annotations #### Annotation process No Annotations are present. #### Who are the annotators? No human annotators. ### Personal and Sensitive Information We are certain there is no PII included in the dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases Misskey.io tends to be NSFW for images and is focused on Japanese culture. ### Other Known Limitations N/A ## Additional Information ### Dataset Curators KaraKaraWitch ### Licensing Information Apache 2.0, for all parts of which KaraKaraWitch may be considered authors. All other material is distributed under fair use principles. Ronsor Labs additionally is allowed to relicense the dataset as long as it has gone through processing. ### Citation Information ``` @misc{missingkeys, title = {MissingKeys: A SNS dataset on misskey.io network}, author = {KaraKaraWitch}, year = {2023}, howpublished = {\url{https://huggingface.co/datasets/RyokoExtra/MissingKeys}}, } ``` ### Name Etymology N/A ### Contributions - [@KaraKaraWitch (Twitter)](https://twitter.com/KaraKaraWitch) for gathering this dataset.
提供机构:
RyokoExtra
原始信息汇总

数据集概述

数据集名称

MissingKeys

许可证

Apache-2.0

任务类别

  • 文本分类
  • 文本生成
  • 文本转图像
  • 文本转视频

语言

主要为日语,也包含英语。

数据集结构

数据集文件存储为jsonl格式,按日期压缩成.7z档案。每个jsonl文件包含100000条笔记。

数据实例

数据实例包含多个字段,如用户信息、文本内容、可见性等。用户信息中的uid遵循特定格式:user_id#username@user_host

数据集创建

数据来源

数据来源于misskey.io网络的用户。

数据处理

数据未经规范化处理,但已移除空值和null字段以节省空间。

使用注意事项

  • 数据集不包含个人识别信息(PII)。
  • 数据集内容可能包含不适合工作环境(NSFW)的图像,并聚焦于日本文化。

数据集贡献者

KaraKaraWitch

许可证信息

数据集遵循Apache 2.0许可证。KaraKaraWitch被视为作者的所有部分,其他材料根据合理使用原则分发。Ronsor Labs有权对经过处理的数据集进行再许可。

引用信息

@misc{missingkeys, title = {MissingKeys: A SNS dataset on misskey.io network}, author = {KaraKaraWitch}, year = {2023}, howpublished = {url{https://huggingface.co/datasets/RyokoExtra/MissingKeys}}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作