RyokoExtra/MissingKeys
收藏Hugging Face2024-04-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RyokoExtra/MissingKeys
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
- text-generation
- text-to-image
- text-to-video
language:
- ja
pretty_name: MissingKeys
configs:
- config_name: default
default: true
data_files:
- split: all
path:
- 'data/*/*.jsonl'
---
# Dataset Card for MissingKeys
NOTE: This contains old data before 10/04/24. The uploader has moved to [here!](https://huggingface.co/datasets/WitchesSocialStream/misskey.io)
## Dataset Description
- **Homepage:** Here!
- **Repository:** N/A
- **Paper:** N/A
- **Leaderboard:** N/A
- **Point of Contact:** KaraKaraWitch
### Dataset Summary
MissingKeys is a raw dataset archive of the misskey.io network.
### Supported Tasks and Leaderboards
This dataset is primarily intended for unsupervised training of text generation models; however, it may be useful for other purposes.
- text-classification
- text-generation
### Languages
Primarily japanese, however there are also english as well.
## Dataset Structure
All the files are located in jsonl files that has been compressed into .7z archives by date.
### Data Instances
Here is a sample with all the potential fields:
```json
{
"id": "9hh9iux6al",
"createdAt": "2023-07-22T07:38:17.994Z",
"userId": "9grv7htulz",
"user": {
"uid": "9grv7htulz#chikusa_nao@misskey.backspace.fm",
"name": "千種ナオ(ばすキー)",
"avatarUrl": "https://proxy.misskeyusercontent.com/avatar.webp?url=https%3A%2F%2Fs3.isk01.sakurastorage.jp%2Fbackspacekey%2Fmisskey%2Fca098593-5c2f-4488-8b82-18961149cf92.png&avatar=1",
"avatarBlurhash": "eGD8ztEK0KVb-=4TtSXm-jf4B7Vs~CEND*Fy%2Mct7%Lx.M{xcS0bv",
"states": "bot,nyaa~",
"hostInfo": "misskey@13.13.2#e4d440"
"emojis": {},
"onlineStatus": "unknown"
},
"text": "パソコン工房などのユニットコム系列だと、マザボ売るときにドライバディスクがないと30%買取金額が下がるという知見を得た",
"cw": null,
"visibility": "public",
"localOnly": false,
"renoteCount": 0,
"repliesCount": 0,
"reactions": {},
"reactionEmojis": {},
"emojis": {},
"fileIds": [],
"files": [],
"replyId": null,
"renoteId": null,
"uri": "https://misskey.backspace.fm/notes/9hh9iux6p7"
}
```
If the value is "Falsey" in python, it has been removed to save on space.
`states` is a comma seperated string that either includes: `bot` or `nyaa~` (Indicates they enabled cat mode) or both.
### Data Fields
Refer to the sample above. I'll drop in some additional notes:
`uid` in `user` follows this specific format:
`user_id#username@user_host`
### Data Splits
Each jsonl file is split at 100000 notes.
## Dataset Creation
### Curation Rationale
Because we need a SNS dataset, and since twitter appears to be quite reluctant, we went for the alternative.
### Source Data
#### Initial Data Collection and Normalization
None. No normalization is performed as this is a raw dump of the dataset. However we have removed empty and null fields to conserve on space.
#### Who are the source language producers?
The related users of misskey.io network.
### Annotations
#### Annotation process
No Annotations are present.
#### Who are the annotators?
No human annotators.
### Personal and Sensitive Information
We are certain there is no PII included in the dataset.
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
Misskey.io tends to be NSFW for images and is focused on Japanese culture.
### Other Known Limitations
N/A
## Additional Information
### Dataset Curators
KaraKaraWitch
### Licensing Information
Apache 2.0, for all parts of which KaraKaraWitch may be considered authors. All other material is distributed under fair use principles.
Ronsor Labs additionally is allowed to relicense the dataset as long as it has gone through processing.
### Citation Information
```
@misc{missingkeys,
title = {MissingKeys: A SNS dataset on misskey.io network},
author = {KaraKaraWitch},
year = {2023},
howpublished = {\url{https://huggingface.co/datasets/RyokoExtra/MissingKeys}},
}
```
### Name Etymology
N/A
### Contributions
- [@KaraKaraWitch (Twitter)](https://twitter.com/KaraKaraWitch) for gathering this dataset.
提供机构:
RyokoExtra
原始信息汇总
数据集概述
数据集名称
MissingKeys
许可证
Apache-2.0
任务类别
- 文本分类
- 文本生成
- 文本转图像
- 文本转视频
语言
主要为日语,也包含英语。
数据集结构
数据集文件存储为jsonl格式,按日期压缩成.7z档案。每个jsonl文件包含100000条笔记。
数据实例
数据实例包含多个字段,如用户信息、文本内容、可见性等。用户信息中的uid遵循特定格式:user_id#username@user_host。
数据集创建
数据来源
数据来源于misskey.io网络的用户。
数据处理
数据未经规范化处理,但已移除空值和null字段以节省空间。
使用注意事项
- 数据集不包含个人识别信息(PII)。
- 数据集内容可能包含不适合工作环境(NSFW)的图像,并聚焦于日本文化。
数据集贡献者
KaraKaraWitch
许可证信息
数据集遵循Apache 2.0许可证。KaraKaraWitch被视为作者的所有部分,其他材料根据合理使用原则分发。Ronsor Labs有权对经过处理的数据集进行再许可。
引用信息
@misc{missingkeys, title = {MissingKeys: A SNS dataset on misskey.io network}, author = {KaraKaraWitch}, year = {2023}, howpublished = {url{https://huggingface.co/datasets/RyokoExtra/MissingKeys}}, }



