five

dvilasuero/sharegpt-curation

收藏
Hugging Face2024-01-31 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/dvilasuero/sharegpt-curation
下载链接
链接失效反馈
官方服务:
资源简介:
--- size_categories: 10K<n<100K tags: - rlfh - argilla - human-feedback --- # Dataset Card for sharegpt-curation This dataset has been created with [Argilla](https://docs.argilla.io). As shown in the sections below, this dataset can be loaded into Argilla as explained in [Load with Argilla](#load-with-argilla), or used directly with the `datasets` library in [Load with `datasets`](#load-with-datasets). ## Dataset Description - **Homepage:** https://argilla.io - **Repository:** https://github.com/argilla-io/argilla - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains: * A dataset configuration file conforming to the Argilla dataset format named `argilla.yaml`. This configuration file will be used to configure the dataset when using the `FeedbackDataset.from_huggingface` method in Argilla. * Dataset records in a format compatible with HuggingFace `datasets`. These records will be loaded automatically when using `FeedbackDataset.from_huggingface` and can be loaded independently using the `datasets` library via `load_dataset`. * The [annotation guidelines](#annotation-guidelines) that have been used for building and curating the dataset, if they've been defined in Argilla. ### Load with Argilla To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code: ```python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("dvilasuero/sharegpt-curation") ``` ### Load with `datasets` To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset("dvilasuero/sharegpt-curation") ``` ### Supported Tasks and Leaderboards This dataset can contain [multiple fields, questions and responses](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#feedback-dataset) so it can be used for different NLP tasks, depending on the configuration. The dataset structure is described in the [Dataset Structure section](#dataset-structure). There are no leaderboards associated with this dataset. ### Languages [More Information Needed] ## Dataset Structure ### Data in Argilla The dataset is created in Argilla with: **fields**, **questions**, **suggestions**, **metadata**, **vectors**, and **guidelines**. The **fields** are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. | Field Name | Title | Type | Required | Markdown | | ---------- | ----- | ---- | -------- | -------- | | prompt | Prompt | FieldTypes.text | True | True | | response | Response | FieldTypes.text | True | True | The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, label_selection, multi_label_selection, or ranking. | Question Name | Title | Type | Required | Description | Values/Labels | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | prompt-quality | Prompt Quality | QuestionTypes.rating | True | N/A | [1, 2, 3, 4, 5, 6, 7] | | prompt-intent | Prompt Intent | QuestionTypes.label_selection | True | N/A | ['generation', 'rewrite', 'extract', 'closed-qa', 'open-qa', 'classification', 'summarization', 'brainstorming', 'chat', 'code', 'other'] | | response-toxicity | Response Toxicity | QuestionTypes.multi_label_selection | False | N/A | ['illegal', 'harmfull', 'unqualified advice'] | The **suggestions** are human or machine generated recommendations for each question to assist the annotator during the annotation process, so those are always linked to the existing questions, and named appending "-suggestion" and "-suggestion-metadata" to those, containing the value/s of the suggestion and its metadata, respectively. So on, the possible values are the same as in the table above, but the column name is appended with "-suggestion" and the metadata is appended with "-suggestion-metadata". The **metadata** is a dictionary that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. | Metadata Name | Title | Type | Values | Visible for Annotators | | ------------- | ----- | ---- | ------ | ---------------------- | | prompt_token_length_mean | Prompt Token Length Mean | MetadataPropertyTypes.float | None - None | True | | prompt_token_length_median | Prompt Token Length Median | MetadataPropertyTypes.float | None - None | True | | prompt_token_length_std | Prompt Token Length Std | MetadataPropertyTypes.float | None - None | True | | prompt_sentence_length_mean | Prompt Sentence Length Mean | MetadataPropertyTypes.float | None - None | True | | prompt_sentence_length_median | Prompt Sentence Length Median | MetadataPropertyTypes.float | None - None | True | | prompt_sentence_length_std | Prompt Sentence Length Std | MetadataPropertyTypes.float | None - None | True | | prompt_syllables_per_token_mean | Prompt Syllables Per Token Mean | MetadataPropertyTypes.float | None - None | True | | prompt_syllables_per_token_median | Prompt Syllables Per Token Median | MetadataPropertyTypes.float | None - None | True | | prompt_syllables_per_token_std | Prompt Syllables Per Token Std | MetadataPropertyTypes.float | None - None | True | | prompt_n_tokens | Prompt N Tokens | MetadataPropertyTypes.integer | None - None | True | | prompt_n_unique_tokens | Prompt N Unique Tokens | MetadataPropertyTypes.integer | None - None | True | | prompt_proportion_unique_tokens | Prompt Proportion Unique Tokens | MetadataPropertyTypes.float | None - None | True | | prompt_n_characters | Prompt N Characters | MetadataPropertyTypes.integer | None - None | True | | prompt_n_sentences | Prompt N Sentences | MetadataPropertyTypes.integer | None - None | True | | response_token_length_mean | Response Token Length Mean | MetadataPropertyTypes.float | None - None | True | | response_token_length_median | Response Token Length Median | MetadataPropertyTypes.float | None - None | True | | response_token_length_std | Response Token Length Std | MetadataPropertyTypes.float | None - None | True | | response_sentence_length_mean | Response Sentence Length Mean | MetadataPropertyTypes.float | None - None | True | | response_sentence_length_median | Response Sentence Length Median | MetadataPropertyTypes.float | None - None | True | | response_sentence_length_std | Response Sentence Length Std | MetadataPropertyTypes.float | None - None | True | | response_syllables_per_token_mean | Response Syllables Per Token Mean | MetadataPropertyTypes.float | None - None | True | | response_syllables_per_token_median | Response Syllables Per Token Median | MetadataPropertyTypes.float | None - None | True | | response_syllables_per_token_std | Response Syllables Per Token Std | MetadataPropertyTypes.float | None - None | True | | response_n_tokens | Response N Tokens | MetadataPropertyTypes.integer | None - None | True | | response_n_unique_tokens | Response N Unique Tokens | MetadataPropertyTypes.integer | None - None | True | | response_proportion_unique_tokens | Response Proportion Unique Tokens | MetadataPropertyTypes.float | None - None | True | | response_n_characters | Response N Characters | MetadataPropertyTypes.integer | None - None | True | | response_n_sentences | Response N Sentences | MetadataPropertyTypes.integer | None - None | True | The **guidelines**, are optional as well, and are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section. ### Data Instances An example of a dataset instance in Argilla looks as follows: ```json { "external_id": null, "fields": { "prompt": "You\u0027re an ebook writer. Write an ebook with title \"Intelligent Kitchen\" and provide sub headline for 50 winning products", "response": "\u003cdiv class=\"markdown prose w-full break-words dark:prose-invert light\"\u003e\u003cp\u003eTitle: Intelligent Kitchen: How to Transform Your Cooking Experience with Smart Technology\u003c/p\u003e\u003cp\u003eSub Headline: 50 Winning Products to Upgrade Your Kitchen into an Intelligent One\u003c/p\u003e\u003cp\u003eIntroduction:\nThe kitchen is the heart of any home, and it\u0027s where we spend a lot of our time cooking and preparing meals for our loved ones. With the advancement of technology, our kitchens are becoming more intelligent, making cooking and entertaining easier, faster, and more enjoyable. This ebook is a guide for anyone who wants to transform their kitchen into an intelligent one by incorporating smart technology. We\u0027ve curated a list of 50 winning products that will make your cooking experience more efficient, convenient, and fun.\u003c/p\u003e\u003cp\u003eChapter 1: Smart Appliances\u003c/p\u003e\u003col\u003e\u003cli\u003eSmart Ovens\u003c/li\u003e\u003cli\u003eSmart Fridges\u003c/li\u003e\u003cli\u003eSmart Dishwashers\u003c/li\u003e\u003cli\u003eSmart Coffee Makers\u003c/li\u003e\u003cli\u003eSmart Slow Cookers\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 2: Voice Assistants\n6. Amazon Echo\u003c/p\u003e\u003col start=\"7\"\u003e\u003cli\u003eGoogle Home\u003c/li\u003e\u003cli\u003eApple HomePod\u003c/li\u003e\u003cli\u003eSonos One\u003c/li\u003e\u003cli\u003eJBL Link\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 3: Smart Lighting\n11. Philips Hue\u003c/p\u003e\u003col start=\"12\"\u003e\u003cli\u003eLIFX\u003c/li\u003e\u003cli\u003eSylvania Smart+\u003c/li\u003e\u003cli\u003eIKEA Tradfri\u003c/li\u003e\u003cli\u003eGE Smart Bulbs\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 4: Smart Kitchen Gadgets\n16. Anova Sous Vide\u003c/p\u003e\u003col start=\"17\"\u003e\u003cli\u003eInstant Pot\u003c/li\u003e\u003cli\u003eSmart Scale\u003c/li\u003e\u003cli\u003eSmart Thermometer\u003c/li\u003e\u003cli\u003eFood Dehydrator\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 5: Smart Cookware\n21. Smart Pans\u003c/p\u003e\u003col start=\"22\"\u003e\u003cli\u003eSmart Grills\u003c/li\u003e\u003cli\u003eSmart Crockpots\u003c/li\u003e\u003cli\u003eSmart Pressure Cookers\u003c/li\u003e\u003cli\u003eSmart Rice Cookers\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 6: Smart Home Security\n26. Smart Locks\u003c/p\u003e\u003col start=\"27\"\u003e\u003cli\u003eSmart Doorbells\u003c/li\u003e\u003cli\u003eSecurity Cameras\u003c/li\u003e\u003cli\u003eSmart Smoke Detectors\u003c/li\u003e\u003cli\u003eSmart Carbon Monoxide Detectors\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 7: Smart Kitchen Storage\n31. Smart Refrigerator Organization\u003c/p\u003e\u003col start=\"32\"\u003e\u003cli\u003eSmart Pantry\u003c/li\u003e\u003cli\u003eSmart Drawer Organizers\u003c/li\u003e\u003cli\u003eSmart Spice Racks\u003c/li\u003e\u003cli\u003eSmart Cabinets\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 8: Smart Countertops\n36. Smart Cutting Boards\u003c/p\u003e\u003col start=\"37\"\u003e\u003cli\u003eSmart Kitchen Scales\u003c/li\u003e\u003cli\u003eSmart Kettles\u003c/li\u003e\u003cli\u003eSmart Toasters\u003c/li\u003e\u003cli\u003eSmart Blenders\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 9: Smart Kitchen Entertainment\n41. Smart TVs\u003c/p\u003e\u003col start=\"42\"\u003e\u003cli\u003eSmart Speakers\u003c/li\u003e\u003cli\u003eSmart Soundbars\u003c/li\u003e\u003cli\u003eSmart Projectors\u003c/li\u003e\u003cli\u003eSmart Streaming Devices\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 10: Smart Kitchen Decor\n46. Smart Artwork\u003c/p\u003e\u003col start=\"47\"\u003e\u003cli\u003eSmart Lighting Fixtures\u003c/li\u003e\u003cli\u003eSmart Kitchen Faucets\u003c/li\u003e\u003cli\u003eSmart Kitchen Mats\u003c/li\u003e\u003cli\u003eSmart Plants\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eConclusion:\nIncorporating smart technology into your kitchen can make a significant difference in your cooking experience. These 50 winning products can help you create an intelligent kitchen that is not only functional but also beautiful and entertaining. Whether you\u0027re a beginner or an experienced cook, there\u0027s something on this list that can enhance your cooking skills and make your life easier. So, get started on your journey towards an intelligent kitchen today!\u003c/p\u003e\u003c/div\u003e" }, "metadata": { "prompt_n_characters": 102, "prompt_n_sentences": 2, "prompt_n_tokens": 19, "prompt_n_unique_tokens": 17, "prompt_proportion_unique_tokens": 0.8947368421052632, "prompt_sentence_length_mean": 9.5, "prompt_sentence_length_median": 9.5, "prompt_sentence_length_std": 5.5, "prompt_syllables_per_token_mean": 1.3157894736842106, "prompt_syllables_per_token_median": 1.0, "prompt_syllables_per_token_std": 0.7292845505553167, "prompt_token_length_mean": 5.052631578947368, "prompt_token_length_median": 5.0, "prompt_token_length_std": 2.416471638578736, "response_n_characters": 2715, "response_n_sentences": 37, "response_n_tokens": 443, "response_n_unique_tokens": 229, "response_proportion_unique_tokens": 0.5169300225733634, "response_sentence_length_mean": 11.972972972972974, "response_sentence_length_median": 11.0, "response_sentence_length_std": 8.706898434610393, "response_syllables_per_token_mean": 1.4943566591422122, "response_syllables_per_token_median": 1.0, "response_syllables_per_token_std": 0.8948509487194239, "response_token_length_mean": 6.00902934537246, "response_token_length_median": 5.0, "response_token_length_std": 4.7444785260833635 }, "responses": [], "suggestions": [], "vectors": {} } ``` While the same record in HuggingFace `datasets` looks as follows: ```json { "external_id": null, "metadata": "{\"prompt_token_length_mean\": 5.052631578947368, \"prompt_token_length_median\": 5.0, \"prompt_token_length_std\": 2.416471638578736, \"prompt_sentence_length_mean\": 9.5, \"prompt_sentence_length_median\": 9.5, \"prompt_sentence_length_std\": 5.5, \"prompt_syllables_per_token_mean\": 1.3157894736842106, \"prompt_syllables_per_token_median\": 1.0, \"prompt_syllables_per_token_std\": 0.7292845505553167, \"prompt_n_tokens\": 19, \"prompt_n_unique_tokens\": 17, \"prompt_proportion_unique_tokens\": 0.8947368421052632, \"prompt_n_characters\": 102, \"prompt_n_sentences\": 2, \"response_token_length_mean\": 6.00902934537246, \"response_token_length_median\": 5.0, \"response_token_length_std\": 4.7444785260833635, \"response_sentence_length_mean\": 11.972972972972974, \"response_sentence_length_median\": 11.0, \"response_sentence_length_std\": 8.706898434610393, \"response_syllables_per_token_mean\": 1.4943566591422122, \"response_syllables_per_token_median\": 1.0, \"response_syllables_per_token_std\": 0.8948509487194239, \"response_n_tokens\": 443, \"response_n_unique_tokens\": 229, \"response_proportion_unique_tokens\": 0.5169300225733634, \"response_n_characters\": 2715, \"response_n_sentences\": 37}", "prompt": "You\u0027re an ebook writer. Write an ebook with title \"Intelligent Kitchen\" and provide sub headline for 50 winning products", "prompt-intent": [], "prompt-intent-suggestion": null, "prompt-intent-suggestion-metadata": { "agent": null, "score": null, "type": null }, "prompt-quality": [], "prompt-quality-suggestion": null, "prompt-quality-suggestion-metadata": { "agent": null, "score": null, "type": null }, "response": "\u003cdiv class=\"markdown prose w-full break-words dark:prose-invert light\"\u003e\u003cp\u003eTitle: Intelligent Kitchen: How to Transform Your Cooking Experience with Smart Technology\u003c/p\u003e\u003cp\u003eSub Headline: 50 Winning Products to Upgrade Your Kitchen into an Intelligent One\u003c/p\u003e\u003cp\u003eIntroduction:\nThe kitchen is the heart of any home, and it\u0027s where we spend a lot of our time cooking and preparing meals for our loved ones. With the advancement of technology, our kitchens are becoming more intelligent, making cooking and entertaining easier, faster, and more enjoyable. This ebook is a guide for anyone who wants to transform their kitchen into an intelligent one by incorporating smart technology. We\u0027ve curated a list of 50 winning products that will make your cooking experience more efficient, convenient, and fun.\u003c/p\u003e\u003cp\u003eChapter 1: Smart Appliances\u003c/p\u003e\u003col\u003e\u003cli\u003eSmart Ovens\u003c/li\u003e\u003cli\u003eSmart Fridges\u003c/li\u003e\u003cli\u003eSmart Dishwashers\u003c/li\u003e\u003cli\u003eSmart Coffee Makers\u003c/li\u003e\u003cli\u003eSmart Slow Cookers\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 2: Voice Assistants\n6. Amazon Echo\u003c/p\u003e\u003col start=\"7\"\u003e\u003cli\u003eGoogle Home\u003c/li\u003e\u003cli\u003eApple HomePod\u003c/li\u003e\u003cli\u003eSonos One\u003c/li\u003e\u003cli\u003eJBL Link\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 3: Smart Lighting\n11. Philips Hue\u003c/p\u003e\u003col start=\"12\"\u003e\u003cli\u003eLIFX\u003c/li\u003e\u003cli\u003eSylvania Smart+\u003c/li\u003e\u003cli\u003eIKEA Tradfri\u003c/li\u003e\u003cli\u003eGE Smart Bulbs\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 4: Smart Kitchen Gadgets\n16. Anova Sous Vide\u003c/p\u003e\u003col start=\"17\"\u003e\u003cli\u003eInstant Pot\u003c/li\u003e\u003cli\u003eSmart Scale\u003c/li\u003e\u003cli\u003eSmart Thermometer\u003c/li\u003e\u003cli\u003eFood Dehydrator\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 5: Smart Cookware\n21. Smart Pans\u003c/p\u003e\u003col start=\"22\"\u003e\u003cli\u003eSmart Grills\u003c/li\u003e\u003cli\u003eSmart Crockpots\u003c/li\u003e\u003cli\u003eSmart Pressure Cookers\u003c/li\u003e\u003cli\u003eSmart Rice Cookers\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 6: Smart Home Security\n26. Smart Locks\u003c/p\u003e\u003col start=\"27\"\u003e\u003cli\u003eSmart Doorbells\u003c/li\u003e\u003cli\u003eSecurity Cameras\u003c/li\u003e\u003cli\u003eSmart Smoke Detectors\u003c/li\u003e\u003cli\u003eSmart Carbon Monoxide Detectors\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 7: Smart Kitchen Storage\n31. Smart Refrigerator Organization\u003c/p\u003e\u003col start=\"32\"\u003e\u003cli\u003eSmart Pantry\u003c/li\u003e\u003cli\u003eSmart Drawer Organizers\u003c/li\u003e\u003cli\u003eSmart Spice Racks\u003c/li\u003e\u003cli\u003eSmart Cabinets\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 8: Smart Countertops\n36. Smart Cutting Boards\u003c/p\u003e\u003col start=\"37\"\u003e\u003cli\u003eSmart Kitchen Scales\u003c/li\u003e\u003cli\u003eSmart Kettles\u003c/li\u003e\u003cli\u003eSmart Toasters\u003c/li\u003e\u003cli\u003eSmart Blenders\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 9: Smart Kitchen Entertainment\n41. Smart TVs\u003c/p\u003e\u003col start=\"42\"\u003e\u003cli\u003eSmart Speakers\u003c/li\u003e\u003cli\u003eSmart Soundbars\u003c/li\u003e\u003cli\u003eSmart Projectors\u003c/li\u003e\u003cli\u003eSmart Streaming Devices\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eChapter 10: Smart Kitchen Decor\n46. Smart Artwork\u003c/p\u003e\u003col start=\"47\"\u003e\u003cli\u003eSmart Lighting Fixtures\u003c/li\u003e\u003cli\u003eSmart Kitchen Faucets\u003c/li\u003e\u003cli\u003eSmart Kitchen Mats\u003c/li\u003e\u003cli\u003eSmart Plants\u003c/li\u003e\u003c/ol\u003e\u003cp\u003eConclusion:\nIncorporating smart technology into your kitchen can make a significant difference in your cooking experience. These 50 winning products can help you create an intelligent kitchen that is not only functional but also beautiful and entertaining. Whether you\u0027re a beginner or an experienced cook, there\u0027s something on this list that can enhance your cooking skills and make your life easier. So, get started on your journey towards an intelligent kitchen today!\u003c/p\u003e\u003c/div\u003e", "response-toxicity": [], "response-toxicity-suggestion": null, "response-toxicity-suggestion-metadata": { "agent": null, "score": null, "type": null } } ``` ### Data Fields Among the dataset fields, we differentiate between the following: * **Fields:** These are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. * **prompt** is of type `FieldTypes.text`. * **response** is of type `FieldTypes.text`. * **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as `RatingQuestion`, `TextQuestion`, `LabelQuestion`, `MultiLabelQuestion`, and `RankingQuestion`. * **prompt-quality** is of type `QuestionTypes.rating` with the following allowed values [1, 2, 3, 4, 5, 6, 7]. * **prompt-intent** is of type `QuestionTypes.label_selection` with the following allowed values ['generation', 'rewrite', 'extract', 'closed-qa', 'open-qa', 'classification', 'summarization', 'brainstorming', 'chat', 'code', 'other']. * (optional) **response-toxicity** is of type `QuestionTypes.multi_label_selection` with the following allowed values ['illegal', 'harmfull', 'unqualified advice']. * **Suggestions:** As of Argilla 1.13.0, the suggestions have been included to provide the annotators with suggestions to ease or assist during the annotation process. Suggestions are linked to the existing questions, are always optional, and contain not just the suggestion itself, but also the metadata linked to it, if applicable. * (optional) **prompt-quality-suggestion** is of type `QuestionTypes.rating` with the following allowed values [1, 2, 3, 4, 5, 6, 7]. * (optional) **prompt-intent-suggestion** is of type `QuestionTypes.label_selection` with the following allowed values ['generation', 'rewrite', 'extract', 'closed-qa', 'open-qa', 'classification', 'summarization', 'brainstorming', 'chat', 'code', 'other']. * (optional) **response-toxicity-suggestion** is of type `QuestionTypes.multi_label_selection` with the following allowed values ['illegal', 'harmfull', 'unqualified advice']. Additionally, we also have two more fields that are optional and are the following: * **metadata:** This is an optional field that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. * **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file. ### Data Splits The dataset contains a single split, which is `train`. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation guidelines # Remarks ## Classification - Prompt in English, and response in another language -> toxicity: unqualified advice. (ratio: models are not supposed to answer in other languages) - System prompts for initial message -> prompt intent: other. (ratio: this should have been a category for itself) ## Discards - very short conversational prompts/responses (ratio: we cannot use this for proper model training) - complete none-sense prompts and responses (ratio: we cannot use this for proper model training) - prompts in other languages (ratio: langdetect for classification of languages might not work) - prompts and responses that were clearly part of a chat-conversation (ratio: might be wrong data processing) - prompts that assume the usage of tools like for example being able to scrape a website or something # Guidelines The ShareGPT dataset is a dataset that was collected by public users who were using the Google Chrome extension offered by [sharegpt.com](sharegpt.com) to share their ChatGPT conversations. This data should mimic real-life usage of the model and can therefore be used to fine-tune a model for an actual scenario. Additionally, Google was accused of using this dataset as a baseline to train its [BARD](https://www.theverge.com/2023/3/29/23662621/google-bard-chatgpt-sharegpt-training-denies) model. We decided to use a random subset of the raw version of the dataset including all conversations but we did filter out non-English conversation. The raw dataset used can be found on [the Hugging Face hub](https://huggingface.co/datasets/zetavg/ShareGPT-Processed). ## Classification Tasks The dataset aims to classify three things: 1. Quality 2. Intent 3. Toxicity ### Quality For the quality, we have decided to define a rating question on a scale from 1 to 7. This question is used to assess the quality of the prompt, based on quality, grammar and clarity of context. ### Intent For the intent, we have decided to define a single-label classification question which will be used to determine the prompt types as defined in the [Llama 2 paper](https://arxiv.org/abs/2307.09288). This outlines a good distribution of the types of prompts we should use for fine-tuning an LLM. - Generation: A statement that indicates the need for open generation. - Rewrite: A statement that is rewritten by the model. - Extract: A question that is answered by extracted from the context in the prompt. - Closed QA: A question that is answered by the model based on the context. - Open QA: A question that is answered by the model without any context. - Classification: A statement that is classified by the model. - Summarization: A summary of the context. - Brainstorming: A list of ideas or concepts. - Chat: A chat between two people. - Code: A statement about code or programming. - Other: A prompt that does not fit into any of the above categories. ### Toxicity For the Toxicity, we decided to use the [Guardrailing Mistral](https://docs.mistral.ai/usage/guardrailing) proposal as a reference. These labels are initially fine-grained but for the overall dataset, we will use a simplified version of these labels. As proposed by the Mistral team. - Illegal: Illegal activity. - Child abuse: child sexual abuse material or any content that exploits or harms children. - Hate violence harassment: Generation of hateful, harassing, or violent content: content that expresses, incites, or promotes hate based on identity, content that intends to harass, threaten, or bully an individual, content that promotes or glorifies violence or celebrates the suffering or humiliation of others. - Malware: Generation of malware: content that attempts to generate code that is designed to disrupt, damage, or gain unauthorized access to a computer system. - Physical harm: activity that has high risk of physical harm, including: weapons development, military and warfare, management or operation of critical infrastructure in energy, transportation, and water, content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders. - Economic harm: activity that has high risk of economic harm, including: multi-level marketing, gambling, payday lending, automated determinations of eligibility for credit, employment, educational institutions, or public assistance services. - Fraud: Fraudulent or deceptive activity, including: scams, coordinated inauthentic behavior, plagiarism, academic dishonesty, astroturfing, such as fake grassroots support or fake review generation, disinformation, spam, pseudo-pharmaceuticals. - Adult: Adult content, adult industries, and dating apps, including: content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness), erotic chat, pornography. - Political: Political campaigning or lobbying, by: generating high volumes of campaign materials, generating campaign materials personalized to or targeted at specific demographics, building conversational or interactive systems such as chatbots that provide information about campaigns or engage in political advocacy or lobbying, building products for political campaigning or lobbying purposes. - Privacy: Activity that violates people's privacy, including: tracking or monitoring an individual without their consent, facial recognition of private individuals, classifying individuals based on protected characteristics, using biometrics for identification or assessment, unlawful collection or disclosure of personal identifiable information or educational, financial, or other protected records. - Unqualified law: Engaging in the unauthorized practice of law, or offering tailored legal advice without a qualified person reviewing the information. - Unqualified financial: Offering tailored financial advice without a qualified person reviewing the information. - Unqualified health: Telling someone that they have or do not have a certain health condition, or providing instructions on how to cure or treat a health condition. These have been mapped to the following simplified labels: - Illegal: Illegal activities such as terrorism, child abuse or fraud. - Harmful: Hateful, harassing or violent content such as discrimination, self-harm or bullying. - Unqualified advice: Unqualified advice for instance in legal, medical or financial domains. #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
dvilasuero
原始信息汇总

数据集卡片 for sharegpt-curation

数据集描述

  • 数据集概述
    • 该数据集包含一个符合Argilla数据集格式的配置文件argilla.yaml,用于在使用Argilla的FeedbackDataset.from_huggingface方法时配置数据集。
    • 数据集记录采用与HuggingFace datasets兼容的格式,这些记录在使用FeedbackDataset.from_huggingface时会自动加载,也可以通过datasets库独立加载。
    • 如果已在Argilla中定义,数据集还包括用于构建和整理数据集的标注指南

加载数据集

使用Argilla加载

python import argilla as rg

ds = rg.FeedbackDataset.from_huggingface("dvilasuero/sharegpt-curation")

使用datasets库加载

python from datasets import load_dataset

ds = load_dataset("dvilasuero/sharegpt-curation")

支持的任务和排行榜

  • 该数据集可以包含多个字段、问题和响应,因此可以用于不同的NLP任务,具体取决于配置。
  • 数据集结构在数据集结构部分中描述。
  • 该数据集没有关联的排行榜。

数据集结构

数据在Argilla中的结构

数据集在Argilla中包含以下元素:

  • 字段(Fields)

    • 目前仅支持文本字段,用于提供对问题的响应。
    字段名称 标题 类型 必填 支持Markdown
    prompt Prompt FieldTypes.text True True
    response Response FieldTypes.text True True
  • 问题(Questions)

    • 问题可以是不同类型,如评分、文本、标签选择、多标签选择或排序。
    问题名称 标题 类型 必填 描述 值/标签
    prompt-quality Prompt Quality QuestionTypes.rating True N/A [1, 2, 3, 4, 5, 6, 7]
    prompt-intent Prompt Intent QuestionTypes.label_selection True N/A [generation, rewrite, extract, closed-qa, open-qa, classification, summarization, brainstorming, chat, code, other]
    response-toxicity Response Toxicity QuestionTypes.multi_label_selection False N/A [illegal, harmfull, unqualified advice]
  • 建议(Suggestions)

    • 建议是人为或机器生成的推荐,用于辅助标注者在标注过程中的选择,与现有问题相关联。
  • 元数据(Metadata)

    • 元数据是一个字典,用于提供关于数据集记录的额外信息,如原始来源链接或记录的作者、日期和来源等。
    元数据名称 标题 类型 对标注者可见
    prompt_token_length_mean Prompt Token Length Mean MetadataPropertyTypes.float None - None True
    prompt_token_length_median Prompt Token Length Median MetadataPropertyTypes.float None - None True
    prompt_token_length_std Prompt Token Length Std MetadataPropertyTypes.float None - None True
    prompt_sentence_length_mean Prompt Sentence Length Mean MetadataPropertyTypes.float None - None True
    prompt_sentence_length_median Prompt Sentence Length Median MetadataPropertyTypes.float None - None True
    prompt_sentence_length_std Prompt Sentence Length Std MetadataPropertyTypes.float None - None True
    prompt_syllables_per_token_mean Prompt Syllables Per Token Mean MetadataPropertyTypes.float None - None True
    prompt_syllables_per_token_median Prompt Syllables Per Token Median MetadataPropertyTypes.float None - None True
    prompt_syllables_per_token_std Prompt Syllables Per Token Std MetadataPropertyTypes.float None - None True
    prompt_n_tokens Prompt N Tokens MetadataPropertyTypes.integer None - None True
    prompt_n_unique_tokens Prompt N Unique Tokens MetadataPropertyTypes.integer None - None True
    prompt_proportion_unique_tokens Prompt Proportion Unique Tokens MetadataPropertyTypes.float None - None True
    prompt_n_characters Prompt N Characters MetadataPropertyTypes.integer None - None True
    prompt_n_sentences Prompt N Sentences MetadataPropertyTypes.integer None - None True
    response_token_length_mean Response Token Length Mean MetadataPropertyTypes.float None - None True
    response_token_length_median Response Token Length Median MetadataPropertyTypes.float None - None True
    response_token_length_std Response Token Length Std MetadataPropertyTypes.float None - None True
    response_sentence_length_mean Response Sentence Length Mean MetadataPropertyTypes.float None - None True
    response_sentence_length_median Response Sentence Length Median MetadataPropertyTypes.float None - None True
    response_sentence_length_std Response Sentence Length Std MetadataPropertyTypes.float None - None True
    response_syllables_per_token_mean Response Syllables Per Token Mean MetadataPropertyTypes.float None - None True
    response_syllables_per_token_median Response Syllables Per Token Median MetadataPropertyTypes.float None - None True
    response_syllables_per_token_std Response Syllables Per Token Std MetadataPropertyTypes.float None - None True
    response_n_tokens Response N Tokens MetadataPropertyTypes.integer None - None True
    response_n_unique_tokens Response N Unique Tokens MetadataPropertyTypes.integer None - None True
    response_proportion_unique_tokens Response Proportion Unique Tokens MetadataPropertyTypes.float None - None True
    response_n_characters Response N Characters MetadataPropertyTypes.integer None - None True
    response_n_sentences Response N Sentences MetadataPropertyTypes.integer None - None True
  • 指南(Guidelines)

    • 指南是可选的,用于向标注者提供指令。

数据实例

一个数据集实例在Argilla中的示例如下:

json { "external_id": null, "fields": { "prompt": "Youre an ebook writer. Write an ebook with title "Intelligent Kitchen" and provide sub headline for 50 winning products", "response": "<div class="markdown prose w-full break-words dark:prose-invert light"><p>Title: Intelligent Kitchen: How to Transform Your Cooking Experience with Smart Technology</p><p>Sub Headline: 50 Winning Products to Upgrade Your Kitchen into an Intelligent One</p><p>Introduction: The kitchen is the heart of any home, and its where we spend a lot of our time cooking and preparing meals for our loved ones. With the advancement of technology, our kitchens are becoming more intelligent, making cooking and entertaining easier, faster, and more enjoyable. This ebook is a guide for anyone who wants to transform their kitchen into an intelligent one by incorporating smart technology. Weve curated a list of 50 winning products that will make your cooking experience more efficient, convenient, and fun.</p><p>Chapter 1: Smart Appliances</p><ol><li>Smart Ovens</li><li>Smart Fridges</li><li>Smart Dishwashers</li><li>Smart Coffee Makers</li><li>Smart Slow Cookers</li></ol><p>Chapter 2: Voice Assistants 6. Amazon Echo</p><ol start="7"><li>Google Home</li><li>Apple HomePod</li><li>Sonos One</li><li>JBL Link</li></ol><p>Chapter 3: Smart Lighting 11. Philips Hue</p><ol start="12"><li>LIFX</li><li>Sylvania Smart+</li><li>IKEA Tradfri</li><li>GE Smart Bulbs</li></ol><p>Chapter 4: Smart Kitchen Gadgets 16. Anova Sous Vide</p><ol start="17"><li>Instant Pot</li><li>Smart Scale</li><li>Smart Thermometer</li><li>Food Dehydrator</li></ol><p>Chapter 5: Smart Cookware 21. Smart Pans</p><ol start="22"><li>Smart Grills</li><li>Smart Crockpots</li><li>Smart Pressure Cookers</li><li>Smart Rice Cookers</li></ol><p>Chapter 6: Smart Home Security 26. Smart Locks</p><ol start="27"><li>Smart Doorbells</li><li>Security Cameras</li><li>Smart Smoke Detectors</li><li>Smart Carbon Monoxide Detectors</li></ol><p>Chapter 7: Smart Kitchen Storage 31. Smart Refrigerator Organization</p><ol start="32"><li>Smart Pantry</li><li>Smart Drawer Organizers</li><li>Smart Spice Racks</li><li>Smart Cabinets</li></ol><p>Chapter 8: Smart Countertops 36. Smart Cutting Boards</p><ol start="37"><li>Smart Kitchen Scales</li><li>Smart Kettles</li><li>Smart Toasters</li><li>Smart Blenders</li></ol><p>Chapter 9: Smart Kitchen Entertainment 41. Smart TVs</p><ol start="42"><li>Smart Speakers</li><li>Smart Soundbars</li><li>Smart Projectors</li><li>Smart Streaming Devices</li></ol><p>Chapter 10: Smart Kitchen Decor 46. Smart Artwork</p><ol start="47"><li>Smart Lighting Fixtures</li><li>Smart Kitchen Faucets</li><li>Smart Kitchen Mats</li><li>Smart Plants</li></ol><p>Conclusion: Incorporating smart technology into your kitchen can make a significant difference in your cooking experience. These 50 winning products can help you create an intelligent kitchen that is not only functional but also beautiful and entertaining. Whether youre a beginner or an experienced cook, theres something on this list that can enhance your cooking skills and make your life easier. So, get started on your journey towards an intelligent kitchen today!</p></div>" }, "metadata": { "prompt_n_characters": 102, "prompt_n_sentences": 2, "prompt_n_tokens": 19, "prompt_n_unique_tokens": 17, "prompt_proportion_unique_tokens": 0.8947368421052632, "prompt_sentence_length_mean": 9.5, "prompt_sentence_length_median": 9.5, "prompt_sentence_length_std": 5.5, "prompt_syllables_per_token_mean": 1.3157894736842106, "prompt_syllables_per_token_median": 1.0, "prompt_syllables_per_token_std": 0.7292845505553167, "prompt_token_length_mean": 5.052631578947368, "prompt_token_length_median": 5.0, "prompt_token_length_std": 2.416471638578736, "response_n_characters": 2715, "response_n_sentences": 37, "response_n_tokens": 443, "response_n_unique_tokens": 229, "response_proportion_unique_tokens": 0.5169300225733634, "response_sentence_length_mean": 11.972972972972974, "response_sentence_length_median": 11.0, "response_sentence_length_std": 8.706898434610393, "response_syllables_per_token_mean": 1.4943566591422122, "response_syllables_per_token_median": 1.0, "response_syllables_per_token_std": 0.8948509487194239, "response_token_length_mean": 6.00902934537246, "response_token_length_median": 5.0, "response_token_length_std": 4.7444785260833635 }, "responses": [], "suggestions": [], "vectors": {} }

在HuggingFace datasets中的相同记录如下:

json { "external_id": null, "metadata": "{"prompt_token_length_mean": 5.052631578947368, "prompt_token_length_median": 5.0, "prompt_token_length_std": 2.416471638578736, "prompt_sentence_length_mean": 9.5, "prompt_sentence_length_median": 9.5, "prompt_sentence_length_std": 5.5, "prompt_syllables_per_token_mean": 1.3157894736842106, "prompt_syllables_per_token_median": 1.0, "prompt_syllables_per_token_std": 0.7292845505553167, "prompt_n_tokens": 19, "prompt_n_unique_tokens": 17, "prompt_proportion_unique_tokens": 0.8947368421052632, "prompt_n_characters": 102, "prompt_n_sentences": 2, "response_token_length_mean": 6.00902934537246, "response_token_length_median": 5.0, "response_token_length_std": 4.7444785260833635, "response_sentence_length_mean": 11.972972972972974, "response_sentence_length_median": 11.0, "response_sentence_length_std": 8.706898434610393, "response_syllables_per_token_mean": 1.4943566591422122, "response_syllables_per_token_median": 1.0, "response_syllables_per_token_std": 0.8948509487194239, "response_n_tokens": 443, "response_n_unique_tokens": 229, "response_proportion_unique_tokens": 0.5169300225733634, "response_n_characters": 2715, "response_n_sentences": 37}", "prompt": "Youre an ebook writer. Write an ebook with title "Intelligent Kitchen" and provide sub headline for 50 winning products", "prompt-intent": [], "prompt-intent-suggestion": null, "prompt-intent-suggestion-metadata": { "agent": null, "score": null, "type": null }, "prompt-quality": [], "prompt-quality-suggestion": null, "prompt-quality-suggestion-metadata": { "agent": null, "score": null, "type": null }, "response": "<div class="markdown prose w-full break-words dark:prose-invert light"><p>Title: Intelligent Kitchen: How to Transform Your Cooking Experience with Smart Technology</p><p>Sub Headline: 50 Winning Products to Upgrade Your Kitchen into an Intelligent One</

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作