synthetic_pii_finance_multilingual

Name: synthetic_pii_finance_multilingual
Creator: maas
Published: 2025-12-05 16:35:23
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/gretelai/synthetic_pii_finance_multilingual

下载链接

链接失效反馈

官方服务：

资源简介：

<center> <img src="https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/nxKiabD9puCKhJCDciMto.webp" alt="gretelai/synthetic_pii_finance_multilingual v1" width="600px"> <p><em>Image generated by DALL-E. See <a href="https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/blob/main/dalle_prompt.txt">prompt</a> for more details</em></p> </center> # 💼 📊 Synthetic Financial Domain Documents with PII Labels **gretelai/synthetic_pii_finance_multilingual** is a dataset of full length synthetic financial documents containing Personally Identifiable Information (PII), generated using [Gretel Navigator](https://gretel.ai/gretel-navigator) and released under Apache 2.0. This dataset is designed to assist with the following use cases: 1. 🏷️ Training NER (Named Entity Recognition) models to detect and label PII in different domains. 2. 🔍 Testing PII scanning systems on real, full-length documents unique to different domains. 3. 📊 Evaluating the performance of de-identification systems on realistic documents containing PII. 4. 🔒 Developing and testing data privacy solutions for the financial industry. ## Example Labeled Document ![image/png](https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/EHx7Jt7mjchE4PvWqtm5m.png) ## Distribution of PII Types in Dataset ![image/png](https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/Ve8EKZ0OWZjYmPBa0PoRQ.png) # Dataset Contents - **55,940 records** partitioned into 50,776 train and 5,164 test records. - Coverage across **100 distinct financial document formats**, with 20 specific subtypes for each format commonly used in the financial industry. - Synthetic PII with **29 distinct PII types** (see table below). - Full-length documents with an **average length of 1,357 characters**, providing context for PII detection and labeling tasks. - Multilingual support, with documents in **English, Spanish, Swedish, German, Italian, Dutch, and French**. ### Language Support: - **English**: 28,910 documents - **Spanish**: 4,609 documents - **Swedish**: 4,543 documents - **German**: 4,530 documents - **Italian**: 4,473 documents - **Dutch**: 4,449 documents - **French**: 4,426 documents ### Distinct PII Types: The types of personally identifiable information (PII) included in this dataset are popular types used in the financial industry. When possible, tag names are aligned with the Python `Faker` generator names, simplifying the use of this dataset to train models to replace detected PII with fake items. | PII Type | train | test | |:--------------------------|--------:|-------:| | account_pin | 1266 | 143 | | api_key | 922 | 91 | | bank_routing_number | 1452 | 158 | | bban | 1477 | 166 | | company | 56338 | 6342 | | credit_card_number | 1224 | 120 | | credit_card_security_code | 1275 | 160 | | customer_id | 1823 | 195 | | date | 75830 | 8469 | | date_of_birth | 2339 | 250 | | date_time | 767 | 89 | | driver_license_number | 1269 | 140 | | email | 12914 | 1422 | | employee_id | 1696 | 175 | | first_name | 2565 | 279 | | iban | 1814 | 203 | | ipv4 | 1591 | 164 | | ipv6 | 1191 | 134 | | last_name | 1594 | 215 | | local_latlng | 802 | 97 | | name | 89642 | 10318 | | passport_number | 1426 | 136 | | password | 789 | 101 | | phone_number | 8277 | 946 | | ssn | 1313 | 153 | | street_address | 37845 | 4307 | | swift_bic_code | 1917 | 227 | | time | 15735 | 1746 | | user_name | 906 | 71 | The use of synthetic data eliminates the risk of exposing real PII while providing a diverse and representative dataset for training and evaluation. ## 🤖 Dataset Generation Gretel Navigator, an agent-based, compound AI system, was used to generate this synthetic dataset. Navigator utilizes the following LLMs for synthetic data generation: - `gretelai/Mistral-7B-Instruct-v0.2/industry`: A Gretel fine-tuned LLM trained on data from 10+ different industry verticals, including popular financial data formats. - `mistralai/Mixtral-8x7B-Instruct-v0.1`: Leveraged for text generation. The data used to train these LLMs contains no usage restrictions. See the License below for details. ### 🛠️ Generation Steps 1. **Document Generation**: Gretel Navigator generated synthetic financial documents based on the specified document types and descriptions, including PII types. 2. **PII Span Labeling**: The spans (start and end positions) of the PII elements within the generated documents were identified and labeled. 3. **Validation and Additional PII Detection**: The Gliner NER (Named Entity Recognition) library was utilized to double-check and validate the labeled PII spans. This step helped identify any additional PII that may have been inadvertently inserted by the LLM during the document generation process. 4. **Human in the Loop**: A provided notebook and visualizations were used to quickly inspect and add synthetically generated records to the training set. Results were spot-checked and random but largely relied on AI-based validation and quality judgments. 5. **LLM-as-a-Judge**: An LLM-as-a-Judge based technique was used to rate and filter the synthetic data based on factors described below. Note: Gretel's LLM, pre-trained on industry documents, was used to generate financial industry-specific documents with synthetic PII values. A NER library labeled the PII, which may introduce false negatives (missed labels) or false positives (incorrect labels). To address this, LLM-as-a-Judge filtered the data, and human reviewers randomly spot-checked it. Some errors may still be present. If you find any, please let us know or submit a pull request to update the labels. Thank you. ## 📋 Dataset Details ### Schema The dataset includes the following fields: - `document_type`: The type of document (e.g., Email, Financial Statement, IT support ticket). - `document_description`: A brief description of the document type. - `expanded_type`: A more specific subtype of the document. - `expanded_description`: A detailed description of the expanded document type. - `language`: The language of the generated text. - `language_description`: A description of the language variant. - `generated_text`: The generated document text containing PII. - `pii_spans`: A list of PII spans, where each span is a JSON string containing the start index, end index, and the type of PII. - `conformance_score`: A score from 0-100 indicating the conformance of the generated text to the tags and descriptions provided, with 100 being fully conforming and 0 being non-conforming. - `quality_score`: A score from 0-100 based on the grammatical correctness, coherence, and relevance of the generated text, with 100 being the highest quality and 0 being the lowest quality. - `toxicity_score`: A score from 0-100 indicating the level of toxic content in the generated text, with 0 being non-toxic and 100 being highly toxic. - `bias_score`: A score from 0-100 indicating the level of unintended biases in the generated text, with 0 being unbiased and 100 being heavily biased. - `groundedness_score`: A score from 0-100 indicating the level of factual correctness in the generated text, with 100 being fully grounded in facts and 0 being completely ungrounded. ### Example ```json { "document_type": "FpML", "document_description": "A standard for representing data concerning financial derivatives, including trade capture, confirmation, and valuation, often used in electronic trading and risk management.", "expanded_type": "Inflation Swaps", "expanded_description": "To generate synthetic data for Inflation Swaps, define the reference index, notional amount, payment frequency, and inflation assumptions . Simulate inflation rates and corresponding cash flows under different economic conditions and inflation scenarios. Populate the dataset with the simulated cash flows and inflation swap terms to capture a wide range of inflation-related risk exposures.", "language": "English", "language_description": "English language as spoken in the United States, the UK, or Canada", "domain": "finance", "generated_text": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<fx:message xmlns:fx=\"http:\/\/www.fixprotocol.org\/FpML-5-5\"\n xmlns:xsi=\"http:\/\/www.w3.org\/2001\/XMLSchema-instance\"\n xsi:schemaLocation=\"http:\/\/www.fixprotocol.org\/FpML-5-5 http:\/\/www.fixprotocol.org\/fixml\/schema\/FpML-5-5-0.xsd\">\n <header>\n <party id=\"sender\">\n <name>Castillo Ltd<\/name>\n <\/party>\n <party id=\"target\">\n <name>Counterparty Inc.<\/name>\n <\/party>\n <sentDate>2022-05-12<\/sentDate>\n <\/header>\n <body>\n <trade>\n <tradeId>20220512-1001<\/tradeId>\n <product>\n <productType>4<\/productType>\n <swap>\n <inflationSwap>\n <referenceIndex>\n <index>\n <name>Consumer Price Index<\/name>\n <currency>USD<\/currency>\n <\/index>\n <\/referenceIndex>\n <notionalAmount currency=\"USD\">10000000<\/notionalAmount>\n <paymentFrequency>2<\/paymentFrequency>\n <inflationAssumptions>\n <indexTenor>1Y<\/indexTenor>\n <indexTenor>2Y<\/indexTenor>\n <indexTenor>5Y<\/indexTenor>\n <\/inflationAssumptions>\n <\/inflationSwap>\n <\/swap>\n <\/product>\n <partyTradeRole>\n <partyRef id=\"sender\"\/>\n <tradeRole>1<\/tradeRole>\n <\/partyTradeRole>\n <partyTradeRole>\n <partyRef id=\"target\"\/>\n <tradeRole>2<\/tradeRole>\n <\/partyTradeRole>\n ", "pii_spans": [ {"start": 342, "end": 354, "label": "company"}, {"start": 418, "end": 435, "label": "company"}, {"start": 474, "end": 484, "label": "date"} ], "conformance_score": 80.0, "quality_score": 95.0, "toxicity_score": 0.0, "bias_score": 0.0, "groundedness_score": 90.0 } ``` ### 📝 Dataset Description This dataset is designed to generate highly realistic synthetic document formats commonly used by banks, financial institutions, and other organizations in the finance space. For this dataset, 100 specific document types were generated, including 20 subtypes per document type, for a total of 2,000 possible document descriptors used to prompt synthetic data generation. ## Distribution of Financial Document Types in Dataset ![Distribution of Financial Document Types](https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/qyj3eAL5gHzH-u7RZf-NS.png) ### 🔍 Data Quality Assessment The [LLM-as-a-Judge technique](https://arxiv.org/pdf/2306.05685.pdf) using the Mistral-7B language model was employed to ensure the quality of the synthetic PII and documents in this dataset. Each generated record was evaluated based on the following criteria: - **Conformance Score**: A score from 0-100 indicating the conformance of the generated text to the provided tags and descriptions, with 100 being fully conforming and 0 being non-conforming. - **Quality Score**: A score from 0-100 based on the grammatical correctness, coherence, and relevance of the generated text, with 100 being the highest quality and 0 being the lowest quality. - **Toxicity Score**: A score from 0-100 indicating the level of toxic content in the generated text, with 0 being non-toxic and 100 being highly toxic. - **Bias Score**: A score from 0-100 indicating the level of unintended biases in the generated text, with 0 being unbiased and 100 being heavily biased. - **Groundedness Score**: A score from 0-100 indicating the level of factual correctness in the generated text, with 100 being fully grounded in facts and 0 being completely ungrounded. ## LLM-as-a-Judge Results ![LLM-as-a-Judge Results](https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/YEGI9D_0BqJ_VweNBWlj9.png) Records with a toxicity score or bias score above 20, or a groundedness score, quality score, or conformance score below 80, were removed from the dataset. ## License All data in this generated dataset is Apache 2.0 licensed and can be used for any purpose that is not harmful. ## Citation ``` @software{gretel-synthetic-pii-finance-multilingual-2024, author = {Watson, Alex and Meyer, Yev and Van Segbroeck, Maarten and Grossman, Matthew and Torbey, Sami and Mlocek, Piotr and Greco, Johnny}, title = {{Synthetic-PII-Financial-Documents-North-America}: A synthetic dataset for training language models to label and detect PII in domain specific formats}, month = {June}, year = {2024}, url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual} } ```

<center> <img src="https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/nxKiabD9puCKhJCDciMto.webp" alt="gretelai/synthetic_pii_finance_multilingual v1" width="600px"> <p><em>本图片由DALL-E生成。如需了解更多细节，请参阅<a href="https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/blob/main/dalle_prompt.txt">提示词</a></em></p> </center> # 💼 📊 带个人可识别信息标注的合成金融领域文档 **gretelai/synthetic_pii_finance_multilingual** 是一个包含完整长度合成金融文档的数据集，内含个人可识别信息（Personally Identifiable Information, PII），由[Gretel Navigator](https://gretel.ai/gretel-navigator)生成，并以Apache 2.0许可证发布。本数据集旨在支持以下应用场景： 1. 🏷️ 训练命名实体识别（Named Entity Recognition, NER）模型，以在不同领域中检测并标注个人可识别信息 2. 🔍 在各领域专属的真实完整文档上测试个人可识别信息扫描系统 3. 📊 评估去标识化系统在含个人可识别信息的真实场景文档上的性能 4. 🔒 为金融行业开发并测试数据隐私解决方案 ## 带标注的示例文档 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/EHx7Jt7mjchE4PvWqtm5m.png) ## 数据集中个人可识别信息类型分布 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/Ve8EKZ0OWZjYmPBa0PoRQ.png) # 数据集内容 - **共55,940条记录**，划分为50,776条训练集与5,164条测试集记录 - 覆盖**100种不同的金融文档格式**，每种格式包含金融行业常用的20种特定子类型 - 合成生成的个人可识别信息包含**29种不同的个人可识别信息类型**（详见下表） - 完整长度的文档**平均长度为1357个字符**，可为个人可识别信息检测与标注任务提供上下文信息 - 支持多语言，文档涵盖**英语、西班牙语、瑞典语、德语、意大利语、荷兰语与法语** ### 语言分布： - **英语**：28,910份文档 - **西班牙语**：4,609份文档 - **瑞典语**：4,543份文档 - **德语**：4,530份文档 - **意大利语**：4,473份文档 - **荷兰语**：4,449份文档 - **法语**：4,426份文档 ### 不同的个人可识别信息类型：本数据集包含的个人可识别信息（PII）类型均为金融行业常用的主流类型。若条件允许，标注名称与Python `Faker` 生成器的命名规则保持一致，便于使用本数据集训练模型，将检测到的个人可识别信息替换为虚假数据。 | 个人可识别信息类型 | 训练集数量 | 测试集数量 | |:--------------------------|--------:|-------:| | account_pin | 1266 | 143 | | api_key | 922 | 91 | | bank_routing_number | 1452 | 158 | | bban | 1477 | 166 | | company | 56338 | 6342 | | credit_card_number | 1224 | 120 | | credit_card_security_code | 1275 | 160 | | customer_id | 1823 | 195 | | date | 75830 | 8469 | | date_of_birth | 2339 | 250 | | date_time | 767 | 89 | | driver_license_number | 1269 | 140 | | email | 12914 | 1422 | | employee_id | 1696 | 175 | | first_name | 2565 | 279 | | iban | 1814 | 203 | | ipv4 | 1591 | 164 | | ipv6 | 1191 | 134 | | last_name | 1594 | 215 | | local_latlng | 802 | 97 | | name | 89642 | 10318 | | passport_number | 1426 | 136 | | password | 789 | 101 | | phone_number | 8277 | 946 | | ssn | 1313 | 153 | | street_address | 37845 | 4307 | | swift_bic_code | 1917 | 227 | | time | 15735 | 1746 | | user_name | 906 | 71 | 使用合成数据可规避泄露真实个人可识别信息的风险，同时为模型训练与评估提供多样化且具有代表性的数据集。 ## 🤖 数据集生成流程 Gretel Navigator是一款基于智能体的复合AI系统，用于生成本合成数据集。该工具借助以下大语言模型（Large Language Model, LLM）完成合成数据生成： - `gretelai/Mistral-7B-Instruct-v0.2/industry`：由Gretel微调的大语言模型，基于10余个不同行业领域的数据训练而成，涵盖主流金融数据格式 - `mistralai/Mixtral-8x7B-Instruct-v0.1`：用于文本生成用于训练这些大语言模型的数据无使用限制。详细信息请参阅下文的许可证条款。 ### 🛠️ 生成步骤 1. **文档生成**：Gretel Navigator根据指定的文档类型与描述（含个人可识别信息类型）生成合成金融文档 2. **个人可识别信息片段标注**：识别并标注生成文档中个人可识别信息元素的片段（起始与结束位置） 3. **验证与额外个人可识别信息检测**：使用Gliner命名实体识别（Named Entity Recognition, NER）库对已标注的个人可识别信息片段进行二次核查与验证。该步骤可识别大语言模型在文档生成过程中无意插入的额外个人可识别信息 4. **人机协同流程**：借助提供的Jupyter笔记本与可视化工具，快速检查合成生成的记录并将其添加至训练集。结果会进行随机抽样核查，但主要依赖基于人工智能的验证与质量判断 5. **大语言模型作为评判者**：采用基于「大语言模型作为评判者」的技术，根据下文所述的标准对合成数据进行评分与筛选注：Gretel的大语言模型基于行业文档预训练，用于生成带有合成个人可识别信息值的金融行业专属文档。命名实体识别库对个人可识别信息进行标注时，可能会产生假阴性（漏标）或假阳性（错标）问题。为解决该问题，本数据集采用大语言模型作为评判者对数据进行筛选，并由人工评审随机抽样核查。尽管如此，数据中仍可能存在少量错误。若您发现任何问题，欢迎告知我们或提交拉取请求以更新标注。感谢您的支持。 ## 📋 数据集详情 ### 数据架构本数据集包含以下字段： - `document_type`：文档类型（例如：电子邮件、财务报表、IT支持工单） - `document_description`：文档类型的简要描述 - `expanded_type`：文档的更细分子类型 - `expanded_description`：细分文档类型的详细描述 - `language`：生成文本的语言 - `language_description`：语言变体的描述 - `generated_text`：包含个人可识别信息的生成文档文本 - `pii_spans`：个人可识别信息片段列表，每个片段为一个JSON字符串，包含起始索引、结束索引与个人可识别信息类型 - `conformance_score`：0-100分的评分，用于衡量生成文本与提供的标签和描述的符合程度，100分表示完全符合，0分表示完全不符合 - `quality_score`：0-100分的评分，基于生成文本的语法正确性、连贯性与相关性，100分表示最高质量，0分表示最低质量 - `toxicity_score`：0-100分的评分，用于衡量生成文本中的有害内容程度，0分表示无有害内容，100分表示高度有害 - `bias_score`：0-100分的评分，用于衡量生成文本中无意存在的偏见程度，0分表示无偏见，100分表示存在严重偏见 - `groundedness_score`：0-100分的评分，用于衡量生成文本的事实正确性程度，100分表示完全基于事实，0分表示完全不符合事实 ### 示例 json { "document_type": "FpML", "document_description": "A standard for representing data concerning financial derivatives, including trade capture, confirmation, and valuation, often used in electronic trading and risk management.", "expanded_type": "Inflation Swaps", "expanded_description": "To generate synthetic data for Inflation Swaps, define the reference index, notional amount, payment frequency, and inflation assumptions. . Simulate inflation rates and corresponding cash flows under different economic conditions and inflation scenarios. Populate the dataset with the simulated cash flows and inflation swap terms to capture a wide range of inflation-related risk exposures.", "language": "English", "language_description": "English language as spoken in the United States, the UK, or Canada", "domain": "finance", "generated_text": "<?xml version="1.0" encoding="UTF-8"?> <fx:message xmlns:fx="http://www.fixprotocol.org/FpML-5-5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.fixprotocol.org/FpML-5-5 http://www.fixprotocol.org/fixml/schema/FpML-5-5-0.xsd"> <header> <party id="sender"> <name>Castillo Ltd</name> </party> <party id="target"> <name>Counterparty Inc.</name> </party> <sentDate>2022-05-12</sentDate> </header> <body> <trade> <tradeId>20220512-1001</tradeId> <product> <productType>4</productType> <swap> <inflationSwap> <referenceIndex> <index> <name>Consumer Price Index</name> <currency>USD</currency> </index> </referenceIndex> <notionalAmount currency="USD">10000000</notionalAmount> <paymentFrequency>2</paymentFrequency> <inflationAssumptions> <indexTenor>1Y</indexTenor> <indexTenor>2Y</indexTenor> <indexTenor>5Y</indexTenor> </inflationAssumptions> </inflationSwap> </swap> </product> <partyTradeRole> <partyRef id="sender"/> <tradeRole>1</tradeRole> </partyTradeRole> <partyTradeRole> <partyRef id="target"/> <tradeRole>2</tradeRole> </partyTradeRole> ", "pii_spans": [ {"start": 342, "end": 354, "label": "company"}, {"start": 418, "end": 435, "label": "company"}, {"start": 474, "end": 484, "label": "date"} ], "conformance_score": 80.0, "quality_score": 95.0, "toxicity_score": 0.0, "bias_score": 0.0, "groundedness_score": 90.0 } ### 📝 数据集描述本数据集旨在生成银行、金融机构及其他金融领域组织常用的高度逼真的合成文档格式。本数据集共生成100种特定文档类型，每种类型包含20个子类型，总计2000种可供合成数据生成提示使用的文档描述符。 ## 数据集中金融文档类型分布 ![Distribution of Financial Document Types](https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/qyj3eAL5gHzH-u7RZf-NS.png) ### 🔍 数据质量评估本数据集采用基于Mistral-7B大语言模型的「大语言模型作为评判者」技术（[详见此论文](https://arxiv.org/pdf/2306.05685.pdf)），以确保合成个人可识别信息与文档的质量。每条生成记录均基于以下标准进行评估： - **符合度评分**：0-100分的评分，用于衡量生成文本与提供的标签和描述的符合程度，100分表示完全符合，0分表示完全不符合 - **质量评分**：0-100分的评分，基于生成文本的语法正确性、连贯性与相关性，100分表示最高质量，0分表示最低质量 - **有害性评分**：0-100分的评分，用于衡量生成文本中的有害内容程度，0分表示无有害内容，100分表示高度有害 - **偏见评分**：0-100分的评分，用于衡量生成文本中无意存在的偏见程度，0分表示无偏见，100分表示存在严重偏见 - **事实正确性评分**：0-100分的评分，用于衡量生成文本的事实正确性程度，100分表示完全基于事实，0分表示完全不符合事实 ## 大语言模型作为评判者的评估结果 ![LLM-as-a-Judge Results](https://cdn-uploads.huggingface.co/production/uploads/632ca8dcdbea00ca213d101a/YEGI9D_0BqJ_VweNBWlj9.png) 毒性评分或偏见评分高于20分，或事实正确性评分、质量评分或符合度评分低于80分的记录，均已从数据集中移除。 ## 许可证本生成数据集的所有数据均采用Apache 2.0许可证发布，可用于任何无害用途。 ## 引用 bibtex @software{gretel-synthetic-pii-finance-multilingual-2024, author = {Watson, Alex and Meyer, Yev and Van Segbroeck, Maarten and Grossman, Matthew and Torbey, Sami and Mlocek, Piotr and Greco, Johnny}, title = {{Synthetic-PII-Financial-Documents-North-America}: A synthetic dataset for training language models to label and detect PII in domain specific formats}, month = {June}, year = {2024}, url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual} }

提供机构：

maas

创建时间：

2025-05-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集