Synthetic Data Generator
收藏Snowflake2024-05-30 更新2024-06-01 收录
下载链接:
https://app.snowflake.com/marketplace/listing/GZT1Z1MSVYE
下载链接
链接失效反馈官方服务:
资源简介:
# Synthetic Data Generator (Beta) by DataMynd.io
This native application is designed for users who want a quick way to generate synthetic data that is based on one or multiple tables in a real production schema. The application works by training one of several different ML models on the real data before using the trained model to generate synthetic data.
<p><br/></p>
The application performs all logic within the native application on the consumer's Snowflake account. No external API calls are made at any point either during installation or use. No third-party services are used at any point by the application. No Data or metadata are shared outside of the consumer's account.
<p><br/></p>
## Privileges
The application requires the following grants to access the targeted data and train the model.
- GRANT USAGE ON DATABASE
- GRANT USAGE ON SCHEMA
- GRANT REFERENCES, SELECT ON TABLE<br/>
## Workflow
1. [Start Page] Create a new project or resume an existing project. The first page of the app contains a list of all open projects and high-level stats on each, like the schema/tables/chosen learning model and whether the model has been trained successfully. Clicking on a project will give the user options to resume or delete the project. Note: deleting the project will also drop any synthetic data generated using that project. Please copy the data before doing this.
2. [Select Page] The user is prompted to select the real data for the project. Data selection is done by using the drop-down lists shown on this page. Note: the user is prompted when creating the project to grant access to the application for the desired tables in the previous step. Failure to do so will result in a warning and the user will not be able to proceed from this step.
3. [Select Page] On the same page the user selects the real data, the learning model must also be selected. Different models vary in accuracy and speed for different use cases and data size/complexity/shapes. GaussianCopula uses purely statistical methods when learning and performs the fastest with lowest accuracy. CTGAN and TVAE use neural-networks and deep learning when profiling the real data and can result in highly accurate synthetic data. CopulaGAN uses a combination of statistical and deep learning methods. The neural network-based models take a longer time to train but will likely produce more accurate, realistic results.
4. [Configure Page] The user must select the table, then configure several options for each column in the dataset. One primary key should be selected (using the P-Key checkbox) if available for each table. "ID" must be selected as the type for the primary key. The type attribute should match that of the original data. Categorical should be selected as the type if the field contains discrete values (text values like 'red' or 'green' or distinct numerical values like radio station frequency that do not make sense to plot as a numerical distribution). Otherwise, one of the other types should be selected corresponding to the datatype of the original.
5. [Configure Page] The user has the option to select 'anonymize' for any categorical-type columns. This will cause the model training to skip over that field and instead it will create artificial values using the Faker library. If anonymize is selected, the user must also select the type of output the user expects to replace the field values for that column. E.g. selecting 'name' for the type (person category) will randomly generate full names to populate the field values. Some selections allow extra parameters for fine tuning the values generated. If extra parameters are allowed, an additional prompt will be displayed. See [Faker Documentation](https://faker.readthedocs.io/en/stable/providers.html) for details on each type and extra parameters ('provider' in the Faker docs corresponds to 'type').
6. [Configure Page] Don't forget to click save and then move onto the next page. The user can always come back and update these parameters.
7. [Train Page] On the next page, the user can fine-tune the training parameters for the selected model. Each model has a different set of parameters (for example the neural-network-based models have an option for # of training epochs). Usually, the defaults are okay to start with. The user may want to come back and adjust if the results don't look right in the last step. Once ready, click 'Train Model'. This may take only a few seconds (small data with a GaussianCopula model for example), or a few hours for some of the neural-net-based models on complex data. This will trigger a training task that runs in the background. The refresh button will update the status displayed, or the next time the app is loaded, the status will be reflected on the [Start Page]. If the training succeeds, a notification will be displayed here. Click Next to continue.
8. [Generate Page] Select the number of rows to generate and click 'Generate Data'. This may take a few moments but is much faster than training. Once the synthetic data has been generated, a notification will be displayed. At this point the new data has been written to a new schema (also indicated). It has not been shared with any other users, however. The user may share the data at this point or click Next to explore the data.
9. [Explore Page] This page is intended to show the user how well the new synthetic data matches the original data. The layout option on the left will allow the user to select what types of exploration options to view. The table view simply displays the new synthetic data and is interactive with the filters. To use the filters, open the expander on the left and select at least one filter to add. Select values for the resulting filter display to affect the table view. The histogram view does not currently react to the filters (as of the first release of the app), and generates a histogram for each field for the real data (green line) and the synthetic data (blue line). The grey bars indicate the variance % between histograms.
提供机构:
DataMynd
创建时间:
2024-05-28



