sdv_dev Profile Banner
Synthetic Data Vault Profile
Synthetic Data Vault

@sdv_dev

Followers
383
Following
50
Media
77
Statuses
260

Join our growing ecosystem of #opensource libraries & resources for generating #SyntheticData for different data modalities. Created at @lab_dai, MIT.

Cambridge, MA
Joined September 2020
Don't wanna be here? Send us removal request.
@sdv_dev
Synthetic Data Vault
2 years
Today, DataCebo launched SDV Enterprise & raised $8.5M in VC. SDV Enterprise is a commercial model of the source-available Synthetic Data Vault (SDV). It makes it easy to develop, manage & deploy #generativeAI models for apps when real data is limited. https://t.co/8DqHCxPDMv
0
1
4
@akshay_pachaar
Akshay ๐Ÿš€
6 months
Generate synthetic data at scale! SDV is an open-source Python library that generates tabular synthetic data by using ML algorithms to learn and replicate patterns from your real data. Here's how it works in 3 steps: 1๏ธโƒฃ Train: Point SDV at your real table; it will capture the
24
126
531
@sdv_dev
Synthetic Data Vault
8 months
Generating synthetic data that maintains realistic relationships between columns is crucial for testing and analysis. Traditional random generation approaches often create unrealistic patterns, like luxury hotel rooms priced cheaper than basic rooms. GaussianCopulaSynthesizer
0
0
3
@sdv_dev
Synthetic Data Vault
8 months
Many businesses collect and store their customersโ€™ GPS locations to help improve their products. But GPS locations may contain precise locations of peopleโ€™s homes. Businesses are sensitive to sharing this data even to internal teams, as it may reveal private information about
0
0
3
@sdv_dev
Synthetic Data Vault
8 months
Synthetic tabular data can help you test software applications because it resembles the key properties and patterns in your real data. Consider a news publication that wants to use synthetic data to test a new software change for their mobile application before it rolls out to
0
0
2
@sdv_dev
Synthetic Data Vault
8 months
One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Letโ€™s explore an example of one such rule. The one-to-many relationship is a common pattern in database schemas. An
0
0
0
@sdv_dev
Synthetic Data Vault
8 months
โœˆ๏ธ @Expedia recently shared a very interesting methodology on how they collect and use synthetic data to improve their flight price forecasting models. When a user makes a flight search, Expedia retrieves the latest pricing data from their data providers for the specified search
0
0
2
@sdv_dev
Synthetic Data Vault
8 months
One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Letโ€™s explore an example of one such rule. Some applications need to store numerical data with different units of
0
0
1
@sdv_dev
Synthetic Data Vault
8 months
Today, weโ€™re excited to introduce a powerful new bundle to The Synthetic Data Vault: AI connectors. AI connectors address 2 key challenges that SDV users face when training generative AI models on datasets from enterprise data stores. (Link to the announcement:
0
0
1
@sdv_dev
Synthetic Data Vault
9 months
SDV Enterprise v0.23.0 is out ๐ŸŽ‰ This release enhances your ability to program your synthesizer to find certain patterns and recreate themโ€” whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. ๐Ÿ†
0
0
1
@sdv_dev
Synthetic Data Vault
9 months
SDV Enterprise v0.23.0 is out ๐ŸŽ‰ This release enhances your ability to program your synthesizer to find certain patterns and recreate themโ€” whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. ๐Ÿ†
0
0
1
@sdv_dev
Synthetic Data Vault
9 months
Synthetic data is a powerful way to generate test data that looks and feels like real production data. You can either insert the synthetic data back into the database in an environment for manual testing or use the data for running automated tests. But if you need to test a new
0
0
3
@sdv_dev
Synthetic Data Vault
9 months
Last week, we shared a synthetic populations dataset for the United States but this week weโ€™re sharing one published by researchers for the whole world. ๐ŸŒ Marijin Ton et alย released a gigantic synthetic population dataset that represents ~๐Ÿณ.๐Ÿฏ๐Ÿฏ ๐—ฏ๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป ๐—ต๐˜‚๐—บ๐—ฎ๐—ป๐˜€,
0
1
2
@sdv_dev
Synthetic Data Vault
9 months
Some multi-table datasets have interesting data patterns, like mirroring 1 or more columns in a child table from its parent table. This design pattern helps the database user avoid the need to run a time-consuming or expensive JOIN query, especially if one of the tables is
0
0
1
@sdv_dev
Synthetic Data Vault
9 months
James Rineer et al just released a new dataset containing millions of #syntheticdata about households and individuals in the US. Using publicly available census data from the U.S. Census Bureau, they generated: ๐Ÿ˜๏ธ 120,754,708 synthetic households ๐Ÿ‘ฅ 303,128,287 synthetic
0
1
3
@sdv_dev
Synthetic Data Vault
9 months
In 2024, synthetic data routinely made headlines alongside many AI product launches. ๐—›๐—ฒ๐—ฟ๐—ฒ ๐—ฎ๐—ฟ๐—ฒ ๐—ผ๐˜‚๐—ฟ ๐—ฝ๐—ฟ๐—ฒ๐—ฑ๐—ถ๐—ฐ๐˜๐—ถ๐—ผ๐—ป๐˜€ ๐—ณ๐—ผ๐—ฟ ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ ๐Ÿ”ฎ ๐Ÿญ. ๐—ง๐—ต๐—ฒ ๐—ฟ๐—ถ๐˜€๐—ฒ ๐—ผ๐—ณ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—”๐—œ ๐˜„๐—ถ๐—น๐—น ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜ ๐—ถ๐—ป ๐—ฎ ๐—ป๐˜‚๐—บ๐—ฏ๐—ฒ๐—ฟ ๐—ผ๐—ณ ๐—Ÿ๐—Ÿ๐— -๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ
0
2
3
@sdv_dev
Synthetic Data Vault
9 months
If you want to use AI generated synthetic data in place of your sensitive real data, then you need to be confident that the ๐ฌ๐ฒ๐ง๐ญ๐ก๐ž๐ญ๐ข๐œ ๐๐š๐ญ๐š ๐š๐๐ก๐ž๐ซ๐ž๐ฌ ๐ญ๐จ ๐ญ๐ก๐ž ๐ฌ๐š๐ฆ๐ž ๐›๐ฎ๐ฌ๐ข๐ง๐ž๐ฌ๐ฌ ๐ซ๐ฎ๐ฅ๐ž๐ฌ.โฃ โฃ For example, imagine that youโ€™re an online retailer that
0
0
1
@sdv_dev
Synthetic Data Vault
9 months
An easy way to improve the quality of the synthetic data that the SDV generates is to accurately define each columnโ€™s sdtype. Sdtypes are a key part of the SDVโ€™s Metadata model, which lets you, the expert of the data, provide additional context for the SDV to incorporate. For
0
0
3
@sdv_dev
Synthetic Data Vault
9 months
Many real-world classification datasets have severe class imbalance. For example, imagine a fraud dataset where 99.9% of the rows are labelled non-fraudulent and only 0.01% are labelled fraudulent. By incorporating synthetic data in your training data, you can achieve a more
0
0
2