Synthetic Data Vault @sdv_dev X Profile

Synthetic Data Vault

@sdv_dev

Followers

383

Following

50

Media

77

Statuses

260

Join our growing ecosystem of #opensource libraries & resources for generating #SyntheticData for different data modalities. Created at @lab_dai, MIT.

https://t.co/y5lEsgBhIC

Cambridge, MA

Joined September 2020

Don't wanna be here? Send us removal request.

Synthetic Data Vault

@sdv_dev

2 years

Today, DataCebo launched SDV Enterprise & raised $8.5M in VC. SDV Enterprise is a commercial model of the source-available Synthetic Data Vault (SDV). It makes it easy to develop, manage & deploy #generativeAI models for apps when real data is limited. https://t.co/8DqHCxPDMv

0

1

4

Akshay 🚀

@akshay_pachaar

6 months

Generate synthetic data at scale! SDV is an open-source Python library that generates tabular synthetic data by using ML algorithms to learn and replicate patterns from your real data. Here's how it works in 3 steps: 1️⃣ Train: Point SDV at your real table; it will capture the

24

126

531

Synthetic Data Vault

@sdv_dev

8 months

Generating synthetic data that maintains realistic relationships between columns is crucial for testing and analysis. Traditional random generation approaches often create unrealistic patterns, like luxury hotel rooms priced cheaper than basic rooms. GaussianCopulaSynthesizer

0

3

Synthetic Data Vault

@sdv_dev

8 months

Many businesses collect and store their customers’ GPS locations to help improve their products. But GPS locations may contain precise locations of people’s homes. Businesses are sensitive to sharing this data even to internal teams, as it may reveal private information about

0

3

Synthetic Data Vault

@sdv_dev

8 months

Synthetic tabular data can help you test software applications because it resembles the key properties and patterns in your real data. Consider a news publication that wants to use synthetic data to test a new software change for their mobile application before it rolls out to

0

2

Synthetic Data Vault

@sdv_dev

8 months

One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Let’s explore an example of one such rule. The one-to-many relationship is a common pattern in database schemas. An

0

Synthetic Data Vault

@sdv_dev

8 months

✈️ @Expedia recently shared a very interesting methodology on how they collect and use synthetic data to improve their flight price forecasting models. When a user makes a flight search, Expedia retrieves the latest pricing data from their data providers for the specified search

0

2

Synthetic Data Vault

@sdv_dev

8 months

One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Let’s explore an example of one such rule. Some applications need to store numerical data with different units of

0

1

Synthetic Data Vault

@sdv_dev

8 months

Today, we’re excited to introduce a powerful new bundle to The Synthetic Data Vault: AI connectors. AI connectors address 2 key challenges that SDV users face when training generative AI models on datasets from enterprise data stores. (Link to the announcement:

0

1

Synthetic Data Vault

@sdv_dev

9 months

SDV Enterprise v0.23.0 is out 🎉 This release enhances your ability to program your synthesizer to find certain patterns and recreate them— whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. 🏆

0

1

Synthetic Data Vault

@sdv_dev

9 months

SDV Enterprise v0.23.0 is out 🎉 This release enhances your ability to program your synthesizer to find certain patterns and recreate them— whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. 🏆

0

1

Synthetic Data Vault

@sdv_dev

9 months

Synthetic data is a powerful way to generate test data that looks and feels like real production data. You can either insert the synthetic data back into the database in an environment for manual testing or use the data for running automated tests. But if you need to test a new

0

3

Synthetic Data Vault

@sdv_dev

9 months

Last week, we shared a synthetic populations dataset for the United States but this week we’re sharing one published by researchers for the whole world. 🌏 Marijin Ton et al released a gigantic synthetic population dataset that represents ~𝟳.𝟯𝟯 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝘂𝗺𝗮𝗻𝘀,

0

1

2

Synthetic Data Vault

@sdv_dev

9 months

Some multi-table datasets have interesting data patterns, like mirroring 1 or more columns in a child table from its parent table. This design pattern helps the database user avoid the need to run a time-consuming or expensive JOIN query, especially if one of the tables is

0

1

Synthetic Data Vault

@sdv_dev

9 months

James Rineer et al just released a new dataset containing millions of #syntheticdata about households and individuals in the US. Using publicly available census data from the U.S. Census Bureau, they generated: 🏘️ 120,754,708 synthetic households 👥 303,128,287 synthetic

0

1

3

Synthetic Data Vault

@sdv_dev

9 months

In 2024, synthetic data routinely made headlines alongside many AI product launches. 𝗛𝗲𝗿𝗲 𝗮𝗿𝗲 𝗼𝘂𝗿 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻𝘀 𝗳𝗼𝗿 𝟮𝟬𝟮𝟱 🔮 𝟭. 𝗧𝗵𝗲 𝗿𝗶𝘀𝗲 𝗼𝗳 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗔𝗜 𝘄𝗶𝗹𝗹 𝗿𝗲𝘀𝘂𝗹𝘁 𝗶𝗻 𝗮 𝗻𝘂𝗺𝗯𝗲𝗿 𝗼𝗳 𝗟𝗟𝗠-𝗯𝗮𝘀𝗲𝗱

0

2

3

Synthetic Data Vault

@sdv_dev

9 months

If you want to use AI generated synthetic data in place of your sensitive real data, then you need to be confident that the 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐚𝐝𝐡𝐞𝐫𝐞𝐬 𝐭𝐨 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐫𝐮𝐥𝐞𝐬.⁣ ⁣ For example, imagine that you’re an online retailer that

0

1

Synthetic Data Vault

@sdv_dev

9 months

An easy way to improve the quality of the synthetic data that the SDV generates is to accurately define each column’s sdtype. Sdtypes are a key part of the SDV’s Metadata model, which lets you, the expert of the data, provide additional context for the SDV to incorporate. For

0

3

Synthetic Data Vault

@sdv_dev

9 months

Many real-world classification datasets have severe class imbalance. For example, imagine a fraud dataset where 99.9% of the rows are labelled non-fraudulent and only 0.01% are labelled fraudulent. By incorporating synthetic data in your training data, you can achieve a more

0

2