DataTalksClub @DataTalksClub X Profile

DataTalksClub

@DataTalksClub

Followers

10K

Following

1K

Media

633

Statuses

2K

The place to talk about data. Do you want to talk about data science, machine learning, and engineering? Join our Slack community and attend weekly events!

WWW

Joined October 2020

Don't wanna be here? Send us removal request.

DataTalksClub

@DataTalksClub

2 years

Want to talk about data?. Data science, career questions, practical machine learning, and more?. Join @DataTalksClub! The link is in the profile description. Learn more about our activities in the thread! 🧵

1

7

28

DataTalksClub

@DataTalksClub

5 days

Our next live podcast episode is right around the corner! Join us on July 07, 2025 as we chat with Lior Barak about Mindful Data Strategy: From Pipelines to Business Impact. You won't want to miss this one!. Register here:

0

1

DataTalksClub

@DataTalksClub

6 days

Firebolt Core: A Free Self-Hosted Distributed Query Engine. @FireboltHQ just announced Firebolt Core — the exact distributed, vectorized query engine behind Firebolt Cloud, now available as a forever-free, fully self-hosted edition. Read the Launch post:

0

4

DataTalksClub

@DataTalksClub

8 days

We've just begun a workshop called "From REST to reasoning: ingest, index, and query with dltHub and Cognee"!. Join us live right here: .

0

3

DataTalksClub

@DataTalksClub

13 days

Are you ready for our next live podcast episode? We're thrilled to have Orell Garten on the show to talk about From Simulation Algorithms to Production-Grade Data Systems. Join us on June 30 to learn more!. Register here:

0

1

DataTalksClub

@DataTalksClub

13 days

Orchestrating LLM workflows with the Airflow AI SDK. You don’t need complex agent architectures to build production-ready AI applications. Join @JulianLaneve , creator of the Airflow AI SDK, on June 26 for a live session to explore how it works. 👉

1

0

3

DataTalksClub

@DataTalksClub

13 days

Without a dedicated ingestion layer, your RAG system will drift, return noisy or broken results, and become hard to maintain. Join our free workshop on June 30, 4:30 PM CET to build an end-to-end open-source ingestion pipeline for RAG. Register here:

0

DataTalksClub

@DataTalksClub

13 days

8. Auditability & traceability: Use immutable logs, schema contracts, and metadata to trace every retrieved passage back to its source.

1

0

DataTalksClub

@DataTalksClub

13 days

7. Orchestration & resilience: Create retry-safe, parallel workflows to manage network issues, rate limits, or large file loads.

1

0

DataTalksClub

@DataTalksClub

13 days

6. Chunking strategies: use semantic or paragraph-aware splitting to maintain coherence in each embedding, rather than fixed-length slices.

1

0

DataTalksClub

@DataTalksClub

13 days

5. Incremental updates: track watermarks or change feeds to process only new or modified records without full reloads.

1

0

DataTalksClub

@DataTalksClub

13 days

4. Schema evolution & versioning: detect API or schema changes automatically and version each load to prevent index corruption.

1

0

DataTalksClub

@DataTalksClub

13 days

3. Text-specific preparation: clean noise, fix encoding, split long documents into meaningful chunks, and attach metadata to each piece.

1

0

DataTalksClub

@DataTalksClub

13 days

2. Heterogeneous sources: Normalize JSON APIs, CSVs, PDFs, and HTML into a consistent format before embedding.

1

0

DataTalksClub

@DataTalksClub

13 days

1. Freshness & accuracy: Regularly pull new documents, API data, or logs so your LLM always has up-to-date context.

1

0

DataTalksClub

@DataTalksClub

13 days

Why does RAG need specialized data ingestion? 🤔. 8 reasons:

1

0

2

DataTalksClub

@DataTalksClub

15 days

On June 30, in our workshop, we’ll show you how to build a data ingestion pipeline for RAG applications using open-source tools. Register now: 4/4.

0

2

DataTalksClub

@DataTalksClub

15 days

Why it matters:. 🔹 Fresh data ensures accurate analytics and ML models.🔹 Automate ETL tasks.🔹 Validate weak formats (CSV/JSON) for consistency.🔹 Gain traceability and audit trails for compliance.🔹 Scale with data growth; avoid manual bottlenecks. 3/4.

1

0

1

DataTalksClub

@DataTalksClub

15 days

Data ingestion is the process of:. 1. Extracting raw data from sources (APIs, logs, databases).2. Transporting it into a staging area (data lake, warehouse, message queue).3. Preparing it for use by normalizing formats, cleaning values, and enriching with metadata. 2/4.

1

0

1

DataTalksClub

@DataTalksClub

15 days

What is data ingestion, and why do you need it?. 1/4

1

2

11

DataTalksClub

@DataTalksClub

16 days

Find a complete implementation of these steps along with the code in our free LLM Zoomcamp module:

0