DataTalksClub Profile
DataTalksClub

@DataTalksClub

Followers
10K
Following
1K
Media
633
Statuses
2K

The place to talk about data. Do you want to talk about data science, machine learning, and engineering? Join our Slack community and attend weekly events!

WWW
Joined October 2020
Don't wanna be here? Send us removal request.
@DataTalksClub
DataTalksClub
2 years
Want to talk about data?. Data science, career questions, practical machine learning, and more?. Join @DataTalksClub! The link is in the profile description. Learn more about our activities in the thread! ๐Ÿงต
Tweet media one
1
7
28
@DataTalksClub
DataTalksClub
5 days
Our next live podcast episode is right around the corner! Join us on July 07, 2025 as we chat with Lior Barak about Mindful Data Strategy: From Pipelines to Business Impact. You won't want to miss this one!. Register here:
Tweet media one
0
0
1
@DataTalksClub
DataTalksClub
6 days
Firebolt Core: A Free Self-Hosted Distributed Query Engine. @FireboltHQ just announced Firebolt Core โ€” the exact distributed, vectorized query engine behind Firebolt Cloud, now available as a forever-free, fully self-hosted edition. Read the Launch post:
Tweet media one
0
0
4
@DataTalksClub
DataTalksClub
8 days
We've just begun a workshop called "From REST to reasoning: ingest, index, and query with dltHub and Cognee"!. Join us live right here: .
0
0
3
@DataTalksClub
DataTalksClub
13 days
Are you ready for our next live podcast episode? We're thrilled to have Orell Garten on the show to talk about From Simulation Algorithms to Production-Grade Data Systems. Join us on June 30 to learn more!. Register here:ย 
Tweet media one
0
0
1
@DataTalksClub
DataTalksClub
13 days
Orchestrating LLM workflows with the Airflow AI SDK. You donโ€™t need complex agent architectures to build production-ready AI applications. Join @JulianLaneve , creator of the Airflow AI SDK, on June 26 for a live session to explore how it works. ๐Ÿ‘‰
Tweet media one
1
0
3
@DataTalksClub
DataTalksClub
13 days
Without a dedicated ingestion layer, your RAG system will drift, return noisy or broken results, and become hard to maintain. Join our free workshop on June 30, 4:30 PM CET to build an end-to-end open-source ingestion pipeline for RAG. Register here:
0
0
0
@DataTalksClub
DataTalksClub
13 days
8. Auditability & traceability: Use immutable logs, schema contracts, and metadata to trace every retrieved passage back to its source.
1
0
0
@DataTalksClub
DataTalksClub
13 days
7. Orchestration & resilience: Create retry-safe, parallel workflows to manage network issues, rate limits, or large file loads.
1
0
0
@DataTalksClub
DataTalksClub
13 days
6. Chunking strategies: use semantic or paragraph-aware splitting to maintain coherence in each embedding, rather than fixed-length slices.
1
0
0
@DataTalksClub
DataTalksClub
13 days
5. Incremental updates: track watermarks or change feeds to process only new or modified records without full reloads.
1
0
0
@DataTalksClub
DataTalksClub
13 days
4. Schema evolution & versioning: detect API or schema changes automatically and version each load to prevent index corruption.
1
0
0
@DataTalksClub
DataTalksClub
13 days
3. Text-specific preparation: clean noise, fix encoding, split long documents into meaningful chunks, and attach metadata to each piece.
1
0
0
@DataTalksClub
DataTalksClub
13 days
2. Heterogeneous sources: Normalize JSON APIs, CSVs, PDFs, and HTML into a consistent format before embedding.
1
0
0
@DataTalksClub
DataTalksClub
13 days
1. Freshness & accuracy: Regularly pull new documents, API data, or logs so your LLM always has up-to-date context.
1
0
0
@DataTalksClub
DataTalksClub
13 days
Why does RAG need specialized data ingestion? ๐Ÿค”. 8 reasons:
1
0
2
@DataTalksClub
DataTalksClub
15 days
On June 30, in our workshop, weโ€™ll show you how to build a data ingestion pipeline for RAG applications using open-source tools. Register now: 4/4.
0
0
2
@DataTalksClub
DataTalksClub
15 days
Why it matters:. ๐Ÿ”น Fresh data ensures accurate analytics and ML models.๐Ÿ”น Automate ETL tasks.๐Ÿ”น Validate weak formats (CSV/JSON) for consistency.๐Ÿ”น Gain traceability and audit trails for compliance.๐Ÿ”น Scale with data growth; avoid manual bottlenecks. 3/4.
1
0
1
@DataTalksClub
DataTalksClub
15 days
Data ingestion is the process of:. 1. Extracting raw data from sources (APIs, logs, databases).2. Transporting it into a staging area (data lake, warehouse, message queue).3. Preparing it for use by normalizing formats, cleaning values, and enriching with metadata. 2/4.
1
0
1
@DataTalksClub
DataTalksClub
15 days
What is data ingestion, and why do you need it?. 1/4
Tweet media one
1
2
11
@DataTalksClub
DataTalksClub
16 days
Find a complete implementation of these steps along with the code in our free LLM Zoomcamp module:
0
0
0