🔥 Already excited about Theseus? It gets better. 🔥
Dive deeper into what makes Theseus the ultimate choice for petabyte-scale data challenges with our bar-raising benchmarking report. See the proof of unparalleled performance & efficiency. 🚀
📊 Full report here:
It's official! Voltron Data is changing the way people interact with
#data
and
#hardware
with support from top tier investors. If you're passionate about open-source and standards in the data analytics ecosystem, read on, and become a Voltronaut!
One major strength of
@ApacheArrow
is that it supports a wide variety of programming languages including R, Python, and C++. This can enable faster and smoother
#data
processing. Learn more at
#TheDataThread
.
For the next edition of Pulling the Thread on 8/24,
@zeroshade
discusses how to get even more mileage out of
@GraphQL
with the help of Arrow Flight RPC. Join the live event w/ Q&A at a new time - *3pm* EDT
#JSON
If you’re a data scientist working with large data in
#Rstats
, you won’t want to miss the next Pulling the Thread event on 8/10 @ 12:30 pm EDT!
@djnavarro
will discuss the advantages of using the Arrow package in R and break down how it all works w/ Q&A.
If you’re an
#RStats
developer working with large datasets, it’s time to explore what
#ApacheArrow
- the multi-language toolbox for working with larger-than-memory data - can do for you. Get access to a tutorial and resources in our latest blog:
🚀 Introducing: The Composable Codex! This is a 5-part dive into
#ComposableDataSystems
. Starting from zero: A New Frontier. It’s a look at the history, technology, trends, and opportunities that are all critical to our arrival at the Composable era.
Our co-founder
@wesmckinn
shares the origin story of
@ApacheArrow
and how the project has evolved. Learn about the project’s capabilities and use cases as well as future goals that could benefit the community on our blog.
#BigData
The
@ApacheArrow
12.0.0 release is out and full of feature improvements and optimizations. Big shoutout to the 97 contributors for all the hard work over the last 3 months. Get our highlights in the 🧵
#ApacheArrow
11.0.0 is out! This release strengthens its position as the backbone for
#developers
who need to interop between different libraries, tools, and languages. Over the last few months, the team resolved 423 issues across 95 contributors. See the 🧵for highlights:
Building on the strengths of
@ApacheArrow
Flight, Arrow Flight SQL offers a smoother method for connectivity between different databases. When combined with
@DuckDB
, the opportunities for data pipelines can expand even further. Learn how they can be used:
Data transfer is a major bottleneck for many data workloads. Apache Doris recently adapted Arrow Flight SQL, which allows it to leverage the Python ADBC Driver for fast data reading.
This tutorial shows how to do data transmission at scale with Python, JDBC, Java, and Spark. In
If you're using DuckDB, come join our discussion today on how you can leverage Ibis to give your
@duckdb
pipelines even more power.
You can run your local experiments using DuckDB, and deploy to production using Ibis with a single line of code change!
Ibis supports 20+
To kick off the new year, we’re highlighting 12
#OpenSource
projects to watch in 2023. These technologies are modernizing the way
#DataAnalytics
and
#MachineLearning
get done. Read the report now!
Preview release of the nanoarrow-based
@SnowflakeDB
Connector for
#Python
is available! This connector is ~10x smaller in size and removes a hard dependency on a specific version of
#PyArrow
. Learn how the nanoarrow integration makes it possible.
Now available:
@SnowflakeDB
driver for
@ApacheArrow
#ADBC
! This makes it quicker and easier to connect to Snowflake using languages that don't already have connectors.
Arrow Database Connectivity (ADBC) is being adopted more and more by organizations seeking to improve data portability and performance. In this blog post, get resources to learn why
@SnowflakeDB
@dbt_labs
@duckdb
integrated it!
Meet Theseus - the petabyte-scale query engine redefining the
#DataAnalytics
SPACE - scale, performance and cost efficiency. 🚀 Process hundreds of TBs in seconds & unleash GPUs for real-time insights. 👉 Unveiled at
#NVIDIAGTC
:
You can use
@IbisData
to maximize your pandas code — without needing to refactor the underlying codebase. Today we show you how and use
@duckdb
to accelerate speed and access to insights.
Arrow Database Connectivity (ADBC) gives users a single way to get Arrow data in and out of databases - whether columnar or not. It’s a new API in the
@ApacheArrow
ecosystem and one we think you should know about:
Ready to take your
@ApacheArrow
experience to the next level? We're pleased to offer a new enterprise subscription that will help simplify issue reporting, provide faster fixes in between releases, and much more.
Tomorrow @ 12:30 pm EDT, join
@djnavarro
for the follow up to her popular session Doing More with
#Data
: An Introduction to Arrow for R Users. Subscribe to the YouTube channel for alerts on this and other exciting community events!
#RStats
Refactoring or rewriting code wastes precious time and resources! If you’re working with
#pandas
codebases, turn to
@IbisData
@duckdb
to accelerate analytics and get insights faster — without changing your underlying pandas code.
Dataframe Interoperability in action! We explain the dataframe interchange protocol and how it’s pushing composability forward in the
#Python
ecosystem. We demo it using
@ApacheArrow
to visualize a
#PyArrow
table with
@vega_vis
Altair without conversion.
We recently released a proof of concept SQLAlchemy driver for Arrow Flight SQL. This supports deploying Arrow-native data storage and analytics stacks leveraging Flight and
@apachesuperset
as the interface. See an example of this in action on the blog!
The latest
@ApacheArrow
8.0.0 release introduces new capabilities that make it an even more powerful toolkit for data analytics and established a foundation for future work that will improve performance and interoperability. Read about it.
Inside the
#OpenSource
Report: It’s fast, lightweight, embraced by many already.
@DuckDB
is pulling users from traditional relational databases and we see why. Get our take on why it’s a top project to watch this year:
The latest
@ApacheArrow
9.0.0 release is available. It includes built-in support for
@GoogleCloud
storage, performance enhancements for Acero, and much more. See the highlights:
ICYMI, the latest release to
@ApacheArrow
was shared earlier this month. Learn more about some of the newest features from the 7.0.0 release including the addition of Flight SQL and changes to
#PyArrow
.
It was exciting to see the composable data systems vision become a reality at VeloxCon!
As data use cases become more diverse, companies need shared open standards and modular components that teams can leverage to quickly build data systems that fit
SQL and Data Frames Unite! Learn how
@IbisData
bridges the language gap by providing a uniform
#Python
API for data warehousing tools that generate SQL code.
The recent ADBC release lets users connect to databases through Arrow Flight SQL while remaining in
#Python
. This cuts down on complexity and enhances performance. We benchmarked this new capability against the Arrow JDBC driver. See the results:
Today on the blog: we dig into the Census PUMS dataset to show why an up-front investment in
#Parquet
(over CSV) provides an opportunity to use your data…rather than reading it.
LLMs unlock capabilities for organizations to chat with data. But what if you want to chat with or ask questions of your own data, stored in different sources? Today we show how to use
@LangChainAI
and
@IbisData
.
The
@IbisData
4.1 release brings a new pair of functions for reading CSV and Parquet to any supported backend, with the same line of code.
@kae_suarez
walks through what that looks like on
#ApacheArrow
DataFusion,
@duckdb
, and
@DataPolars
. Check it out!
Love to see
@IbisData
powering
#opensource
tools!
@Google
uses Ibis in its Data Validation Tool, which helps users ensure their data stays intact after moving it between different backends, including
#BigQuery
, Teradata, Cloud SQL, and more.
For the last decade, our team has championed
#OpenSource
Standards and software development on
#GPUs
.
Today, our CEO
@datametrician
takes it to the next level at
#HPEdiscover
.
Introducing Theseus: the accelerator-native data processing engine.
Now inside
@HPE_Ezmeral
.
New release alert:
@ApacheArrow
nanoarrow 0.2 is out! nanoarrow is a lightweight C library that helps apps implement Arrow C Data & Stream interfaces. The release offers an IPC reader and a Getting Started tutorial, among other improvements. Learn more:
In our latest post,
@marlene_zw
uses
@IbisData
to analyze a massive Hacker News dataset pulled from Google
#BigQuery
. If you use
#python
for data analysis and feel the pain of using
#SQL
for large queries…this is a must-read:
We're working with
@MetaOpenSource
's
#Velox
project to improve the developer experience. Together, we'll enable modular, composable accelerated query processing that integrates with the rest of the open source
#ApacheArrow
ecosystem. Read more:
Looking for a primer on how
@ApacheArrow
can work to streamline your applications? Check out "In-Memory Analytics with Apache Arrow" - an essential resource from
@zeroshade
:
Find out what happened when
@mim_djo
and
@MurrayData
teamed up to see how fast a 60 million row
#Parquet
file could be processed, benchmarking numerous data engines. The results might surprise you… Then try it yourself!
Check out this new
@github
Actions package our team built to help
#DevOps
teams maximize runners and hit high-velocity CI/CD. Read our blog for access and to learn more.
Want to use LLMs to chat to or query your own data? Using
@IbisData
with
@LangChainAI
you can. Ibis gives users the flexibility needed to connect LLMs to multiple data sources & systems — with a few lines of code. Learn how:
We took a dataset that was 3.1 GB compressed, 14 GB uncompressed, and 50GB in memory, almost a hundred files, and 200+ columns… and transformed it into a
#Parquet
. The results? A compact, fast-to-use, and portable dataset. Check it out:
Have you thought about using
#golang
for your
#DataScience
workflows? When paired with
#ApacheArrow
and
#Parquet
you can experience advantages for common use cases.
@zeroshade
wrote a series of posts covering this topic. First up: how to get started.
Over the past few years
@IbisData
downloads and adoption have spiked. We caught up with maintainers and contributors to get their perspective – and got a preview of the 4.0 release coming soon.
Formalized database access for Apache Arrow! The community accepted the Arrow Database Connectivity (aka ADBC) specification. Now applications have a simple API abstraction for moving Arrow data in and out of databases. Learn more here:
It's a great day to learn about
@IbisData
! Listen to
@cpcloudy
, Principal Engineer at
@VoltronData
and lead maintainer of the Ibis Project, speak with
@digiglean
on the
@realpython
's recent podcast episode "Decoupling Systems to Get Closer to the Data" >
On today's blog we’re spotlighting a specific change in the
#ApacheArrow
9.0.0 release. Learn how the Arrow C++ execution engine, Acero, drives significant performance gains in Arrow workflows.
We want to extend our sincere gratitude to everyone who gave a talk and attended
#TheDataThread
(the first event of many!). If you had trouble accessing the platform or couldn't attend, all live and pre-recorded sessions are available here:
Is Apache Arrow Flight SQL on your radar yet? It’s a new columnar database protocol in the
@ApacheArrow
project. Today on the blog, we cover some of the questions we’ve heard about its capabilities. Check it out!
Want to experiment with Arrow Flight SQL? Our latest blog shows you how to run Flight SQL using a
#DuckDB
backend — and we give you the code + a pre-built Docker container to get started.
This is impressive. A user request came in for an
@IbisData
#ApacheDruid
backend connection. A few hours later,
@cpcloudy
shipped it. Learn how he did it:
This is a big step for bridging real-time machine learning and accelerating data processing for AI workloads on GPUs. Welcome to Voltron Data,
@Claypot
! It starts with modular and composable standards for
#RealTime
… more to come next week 💥💥💥
If you need to balance performance and memory limitations, use
@IbisData
and
@duckdb
to convert CSV files to
@ApacheParquet
for increased flexibility and in-memory columnar capabilities. We show you how with the UK Census Data!
PSA: Tomorrow you can learn about
@IbisData
from
@fishnets88
on the
@probabl_ai
live coding stream! Click the link to see what time it is streaming in your area.
Per suggestion from the crowd, I will be exploring Ibis in the next live coding stream on behalf of
@probabl_ai
. I'll also do my best to find a fun dataset for this one 🙂
New JDBC driver connects apps to Arrow-native databases! Applications using
#JDBC
can now talk to databases supporting Flight SQL. Big shoutout to
@Dremio
for their contributions. Learn more here:
The Wall is coming……and it’s 😱
Fortunately, Chapter 04 of the
#ComposableCodex
is HERE.
Read The Wall & The Machine to understand the macro trends (and threats) impacting data system performance in the face of
#AI
.
Available now:
If we didn’t meet at
@nvidia
#GTC24
, watch our session, "Breaking Down The Wall: Accelerator-Native Now" [S63422]. Our cofounder
@rodaramburu
presents our new benchmark data and shares our vision for accelerating the full data system.
In the NEW
#ComposableCodex
chapter we make one thing clear: Don't move your data. No really, don’t do it. Learn how composable connectivity standards for data format, access & transport help systems rise above the data sprawl.
#ComposableDataSystems
Introducing
#Velox
, an open source execution engine for data management systems that both unifies & accelerates common data computation engines, like Spark & Presto. Learn how Velox improves computation-intensive data workloads & how you can participate:
We’re excited to present at
#VLDB2023
and the co-located Composable Data Management Systems (CDMS) workshop this week! This is where
#database
innovation + industry best practice meet. Preview what we'll cover, and catch our talks if you attend!
@VLDBconf
We’re counting down to the start of
#TheDataThread
, an
@ApacheArrow
community event🎉. Don’t miss
@WesMckinn
&
@IntJesus
as they kick off the live talks with their keynote TODAY at 12pm ET. Registrants should have a link to the event from Zoom in their inbox.
One week away: See a live demo of our
#GPU
query engine, Theseus, and learn what Voltron Data is up to. Register now for this 45-min live virtual event hosted by our co-founders,
@rodaramburu
@keithjkraus
with
@philbewankenobi
running the demo 🔥
Learn how to run a
#LLM
in a
#Python
UDF using
@IbisData
&
@duckdb
. This means you can augment your data system to integrate LLMs into tabular data workflows - unlocking faster data training, labeling, querying in natural language & more!