
Andrew Lamb
@andrewlamb1111
Followers
3K
Following
356
Media
81
Statuses
696
Apache {DataFusion, Arrow} PMC, Database Engineer
Joined November 2020
Its happening -- DataFusion will (finally) get spilling hash joins. The march to completeness begins.
I'd like to start using this platform as a place to post about open source work I do on my off time. To lead it off, I have posted a hash join spilling proposal in Apache Datafusion. Check it out if you're interested 😀:.
0
4
48
RT @jonathanc_n: I'd like to start using this platform as a place to post about open source work I do on my off time. To lead it off, I ha….
github.com
Is your feature request related to a problem or challenge? I wanted to share the idea for hash join spilling here as I would like to get input on this as hash join spilling is one of the core funct...
0
3
0
Join us in Boston at the DataDog offices for the @apachedatafu meetup on Nov 12 for pre 🦃 discussion of databases
lu.ma
Join us for an evening of talks, panel discussion, and community discussion about Apache DataFusion and its growing role in modern data infrastructure. This…
0
0
2
🎣 Anyone want to try and help implement a proposed improvement to @ApacheParquet for better Floating point support? Open source fame and glory await . 🙏🙏🙏🙏🙏. [Parquet] Prototype: PARQUET-2249: Introduce IEEE 754 total order & NaN-counts #514 #8156.
3
4
57
It is a common misconception that Parquet requires (slow) reparsing metadata and is limited to built in indexing structures. Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on @ApacheParquet with @ApacheDataFusio .
4
11
106
Despite a seemingly common misperception of @ApacheParquet implementation fragmentation, there is a lot of green on the implementation status page:
0
4
46
RT @SpiralDB: It's official: Vortex has been accepted as an Incubation-stage @linuxfoundation project 🍾. The core of the Composable Data St….
0
13
0
RT @pauldix: Congrats to @_willmanning and the @SpiralDB team for getting Vortex hosted by the Linux Foundation! M….
linuxfoundation.org
Vortex Project Joins LF AI & Data Foundation
0
2
0
Its starting -- Ed is planning to write a custom @ApacheParquet thrift decoder. I expect a 2-4x improvement in footer parsing with no format changes required. I am pretty stoked to see this.
github.com
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Part of #5853 Parsing the parquet metadata takes substantial time and most of that time is spen...
1
3
32
Refactoring SUM(x), SUM(x+1), SUM(x+2) . is also another dead giveaway of Clickbenchmaxxing:
github.com
Suppose there is a query that contains the aggregate SUM(x + 1), this aggregate can be decomposed into SUM(x) + COUNT(x). In particular if there are multiple of such clauses, e.g. SUM(x + 1), SUM(x...
0
0
7
We are doing another DataFusion meetup in Boston Wednesday Nov 12, 2025
lu.ma
Join us for an evening of talks, panel discussion, and community discussion about Apache DataFusion and its growing role in modern data infrastructure. This…
0
3
10
RT @OnlyXuanwo: Some good first issues available in @ApacheOpenDAL, welcome to jump in!.
github.com
Overview This issue tracks the remaining work needed to fully implement the object_store trait in OpenDAL's integration layer. While the core functionality is working, several methods return No...
0
3
0
"EDB claimed the new engine, which pushes queries to open source @ApacheDataFusio , returned queries 30x faster than standard Postgres while tiering offloads cold transactional data to storage is up 18x more cost-efficient.".
0
5
31
Mutli-level merge sort queued up for DataFusion 50.0.0 next month: Thanks to @rluvaton and Yongting You.
github.com
Which issue does this PR close? Closes A complete solution for stable and safe sort with spill #14692. Rationale for this change We need merge sort that does not fail with out of memory What chan...
1
6
44
@cwi_da @peterabcz @afroozeh3 @ApacheParquet The confusion between format and implementation is common in academic papers, and I think hinders industrial adoption of the technology. For example, the number of format implementation:.BtrBlocks: 1 .FastLanes: 1 .Parquet: 10+ (that *I* can name), an order of magnitude more.
1
0
1
@cwi_da @peterabcz @afroozeh3 @ApacheParquet Finally, the paper several statements about "Parquet" which are really about a particular implementation (probably DuckDB's) -- e.g. "access granularity", "can return compressed vectors", "uses physical sizes for row groups".
1
0
0
@cwi_da @peterabcz @afroozeh3 @ApacheParquet I understand the value of a new format for flexibility while researching but I think the paper's contribution is more broad. I believe the paper would be more impactful if more emphasized that the techniques do not *require* a new file format.
1
0
2
@cwi_da @peterabcz @afroozeh3 @ApacheParquet The paper was a good read: Specifically I think the Encoding Expressions framework is a very nice idea for expressing cascading encodings.
The FastLanes format paper from @afroozeh3 and @peterabcz contains interesting and practical ideas for representing SIMD friendly cascaded encodings. I think almost all of the ideas could be applied to extend @ApacheParquet as a new encoding scheme.
0
1
6
The FastLanes format paper from @afroozeh3 and @peterabcz contains interesting and practical ideas for representing SIMD friendly cascaded encodings. I think almost all of the ideas could be applied to extend @ApacheParquet as a new encoding scheme.
github.com
Next-Gen Big Data File Format. Contribute to cwida/FastLanes development by creating an account on GitHub.
3
9
73