andrewlamb1111 Profile Banner
Andrew Lamb Profile
Andrew Lamb

@andrewlamb1111

Followers
3K
Following
356
Media
81
Statuses
696

Apache {DataFusion, Arrow} PMC, Database Engineer

Joined November 2020
Don't wanna be here? Send us removal request.
@andrewlamb1111
Andrew Lamb
13 hours
Its happening -- DataFusion will (finally) get spilling hash joins. The march to completeness begins.
@jonathanc_n
jonathanc-n
1 day
I'd like to start using this platform as a place to post about open source work I do on my off time. To lead it off, I have posted a hash join spilling proposal in Apache Datafusion. Check it out if you're interested 😀:.
0
4
48
@andrewlamb1111
Andrew Lamb
7 days
🎣 Anyone want to try and help implement a proposed improvement to @ApacheParquet for better Floating point support? Open source fame and glory await . 🙏🙏🙏🙏🙏. [Parquet] Prototype: PARQUET-2249: Introduce IEEE 754 total order & NaN-counts #514 #8156.
3
4
57
@andrewlamb1111
Andrew Lamb
7 days
It is a common misconception that Parquet requires (slow) reparsing metadata and is limited to built in indexing structures. Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on @ApacheParquet with @ApacheDataFusio .
Tweet media one
4
11
106
@andrewlamb1111
Andrew Lamb
14 days
Despite a seemingly common misperception of @ApacheParquet implementation fragmentation, there is a lot of green on the implementation status page:
Tweet media one
Tweet media two
0
4
46
@andrewlamb1111
Andrew Lamb
15 days
RT @SpiralDB: It's official: Vortex has been accepted as an Incubation-stage @linuxfoundation project 🍾. The core of the Composable Data St….
0
13
0
@andrewlamb1111
Andrew Lamb
15 days
RT @pauldix: Congrats to @_willmanning and the @SpiralDB team for getting Vortex hosted by the Linux Foundation! M….
Tweet card summary image
linuxfoundation.org
Vortex Project Joins LF AI & Data Foundation
0
2
0
@andrewlamb1111
Andrew Lamb
17 days
Its starting -- Ed is planning to write a custom @ApacheParquet thrift decoder. I expect a 2-4x improvement in footer parsing with no format changes required. I am pretty stoked to see this.
Tweet card summary image
github.com
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Part of #5853 Parsing the parquet metadata takes substantial time and most of that time is spen...
1
3
32
@andrewlamb1111
Andrew Lamb
17 days
Benchmaxxing (verb): to add specific optimizations that only impact benchmark results and are not widely applicable to real world use cases.
4
3
50
@andrewlamb1111
Andrew Lamb
23 days
"EDB claimed the new engine, which pushes queries to open source @ApacheDataFusio , returned queries 30x faster than standard Postgres while tiering offloads cold transactional data to storage is up 18x more cost-efficient.".
Tweet media one
0
5
31
@andrewlamb1111
Andrew Lamb
26 days
@cwi_da @peterabcz @afroozeh3 @ApacheParquet The confusion between format and implementation is common in academic papers, and I think hinders industrial adoption of the technology. For example, the number of format implementation:.BtrBlocks: 1 .FastLanes: 1 .Parquet: 10+ (that *I* can name), an order of magnitude more.
1
0
1
@andrewlamb1111
Andrew Lamb
26 days
@cwi_da @peterabcz @afroozeh3 @ApacheParquet Finally, the paper several statements about "Parquet" which are really about a particular implementation (probably DuckDB's) -- e.g. "access granularity", "can return compressed vectors", "uses physical sizes for row groups".
1
0
0
@andrewlamb1111
Andrew Lamb
26 days
@cwi_da @peterabcz @afroozeh3 @ApacheParquet I understand the value of a new format for flexibility while researching but I think the paper's contribution is more broad. I believe the paper would be more impactful if more emphasized that the techniques do not *require* a new file format.
1
0
2
@andrewlamb1111
Andrew Lamb
26 days
@cwi_da @peterabcz @afroozeh3 @ApacheParquet The paper was a good read: Specifically I think the Encoding Expressions framework is a very nice idea for expressing cascading encodings.
@andrewlamb1111
Andrew Lamb
26 days
The FastLanes format paper from @afroozeh3 and @peterabcz contains interesting and practical ideas for representing SIMD friendly cascaded encodings. I think almost all of the ideas could be applied to extend @ApacheParquet as a new encoding scheme.
0
1
6
@andrewlamb1111
Andrew Lamb
26 days
The FastLanes format paper from @afroozeh3 and @peterabcz contains interesting and practical ideas for representing SIMD friendly cascaded encodings. I think almost all of the ideas could be applied to extend @ApacheParquet as a new encoding scheme.
Tweet card summary image
github.com
Next-Gen Big Data File Format. Contribute to cwida/FastLanes development by creating an account on GitHub.
3
9
73