
Andrew Lamb
@andrewlamb1111
Followers
3K
Following
406
Media
99
Statuses
746
Apache {DataFusion, Arrow} PMC, Database Engineer
Joined November 2020
Prateek Gaur and co at @Snowflake reproduced the (great) results for the ALP encoding algorithm from @cwi_da / @afroozeh3 / @peterabcz. ALP achieves ZSTD levels of compression and much faster decode. We are discussing adding it to @ApacheParquet: https://t.co/gxwF5QqtNO
0
10
76
@SpiralDB @CMUDB @ApacheParquet The idea of using WASM as a forward compatibility mechanism I thought was especially neat
0
0
5
The talk on @SpiralDB at @CMUDB : https://t.co/6mRfsnDZiP is a great one. I think it would also be interesting to hear a counterpoint about @ApacheParquet that explains actual technical details of that format, the Cathedral vs Bizzaar management, options with Metadata, etc
1
15
112
Our new thrift parser in the Rust @ApacheParquet implementation is a 🎁 that keeps on giving performance wise 🚀 https://t.co/b6lHJbxQzd We are also working on a blog post that has a deeper explanation
2
8
137
Yesterday I learned about the SpatialBench from Sedona https://t.co/I7MYOptkuK Which they based on our tpchgen-rs project: https://t.co/PR0F0AS9SD (BTW I a still looking for some more github watchers on tpchgen-rs so I can get it on homebrew)
0
2
32
BTW if anyone wants a good intro to database storage / Log structured storage (aka LSM trees), the @CMUDB lecture this fall is a good one:
0
30
281
It starts: https://t.co/0fhieCL0BX clfushopt is going to make the worlds fastest tpc-ds generator
github.com
WIP (out of tree) Rust implementation of TPC-DS generators. - clflushopt/tpcdsgen
2
3
31
I am proud to announce I am now a committer on the @ApacheParquet project. Realistically this likely means more reviews / helping clarify the parquet specs, but I also hope to help more actively evolve the format, especially around new encodings. https://t.co/lnR71Po1yA
5
2
109
I am really proud to announce that we raised €18M in series A. We have got big plans on improving Polars. Great things to come!
We raised €18M in Series A led by @Accel to build fast data processing at any scale. All on Polars. https://t.co/Qy13YezymD
5
2
63
BTW I would love some help getting some official DataFusion 50 benchmarks into ClickBench --
github.com
Is your feature request related to a problem or challenge? Follow on to #16643 #14587 Requires #16799 Describe the solution you'd like Now that DataFusion 500.0 is released, It would be great t...
0
0
3
CloudFlare's Distributed R2 SQL engine's is a pretty good exemplar of how to build a serverless database to process petabytes in seconds using @ApacheDataFusio and @ApacheParquet
https://t.co/8QSk6fOwuQ
blog.cloudflare.com
R2 SQL provides a built-in, serverless way to run ad-hoc analytic queries against your R2 Data Catalog. This post dives deep under the Iceberg into how we built this distributed engine, from its...
3
10
91
@ApacheParquet @ApacheDataFusio Check out the follow on from @jcsherin who used these techniques to put full text indexes in parquet:
1
0
7
So cool: @jcsherin added full text indexes into Parquet files using the techniques from our blog https://t.co/t0eDGHeG9c
4
9
50
"Introducing SedonaDB: A single-node analytical database engine with geospatial as a first-class citizen" Built in Rust with @ApacheDataFusio
https://t.co/bsneiAJFRv
0
4
32
Example of the level of optimization obsession possible with the @ApacheDataFusio community ❤️ :
github.com
I am interested in predicates like: CASE WHEN X THEN "a" WHEN Y THEN "b" ... END = "a" CASE WHEN X THEN "a" WHEN Y THEN &...
0
0
3
We just published an easier to find list of all PMC and committers on @ApacheDataFusio, and it is quite a cool list of people and affiliations if I do say so myself 🤗 https://t.co/OOYNgf58eZ
1
5
33
And we are also adding Geometry to the Rust parquet implementation . Huge thanks to @kylebarron2
github.com
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Parquet recently adopted Geometry and Geography types: apache/parquet-format@master/Geospatial....
0
4
22
It was a great time on Monday at the @ApacheDataFusio meetup in NYC. We heard about distributed query plans, filter pushdown, geospatial support, and VegaFusion. More deets here https://t.co/Axugrv05P3
1
2
26
6 hours to generate TPCH SF750000 dataset using a worker pool of 1000 parallel processes (spread across 25 VMs). BTW SF750000 is 750TB raw / 220 TB parquet. https://t.co/Piqq2ubVIw
github.com
Tracking Issue for v2.0.0 Open issues related to v2.0.0 Memory size growth (#76 & #150) #152 #146 #80 (not sure if we want to include this one) #145 We dont have to resolve all of these, I thin...
1
3
26