The bad news is we are all going to get covid multiple times over the course of our lives. The good news is it is no longer any more dangerous than colds and flu. If we don’t want to spend the rest of our lives living in a TSA checkpoint, it’s time to start ignoring covid.
The data show the covid doesn’t spread through schools. California has kept the schools closed. The data show that outdoor transmission is 99% less likely. California has outdoor mask mandates and is closing outdoor dining. Policy is being driven by scientists but not by science.
How fast is DuckDB compared to the best commercial data warehouses? I decided to benchmark it myself. Short version: very fast! But it's not (yet) great at scaling up to many cores.
Data lake support is one of the most technically challenging things we've ever delivered. Writing updates to S3 requires building a quasi-DWH inside Fivetran. We use
@DuckDB
to rewrite the parquet files and built a BigQuery-style scale-out service to deal with large tables.
Breaking news! Amazon S3, fueled by Fivetran with Apache Iceberg, has officially moved to general availability.
Swipe to discover a few highlights. ➡️
Check out this blog to discover more:
I’m a big DuckDB fan but this direction worries me. If the syntax diverges so far from standard SQL it’s going to be really hard to build tooling for DuckDB.
Do you wish your DB was more Pythonic? How about more fluent?
In
@DuckDB
, now you can chain functions together! No more reading from the inside out - you can read your code left to right!
SELECT ([1,2,3]).filter(x -> x<3).apply(x -> x*5)
Modern SQL!
Two years ago, Fivetran took on a unique type of debt from
@generalcatalyst
. Debt comes with covenants and payback and these can be deadly to a mid stage startup if things go wrong. But GC has created a unique offering that sits between traditional debt and equity. We’ve been
The top-secret truth of the whole ETL vs ELT thing is...ELT is really still ETL. We do a *ton* of transformation at
@Fivetran
, but it’s all automated and generic. The normalized schema you see in your DWH doesn’t just *happen*.
When Taylor and I started Fivetran, we each wrote a personal check to our new SVB bank account to purchase the initial shares. SVB had no branches so we went to their main office, rang the bell, and waited for someone to come out and take the checks.
Yikes! One of the best decisions we made at Fivetran was *not* using an ORM. Human-written SQL queries tend to be much simpler and avoid exploring these crazy corner cases of the performance space.
2 years ago Fivetran introduced objective, consumption-based pricing. Today, we’re making some adjustments based on what we’ve learned in the last 2 years. Why are we making these changes and how will they affect our customers? 🧵
@lpolovets
@stripe
Of the (many, many) investors who passed in Fivetran’s early days, two gave detailed feedback: you and
@tonsing
. I appreciated it greatly at the time and all these years later I still haven’t forgotten!
New blog post by
@cfwang1337
at
@fivetran
. The big 3 data platforms Snowflake BigQuery and Databricks are converging on the same 3 core capabilities:
1/ Vectorized SQL execution.
2/ Python dataframes.
3/ Lakehouse.
Let’s see if we can create a fake data trend. Headless data warehouse? Machine learning mesh? Reverse business intelligence? C’mon people, we can do this.
Fun fact: Fivetran still uses a single vertically-scaled Postgres database for our production workload. 26k transactions per second. To replicate it to our data warehouse takes about 10m every 15m, using Fivetran (obviously). We replicate off the primary.
It's very peculiar how people will spend $100ks on head count but freak out at spending $10ks on tools. Especially with
@Fivetran
, where 90% of what you're paying for is various forms of data cleanup and failure-recovery that you would otherwise have to manage yourself. 🤷♂️
3¢ per transaction for Stripe Data Pipeline is kinda nuts. By comparison, for a customer with 100m MAR, Fivetran is 0.01¢ / MAR, and Fivetran is not an especially cheap data pipeline.
We’re seeing a new type of “hybrid analytical” workload at
@fivetran
. Meanwhile there’s a new type of analytical database—the in-memory column store—that might be a perfect match for it.
@andy_pavlo
Congrats to
@getdbt
@jthandy
and
@drewbanin
on the series C!
@fivetran
is proud to be a member and supporter of the dbt community. dbt is a truly open-source, bottom-up phenomenon that empowers analysts, and that’s a great thing for the world.
My PhD advisor likes to point out that every new technology is analogized to the brain. Hydraulics was like the brain, radio was like the brain, now computers are like the brain.
Step 1: Mankind invents a new technology.
Step 2: "The brain works just like this new technology!" "The universe works just like this new technology!"
Step 3: Go to step 1.
I see the claim that “the log is the database” so often and it’s so, so wrong. It’s the “you only use 10% of your brain” of data infrastructure. Everybody reads “designing data intensive applications” and comes away with this dumb idea.
What does this even mean? You should never take a valuation above 10x revenue? I mean, I’m sure Bill Gurley would love it if founders followed that advice, but I don’t know why we would.
1) Previous "all-time" highs are completely irrelevant. It's not "cheap" because it is down 70%. Forget those prices happened.
2) Valuation multiples are always a hack proxy. Dangerous to use. If you insist, 10X should be considered AMAZING and an upper limit. Over that silly.
A harsh but directionally accurate* list of the things we can still get better at. The good news is we have big progress in the works for long-tail sources, outages, and reconciliation. (*The lost data claim is wrong)
I think this concept will succeed, but all the examples I have seen are extremely simple queries, and I’ve asked ChatGPT for things like net retention and it can’t figure it out. I suspect we need the AI to target a dimensional model rather than SQL.
Ask your data any question using plain English.
@getLogicLoop
uses AI to generate SQL queries you can run directly on own unique data schema to find what you’re looking for.
Manually writing SQL can be tedious. Business users can use AI to help query and analyze data faster and
@sc13ts
I don’t disagree but I’ve concluded that none of the problems with SQL are quite bad enough to motivate the ecosystem to move to something else. When we’re flying starships to Alpha Centauri, there will be a SQL database in there.
Many ETL tools rely on “visual programming languages” but I had the good fortune to be a scientist before I was a startup founder and I…saw some things…in LabVIEW. And that’s how
@Fivetran
never had a clicky-flow-chart programming tool.
Everything moving to the cloud is scary for hardware innovation. How can you bring something new to the market when you have to get adoption by the hyperscalers to reach customers? Thinking of the failure of NVRAM/Optane but presumably applies across many domains.
Hyper might still be the fastest analytical database, despite having been discontinued as an independent product in 2016. Such a shame that it's sitting on the shelf at Salesforce, an open-source Hyper could really shake up the analytics ecosystem.
@muehlbau
@mim_djo
In the future, all humans will spend the first 50 years of their lives in school, the second 50 years working in either schools or hospitals, and the last 50 years in the hospital. All other needs and wants will be provided by robots.
The CA mask mandate will finally be lifted on Wednesday, even though cases are still higher than the peak of the delta wave. Masks are and always have been about public opinion.
YC has always had a lot of people implying that it sucks.
notably this almost always comes from other investors (who are not thrilled about founders being more empowered and having such a good option)
@MaterializeInc
is a super interesting company to watch. They've started with the hardest problem in data warehouse, materialized views, which has never really been solved. Materialized views are useful in and of themselves, but look at their road map:
People who say “I believe in Science” and get mad at
@NateSilver538
for having opinions about vaccine distribution don’t understand how science works. It’s not a belief system. Anyone can look at the evidence and form their own opinion about where the truth lies.
We ran Fivetran with that same primary bank account for 10 years. We must have run over a billion dollars through it. SVB is a very special institution and I hope they continue in some form.
300m rows per second is well within the capability of a single modest-sized columnar data warehouse. Ironically, the companies who spend a ton of engineering resources building custom Kafka/Data Lake/Query Engine infrastructure end up with worse results.
This is basically a dashboard of the effectiveness of every government. Everyone else on this list should be asking themselves, why is Israel’s government so much better than ours? Who should I vote for/against to make our government more like Israel’s?
It’s remarkable the war for talent that’s taking place at the intersection of data warehousing and machine learning. According to Econ 101 this will continue until 100% of the economic rent accrues to labor and the median wage for ML software engineers will be $100m / year 💸
@martinkl
I apologize for that, I got a little overwrought in the heat wave last night...I have my issues with your work but nobody should ever be compared to Nassim Taleb 😉 You can take satisfaction that I’m getting in big trouble for that comment right now 🙈
We're far enough away from 2020 that we have good estimates of total mortality from
@HMDatabase
. What do the data tell us? First, in the 25 countries for which we have a complete estimate, mortality increased by about 10%, taking us back to the level of 2008.
I wrote in Forbes Tech Council about why data lakes are dead, and data warehouses/lakehouses are the future:
(I’m also a big fan of actual lake houses, but that’s a subject for another day)
2 years ago I posted about data validation in LocallyOptimistic,
@jasonnochlin
replied, it led to an acquisition and now we have 500 customers using the novel SELECT-only sync method he invented. Lesson: I need to spend more time hanging around in data slack communities.
Big news! Fivetran created an SDK that allows sources and destinations to write their own connectors. With a small amount of code, any application or database provider can enable their customers to centralize data in any destination supported by Fivetran.
Introducing Fivetran SDKs! 🧰 Our new Software Development Kits allow third-party vendors to build their own connectors and destinations. Join the movement with
@Convex_dev
,
@PlanetScale
, and
@MotherDuck
, who have already started creating new possibilities.
Discover more:
Had a great convo with
@mattturck
about the evolution of the modern data stack, the frontier of putting data into action, and how to use Google Trends to name your startup:
@tayloramurphy
Funny how many of these replies call out real time data. I’ve had a lot of conversations with Fivetran users who require very low latency pipelines, because someone in the business demands it, but they acknowledge it’s a kind of vanity metric. 🤷♂️
Regardless of the outcome, it is fantastic when we test government policies using randomized trials. We should do this every chance we get. Who cares if the results favor team red or team blue, the results favor TRUTH.
Devastating new results on the effects of state-funded pre-K programs.
In policy you rarely get stronger study designs than random assignment + multi-year longitudinal follow-up.
Yikes.
This is a big deal. This allows non-relational workloads to have a high-bandwidth interface to data stored in Snowflake. It’s an alternative to building a transactional data lake for customers that prefer the simplicity of storing all their data in SF.
Fivetran is an “ETL” tool -- extract, transform, and load. I don’t really know what that means but for us, it copies data from disparate sources into BigQuery (our data warehouse)
@jasonnochlin
I wrote a query planner as a hobby project (naturally) and implemented this algorithm in it.
@mraasveldt
YouTube talk on the subject is the key resource, it covers a bunch of edge cases not described in the paper.
We got a lot of great feedback on the benchmark over the last few days, and it’s turned into a bit of a living document. Redshift numbers are still a little weird, I feel like I’m still missing something 🤔.
It’s not just AI, for example the
@SnowflakeDB
founders are very immersed in the technical details to this day. Great technical leaders maintain the ability to “zoom in.”
It seems to be not a coincidence that some of the strongest leaders in AI who manage large teams frequently do very low-level technical work.
Jeff Dean doing weekly IC (individual contributor) work while managing 3k+ people at Google Research is the canonical example, but I've
My favorite thing about ChatGPT is that it answers the question I asked, instead of giving me a 15 minutes description of whatever it’s working on right now.
In the early years, I would cancel our SVB debit cards every year so all our subscriptions would lapse and I could be sure we were only paying for the things we really needed. When I would break the cards off the page they came on, a little ragged tab would always stick to them.
1/ A bit of news: last week I decided to stop working on Mighty after 3.5 years 😓. If anyone is interested in buying the IP, please reach out.
This week our team will begin work on making new kinds of creative tools using advances in AI. A new kind of Adobe Creative Suite.
@jthandy
Your data is almost certainly more accurate. Idea! Let's pool our data and publish a combined data warehouse leaderboard. People would love it!
Interest-bearing bank accounts seem like an anachronism. It would make more sense for banks to hold everything in 100% safe liquid form and charge a fee for this service, and for everyone who wants interest to get it from money market funds.
Fivetran's Iceberg data lake implementation is basically a headless cloud data warehouse storage engine. We're looking for a principle engineer to lead development of it. Super cool opportunity for someone who loves DBMS! Open in many locations but here is the CA link:
It’s so funny to me that people try to dunk on Fukuyama. Dude literally named Donald Trump as an example of a megalothymic individual who would be dissatisfied with being a mere rich developer and become a threat to democracy. In 1993.
@0interestrates
And surprisingly high performance. We use it to stage data for loading into data warehouses
@fivetran
, we’ve tested parquet but CSV is faster for real world datasets, or at least it was a couple years ago.
This is a great talk from the creators of
@duckdb
, among other things it argues persuasively that you can’t just shoehorn ML in your DBMS, the ecosystem is just too big, you have to figure out a way to interact with what already exists.
Despite universal mask wearing, Japan has case rates similar to the US at the height of omicron. At this point, masking is a superstition, like astrology or chiropractic.
@vboykis
Databricks is doing a lot of great work, but this idea that you have to have a data lake in front of your data warehouse is ridiculous. They’re really straw-manning the “just use a SQL data warehouse” point of view.
I had a great conversation with Frank at
@Snowflake
about how our products are complementary, consumption based pricing, and where the partnership is going: