We are joined by the brilliant Jacob Matson, Developer Advocate at Mother Duck, to discuss the new frontier of data application, the evolution of computing, and Brian’s taste in wine. Get nerdy with us and learn about the unlimited potential for customized consumer experiences powered by advanced databases and more tech forecasts – brought to you by a frontline data expert.
We are joined by the brilliant Jacob Matson, Developer Advocate at Mother Duck, to discuss the new frontier of data application, the evolution of computing, and Brian’s taste in wine. Get nerdy with us and learn about the unlimited potential for customized consumer experiences powered by advanced databases and more tech forecasts – brought to you by a frontline data expert.
Have any questions or comments about the show? Let us know on futurecommerce.com, or reach out to us on Twitter, Facebook, Instagram, or LinkedIn. We love hearing from our listeners!
[00:01:53] Brian: Hello, and welcome to Future Commerce, the podcast at the intersection of culture and commerce. I'm Brian.
[00:01:58] Phillip: I'm Phillip. And, Brian, are you in a cabin in the woods today? Do I need to be worried about where are you?
[00:02:04] Brian: It's my parents' basement.
[00:02:06] Phillip: Oh, sorry. I didn't mean to call you out in podcasting from your parents' basement, although that seems a South Park-esque art type of a podcast.
[00:02:14] Brian: No. A 100%. I'm falling right into the trope right now.
[00:02:19] Phillip: We and speaking of of tropes, one of the things that we haven't done in a really long time is get a little bit, a hair technical and get a hair futuristic. And I had a conversation with a former podcast guest from six or seven years ago, someone I mix it up with on X slash Twitter all the time, someone who I think of as being really brilliant and is someone who's in a new role. And we had this amazing conversation, and I was like, "Brian, we have to replay this on the podcast." I am so pumped. Cannot wait. So without any further ado, welcome, Jacob Matson, developer advocate these days at MotherDuck, but has done a tour of duty or two in some of the world's most recognizable brands. And glad to have you back on the show. It's been way too long, Jacob.
[00:03:06] Jacob: Yeah. It has been too long. Thank you thank you, Phil. Thanks, Brian. It's good to be here, and it's always fun to, like, nerd it up a little bit, but also kind of make it a little bit practical and not just talk about, you know, tech for the sake of tech. So I'm really excited about this.
[00:03:21] Phillip: Yeah. And I think that we will talk a little bit about that. Tell us about the new gig real quick.
[00:03:26] Jacob: Yeah. So I work at a company called MotherCuck, which is building a set of cloud services on top of the open source database, DuckDB. And so I get the pleasure of, you know, helping to reach out to our developer community and work on things like documents, but also give talks and make sure that our customers are getting what they need out of all the things that we're building. And so it's a lot of fun.
[00:03:52] Phillip: And you're sharing a lot of that openly on social media. I'm sure LinkedIn, Twitter, that sort of thing. But it looks like the tooling that you guys have at MotherDuck seems like it's really next generation. Seems like it's really, as a person who I spent a decade of my career writing SQL every day.
[00:04:13] Jacob: Sure. Yeah.
[00:04:15] Phillip: It really looks like you guys are onto something. So congratulations on the new role. But, like, what's the differentiator, and what does this do for those who aren't initiated?
[00:04:24] Jacob: Yeah. It's a good question. So we are building we kind of think about as, like, an aggressively serverless single node database. Well, I kinda have to back up and tell a little bit of a history. There's a little bit of a history lesson here.
[00:04:38] Brian: Alright. {laughter}
[00:04:39] Jacob: We gotta gotta go back in time. If you look kind of back around the time where we really started talking about big data, it was 2004, 2005 ish, and Google, you know, was running into lots of lots lots of really challenging kind of computer science problems around handling all of their data. And so we kind of built software kind of in that vein, making a bunch of assumptions about things that were true back then. The main assumption we kind of made was that, okay, computers are gonna keep getting faster. And the reality of what happened is they stopped getting faster. In fact, the kind of CPU inside your MacBook today is about as fast as a CPU from 2,005. The difference is you have 14 or 16 cores on your computer instead of one. That's kind of the first point that happened is the architecture changed away from single core very fast to flat speed but many cores. The next thing that happened is storage just got way, way cheaper. Just to kind of give you a sense of where storage looked like in in 2,005, it was about, $2 a gigabyte, and it's now about a cent per gigabyte. And if the folks that I know at AWS are to be believed, the S3 team kind of achieves much cheaper than that. We're seeing massive kind of decreases in cost. And then the last part is we've also had a lot of innovation around, you know, I don't know if you guys remember this, but back then, our hard drives had spinning disks in them. They're pretty slow. We had to defragment them. I don't know if you remember that.
[00:06:25] Phillip: Oh, I do. I love that.
[00:06:27] Brian: Of course.
[00:06:27] Phillip: I miss that actually.
[00:06:28] Jacob: Yeah. Yeah. So that's all gone away. We have these solid state drives and now we have these kind of next generation NVMe drives that are very fast. I think about 70 times faster than, like, a spinning disk for sequential reads. So you kind of combine all of those things together, and what you end up with is the technology that was built by Google and others didn't really that these things were going to occur. And so you can kind of apply a totally new paradigm and a new lens for building software, and that's what the folks at DuckDB kind of recognized kinda circa 2016. Around then. There was a guy this guy, Hannes, working on his PhD at CWI in the Netherlands, working on something he called MonetDB lite. It was kind of like a a single node in process version of MonetDB, and that eventually turned into what is now DuckDB, which is a single node, meaning runs on one machine, but massively parallel database. So it takes a single query, and it'll break it into a bunch of little parts, and then jam it through your CPU as fast as it can. And that's very different from the kind of historical architectures and multi parallel processing of BigQuery and Spark and, you know, Snowflake, etcetera, which are basically designed to run at very high scale but across multiple machines. And so if we take that paradigm, it means that basically we can run analysis on 100 of gigabytes very quickly on our laptop, which is kind of a brand new thing. And so what MotherDuck is doing is saying, "Okay, we'll take that. We'll let you use the compute available on your laptop or in a cloud in a cloud like EC2 Instance, but also we have our own kind of cloud version of DuckDB that you can run for even bigger compute. How much RAM and how many cores do you think is on the biggest single node that you can buy on EC2 today?
[00:08:29] Phillip: Okay. I'm gonna say, like, it's probably a giga core. Are there a 1,000 core instances? Does that exist? Did I overshoot it?
[00:08:36] Jacob: There are, okay, so the big the number of cores is 896 vCPUs.
[00:08:45] Phillip: That's so many. And RAM has to be crazy.
[00:08:48] Brian: That's crazy.
[00:08:48] Jacob: They're 32 terabytes.
[00:08:50] Phillip: Oh no!
[00:08:51] Brian: Oh my gosh.
[00:08:53] Jacob: Now to be fair, these were built to run SAP HANA, of course.
[00:08:58] Phillip: {great laughter}
[00:09:01] Brian: That's not lost on there.
[00:09:03] Phillip: That's a really funny joke. The kids won't understand it, but I think that's freaking funny.
[00:09:08] Brian: AS 400. If you'd said that, I wouldn't have been surprised.
[00:09:11] Jacob: I don't know if those are publicly available. You might have to, like, reserve them three years in advance or something, but you can see the price list at least on EC2. Now DuckDB doesn't scale to nodes that large yet. So what I mean is you can, of course, run it, but you really don't see kind of linear scaling kind of after maybe 64, a 128 cores. I don't know the exact breakpoint right now.
[00:09:37] Phillip: Let me just jump in real quick because I wanna contextualize it for me because I'm..
[00:09:40] Jacob: Yeah. Yeah.
[00:09:41] Phillip: You know, I my knowledge gap starts around, like, 2010, 2011.
[00:09:46] Jacob: Sure.
[00:09:49] Phillip: In that intervening time, so I did start doing Internet scale things around 2005. I think that's the timing is really interesting. I believe there were a lot of folks in these sort of, like, dig.com, reddit.com that were trying to solve these, like, huge spikes in throughput. And, also we had Web2 was like, "Oh, we're speaking back." People are voting things up. There's a lot of parallelization. There's, like, all these issues there. We were, like, sharding databases. So at some point, we had this DIY mentality of we had a software layer that managed where data went, and then we grew up from there. And it was, let's just have a really large cluster that has a management layer that manages and balances where data goes. And then now you're saying that we're kind of at the maxima of that. Now it's to a a hilarious degree. And part of the reason that we have such exaggerated size and scale of cloud compute is because of the inefficiencies of the engines that run them and the software layers that sort of manage how data is distributed either locally or globally. Is that correct, or am I simplifying?
[00:11:05] Jacob: I think that's fairly... I think that's that's on the right track. I think that there's still really good reasons to have multiple nodes running in a load balancer and things like that. And especially if you think about in the context of, like, failover and high availability. A single node still is a single node. So you can't just discard, you know, all of that really great work that a lot of folks have done to kind of make things work across multiple nodes. But what it means is that for, especially analytical workloads that tend to be, I don't wanna say not unimportant, but maybe not on the critical path. If you can't make a recommendation for a product on an ecommerce website for five minutes, no one cares. It's like, okay. Maybe that's bad if you're Amazon, but for most Shopify like, people running things on Shopify, for example, it's like, okay. My analytics piece is like a nice thing, and I really only need it every once in a while, and maybe I'm using to generate recommendations, for example. But it's not running at the same kind of SLAs that you need for a production service. So anyways, you kind of can put all of those things together, and what it means is, we can build something that performs better and is actually cheaper and can take advantage of your local compute. I mean, there's so much compute available around us. I think one of my favorite things is I remember as someone trying to learn how to manipulate larger and larger pieces of data, it was like you had to get access to a server. Someone had to give you credentials. Now it's as simple as going to Shell.DuckDB.org and you have a full SQL client in your browser. And that's new and it means that it's easier to learn and those gatekeepers are not there. I think there's a lot of really positive things that kind of are downstream of that that we can get into. But that's really kind of the what's changing the game is is all of those pieces kind of coming together. So anyway, so there's the history lesson that kinda led leads us to all these things happen. Software's built making a cert certain set of assumptions. The assumptions change. Now we need to build new software, and one of the things taking advantage of that is DuckDB. And so I'll talk a little bit about DuckDb, which is an embedded in process database built for analytics. And so, really, what that means is it operates on things in columns instead of rows. There's a whole bunch of really technical things we could get into here. I'll spare the audience. There's a whole bunch of really good writing and and talks about that over on DuckDB.org. But, anyways, open source project that is coming out of the team, the CWI team in Netherlands and now has its own own thing, run by DuckDBLabs. And so yeah, we're building a cloud service on that. Mother Duck is a separate entity, and we're building a cloud data warehouse to kind of take advantage of all of that awesome tech that they've built and really make it production ready for customers.
[00:14:21] Phillip: Is this like a commercial arm services and enterprise sort of servicing arm to an open source product then?
[00:14:30] Jacob: The open source project is run by its own entity, which is DuckDBLabs, and we have a relationship with them. Obviously, we pay them. You can go on their website and see who else pays them. They have it listed. But we are not the only folks who are helping support financially the DuckDBLabs folks. No, they have a bunch of really great enterprise contracts that they're running on their site.
[00:17:29] Brian: This is exciting. So open source has been near and geared to Phillip and I for for years years years years. And a lot of the way we kinda think about the way that commerce is shifting comes from some of that mindset and what we saw come out of that. And, obviously, we've seen a very sharp turn away from sort of the open source mindset in the world of commerce tech in particular. Lot of roll up in the SaaS, lot of rented space, lot of you know all this. I'm preaching the choir, but we've been saying for the past couple years, a lot of this stuff is getting more and more expensive. You're having to add the air layers and layers and layers on. It's getting harder to control. Getting to the outcome that you're looking to achieve is just getting harder and harder. And from a database perspective, but at every layer of the process it's getting harder to have the outcomes you wanna have and the the financial stack is just getting larger and larger, and you don't own any of it.
[00:18:32] Jacob: Correct. Yeah.
[00:18:34] Brian: I'm curious, you would, I mean, obviously, we're talking to a very commerce focused audience. This sounds hard to the current generation of people coming into the commerce world. There's a whole new crop of people that are building brands on top of SaaS software, and they listen to you talk, Jacob, and they're like, "I don't think I can do that?"
[00:18:59] Phillip: Yeah, like, "Why do I even need to know any of that?" Well, I think that we can cover that gap because there's something that's happening in the world that I think we both see. But there is a really interesting point that he's making there, Jacob, about the distance between what was an ecommerce centered organization that was IT aligned a decade ago, and then the one today that is raised and born and bred in SaaS services and doesn't understand the underpinnings and couldn't rebuild it themselves if they had to. Do you have anything that's parallel to that in your ecosystems on the data science side or data analysis?
[00:19:35] Jacob: I mean, you know, I think there is a pretty flourishing data ecosystem built around open source tools. There's the Pozit folks who are building positron, you know, previously that was RStudio, as one example. There's a ton of really awesome stuff happening in the Python ecosystem, things like Jupyter Notebooks and others. I think that there's there's an interesting... You're asking kind of an interesting question, which is why should I care about this stuff as a brand? And the answer is, I think it's sort of complicated, but I'll talk about the previous company that I worked at, which was this IoT app platform or kind of data platform called Symmetric. And there was a lot of stuff that we could have kind of just bought off the shelf and bolted together to build what we were trying to build. But our kind of position, as who kind of came from the financial world were like, we want to build this ourselves because we know that if we use these other components the tide will not always be as high as it is right now. And we're gonna see intense margin pressure and competition on it. And when that happens, you not owning your stack is a big disadvantage. And I'm not saying that's gonna happen tomorrow and economically things have been a little bit weird over the last couple years, but I think the era of of 0 interest rates is over, but what does it mean for how you build? And I think the answer is, well, you should consider using more open source software and less kind of off the shelf SaaS. But I think there's two reasons for that. The first reason is, you wanna build a better experience than your competitors. And by using technology that is not just like canned, that opens up some doors. The second thing is, as it turns out, large language models, in particular things like ChatGPT, are really good at writing kind of glue code. So especially with more ubiquitous platforms, like, let's say, Python or SQL, they're really good at for you being able to say, "I have this problem. How do I solve it using Python?" for example. And you can get an answer that is pretty good. Is it perfect? No. You need to kind of understand a little bit of how computers work, but you don't need to be, you know, a 10 x engineer. That's the other thing that's really interesting is that you can kind of glue this stuff together. And because we have so much training data available for this open source software, it's a really big advantage versus using a closed platform. Just as an example, part of the reason why in the database space, we're seeing things like Postgres win is because there's all the documentation is out in the open, and that means it goes into, these large language models and same with the code base. I know that this is probably the counterexample, which is WordPress. Who knows what's going on there? But that ecosystem, there's a plug in for literally everything.
[00:23:15] Phillip: Yeah.
[00:23:15] Jacob: It's almost all open source. LMS understand it really well. The future of it is a little uncertain because of what's happening with the foundation for sure, but, like, there's a huge advantage to that. And there's a reason it powered so much of the web and, you know, hopefully, they sort out whatever's going on and are able to kind of continue building there. But, yeah, I think that's really what I think about is who are you competing with? How do you differentiate your experiences? If everyone's buying the same canned software, that's great. We're gonna have just converge on the same experience for everything. That sounds crappy. I don't want that.
[00:23:54] Phillip: It's funny because the there was a decade that I remember in the open source led ecommerce development arena where it was like you build something, you chose software because of its extensibility, its customizability in the community that was thriving around it and the way that they led with openness. And it's really interesting how what we have is we've got a more nuclear approach to solving certain problems. All of them are transactional and commercial in nature. The integration effort way, way easier. You just click a button and it just works. But that's because they created a walled garden around the hard things that were hard to customize to begin with. And so they stopped allowing customization in the areas that most people wanted to customize, and we just kind of accepted it. So businesses bent to software where software was supposed to bend around the business.
[00:24:58] Jacob: Yep.
[00:24:59] Phillip: And we, for better or worse now, you see this happening in B2B is that it's like every company sort of had its own way of doing business. Every company had a really customized pricing scheme, the way that they managed orders and order flows with their customers. Well, software doesn't support that. Too expensive to create software customization to support that. So instead, we're just gonna change our pricing in our business to fit what the software gives us. And I sense that, like, ecommerce is just one area of the universe that we understand. I believe that's just true. That's everywhere. It happened in document management. It's happened in online teaching instruction. Happens in the classroom with my kids. We are shaping everything around limitations of the software because we have a platonic forms almost that just seem to be like it's the only way we can think of things. That's the only way to do them.
[00:26:04] Jacob: Yeah. Sure. Sure. That's really interesting. I mean, I think, as I think about where we are today, if we go back to, like, 2005, if you're running a medium sized ecommerce business, let's say you were doing, I don't know, a $100,000 a week or something. You needed an army of servers. I don't know if you guys remember Rackspace. It was like, get all your racks get all your shit in Rackspace and, like, make sure it work.
[00:26:38] Phillip: I remember.
[00:26:40] Jacob: Yeah. Yeah.
[00:26:40] Phillip: Probably my the company I worked for last is probably still spending a bunch of money on r710s, like, Rack somewhere.
[00:26:48] Jacob: That's pretty funny. The point though is that you can run that entire website on a Mac Studio now. That's new. You probably run it on maybe even on one of the new M4 Mac Minis. I need to buy one. Those are sweet.
[00:27:03] Brian: Run on of MacBook Air, maybe.
[00:27:05] Jacob: Yeah. Maybe. Maybe. I don't know. I don't know. But the point is that, like, we have all this power available that we didn't know 20 years ago. And we're starting to see people take advantage of the fact that you can kind of run things in a much smaller, tighter way. But that only has occurred because of all the things that happened in the past. We truly stand on the shoulders of giants. So like we get to just like kind of sit in the pier and reflect and say, "Oh, wait, hang on. If all this stuff exists now, how do I package it in a way that I can make the technology work for me instead of against me?" And I think that's the really, the really cool part. And I think if we can capture that notion, what it really means is, to make it practical for e commerce, it means that stuff becomes more human and less "Well, the computer lets me do this, so I do it this way."
[00:29:06] Brian: Yeah. To sum it up, maybe. It's not that we were necessarily doing things the wrong way in the past. And so we came up against the limitations of the technology that was available to us, whether it be cost or speed. Right?
[00:29:23] Jacob: Yup. 100%. Totally agree with that. It's not that we were doing things the wrong way. It's that things have changed, and a lot of things, a lot of things had to happen for us to get to a point where we're there. And the fact that you can buy a 36 terabyte hard drive, a single disk, if you knew that in 2005, that you'd be able to to do that, you probably would have made different decisions, but no one knew that. You know, you it's hard to see the future. And but we're here, we're here now, and I think what's really cool is I saw a talk at a conference a couple a couple of months ago that was kind of talking about this notion of our hardware is now scaling faster than we can create data. And that's really the point of DuckDB and a kind of this new wave of software that will come, which is what if I can remove a bunch of the complexity? That doesn't mean I won't run it in the cloud. There's lots of advantages to run things in the cloud. Like, I don't have to think about, you know, there's a whole bunch of of really good abstractions there. But what it means is I don't maybe need as many nodes or I can think about things in a much simpler way or I think people have talked about what will be the first single person $1,000,000,000 company.
[00:30:46] Phillip: Yeah.
[00:30:47] Jacob: All of this kind of fits into that, which is like, well, now we can kind of reason about our business problems in ways that are kind of much more comprehensible because our hardware is now so much simpler.
[00:30:59] Brian: Oddly enough, we're not thinking big enough. I think we came up with a lot of effectively workarounds to get to where we wanted to go. It was we have to add these layers of software, compilers, and all kinds of things that we have to add all of this cloud infrastructure and all of this additional, well, managed services around it effectively. That's what SaaS software is. Basically. All those things were just ways to get around the fact that we hit limitations.
[00:31:32] Jacob: Yeah. Yeah. Totally. Totally. Right. Primarily driven by I mean, I would argue that a lot of those limitations were we had too much data for a single machine, whether it was to process or to store. And those constraints are not behind us yet, but they are going to be behind us soon.
[00:31:50] Brian: Right. And you even in an early access article you handed to us, you were talking about how people would put up with so much pain with Excel because it had was such a powerful tool. Back in the day, we had so much data. We're like, "Okay, we can put as much data as we can get into this thing, and it we'll, like, let this function run for five months. I literally remember this in my accounting departments in my first job. They're like, "We've got these crazy Excel formulas, but they're the best way to do this, so we're willing to put up with a five minute refresh on this, " and now that would be run-in a millisecond. That exact same formula.
[00:32:30] Jacob: I remember it being a celebratory moment when we got to, like, a 1000000 rows in our Excel spreadsheets. And then anyone who actually tried to put a 1000000 rows in their spreadsheet, it was horrible. It didn't work. It crashed all the time. You had weird workarounds. And that's still true. Like, it's still broadly true. I think we accept it. We accepted a lot of that reality when we didn't realize that actually, it's because the software is built for a very specific kind of way of thinking about the problem set, and we can think about it differently now because the architecture has changed. I'm still waiting for Spreadsheets 2.0.
[00:33:07] Brian: But it is interesting that even I remember going through ecommerce integrations back in the day, back in the open source world. And Phillip, you talked about how now it's just, like, click click and all the hard stuff is done. Right? But I remember we were at a point where, like, APIs were so complex and so buggy and had so many problems. We were like, "We would rather just do a flat file integration and just run this in a CSV because of the transparency and the benefits of having that data available to us, and it's way easier to deal with and diagnose." I think that there's something to that even now. You were talking about with WordPress. It's all open and available for indexing, and our AI tools can understand it better. Where do you think AI fits into this? Because I actually think that's part of the unlock coming up here for open source.
[00:34:02] Jacob: Yeah. That's really interesting. I mean, I think, as an open source, you know, maintainer and creator and and documenter, I think certain things need to be top of mind, which is basically the way you get distribution is actually you need to make sure your data is included in the training dataset. And you need to make sure that the LLM can understand it and turn it into something that humans can prompt and get back the right answer. So I think that's a totally new paradigm that we're opening up into, which is very interesting. I think the other thing is a lot of the AI stuff has been very focused on this chat interaction interface, which is kinda like one at a time, you know, is writing code. It's doing things like that. I think that there is, and we wrote about this on the MotherDuck blog as well, but simply having good abstractions for prompting functions in SQL is actually very powerful. And there's two reasons for that. The first one is SQL is like a cockroach. There's a database called Cockroach even. Andy Pablo at CMU wrote a really interesting paper with Michael Stonebraker recently kind of revisiting, you know, all this stuff that people had worked on over the last 20 years and inevitably kind of made its way back to SQL. And I think that we are going to see that become more and more true. And I think the other thing is because we have a 50 year corpus, meaning we've got 50 years of training data for SQL, it also means that from an AI perspective, I would bet heavily on SQL kind of being the connective tissue between your business user, to your data analyst, to your data scientist, your data engineer, to your software engineer. In the past, each of those groups would buy tools that operated in different languages and different levels of abstraction. And I think because of AI, enabled by AI, we now potentially have the future where there can be this lingua franca shared between those users, which could be SQL, right, which is really just CSV generation to your prior point, Brian. You're generating tabular data that can kind of be reasoned about easily. There's lots of bad things about SQL. It's written in the wrong order. The syntax is very verbose. The query planners are not deterministic. There's a host of objections folks have with them. But I think we can deal with those trade offs if it's also the way that someone in the business can communicate easily the thing that they need. I will tell you that REST APIs are the wrong abstraction. And you mentioned that earlier also, Brian, which is like, "Hey, we should just integrate this stuff with the rest APIs." And it's like, "Good luck."
[00:37:07] Phillip: Oh, for sure. Yeah.
[00:37:10] Jacob: So if we can kind of think about things as tables, actually, that turns out maybe it might be a better way to move forward.
[00:37:19] Phillip: I think it's interesting that you call SQL the lingua franca. It is Turing complete, so there's that, but I had had a sense that we would see emergent, low level languages come out of LLMs, where there's a new way of speaking to hardware that would recapture some of this lost compute. Something I wrote about a few years ago was why do we need the human abstractions? It almost sounds like there's other ways to get at that lost compute without having to go like, that's not the solution necessarily. Maybe that's something that does happen in the future, but that's not the thing that recaptures it. There's other ways of other modes of thinking and building new solutions, in this case, like column based versus row based or or what have you. I'm definitely out of my depth, but I do sense that there's a thing that we could have with LLMs that allows us to get at the solution that we're after without us having to go through the layers of human abstraction. And I guess SQL is pretty low level. If you're talking directly to the data, like, it is getting past all of the abstractions because you're not having to go through layers of software. I don't know.
[00:38:44] Jacob: I mean, technically I don't agree with that.
[00:38:49] Phillip: Okay.
[00:38:49] Jacob: But from a human perspective, I agree. It's part part of the challenge with SQL is that it's written for humans to reason about data. Folks at DuckDB have done the done the hard work of saying, "Okay. SQL seems like a pretty good abstraction, and we're gonna write a totally new thing in C++, which is very, you know, a low level abstraction, and then give you all these APIs where you can that you can interact with, one of which is SQL. There's a Python API. There's all this other stuff too. So you don't it's not just SQL, but as it turns out, like, may maybe the part that is kind of the abstraction we're talking about is more maybe tables than it is SQL, which is like SQL is just a way to define tables, and we can reason about lots of things in tables. And it seems to be one that we can use and I'm sure certain people would say, well, data frames are far superior for reasoning about tables, and they're probably not wrong, but there's not 50 years of history with those. So you guys were talking about kind of integrating software earlier, like, all these parts. Really, you the thing you know what you want is a table that looks like a certain shape. But expressing it is the hard part. It's like, "Okay. Well, to express the thing that I need, I need to get this data from Salesforce and this data from Shopify and this data from my internal app and this data from here, and now I can get the table that I need to make the decision I need." And so the I think the really interesting thing that's happening, particularly with DuckDV is that it's making all of those things easier because it has a really nice kind of SQL dialect, but also it's really fast, and it's fast to the point that when software engineers use it, they don't hate it, which is like a record breaking thing for SQL.
[00:40:36] Phillip: {laughter} You did some live demo stuff with me before this. The kinds of things that you do in those demos is kind of incredible. The sets of data that you can just grab, ingest, get right to work on. You don't need to write software to, you know, bake embeddings and start prompting against the dataset. Talk me through some of that in a practical use case, maybe the wine based one that you did with me. Yeah.
[00:41:06] Jacob: So my buddy sent me a list about 2,500 wines, just like a list of SKUs, you know, like region, wine name, year, varietal, just like technical specifications for for some wine. Yeah.
[00:41:21] Brian: You had ratings as well?
[00:41:22] Jacob: I did have wine advocate rating in, like, VINIUS, I think, and maybe one more.
[00:41:25] Brian: This is great. I'm all in on this. So I need this.
[00:41:28] Jacob: That's all the information I had. So like, if my wife is like, "Oh, I love a Cab Sav, like a darker wine." I'm like, "Oh, well, I've got some California wines." I can kind of like squint and, like, get there, but I was like, "Oh, well, hang on. This is actually an opportunity for me to use some little bit of large language model stuff." And so I took that dataset and I loaded it in to to my database, which is literally using DuckDB. It's as simple as doing select star from your file name, and it literally just now it's in the database, which for those of you who have had the pleasure of working with things like MySQL or SQL Server, etcetera, it's never that easy. But it is with DuckDB now, which is amazing. So we get the data in. We do a little bit of cleaning, like the case sizes are variable depending on how big the bottle is. Some of them are half size, some of them are double, those magnums or whatever. But I'm like, "Okay. I wanna see if I can use this wine and generate an embedding, which is just a vector that kind of describes that text. So I did that, and I did that, and I got a recommendation. And then I was able to kinda say, okay. Show me, like, dark fruity wines. And it worked okay. It worked okay. That was kind of the first step. You can do that all within MotherDuck SQL. We have a kind of LLM extension function. So that was the first thing that I did, and got some recommendations, and they were okay. But then I thought about it a little bit more deeply, and I said, how do I make this recommendation better? I wrote a prompt that said, "Pretend you're writing a wine review for wine advocate. Describe the wine that has these characteristics, and do it in less than 500 characters or something." And so I was able to run that, you know, against these 2,500 wines. It took about 10 seconds to run. So it ran tons of prompts really quickly in parallel. Great way to spend money on MotherDuck, by the way. We do have controls, so you can't spend your entire budget immediately. But it's a good good way to make money for us. Anyways, so, you run this through. You do a prompt, and then you do an embedding on the prompt, which is basically a description of the wine kind of wine advocate esque text. And so I was able to use that now to drive the recommendation instead of using just kind of a a very basic description. And with that enriched description, I was able actually to kind of get a few excellent recommendations and place an order that actually made sense and was coherent. And this took me, you know, probably 30 minutes of just work to effectively research, "Hey, what's the best dark fruity Cab Sav from California? That's the input that I gave it. What's the best one out of this list of 2,500 wines? And to do that manually, would have taken me, I don't know, few days probably. So now I have this like kind of way to do this. And obviously like that's one where it's like Precision and Accuracy can kind of be pretty, are pretty soft. I'm sure that a great human could probably beat the performance.
[00:44:45] Brian: I have looked through tens of thousands of bottles of wine in the past two years. That's not an exaggeration. Bottle by bottle, and done research on ones that I can identify quickly just based on my knowledge. But the thing is oh my gosh. Can I steal that from you? I'll pay.
[00:45:04] Jacob: I would love to. I'll send it to you.
[00:45:07] Brian: We can do a case study on this, and I will be a power user. Listen. Let's run this against the Vivino. I'm sure we can get access to Vivino ratings. And wine inspector, a wine enthusiast, by the way, best one to go with. They still do blind tastings, just FYI. But, man, you actually have inspired me here a lot of other thoughts around this, let alone wine for a minute because I mean, I'm pretty stoked, and you're speaking my language, so it made it way easier to think about how things can get applied here. But the does this mean that, in reality, data becomes almost flat in general? And hear me out on this. If all data is open and set in these tables that are crawlable by LLMs, it becomes the training data. Right? For edit. And it's all available, all out there, and maybe you'd have to pay to get access to certain components of that data. And AI can actually extrapolate on that data, which then it can become made public to all of the other potential use cases for that data. And does this change the nature of how the data process... I'm thinking about shopping and wine buying is one form of that. But any kind of shopping, there's so much really bad data out there and really underdeveloped data and also so much missing data. And so my brain is just twisting and turning right now about how this changes our relationship with data in general.
[00:46:46] Jacob: I think you're hitting on something, which is basically, if you think about things in the in the kind of way I just expressed it, an LLM is just a database. Right? And that means you can interact with it with database primitives, which means all of a sudden now instead of getting the last four years of whatever prompting we've learned to do, we get 50 years of SQL knowledge that we get to use too. {laughter} I think one of the things that's really cool about thinking about what I just kind of walked you through is I've also worked with my buddy, Nate, and he was working on a classifier using kind of the Snowflake Cortex functionality, which is very similar to what I kind of walked through in MotherDuck. And the hard part is not getting an answer. The hard part is getting a right answer. And so I think the next step here is this is really the thing that I think AI and LLM researchers have probably been talking about for a few years, which is they see what you can do with these prompt machines. And if you treat them like a database, you can kind of do a lot of really cool things. But the accuracy rate kind of base case today is is probably, like, I don't know, 60 to 80% on a lot of this stuff. And so how do you improve that? And what loops do you need to build and what affordances need to exist? The good news is we're gonna continue to build these out in MotherDuck and, for example, I wrote, like, a judge macro for a classifier. So I said, "Alright. Given this list of companies, what industry which which of these 10 industries does each of these company fit into?" And so you can use one LLM to say to classify it, for example, and then use a second one to say, is this right? And tuning those types of things and then building them in loops is one example of a pattern you can do. That's possible kind of to someone who's not like an LLM researcher. Again, because now we can do it using SQL primitives, which are available to, like, you know, people more broadly. And so that that means that we can do a lot of kind of things for that at a lower kind of or, I guess, higher level of abstraction.
[00:48:53] Phillip: And 01 does some of that reasoning and sort of looping on checking itself to some degree too, but it's expensive. And I think that in your example, it's, like, so cheap. It's trivial to hit smaller, more faster performant models to do the kind of thing you're doing especially when it's not, like, a critical...
[00:49:17] Jacob: Yeah. Correct. I mean, the thing that I didn't actually mention this in my kinda what I was talking about my SQL query wine example. But I could type in a string of text and get a recommendation back in half a second. So what that practically means is you can oh, there's a lot of recommendation and stuff built around this that is precalculated and precomputed that now you can just do on the fly, for example. And so that's really cool being able to kind of take a workflow that historically ran on a bunch of big machines and then simplify it down to running on one, and then, you know, have clever integrations and abstractions so that you can do things on the fly, I think, is really cool. It also means that you get a human that can, for example, I would expect Brian can write a way better prompt to get a description of a wine bottle than I can. And so imagine that you could now have an e commerce shop that encodes all of the knowledge that you have about wine and is basically like, effectively, when I'm using that shop, I'm getting access to Brian's brain. That's really interesting. Especially because I think, and you guys have talked about this too, what really matters today when you're building things, whether it's a brand or a software or whatever comes down to taste. And so if you have taste, well, how do you turn that into something that is an experience that you can share with others and make a little bit money with? There's something really interesting to say there, which is like, "Hey, I don't need to hire a huge team to build this thing. I can actually encode the Brian some sommelier bot." And now all of a sudden, all of this knowledge is available and other people can use it. Now are we there today? Maybe not. But I think we're getting close. We're getting closer every day to really, really interesting experiences that can only be driven by this augmentation of humans with AI.
[00:51:27] Brian: Yeah. And I think there are really simple expressions of this to start. Back to the site, some websites don't have great data about their products, especially discounting sites or sites that are getting things, not just for us on a wholesale business, but they're getting it three steps down this supply chain.
[00:51:48] Jacob: Yeah. Correct.
[00:51:49] Brian: And so the whatever data they have, it's inherited, maybe scraped. The data that could be distributed around those products or whatever's being displayed or people are getting information on it, it's I feel like there's a confidence level that LLMs could start to populate that with, as long as people understood that it wasn't necessarily the truth, but a perspective on the truth.
[00:52:19] Jacob: Sure.
[00:52:21] Brian: I actually think that that could be how a lot of this data starts to work its way into places. People are gonna have to get used to technology having a perspective on what things might look like and some level of confidence around that perspective.
[00:52:37] Jacob: Yeah. I think that's totally right. I mean, I think you're actually honestly describing, like, the process or or how things exist today, with buying apparel on Amazon, for example. That reality is already here. And by the way, it really sucks.
[00:52:53] Brian: It really sucks.
[00:52:54] Jacob: There's no perspective, basically, like, searching for something that is like clothing on Amazon, which is great because it's easy to buy. There's no friction there, but it's also the friction on listing is now so low that now you're on the other side of it is like there's 8,000 options for each. The brands are like totally made up. You have no relationship to the brands. The descriptions are, who knows if they're accurate? I think to some extent we're already kind of there, but also what that means is, you know, we have an opportunity to swing the pendulum back the other way, which is to say we can build a curated set of experiences, which might just be manifest as a really awesome ecommerce apparel store. I think that's okay.
[00:53:42] Brian: I think that's true. Although, I do think that people wanna sort through all of that because they can find a lot of value in those scenarios.
[00:53:49] Jacob: Yeah. Yeah.
[00:53:49] Brian: So maybe it's where consumers are gonna start to build their own protections, or they're just gonna have tools to help give their own perspective on the quality and the sourcing and the details that are in Amazon. It's like, bring your own...
[00:54:06] Jacob: Shopper bot. Yeah. Bring your own shopper bot. Yeah.
[00:54:08] Brian: Yeah. Yeah. Totally.
[00:54:09] Jacob: Encode your own taste. Yeah. Absolutely.
[00:54:11] Brian: Own taste and confidence of, based on the data that's available about if the product is what it says it is.
[00:54:18] Jacob: Yeah. Yeah. Yeah. I mean, this sounds like really bad for Stitch Fix.
[00:54:21] Phillip: I really like the big ideas in this conversation. I think that if someone had to walk away with one big takeaway, I would be... I foresee a future where if things like what you described become trivial for just play use cases, it also trivializes really hard to solve problems that are multilayered steps and require lots of engineering and data exporting, massaging, reimporting. You can do a lot of that and compress those steps into a handful of steps. They can be done locally or semi locally instead of having to do this and reserve workloads or something. I see a lot of value in IT organizations having a greater hand in ecommerce experience in the future because they can do a lot more for themselves as opposed to having to farm out things through SaaS companies that solve a particular, one particular problem in the user experience and then stitching those together.
[00:55:25] Jacob: Yeah. I think that's a great takeaway. Totally agree that, like, things are going to continue to get easier. And more importantly, I guess it kinda comes back to, like, encoding your taste in the thing you're building? We are getting very close to the point that the constraints that came in from using CAM software are potentially oing to be blown blown away. And I think that's a really cool place to be at. It's exciting.
[00:55:55] Phillip: I think it's very, very exciting. MotherDuck.
[00:56:00] Jacob: Yeah.
[00:56:00] Phillip: Wow. Congrats on the new role. And Thank you. And where can people learn more? Where can people find you online?
[00:56:07] Jacob: Yeah. Sure. So, obviously, I'm doing some writing and docs and things on MotherDuck.com. And then feel free to kind of reach out to me and ping me either, on Twitter, now X, whatever @MatsonJ. And, of course, over on LinkedIn, you can find me writing and posting about data related things.
[00:56:27] Phillip: Love it, and I appreciate it. Have you predicted the playoffs, the NFL playoffs yet? Are you still doing that sort of a thing?
[00:56:34] Jacob: I still am. You can still check that out on mdsinthebox.com. No NFL playoff predictions, but I do have NBA playoff predictions. Little sports simulation model there. Very cool, kinda single node. Single node maxing there on that project.
[00:56:51] Phillip: Love that. Thank you so much, Jacob Matson from MotherDuck, and, hey, thank you for listening to this episode of the podcast. Give your voice back. We wanna hear what you think about this. Where do you think these things are heading? [email protected]. And you can also get premium content just like this over at futurecommerce.com/plusjointhemembership. And there's so much more to come. Thank you so much for listening to this episode of Future Commerce.