ELI5 #4: Our API Problem
An attempt at some insights on developer infrastructure -- addressing the problem of inaccessible data through custom connectors.
hey :) I’m Anant and I’m another 20-something conjecturing about technology. My goal is to take complicated ideas, explain them simply, and develop a perspective. I’m writing to hold myself accountable to learning — hopefully you’ll find it interesting. Maybe you won’t. Thanks for reading either way.
Going shorter form (a la Tomasz Tunguz) to see if inspiration strikes.
We’ve talked a bit about the importance of data in AI training and inference, but like you and me, turns out businesses have a hard time consolidating their data. A business might have data spread across a bunch of different apps — in their CRM, their ERP, in emails, saved locally on computers, etc. Some subset of that data is important to the organization and should be stored centrally in a data store for easy access (i.e., in “your cloud” — more here another time).
Pulling together all of that data is kind of a… pain in the ass.
First, you have to extract your data from where it’s originally stored. In the case that you’re pulling data from an app (like your CRM), this is typically done through an API (Application Programming Interface). If the UX you use to click into the application is the front entrance, you can think of an API as being the back door for pulling data out of an application (see image below). You feed the API a snippet of code to tell it what data you want (input), and it spits it out for you (output).
But turns out extracting that data is also kind of a… pain in the ass. APIs are the wild west — given the varied purposes of applications, each has (necessarily) different types of requests, different inputs, and different outputs (in different formats). The documentation about how to access APIs is (as a high school teacher once commented on my contributions during a class) “of variable quality.” An API’s specifications can change (e.g., if the application is updated). And worst of all, many software companies don’t have publicly accessible APIs at all (or they hide them behind a paywall).
Building the connectors to pull data from APIs can be expensive (and distracting to an engineering team), so big businesses have been built to create these connectors for you (e.g., Fivetran and Matillion). But it’s not all rosy — (1) this is a costly service (pricing is often related to volume of data extracted, which can add up quickly), (2) for these companies, it requires a ton of software engineer manpower to “babysit” the connectors, updating them if they break / if the API is updated, and (3) they only offer connectors for pre-set menu of ~500 applications (out of the thousands of apps you might want to pull data from — e.g., no connectors for some of the biggest EHRs).
This is an annoying (and costly) problem for devs. Some spaces to watch:
Better integrations require better documentation. Lots of startups using gen AI to write docs that help explain how APIs work. Better docs are foundational to leveraging AI to generate connectors
API explosion. Demand side desire to plug into more apps than ever before and with new particularities around data formats.
One to many API integrations. Connecting with one API that allows you plug into many more, though limited to more standardized data pulls.
Low-code / no-code integrations, a la Zapier. This is probably the future, but a long way before we get here.
One of the tensions here is latency though -- the more middleware one introduces, the more you slow down API responses for models that are already often borderline intolerably slow. Waiting even 140 something milliseconds (https://thefastest.ai/) for a response is still unacceptably slow for many high volume technical applications.