When ChatGPT debuted over a year ago, internet users got an always-available AI assistant to chat and work with. It handled their day-to-day tasks, from producing natural language content (like essays) to reviewing and analyzing complex information. In no time, the meteoric rise of the chatbot drew the world’s attention to the technology sitting at its heart: the GPT series of large language models (LLMs).
Fast forward to the present day, LLMs – the GPT series and others – are the driving force of not just individual-specific tasks but also massive business operations. Enterprises are leveraging commercial model APIs and open-source offerings to automate repetitive tasks and drive efficiencies across key functions. Imagine conversing with AI to generate ad campaigns for marketing teams or being able to accelerate customer support operations by surfacing the right database at the right time.
The impact has been profound. However, one area where the role of LLMs isn’t discussed as much is the modern data stack.
LLMs transforming the data stack
Data is the key to high-performance large language models. When these models are trained correctly, they can help teams work with their data — whether it is experimenting with it or running complex analytics.
In fact, over the last year, as ChatGPT and competing tools grew, enterprises providing data tooling to businesses looped generative AI in their workflows to make things easier for their customers. The idea was simple: tap the power of language models so the end customers not only get a better experience while handling data but are also able to save time and resources – which would eventually help them focus on other, more pressing tasks.
The first (and probably the most important) shift with LLMs came when vendors started debuting conversational querying capabilities — i.e. getting answers from structured data (data fitting into rows and columns) by talking with it. This eliminated the hassle of writing complex SQL (structured query language) queries and gave teams, including non-technical users, an easy-to-use text-to-SQL experience, where they could put in natural language prompts and get insights from their data. The LLM being used converted the text into SQL and then ran the query on the targeted dataset to generate answers.
While many vendors have launched this capability, some notable ones to make their move in the space were Databricks, Snowflake, Dremio, Kinetica and ThoughtSpot. Kinetica initially tapped ChatGPT for the task but now uses its own native LLM. Meanwhile, Snowflake offers two tools. One, a copilot that works as a conversational assistant for things like asking questions about data in plain text, writing SQL queries, refining queries and filtering down insights. The second is a Document AI tool to extract relevant information from unstructured datasets such as images and PDFs. Databricks also operates in this space with what it calls ‘LakehouseIQ’.
Notably, several startups have also come up in the same area, targeting the AI-based analytics domain. California-based DataGPT, for instance, sells a dedicated AI analyst for companies, one that runs thousands of queries in the lightning cache of its data store and gets results back in a conversational tone.
Helping with data management and AI efforts
Beyond helping teams generate insights and answers from their data through text inputs, LLMs are also handling traditionally manual data management and the data efforts crucial to building a robust AI product.
In May, Intelligent Data Management Cloud (IDMC) provider Informatica debuted Claire GPT, a multi-LLM-based conversational AI tool that allows users to discover, interact with and manage their IDMC data assets with natural language inputs. It handles multiple jobs within the IDMC platform, including data discovery, data pipeline creation and editing, metadata exploration, data quality and relationships exploration, and data quality rule generation.
Then, to help teams build AI offerings, California-based Refuel AI provides a purpose-built large language model that helps with data labeling and enrichment tasks. A paper published in October 2023 also shows that LLMs can do a good job at removing noise from datasets, which is also a crucial step in building robust AI.
Other areas in data engineering where LLMs can come into play are data integration and orchestration. The models can essentially generate the code needed for both aspects, whether one has to convert diverse data types into a common format, connect to different data sources or query for YAML or Python code templates to construct Airflow DAGs.
Much more to come
It’s only been a year since LLMs started making waves and we are already seeing so many changes in the enterprise domain. As these models improve in 2024 and teams continue to innovate, we’ll see more applications of language models in different areas of the enterprise data stack, including the gradually developing space of data observability.
Monte Carlo, a known vendor in the category, has already launched Fix with AI, a tool that detects problems in the data pipeline and suggests the code to fix them. Acceldata, another player in the space, also recently acquired Bewgle to focus on LLM integration for data observability.
However, as these applications emerge, it will also become more important than ever for teams to make sure that these language models, whether built from scratch or fine-tuned, are performing right on the mark. A slight error here or there and the downstream result could be affected, leading to a broken customer experience.
TechForgePulse's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.