Treeverse, creators of the open-source lakeFS data version control system, today announced the release of lakeFS 1.0. This major update brings production-level stability, security and performance to the data lake version control software.
The lakeFS project got its start back in 2020 and has been steadily improving in the years since, providing an open source technology to help organizations with version control for object storage based data, stored in data lakes.
Treeverse, the lead company behind the technology, raised $23 million back in 2021 to build out the concept that delivers capabilities that are similar to the open source Git version control system, to data lakes. In 2022, the technology got a cloud service with Treeverse launching the lakeFS cloud offering providing a managed cloud service data version control. The lakeFS approach has found a receptive audience according to Treeverse, with large enterprises including Lockheed Martin, Volvo and Arm among the technology’s users.
The lakeFS 1.0 technology is now also able to integrate with other data lake technologies, including Databricks as well as the open source technology Apache Iceberg that is increasingly being widely adopted by cloud data vendors, including Cloudera and Snowflake among others.
“We have a large base of installations and really a product that reflects what people need for data version control over a data lake,” Einat Orr, Co-founder and CEO at Treeverse, told TechForgePulse in an exclusive interview.
What lakeFS data version control bring to the data lake market
Data version control allows users to track changes to data over time, similar to how version control systems like Git track changes to code.
With the open source Git version control system, that is at the heart of GitHub and much of modern application development, there is the concept of being able to have different versions of code and different branches. It’s a wildly popular approach to development that lakeFS has extended to the world of data stored in data lakes.
The idea of versioning in data lake deployments has a lot of nuance, as multiple vendors and technologies have varying degrees of versioning capabilities. Orr noted that while other technologies including Databricks and Apache Iceberg may allow creating versions of tables or schemas, that is different than a full data version control system.
Orr explained that lakeFS provides a full version control experience across an organization’s entire data lake, not just specific tables or schemas. This allows versioning entire data pipelines and workflows together. The lakeFS technology stores metadata about each version and changes that are important for reproducibility and integration.
Treeverse is not necessarily positioning lakeFS as a competitor to technologies like Databricks or Apache Iceberg but rather as a complementary technology that provides additional benefits to users. Orr also noted that lakeFS integrates with data orchestration tools including Apache Airflow, Prefect and Dagster, bringing the power of data version control to the data pipeline workflow.
The intersection of lakeFS and AI
There are a number of different data analytics and AI use cases for the lakeFS technology.
Looking at AI and machine learning (ML), Orr said that one interesting use case is that data scientists can use lakeFS to version data locally for model development and testing purposes, through a new lakeFS local capability.
Orr explained that data scientists and AI/ML model developers will often deal with a lot of data. That said, she noted that for testing and development, developers will sometimes be doing the research on their own local systems, which is what the new lakeFS capabilities help to enable.
Looking forward, Orr said that her company is in the early stages of figuring out how to integrate and enable data version control capability for vector database technologies.
“Our vision is to be the version control tool that is running over all your data sources, and providing you the ability to version control your data pipelines, no matter where the data is,” she said.
TechForgePulse's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.