Our Vision for the Future of Vision

January 29, 2024
Synativ Team

Everyone will agree that AI is reshaping how we live and interact with each other. Last year, this development started to accelerate with the successes of large language models which, for the first time, have made a lasting and conscious impression to a broad audience. They redefined how we search, learn, problem solve, and create.

This progress might have given the impression that language models alone could form the end point of AI. However, although results are pushing the boundaries of our imagination, text exists solely on paper & our devices and there are so many other exciting forms of AI that interact with the physical world that will benefit us.

Vision forms an essential element for humans to create an understanding of our world. Babies learn many rules governing the world, such as gravity, before they understand or speak any language. Indeed, mammals who do not have any structured language are still able to read their environment.

At Synativ, we are convinced that the future of AI is multi-modal for many real-life applications. Vision is an indispensable component but for industrial applications (e.g. medical imaging, AgTech, geospatial monitoring) its development has fallen behind its language counterpart. Over the years, many tools have been built to optimise the supervised deep learning workflow that was introduced during the 2012 ImageNet challenge, but this workflow of collecting and labelling data has remained largely unchanged and inefficient to this day.

The world needs a better, more scalable way to unlock the value that vision applications could bring to our lives and it is Synativ’s mission to build towards that.

Foundation models for vision

Last year, the releases of SAM and DINOv2 marked a pivotal point for computer vision. Although “larger” (for the vision field) models had been released before, those models showed generalisation capabilities that were not witnessed in a similar fashion before. The sheer scale of their training data gives them a general and versatile visual understanding of the world leading to impressive zero- or few-shot capabilities. Building on top of Visual Foundation Models (VFMs) has the potential to significantly decrease the development time of new vision applications by reducing the amount of required data collection, labelling, and training.

However, in contrast to LLMs, few have managed to fully realise this potential. Vision applications are diverse and the long tail is generally not sufficiently covered by web datasets that are often used for training VFMs. Moreover, the field is still nascent and building up the relevant expertise to fine-tune and utilise VFMs correctly can be daunting and distracting.

Nevertheless, many of our partners have collected huge amounts of industry-specific data over time by recording their camera feeds. Today, this data remains stored away in their clouds as it has not been labelled and thus cannot be employed for supervised learning. We would like to change that by giving them the possibility to train their own proprietary foundation model which can deliver on the aforementioned potential.

Synativ today

Taking learnings from the language field, Synativ has been built around a notion of hierarchical fine-tuning. As a first step, VFMs trained on generic vision data are fine-tuned on large (unlabelled) industry-specific datasets. Then, these industry-specific VFMs can be adapted further to each use case with a small labelled training set. For example, clinicians might first fine-tune SAM on all (unlabelled) pathology slides that they recorded over time and then fine-tune it further on a small labelled dataset of a specific disease.

When we started Synativ, we focussed on fine-tuning foundation models that were trained on a broad set of generic data. However, as we saw some of our users having great success with hierarchical fine-tuning, we felt a desire to bring a similar workflow to users who do not yet have collected vast amounts of images. Synativ's pathology and geospatial foundation models are our first efforts in that direction; industry-specific foundation models which are provided as a starting point to be fine-tuned to a wide variety of downstream tasks when only a small amount of data is available.

Currently, we provide our toolbox (including these models) as a Python SDK, through which users can upload their data, fine-tune a VFM on the cloud, and download their model for deployment.

Synativ's future

Our ambition is to become the leading platform for everyone who would like to build vision applications that work in the real world. Our team members have been working at the frontier of computer vision for nearly a decade at top academic and industry research labs. Over that time, we have realised that two elements are critical for success at scale: profound understanding of 1) the training input (data) and 2) the training method (algorithms).

We aim to productise and extend our shared expertise in those areas, so that engineering teams no longer need to hire PhDs, struggle with poorly documented public code, manually label thousands of images, and wait for months between model iterations. Instead, they should be able to efficiently adapt their models as more use cases pop up.

We would love to hear from you if you would like to build this future together with us either as a team member or as a partner.

Want to keep up to date with our development?

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.