The Challenges and Opportunities of Home-baked Open Source Multimodal Search

June 14, 2023
The Objective Team

In 2022, 65% of all internet traffic was video and images; data usage from video sites increased by 24% year-over-year. With the adoption of mobile devices and increasing bandwidth and connectivity, people are consuming more rich media than ever. We've come a long way from the days of the text-only web.


For a long time, rich media has been searched with engines built for searching text. To accomplish that, all images, video, and audio are titled, tagged, and described, and that text metadata is searched. This comes with a multitude of limitations, since text does not capture all the information that an image or a song contains. “A picture is worth one thousand words”.

People now consume so much non-text data that expectations for search are changing. They're starting to expect computers to understand images and video natively; not just what they contain but also the sentiments and moods they evoke. And with the explosion of rich media, companies are struggling to maintain their old, text-based search infrastructure.

Companies like Google and Amazon are shifting to this new world of multimodal search with systems that understand images, video, and audio in addition to text data. Those companies hire armies of PhDs and deploy massive amounts of data and infrastructure to build this tech in house. But if one does not have deep pockets like Big Tech companies, what do they do?

For companies with an in-house data science/ML team, it's tempting to build their own solution using open-source. The internet has many tutorials showing how to combine amazing open-source models like CLIP with vector search databases to create a basic search engine that works on images.

We created a demo of CLIP on a fashion dataset. Here's an example of a simple search for “hoodie”:

CLIP can understand images at a high level and make them searchable using natural language. The nice thing is that creating this demo is straightforward. However, throwing CLIP embeddings into a vector database is hardly the end of the story for companies seeking to provide top-tier customer experiences that strengthens brand perception, delights customers and increases conversions. We will expand on that.

In this next example, a basic search for "red top" yields a pair of pants in the top results:

Search quality quickly degrades as the query gets more descriptive. Here, a search for “sweater dress with belt” yields sweater dresses without belts, and one which isn’t a sweater dress at all:

For any business that depends on helping users find what they're looking for, this needs to be much better.


One of the recommended ways to improve a model's performance is to fine-tune it on the target domain. In this context, fine-tuning means taking a model that was pre-trained on a very large, diverse and task-independent dataset, and further training it on data that is more targeted to the model's desired application. A group of researchers from industry and academia created FashionCLIP: an instance of CLIP fine-tuned for fashion. We won't go into the details of the model here, but you can read more about it in the references section at the end of this article.

We created a FashionCLIP demo over the same dataset we used for the CLIP demo, and FashionCLIP showed significant improvements.

It has greater accuracy. Basic semantic searches like “red top” work fine:

And it holds up well as queries get more descriptive. All top results for “sweater dress with belt” are relevant:

And it’s the same for “shoes with chunky heel”:

Despite being fine-tuned for fashion, FashionCLIP's accuracy still falls short for a production-grade system.

Searching for "linen shirts" yields cotton shirts:

The problem here is that by relying only on visual vector similarity, the system failed to take metadata into account and so returned irrelevant items.

Searching for specific brands and product types (or combination of them) yields completely unexpected results. Here, searching for “Elisha (brand) sandalette (product)” results in either wrong brand or wrong product or both:

The problem is that both the CLIP+VectorDB and the FashionCLIP+VectorDB implementations are failing to leverage all fields and all modalities (both images and text), and they are attempting to reduce the search problem to simple vector similarity search. In practice, we have found that the problem is much more complex. Depending on the user's intention, they will search using specific keywords or descriptive queries, and they sometimes desire exact matching and sometimes they expect a more exploratory behavior that benefits from approximate matching. Finding the optimal ranking is a difficult problem that requires experience and often many iterations until you get it right.

We have argued that text-only search is limiting, but image-only is also limiting. Images contain much richer visual information than text can ever capture, but images lack non-visual metadata that is absent from the picture such as brand name, price, popularity, ratings, etc. The combination of modalities is key to provide search systems with the information they need to drive the best customer experience.

To deal with the shortcomings of pure vector search, the popular solution is to combine vector and text search. But this adds an extra layer of complexity:

  • How much do you trust signals coming from different matchers?
  • How do you score and combine results from them?
  • Which additional signals do you add to create the best possible experience?

These are the nitty-gritty details of production-grade search systems tutorials and articles don't talk about. And adding multimodality takes the complexity even further.

At Objective, we've built a multimodal search and discovery platform that addresses all these questions, and helps companies provide a robust, next-generation search experience to their users.

Check out how our system performs compared to CLIP and FashionCLIP.

Our system has far higher accuracy for basic descriptive searches. Here, CLIP and FashionCLIP return irrelevant results for “pink children’s backpack” (mittens, hats, parkas) while our system (far-right) returns only matching items:

Our system stays robust even when small details count. Here, a search for “red stiletto heel with bow” results in CLIP and FashionCLIP returning obviously wrong items (sandals, children’s shoes, no bow). While our system returns one incorrect result, even then it’s still a red adult shoe with heels and a bow (it just has a chunky heel instead):

Unlike a pure vector similarity engine, our system is optimized across the stack to provide the best results while leveraging all modalities, and both lexical and vector information. It works seamlessly when a user searches for a product type and a brand. Here a search for “marta sandal” (a specific line of sandals) fails completely on CLIP and FashionCLIP, but thanks to combining vector and lexical search, it works perfectly on our system:

In a head-to-head matchup against CLIP and FashionCLIP, Objective provides similar or better results 78% of the time, and strictly superior results 44% of the time. And we continue to improve every day.

When your revenue depends on search quality, relevance is just one thing to get right. It is hard enough on its own, but it is also only the tip of the iceberg. You also need to address the complexities around infrastructure: data processing pipelines, indexing, sharding, serving, latency, monitoring, redundancy, data drift, business KPIs, and ongoing quality improvements of models in response to user interactions.

At Objective, we’re search experts who take care of all this so you can focus on building the UX that will make your product stand above your competition. We’ve worked hard to make our system easy to use. Just load your data, and we'll handle the rest.

See Objective Search for yourself: Schedule a demo.


We recommend you to read