How to deploy AI and embeddings without breaking the bank

LLMs like GPT can give useful answers to many questions, but there are also well-known issues with their output: The responses may be outdated, inaccurate, or outright hallucinations, and it’s hard to know when you can trust them. And they don’t know anything about you or your organization’s private data (we hope). RAG can help reduce the problems with hallucinated answers, and make the responses more up-to-date, accurate, and personalized – by injecting related knowledge, including non-public data. In this talk, we’ll go through ways you can implement RAG, including vector search, multi-vector, filtering, ranking and hybrid search.

This talk will also cover the SOTA on quantization and dimension reduction using a combination of Matryoshka and binary quantization with Hamming distance (https://blog.vespa.ai/combining-matryoshka-with-binary-quantization-using-embedder/). This makes application also economically viable as the embedding data is split in low-res RAM / hi-res on-disk and a two-phase ranking function for low-latency evaluation.

Meet Kristian Aune

Event Timeslots (1)

@GetSparked B
14:40 - 15:10
by Kristian Aune

How to deploy AI and embeddings without breaking the bank

How to deploy AI and embeddings without breaking the bank

Event Timeslots (1)

Get in touch with

Venue info

Be part of #AI5050

Welcome to AI5050

Be Part of AI5050

Follow us