How to deploy AI and embeddings without breaking the bank

LLMs like GPT can give useful answers to many questions, but there are also well-known issues with their output: The responses may be outdated, inaccurate, or outright hallucinations, and it’s hard to know when you can trust them. And they don’t know anything about you or your organization’s private data (we hope). RAG can help reduce the problems with hallucinated answers, and make the responses more up-to-date, accurate, and personalized – by injecting related knowledge, including non-public data. In this talk, we’ll go through ways you can implement RAG, including vector search, multi-vector, filtering, ranking and hybrid search.

This talk will also cover the SOTA on quantization and dimension reduction using a combination of Matryoshka and binary quantization with Hamming distance (https://blog.vespa.ai/combining-matryoshka-with-binary-quantization-using-embedder/). This makes application also economically viable as the embedding data is split in low-res RAM / hi-res on-disk and a two-phase ranking function for low-latency evaluation.

Event Timeslots (1)

@GetSparked B
-
by Kristian Aune