Semantic search with Node.js and LangChain

Semantic search with Node.js and LangChain

Published at


semantic search

LangChain is an innovative library that has recently burst onto the AI scene. Available for both JavaScript and Python, it's a versatile tool designed to streamline the integration of Large Language Models (LLMs) with other utilities like hnswlib, facilitating the swift development of your AI applications.

In today's blog post, we're going to leverage the power of LangChain to build a compact movie search application.

The beauty of this application lies in its use of OpenAI embeddings stored within the hnswlib vector store. Comparing embeddings with LangChain becomes a breeze, and we’ll build a dynamic, interactive front-end with Next.js.

If you think this is a lot, just hold on, we’re going to go through everything and you’ll realize how easy it is to build your own semantic search engine.

So let's get started, and you'll see just how LangChain can supercharge your AI application development journey with very little code.

What is Semantic Search?

Before we delve into the intricacies of implementing semantic search, it's crucial that we first understand its essence and significance in the realm of data retrieval.

Before Semantic Search

In the pre-semantic search era, our search capabilities were confined to exact matches or the utilization of regular expressions.

When users conducted a search, our systems returned results closely matching the entered phrase.

However, a significant drawback of this approach was the lack of comprehension. Our servers, oblivious to the semantics of the words, conducted searches based purely on textual resemblance.

This meant that even a minor spelling error could derail the entire search process, resulting in irrelevant results or, worse, none at all.

After Semantic Search

With the advent of semantic search, it's as if our servers have gained a newfound understanding.

They're no longer simply matching data based on text characters but instead, comprehending the meaning of words and their connection with other words within their context.

Search results are now filtered based on the intent behind the search query.

This transformative approach to information retrieval is what we'll be delving into in this blog post.

We'll be exploring the power of semantic search and demonstrating how to implement this advanced tool in a series of simple steps.

What are Embeddings?

In the context of machine learning, embeddings are a form of representation for categorical data, such as words or phrases, in a way that preserves the semantic relationships between them.

They are vectors (or arrays of numbers) where each point corresponds to a particular category or word. These vectors sit in a high-dimensional space, and the 'distance' and 'direction' between vectors capture the relationships between the corresponding words or categories.

When it comes to natural language processing, embeddings allow us to convert words or phrases into a form that a machine learning algorithm can understand.

The beauty of embeddings is that they capture more than just the standalone meaning of words - they also encapsulate the context and semantic relationships that exist between different words.

This makes them invaluable when dealing with tasks like semantic search, where understanding context and relationships is crucial.

For the purposes of our movie search application, we will be utilizing OpenAI's embeddings - specifically, the Ada-2 model.

This model comes highly recommended by OpenAI due to its efficiency and cost-effectiveness. It enables us to convert our search queries and movie titles into meaningful vectors that capture the semantics of the words involved.

Once we have these embeddings, we'll be storing them in the hnswlib vector store - a high-performance library for nearest neighbour search.

This task is made straightforward with the LangChain library, which provides seamless integration between LLMs and other tools like hnswlib. This powerful combination of technologies lays the foundation for the semantic search capabilities of our application.

What is HNSWLIB?

HNSWLIB, or Hierarchical Navigable Small World graphs library, is an open-source, highly efficient approximate nearest neighbour (ANN) search library written in C++.

It implements the Hierarchical Navigable Small World (HNSW) algorithm, which is renowned for its remarkable speed and accuracy in high-dimensional spaces.

ANN search, which lies at the core of the HNSW algorithm, is a problem of finding data points in a dataset that are closest to a given point.

Traditional methods of performing the nearest-neighbour searches can be computationally expensive, especially when dealing with high-dimensional datasets.

This is where approximate methods, like the one implemented in HNSWLIB, come in - they provide a good trade-off between speed, accuracy, and memory usage.

In our movie search application, we'll be utilizing HNSWLIB as a vector store. This means we'll be storing the OpenAI embeddings (which are essentially high-dimensional vectors representing search queries and movie titles) in an HNSWLIB index.

When a user performs a search, we can quickly find the movies whose embeddings are nearest to the embedding of the search query - thus, effectively implementing semantic search.

Furthermore, the integration of HNSWLIB with the LangChain library allows us to handle the storage and retrieval of embeddings in an efficient and streamlined manner, paving the way for effective and responsive semantic searches.

While HNSWLIB is a powerful choice for storing and retrieving embeddings, it's not the only option at your disposal when working with the LangChain library.

LangChain provides seamless integration with a multitude of vector store options, each with its unique strengths, allowing you to choose the one that best suits your specific needs.

Here are some of the alternatives you might consider:

  • Memory Vector Store
  • Chroma
  • Elasticsearch
  • FAISS
  • LanceDB
  • Milvus
  • MongoDB Atlas

How Does Semantic Search Work?

Semantic search represents a paradigm shift in information retrieval. It aims to improve search accuracy by understanding the searcher's intent and the contextual meaning of the terms as they appear in the searchable dataspace.

To understand how semantic search operates, let's take our movie search application as an example.

First, we generate embeddings for each movie title in our database. These embeddings are essentially numerical representations of the movie titles and other data related to the same movie.

for example: what happens in the movie? or what does the main character do to succeed? the more information you give in generating the embedding, the more efficient the semantic search will be and users can search for the same movie with very different queries.

For example, if you make the entire movie story an embedding, then people could easily search for movies just by describing some part of the story and get the exact movie they were trying to find.

Because embeddings are high-dimensional vectors, they tend to be large, and traditional methods of handling such data (like searching, reading, and looping through a typical relational database such as PostgreSQL or a document database like MongoDB) can be resource-intensive.

Instead, we store these embeddings in a specialized vector store like HNSWLIB. Vector stores are designed to handle high-dimensional data efficiently. They leverage Approximate Nearest Neighbors (ANN) algorithms, which enable us to search through these embeddings rapidly and accurately.

Once we have generated and stored embeddings for every movie title in our database, the next step is handling the user's search queries.

When a user enters a search term, we generate an embedding for this term in the same way we did for the movie titles. This embedding, which again captures the semantics of the user's query, is then compared with the stored movie title embeddings.

By comparing the user's search query embedding with the movie title embeddings, we can gauge their similarity.

This comparison doesn't just involve matching the exact phrases but also takes into account the meaning and context of the words. The system then returns the most similar movie titles as search results.

In essence, it’s like our server is understanding the user’s intent, and not just matching character strings.

Through this process, semantic search offers a more nuanced, accurate, and efficient approach to information retrieval. It provides a richer, more intuitive search experience by interpreting the user's intent and the context behind the search query.

Getting Started

Before we kickstart our journey into semantic search with Node.js and LangChain, it's important to ensure your development environment is set up correctly.

First and foremost, Node.js should be installed on your PC. If you haven't installed it yet, you can download it from the official Node.js website.

Once Node.js is installed, we will set up a new Next.js application. To do this, open a terminal and run the following command:

npx create-next-app@latest

This will trigger a prompt with several options. For the purposes of our tutorial, we'll be using the app router and enabling Typescript, and TailwindCSS with ESLint features. Choose the options according to your preferences.

Next, we need to install the necessary libraries for our project. Open a terminal in the root directory of your Next.js application, and use the following commands to install each library:

yarn add langchain hnswlib-node gpt-3-encoder

Note: If you're using a Windows PC, you might need to install Visual Studio before you can properly build the hnswlib-node package. This is due to the native code contained within the package.

With the above steps, you'll have a new Next.js project set up and ready, with all the required libraries installed. In the following sections, we'll delve into how to leverage these tools to build our semantic search application.

Database

For the database in our tutorial, we will be using a JSON file for simplicity's sake. Let's call this file movies.json. I’ve gotten the movie data from this open-source repository (thanks to them) and restructured it for our needs.

The movies.json file will store an array of movies, each represented by an object. The structure of each movie object will be as follows:

[
  {
    "id": "a8c1479b-dd6b-488f-970b-ec395cf1798b",
    "name": "The Wizard of Oz",
    "year": 1925,
    "actors": [
      "Dorothy Dwan",
      "Mary Carr",
      "Virginia Pearson"
    ],
    "storyline": "Dorothy, heir to the Oz throne, must take it back from the wicked Prime Minister Kruel with the help of three farmhands."
  },
]

You can add as many movie objects as you want to the movies.json file. Just ensure that they follow the same structure as the above example.

As we proceed, we will generate and store embeddings for each movie property in our movies.json database. These embeddings will later enable us to implement our semantic search.

Generating the Embeddings

Before we can begin serving search requests in our application, we need to generate embeddings for every movie in our database and save them into a vector store. These embeddings will provide the foundation for our semantic search capabilities.

Estimate OpenAI embeddings price

If you like to know how much it would cost for generating your embeddings, here’s a small snippet of code to calculate it.

Just remember this calculates it only for the Ada-2 model, if you use a different model, change the second argument’s value:

import { encode } from "gpt-3-encoder"

/**
 * @param text
 * @param pricePerToken Price per token (For ada v2 it is $0.0001/1k tokens = 0.0001/1000)
 * @returns Price in usd
 */
const estimatePrice = (text: string, pricePerToken = 0.0001 / 1000) => {
  const encoded = encode(text)

  const price = encoded.length * pricePerToken

  return price
}

export default estimatePrice

Now that we have a function to estimate the price of encoding a given text, let's apply it to our movie data.

Here's how you can use the estimatePrice function to calculate the total cost of generating embeddings for all your movies:

import movies from "../data/movies.json"
import estimatePrice from "./estimatePrice"

const run = async () => {
  const textsToEmbed = movies.map(
    (movie) =>
      `Title:${movie.name}\n\nyear: ${
        movie.year
      }\n\nactors: ${movie.actors.join(", ")}\n\nstoryline: ${
        movie.storyline
      }\n\n`
  )

  const price = estimatePrice(textsToEmbed.join("\n\n"))

  console.log(price)
}

run()

Running this script gives us a result of 0.0040587, which is less than 1 cent. This low cost makes the use of OpenAI's services extremely affordable for our application.

Let’s get back to generating embeddings, here's how you can generate the embeddings for the movies data:

// ./functions/generateEmbeddings.ts
require("dotenv").config()
import { HNSWLib } from "langchain/vectorstores/hnswlib"
import { OpenAIEmbeddings } from "langchain/embeddings/openai"
import movies from "../data/movies.json"

const generateEmbeddings = async () => {
  try {
    const start = performance.now() / 1000

    const textsToEmbed = movies.map(
      (movie) =>
        `Title:${movie.name}\n\nyear: ${
          movie.year
        }\n\nactors: ${movie.actors.join(", ")}\n\nstoryline: ${
          movie.storyline
        }\n\n`
    )

    const metadata = movies.map((movie) => ({ id: movie.id }))

    const embeddings = new OpenAIEmbeddings()

    const vectorStore = await HNSWLib.fromTexts(
      textsToEmbed,
      metadata,
      embeddings
    )

		// saves the embeddings in the ./movies directory in the root directory
    await vectorStore.save("movies")

    const end = performance.now() / 1000

    console.log(`Took ${(end - start).toFixed(2)}s`)
  } catch (error) {
    console.error(error)
  }
}

generateEmbeddings()

This script does the following:

  • It reads the list of movies from our movies.json file.
  • For each movie, it generates a single string combining the title, year, actors, and storyline.
  • It creates an array of these strings (textsToEmbed) and an associated array of metadata (metadata). Each element in the metadata array is an object with an id field corresponding to the id of the movie.
  • It creates a new instance of OpenAIEmbeddings which we'll use to convert our texts to embeddings.
  • It uses the fromTexts method of the HNSWLib class to generate embeddings for each string in textsToEmbed and store them in a new vector store.
  • Finally, it saves this vector store in the "movies" directory.

Remember to set your OpenAI API key in your .env file before running this script. The key should be named exactly "OPENAI_API_KEY" and LangChain will automatically pick it up.

By following these steps, you can ensure that every movie title in your database has an associated embedding stored in the vector store, ready to be utilized in semantic searches.

Now we have our vector store saved in the ./movies directory and ready to be used to search for movies.

Creating the Search Functionality

With the embeddings stored in the ./movies directory, we are now ready to create the semantic search functionality in our application.

LangChain provides a convenient way to perform similarity searches on the embeddings we've generated.

The similaritySearch method from the HNSWLib class allows us to compare the embeddings we've generated and find the most similar entries based on a given text.

This powerful feature forms the backbone of our semantic search engine.

Here's how you can create a function that performs a similarity search:

require("dotenv").config()
import { OpenAIEmbeddings } from "langchain/embeddings/openai"
import { HNSWLib } from "langchain/vectorstores/hnswlib"

const search = async (text: string) => {
  try {
    const vectorStore = await HNSWLib.load("movies", new OpenAIEmbeddings())

    const results = await vectorStore.similaritySearch(text, 2) // returns only 2 entries

    results.forEach((r) => {
      console.log(r.pageContent.match(/Title:(.*)/)?.[0]) // Use regex to extract the title from the result text
    })
  } catch (error) {
    console.error(error)
  }
}

search("a tom cruise movie")

This function takes a string of text as input, loads our stored embeddings from the ./movies directory, and performs a similarity search on the embeddings.

The second argument to similaritySearch method indicates the number of similar entries we want to find - in this case, we're asking for the top 2 most similar entries.

We then log the titles of the movies that were found to be most similar to the input text. In this example, the input text is "a tom cruise movie".

Running this function will output:

Title:Mission: Impossible
Title:Mission Impossible III

This is exactly what we want! With this, we have successfully implemented semantic search in our application.

The search functionality now understands the meaning of the search query and finds the most relevant results, rather than just looking for exact matches in the text.

Building the API

Having created the semantic search functionality, we are now ready to implement an API that will allow clients to use this feature. We will build this API using Next.js's route handler.

To do so, create a file in ./api/search/route.ts and add the following code:

import { OpenAIEmbeddings } from "langchain/embeddings/openai"
import { HNSWLib } from "langchain/vectorstores/hnswlib"
import { NextResponse } from "next/server"
import movies from "@/data/movies.json"

export async function GET(req: Request) {
  const { searchParams } = new URL(req.url)

  const q = searchParams.get("q")

  if (!q) {
    return new NextResponse(JSON.stringify({ message: "Missing query" }), {
      status: 400,
    })
  }

  const vectorStore = await HNSWLib.load("movies", new OpenAIEmbeddings())

  const searchResult = await vectorStore.similaritySearch(q, 5)

  const searchResultIds = searchResult.map((r) => r.metadata.id)

  let results = movies.filter((movie) => searchResultIds.includes(movie.id))

  return NextResponse.json({ results })
}

Here's what the code does:

  • We define a GET function that handles GET requests to the API.
  • The function extracts the query string q from the URL of the request.
  • The function loads our stored embeddings and performs a similarity search on them using the provided query string.
  • It then filters the list of movies to include only those whose IDs match the metadata of the search results.
  • Finally, the function returns a JSON response that contains these filtered movies.

This function forms the core of our API. It allows clients to search for movies based on the semantic meaning of a provided query string. We have limited the number of results to the top 5 most similar entries to keep our API responses manageable.

Here’s an example response:

{
  "results": [
    {
      "id": "a203e314-7ce0-429e-945e-81c1ea905d9f",
      "name": "Mad Max 3: Beyond Thunderdome",
      "year": 1985,
      "actors": [
        "Mel Gibson",
        "Tina Turner"
      ],
      "storyline": "Max is exiled into the desert by the corrupt ruler of Bartertown, Aunty Entity, and there encounters an isolated cargo cult centered on a crashed Boeing 747 and its deceased captain."
    },
	]
}

The Frontend

Next.js and Tailwind CSS are great for rapidly developing sleek and modern web applications. The user interface (UI) for our movie search app is straightforward: we need a search input to enter queries, and a space to display the resulting movies.

First, install the react-use library, which provides a variety of helpful hooks, but we only need the useAsyncFn to handle asynchronous functions.

You can install it using Yarn with the following command:

yarn add react-use

Now, let's build our interface. Here's a basic frontend setup that connects to the API we previously created leveraging the useAsyncFn hook from the react-use package to handle the asynchronous API request. Here's how to set it up:

"use client"
import { useState } from "react"
import { useAsyncFn } from "react-use"

interface SearchResults {
  results: {
    id: string
    name: string
    year: number
    actors: string[]
    storyline: string
  }[]
}

export default function Home() {
  const [query, setQuery] = useState("")

  const [{ value, loading }, search] = useAsyncFn<() => Promise<SearchResults>>(
    async () => {
      const response = await fetch("/api/search?q=" + query);
      const data = await response.json();
      return data;
    },
    [query]
  )

  return (
    <main className="flex min-h-screen flex-col items-center p-5 lg:p-24 w-full mx-auto">
      <h1 className="text-4xl font-bold text-center">Search Movies</h1>

      <form
        onSubmit={(e) => {
          e.preventDefault()
          search()
        }}
      >
        <input
          type="text"
          name="search"
          className="border-2 border-gray-300 bg-black mt-3 h-10 px-5 pr-16 rounded-lg text-sm focus:outline-none"
          placeholder="Search anything"
          value={query}
          onChange={(e) => setQuery(e.target.value)}
        />
      </form>

      <div className="mt-10">
        {loading ? (
          <div>Loading...</div>
        ) : (
          <div className="flex flex-wrap gap-5">
            {value?.results.map((movie) => (
              <div
                key={movie.id}
                className="flex flex-col bg-gray-800 rounded-lg shadow-lg p-5 w-full max-w-sm"
              >
                <h2 className="text-xl font-bold">{movie.name}</h2>
                <p className="text-sm">{movie.year}</p>
                <p className="text-sm">{movie.actors.join(", ")}</p>
                <p className="text-sm">{movie.storyline}</p>
              </div>
            ))}
          </div>
        )}
      </div>
    </main>
  )
}

This UI component comprises a search bar and a list of results. When a user types a query into the search bar and submits it, a request is made to our API, and the results are displayed on the screen.

It's important to note that the useAsyncFn hook automatically handles the loading state for us, showing a loading message until the results are fetched and ready to be displayed.

semantic search

As you can see, now our front-end works well with the API and our semantic search functionality is completely set up and good to go.

Conclusion

In this tutorial, we have explored a modern approach to building an effective and economical semantic search engine.

We used LangChain and OpenAI embeddings, along with HNSWLib to store the embeddings, allowing us to create a semantic search engine for a collection of movies.

This approach showcases how language models can be leveraged to provide powerful features with affordable costs, thanks to the efficiency of OpenAI's Ada v2 model and the convenience of the LangChain library.

The use of semantic search provides a more intuitive and user-friendly search experience, helping users find more relevant results even if the exact phrasing or keywords aren't used.

But we've only just scratched the surface of what's possible. There's an abundance of vector stores to explore for storing data, each offering its own unique capabilities and trade-offs.

Elasticsearch, for instance, is an option that would enable full-text search functionality and complex querying options.

As AI technologies continue to advance and evolve, so do the possibilities for what we can achieve with them. Whether it's building a semantic search engine, a recommendation system, or another application, the potential is limited only by our imagination.

Comments

We won't show your email address!

500 characters left
Jim avatar

Jim

July 18, 2023

Does the OpenAI cost apply just to creating the embeddings the first time? or every time a user searches?

July 23, 2023

@Jim every time a user searches a new request is sent to OpenAI to generate the embeddings.

But the cost is very cheap and insignificant, if your average search query is 5 tokens, that means you can handle 2 billion searches with the current Ada-2 pricing for just one dollar.

There are other workarounds, you can save the search embeddings in your database, but they are huge and take a lot of space and actually, it's just better to send the request to OpenAI every time a user searches.