Skip to main content

Revolutionising Conversational AI with OpenAI Embeddings

Revolutionizing Conversational AI with OpenAI Embeddings

A step-by-step guide to enhance your chatbot capabilities using OpenAI embeddings.

Recently, OpenAI launched ChatGPT, an AI-powered chatbot trained with a large language model (LLM) that can answer almost anything accurately! And as if that’s not enough, it is continuously learning and developing.

Have you ever wondered if it could be utilized for your business or project? Well, guess what? It can be! There are different approaches to utilizing ChatGPT’s capabilities to enhance traditional chatbots.

Before We start

Before we get started, let’s get some ideas about OpenAI. OpenAI is a research and development company that provides different APIs for text completion, code completion (Codex), image generation (DALL-E), embedding, fine-tuning, and more.

In this example, we are going to utilize text completion and embedding APIs.

Let’s also understand what semantic search and embedding are. Semantic search is a way of searching by understanding the searcher’s intent, query context, and the relationship between words to generate accurate answers.

lexical search vs semantic search
credit: seobility

Embedding is the process of converting high-dimensional data to low-dimensional data as a list of real-valued numbers in the form of a vector in such a way that the two are semantically similar.

credit: openai

Now that we have understood these two concepts let’s dive into the interesting part!

Getting started

The main objective of this guide is to demonstrate how embeddings can be used to expand your bot’s knowledge. Currently, there are three main ways to extend your knowledge base to the GPT models:

Fine tuning: a straightforward approach, but you have no control over the model’s response apart from the initial prompt engineering.

Embeddings: a better approach to extend the model’s domain-specific knowledge, allowing more flexibility and control over the generated model output.

Codex: this approach helps if we have a SQL database as a data source. With this approach, we create and perform SQL queries on the database based on user input.

In this article, we will go through the embedding approach.

The Dataset

The first step towards creating a chat assistant is to prepare the data that will be used as a knowledge base. For this, we are utilizing a dataset. You can get the CSV file from here as well.

amazon product dataset

From the available columns in the dataset, we are more interested in the product title, product description, category, brand, price, and availability.

We will create a new column titled “text,” which contains data from all the columns mentioned above. We can use this new column (text) to generate embeddings.

data['text'] = "Category: " + data.Category + "; Product Title: " + data["Product Title"] + "; Product Description: " + data["Product Description"] + "; Brand: " + data.Brand + "; Price: " + data.Price + "; Stock Availability: " + data["Stock Availibility"]

Creating Embeddings

There are several models available for creating embeddings. The size of the embedding depends on the model that we choose. Below is a list of different OpenAI models, along with their respective embedding sizes.

Note that the higher the cost, the more dimensions the embeddings will have, resulting in more accurate results.


Model ,Dimensions 
Ada ,1024 
Babbage ,2048 
Curie ,4096 
Davinci ,12288 

To proceed further, we need an OpenAI API key. You can create an account and get the API key from here. Additionally, we need to install an OpenAI client to access the API.

pip install openai
openai.api_key = "<API_KEY>"
model = 'babbage'
from openai.embeddings_utils import get_embedding
data['embeddings'] = data.text.a pply(lambda x: get_embedding(x, engine=f'text-search-{model}-doc-001'))
data.head()

We can use any model to generate the embeddings. Here, we are choosing Babbage to balance between accuracy and cost.

We also need to generate a token count for each row. The following code snippet will add a column titled “n_tokens,” which contains the total count of tokens for each row (only for the “text” column).

# requirement
pip install transformers
from transformers import GPT2TokenizerFast


tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
data['n_tokens'] = data.text.apply(lambda x: len(tokenizer.encode(x)))

Here is a sample of the embedding column.

NOTE: If you get a rate limit here, perform this operation in batches.

Below snippet will do the job.

# Create embedding for each row in a loop.
# If rate limit hits, jump to exception, wait for 60 seconds and continue the job.
embed_list = []
for id, row in data.iterrows():
try:
embed = get_embedding(str(data.text[id]), engine=f'text-search-{model}-doc-001')
embed_list.append(embed)
except:
print(id)
sleep(60)
embed = get_embedding(str(data.text[id]), engine=f'text-search-{model}-doc-001')
embed_list.append(embed)
Continue

data['embeddings'] = embed_list

Choosing a vector database

Now, we need to choose a vector database to store the embeddings we have generated. But wait, what is a vector database, and why do we need one?

In this era of AI/ML, we need tools/technology to store collections of representations of words, sentences, paragraphs, images, or documents called embeddings. With the focus on deep learning models in AI-based applications, we need systems to store and retrieve such large-sized data (vectors) in considerable quantities in real-time.

This is where a vector database comes into the picture. These databases can store and index embeddings and help perform semantic search rather than an exact match with higher speed, accuracy, and flexibility.

There are mainly two options to choose from:

  1. Self-hosted open-source database
  2. Managed cloud database

In the open-source category, we have options like Milvus, Weaviate, and Typesense. You may also use commercial services like Pinecone, Redis, or Faiss.

Here, we are going with the Milvus database. It’s an open-source service. You can find the installation guide here

# Requirements
pip install pymilvus python-dotenv
import os
from traceback import format_exc

from dotenv import load_dotenv
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema, connections
from pymilvus.exceptions import (
CollectionNotExistException,
MilvusException,
SchemaNotReadyException,
)
load_dotenv()

DEFAULT_INDEX_PARAMS = {
"metric_type": "L2",
"index_type": "IVF_FLAT",
"params": {"nlist": 2048},
}

INDEX_DATA = {
"field_name": "embeddings",
"index_params": DEFAULT_INDEX_PARAMS,
"index_name": "amzn_semantic_search",
}

# You need to define these variables in environment file
def connect_db(
alias: str = os.getenv("VECTOR_DB_ALIAS"),
host: str = os.getenv("VECTOR_DB_HOST"),
port: str = os.getenv("VECTOR_DB_PORT"),
user: str = os.getenv("VECTOR_DB_USER"),
password: str = os.getenv("VECTOR_DB_PASSWORD"),
):
"""Connect database

Args:
alias (str): Database (Collection) name.
host (str): Database Host.
port (str): Database Port.
user (str): Database User.
password (str, optional): Database Password.
"""

connections.connect(
alias=alias,
host=host,
port=port,
user=user,
password=password,
)

# Fields
id = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, description="ID")
embeddings = FieldSchema(
name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=2048, description="Embeddings"
)
metadata = FieldSchema(
name="metadata", dtype=DataType.VARCHAR, max_length=20000, description="Metadata"
)
schema = CollectionSchema(
fields=[id, embeddings, metadata], description="Amazon Product search"
)


# Create collection
def create_collection(name: str = "amzn_data"):
collection = Collection(name=name, schema=schema, using="default", shards_num=2)
return collection


def get_or_create_collection(
name: str = "amzn_data",
create_index: bool = True,
index_data: dict = INDEX_DATA,
load_data: bool = False,
):
"""Fetch collection object or create one
Args:
name (str, optional): Collection name. Defaults to "amzn_data".
create_index (bool, optional): If True, create index. Defaults to True.
index_data (dict, optional): If create_index=True, this data will be used to create index
load_data (bool, optional): If True, insert data in the created collection. Defaults to False.

Returns:
Collection: Milvus collection
"""

try:
# Connect to the database
connect_db()

# Fetch the collection object by name
collection = Collection(name)
except Exception as excetion:
print(excetion)
print("Creating Collection...")

# If collection is not available, create a collection
collection = create_collection(name=name)

# Create index if unavailable
# Here we will provide index_data (INDEX_DATA) that we have defined above.
if create_index and index_data:
collection.create_index(**index_data)
if load_data:
# We need to pass the latest dataframe with n_tokens and embeddings columns available.
insert_data(collection_name=collection, dataframe=df)
finally:
collection.load()
return collection


def insert_data(collection_name: Collection, dataframe):
"""Insert Data into database"""
try:
final_values = []
index_list = [i for i in range(len(dataframe["embeddings"]))]
emb_list = dataframe["embeddings"].to_list()
metadata_list = [
{
"Category": row["Category"],
"Price": row["Price"],
"Brand": row["Brand"],
"Stock Availibility": row["Stock Availibility"],
"Image Urls": row["Image Urls"],
"n_tokens": row["n_tokens"],
}
for _, row in dataframe.iterrows()
]

final_values.append(index_list)
final_values.append(emb_list)
final_values.append(metadata_list)

collection_name.insert(final_values)
except Exception as exception:
print(exception)
print(format_exc())

Please note that in the above code, most of the functions are customized for the dataset we are using. For any other dataset that has different columns, please make the necessary modifications.

Although the choice of distance function does not matter much, it’s suggested to choose cosine similarity in the official documentation.

Chatbot workflow

As visible in the image below, we will generate embeddings for the question and perform a semantic search in the vector database where we have stored embeddings for the dataset.

Once we get the feasible set of answers from the database, we pass it to the Completion method of OpenAI along with some instructions to generate the response in proper format with complete sentences.

Credit: cohere.ai

The result

Based on the question, the chatbot will provide answers in human-like language. If the answer is unavailable, it will ask to rephrase the question or mention that it is out of scope.

Limitations

  • Due to how contexts are being retrieved, the bot can only carry a conversation about one topic at a given time. Asking it a different topic question amidst an ongoing chat will result in it being confused by the previous context, and it will no longer generate accurate results, although it can sound very convincing!
  • To overcome this, we have a reset chat option available.
  • Sometimes, it might generate answers that seem pretty convincing but could be incorrect.

TL;DR

OpenAI’s embedding chatbot uses advanced machine learning techniques to generate natural and contextually relevant responses to user input. The chatbot generates embeddings for user questions, performs a semantic search in a vector database, and uses openAI’s Completion method to generate responses.

While the chatbot’s responses are not yet indistinguishable from those of a human, they are advanced enough to provide value in many real-world scenarios. The technology has numerous applications in industries such as customer service, healthcare, and education, where having human-like conversations with machines can greatly improve efficiency and accessibility.

Conclusion

The development of OpenAI’s embedding chatbot is a significant advancement in the field of natural language processing and conversational AI. This technology has numerous applications in industries such as customer service, healthcare, and education, where having human-like conversations with machines can greatly improve efficiency and accessibility.

However, it is important to note that while the technology is highly advanced, there is still a long way to go in terms of making chatbots truly indistinguishable from humans in terms of conversational abilities. Nonetheless, OpenAI’s embedding chatbot represents a significant step forward in this direction and will continue to push the boundaries of what is possible with conversational AI.

Comments

Popular Posts

How I Reduced the Size of My React Native App by 85%

How and Why You Should Do It I borrowed 25$ from my friend to start a Play Store Developer account to put up my first app. I had already created the app, created the assets and published it in the store. Nobody wants to download a todo list app that costs 25mb of bandwidth and another 25 MB of storage space. So today I am going to share with you how I reduced the size of Tet from 25 MB to around 3.5 MB. Size Matters Like any beginner, I wrote my app using Expo, the awesome React Native platform that makes creating native apps a breeze. There is no native setup, you write javascript and Expo builds the binaries for you. I love everything about Expo except the size of the binaries. Each binary weighs around 25 MB regardless of your app. So the first thing I did was to migrate my existing Expo app to React Native. Migrating to React Native react-native init  a new project with the same name Copy the  source  files over from Expo project Install all de...

How to recover data of your Android KeyStore?

These methods can save you by recovering Key Alias and Key Password and KeyStore Password. This dialog becomes trouble to you? You should always keep the keystore file safe as you will not be able to update your previously uploaded APKs on PlayStore. It always need same keystore file for every version releases. But it’s even worse when you have KeyStore file and you forget any credentials shown in above box. But Good thing is you can recover them with certain tricks [Yes, there are always ways]. So let’s get straight to those ways. 1. Check your log files → For  windows  users, Go to windows file explorer C://Users/your PC name/.AndroidStudio1.4 ( your android studio version )\system\log\idea.log.1 ( or any old log number ) Open your log file in Notepad++ or Any text editor, and search for: android.injected.signing and if you are lucky enough then you will start seeing these. Pandroid.injected.signing.store.file = This is  file path where t...

React Native - Text Input

In this chapter, we will show you how to work with  TextInput  elements in React Native. The Home component will import and render inputs. App.js import React from 'react' ; import Inputs from './inputs.js' const App = () => { return ( < Inputs /> ) } export default App Inputs We will define the initial state. After defining the initial state, we will create the  handleEmail  and the  handlePassword  functions. These functions are used for updating state. The  login()  function will just alert the current value of the state. We will also add some other properties to text inputs to disable auto capitalisation, remove the bottom border on Android devices and set a placeholder. inputs.js import React , { Component } from 'react' import { View , Text , TouchableOpacity , TextInput , StyleSheet } from 'react-native' class Inputs extends Component { state = { ...