In this article, we’ll test the ability of SingleStoreDB to store and query the OpenAI Wikipedia Vector Database dataset. We’ll see that SingleStoreDB can manage this dataset with ease. SingleStoreDB has supported a range of vector functions for some time, and these functions are ideally suited for modern applications using GPT technology.
The notebook file used in this article is available on GitHub.
Introduction
In several previous articles, we have used some of the vector capabilities built into SingleStoreDB:
- Quick tip: SingleStoreDB’s EUCLIDEAN_DISTANCE and JSON_ARRAY_PACK functions
- Using SingleStoreDB, Spark, and Alternating Least Squares (ALS) to build a Movie Recommender System
In this article, we’ll test the JSON_ARRAY_PACK
and DOT_PRODUCT
vector functions with the OpenAI Wikipedia Vector Database dataset.
There is an OpenAI notebook available on GitHub under an MIT License that tests several Vector Database systems. The tests can be run using local clients or in the cloud. In this article, we’ll use a local installation of SingleStoreDB.
Install SingleStoreDB
In this article, we’ll install SingleStoreDB in a Virtual Machine (VM) environment. It takes just a few minutes. A previous article described the steps. Alternatively, we could use a Docker image.
For this article, we’ll only need two tarball files for the VM installation:
- singlestoredb-toolbox
- singlestoredb-server
Assuming a two-node cluster was correctly deployed and using the same variable names from the previous article, we can connect to our cluster from a MySQL CLI Client as follows:
mysql -u root -h ubuntu -P 3306 --default-auth=mysql_native_password -p
Once connected to our cluster, we’ll create a new database as follows:
CREATE DATABASE IF NOT EXISTS openai_demo;
Install Jupyter
From the command line, we’ll install the classic Jupyter Notebook as follows:
OpenAI API Key
Before launching Jupyter, we must create an account on the OpenAI website. This provides some free credits. Since we will use embeddings, the cost will be minimal. We’ll also need to create an OpenAI API Key. This can be created from USER > API keys in our OpenAI account.
From the command line, we’ll export the OPENAI_API_KEY
variable in our environment, as follows:
Replace
with your key.
Next, we’ll launch Jupyter as follows:
Fill Out the Notebook
Let’s now create a new notebook. We’ll adhere to the flow and structure of the OpenAI notebook and use some small code sections directly from the notebook, where required.
Setup
First, some libraries:
!pip install matplotlib !pip install openai !pip install plotly.express !pip install scikit-learn !pip install singlestoredb !pip install tabulate !pip install wget
Next, some imports:
import openai import pandas as pd import os import wget from ast import literal_eval
Then, the embedding model:
EMBEDDING_MODEL = "text-embedding-ada-002"
Load Data
We’ll now obtain the Wikipedia dataset:
embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip' # The file is ~700 MB so this will take some time wget.download(embeddings_url)
And unpack it:
import zipfile with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip", "r") as zip_ref: zip_ref.extractall("data")
Next, we’ll load the file into a Pandas Dataframe:
article_df = pd.read_csv( "data/vector_database_wikipedia_articles_embedded.csv" )
And we’ll take a look at the first few lines as follows:
The next operation from the OpenAI notebook can take a while:
# Read vectors from strings back into a list article_df['title_vector'] = article_df.title_vector.apply(literal_eval) article_df['content_vector'] = article_df.content_vector.apply(literal_eval) # Set vector_id to be a string article_df['vector_id'] = article_df['vector_id'].apply(str)
Next, we’ll look at the dataframe info:
article_df.info(show_counts=True)
The result should be as follows:
Create Table
Let’s now create a connection to our local installation of SingleStoreDB:
import singlestoredb as s2 conn = s2.connect("root:@:3306/openai_demo") cur = conn.cursor()
We’ll replace the values for
and
with the values that we used earlier at installation time.
We’ll now create a table as follows:
stmt = """ CREATE TABLE IF NOT EXISTS wikipedia ( id INT PRIMARY KEY, url VARCHAR(255), title VARCHAR(100), text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci, title_vector BLOB, content_vector BLOB, vector_id INT ) """ cur.execute(stmt)
Populate Table
We can populate our database table as follows:
# Prepare the statement stmt = """ INSERT INTO wikipedia ( id, url, title, text, title_vector, content_vector, vector_id ) VALUES ( %s, %s, %s, %s, JSON_ARRAY_PACK_F64(%s), JSON_ARRAY_PACK_F64(%s), %s ) """ # Convert the DataFrame to a NumPy record array record_arr = article_df.to_records(index=False) # Set the batch size batch_size = 1000 # Iterate over the rows of the record array in batches for i in range(0, len(record_arr), batch_size): batch = record_arr[i:i+batch_size] values = [( row[0], row[1], row[2], row[3], str(row[4]), str(row[5]), int(row[6]) ) for row in batch] cur.executemany(stmt, values)
We can also use JSON_ARRAY_PACK_F32
(32-bit, IEEE standard format), instead of JSON_ARRAY_PACK_F64
(64-bit, IEEE standard format).
Loading the data should take a short time. We can use other data loading methods, such as pipelines, for larger datasets.
Search Data
First, we’ll import the following:
from openai.embeddings_utils import get_embedding
Next, we’ll check that the OPENAI_API_KEY
can be read, as follows:
if os.getenv("OPENAI_API_KEY") is not None: openai.api_key = os.getenv("OPENAI_API_KEY") print ("OPENAI_API_KEY is ready") else: print ("OPENAI_API_KEY environment variable not found")
We’ll now define a Python function that will allow us to use either of the two vector columns in the database:
We can test SingleStoreDB using the two examples in the OpenAI notebook. First, we’ll use title
and title_vector
:
values1, scores1 = search_wikipedia( query = "modern art in Europe", column1 = "title", column2 = "title_vector", num_rows = 5 )
We’ll format the results using the following:
from tabulate import tabulate # Combine the values and scores lists into a list of tuples # Each tuple contains a value and its corresponding score table_data1 = list(zip(values1, scores1)) # Add a rank column to the table data table_data1 = [(i + 1,) + data for i, data in enumerate(table_data1)] # Create the table table1 = tabulate(table_data1, headers=["Rank", "Title", "Score"]) # Print the table print(table1)
The output should be similar to the following:
Rank Title Score ------ -------------------- -------- 1 Museum of Modern Art 0.875081 2 Western Europe 0.867523 3 Renaissance art 0.864172 4 Pop art 0.860346 5 Northern Europe 0.854755
Next, we’ll use text
and content_vector
:
values2, scores2 = search_wikipedia( query = "Famous battles in Scottish history", column1 = "text", column2 = "content_vector", num_rows = 5 )
We’ll format the results using the following:
# Combine the values and scores lists into a list of tuples # Each tuple contains a value and its corresponding score table_data2 = list(zip([value[:50] for value in values2], scores2)) # Add a rank column to the table data table_data2 = [(i + 1,) + data for i, data in enumerate(table_data2)] # Create the table table2 = tabulate(table_data2, headers=["Rank", "Text", "Score"]) # Print the table print(table2)
The output should be similar to the following:
Rank Text Score ------ -------------------------------------------------- -------- 1 The Battle of Bannockburn, fought on 23 and 24 Jun 0.869338 2 The Wars of Scottish Independence were a series of 0.86148 3 Events 0.852533 January 1 – Charles II crowned King of 4 The First War of Scottish Independence lasted from 0.849642 5 Robert I of Scotland (11 July 1274 – 7 June 1329) 0.846184
Summary
In this article, we’ve seen that SingleStoreDB can store vectors with ease and that we can also store other data types in the same table, such as numeric and text. With its powerful SQL and multi-model support, SingleStoreDB provides a one-stop solution for modern applications bringing both technical and business benefits through a single product.