The rapid growth of AI and ML applications, particularly those involving large-scale data analysis, has increased the demand for vector databases that can efficiently store, index, and query vector embeddings. Therefore, vector databases like MyScale, Pinecone, and Qdrant continue to be developed and expanded to meet these requirements.
At the same time, traditional databases continue to improve their vector data storage and retrieval capabilities. For instance, the well-known relational database PostgreSQL and its
pgvector extension provide similar functionality, albeit less effectively than a well-optimized vector database.
There are significant differences in performance, accuracy, and other aspects when using a general-purpose database like PostgreSQL. These differences can lead to bottlenecks in performance and data scale. To address these issues, upgrading to a more efficient vector database like MyScale is recommended.
What sets MyScale apart from other specialized vector databases is its ability to provide full SQL support without compromising on high performance. This makes the migration process from PostgreSQL to MyScale much smoother and simpler.
To add value to this statement, let’s consider a use case where we have vector data stored in a PostgreSQL database but have performance and data scale bottlenecks. Therefore, as a solution, we have decided to upgrade to MyScale.
A core part of upgrading to MyScale from PostgreSQL is to migrate the data from the old to the new database. Let’s look at how this is done.
Note: To demonstrate how to migrate data from PostgreSQL to MyScale, we must set up both databases even though our use case notes that we already have vector data in a PostgreSQL database.
Before we begin, it’s important to note that we will use the following environments and datasets:
We used the first 1M rows (1,000,448 rows exactly) from the LAION-400-MILLION OPEN DATASET for this exercise to demonstrate a scenario where the data scale continues to increase after migration.
Note: This dataset has 400 million entries, each consisting of a 512-dimensional vector.
Load the Data Into PostgreSQL
If you already know about PostgreSQL and pgvector, you can skip ahead to the migration process by clicking here.
The first step is to load the data into a PostgreSQL database by working through the following step-by-step guide:
Create PostgreSQL Database and Set Environment Variables
Create a PostgreSQL instance with
pgvector in Supabase as follows:
- Navigate to the Supabase website and log in.
- Create an organization and name it.
- Wait for the organization and its database to be created.
- Enable the
Once the PostgreSQL database has been created and
pgvector enabled, the next step is to configure the following three environment variables to establish the connection to PostgreSQL using psql:
export PGHOST=db.qwbcctzfbpmzmvdnqfmj.supabase.co export PGUSER=postgres export PGPASSWORD='********'
Additionally, run the following script to increase the memory and request duration limits, making sure any SQL queries we run are not interrupted:
$ psql -c "SET maintenance_work_mem='300MB';" SET $ psql -c "SET work_mem='350MB';" SET $ psql -c "SET statement_timeout=4800000;" SET $ psql -c "ALTER ROLE postgres SET statement_timeout=4800000;"; ALTER ROLE
Create a PostgreSQL Data Table
Execute the following SQL statement to create a PostgreSQL data table. Ensure both the vector columns (
image_embedding) are of type
Note: The vector columns must be consistent with our data’s vector dimensions.
$ psql -c "CREATE TABLE laion_dataset ( id serial, url character varying(2048) null, caption character varying(2048) null, similarity float null, image_embedding vector(512) not null, text_embedding vector(512) not null, constraint laion_dataset_not_null_pkey primary key (id) );"
Insert the Data Into the PostgreSQL Table
Insert the data in batches of 500K rows.
Note: We have only inserted the first set of 500K rows to test how successfully the data was inserted before adding the rest.
$ wget https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/laion_dataset.sql.tar.gz $ tar -zxvf laion_dataset.sql.tar.gz $ cd laion_dataset $ psql < laion_dataset_id_lt500K.sql INSERT 0 1000 INSERT 0 1000 INSERT 0 1000 INSERT 0 1000 INSERT 0 1000 ...
Build an ANN Index
The next step is to create an index that uses Approximate Nearest Neighbor search (ANN).
Note: Set the
lists parameter to 1000, as the expected number of rows in the data table is 1M rows.
$ psql -c "CREATE INDEX ON laion_dataset USING ivfflat (image_embedding vector_cosine_ops) WITH (lists = 1000);"; CREATE INDEX
Run a SQL Query
The last step is to test that everything works by executing the following SQL statement:
$ psql -c "SET ivfflat.probes = 100; SELECT id, url, caption, image_embedding <=> (SELECT image_embedding FROM laion_dataset WHERE id=14358) AS dist FROM laion_dataset WHERE id!=14358 ORDER BY dist ASC LIMIT 10;"
If your result set looks like ours, we can proceed.
Insert the Rest of the Data
Add the rest of the data to the PostgreSQL table. This is important to validate the performance of
pgvector at scale.
As you can see from this script's output (and the following image), inserting the rest of the data returns errors. Supabase sets the PostgreSQL instance to read-only mode, preventing storage consumption beyond its free tier limits.
Overflowed Supabase Store Size
Note: Only 500K rows have been inserted into the PostgreSQL table and not 1M rows.
Migrate to MyScale
Let's now migrate the data to MyScale by following the step-by-step guide described below:
Note: Not only is MyScale a high-performant, but benchmarking reports that MyScale can outperform other specialized vector databases with respect to. MyScale's proprietary vector indexing algorithm, Multi-Scale Tree Graph (MSTG), utilizes local NVMe SSD as a cache disk, significantly increasing the supported index scale when compared to in-memory situations.
Create a MyScale Cluster
The first step is to create and launch a new MyScale cluster.
- Navigate to the Clusters page (opens new window) and click the +New Cluster button to launch a new cluster.
- Name your cluster.
- Click Launch to run the cluster.
The following image describes this process further:
Creating MyScale Cluster
Create a Data Table
Once the cluster is running, execute the following SQL script to create a new database table:
When you have created the database table, use the
postgresql() method (developed by MyScale) to migrate the data from PostgreSQL easily.
Confirm the Total Number of Rows Inserted
SELECT count(*) statement to confirm whether all the data has migrated to MyScale from the PostgreSQL table.
Build Database Table Index
The next step is to build an index with the index type
metric_type (distance calculation method) as
Execute an ANN Query
Run the following script to execute an ANN query using the
MSTG index we have just created.
Insert the Rest of the Data Into the MyScale Table
Let's add the rest of the data to the MyScale table.
Import Data From CSV/Parquet
The good news is that we can import the data directly from the Amazon S3 bucket using the following MyScale method:
Confirm the Total Number of Rows Inserted
Once again, run the following
SELECT count(*) statement to confirm whether all the data from the S3 bucket has been imported into MyScale.
Note: The free pod of MyScale can store a total of 5M 768-dimensional vectors.
Execute an ANN Query
We can use the same SQL query as we used above to perform queries on a dataset of 1M rows.
To refresh your memory, here is the SQL statement again:
https://thumbs.dreamstime.com/t/pretty-ragdoll-cat-white-background-10310986.jpg (opens new window)
|Pretty Ragdoll cat on white background Royalty Free Stock Image||0.07492614|
https://us.123rf.com/450wm/photodeti/photodeti1402/photodeti140200064/25754632-cat-sitting (opens new window)
|cat sitting in front and looking at camera isolated on white background Stock Photo||0.09292877|
http://4.bp.blogspot.com/-ERZs22WUkq4/VfzZIRoa2MI/AAAAAAAAALU/8qQ4JHJYFEk/s640/bleu-russe.jpg (opens new window)
|Russian Blue cat, mysterious cat, friendly cat, Frances Simpson, cat breeds, cats breed||0.09316337|
https://thumbs.dreamstime.com/t/mixed-breed-cat-felis-catus-6-months-old-22629295.jpg (opens new window)
|Mixed-breed cat, Felis catus, 6 months old Royalty Free Stock Photo||0.09820753|
https://cdn.britannica.com/s:300x300/25/7125-004-CA2926E3.jpg (opens new window)
|Abyssinian Cat Pictures||0.10573125|
https://media.buzzle.com/media/images-en/photos/mammals/cats/1200-60808028-red-kitten.jpg (opens new window)
|Male Cat Names For Orange Cats||0.10775411|
https://thumbs.dreamstime.com/t/russian-blue-kitten-17465284.jpg (opens new window)
|Russian blue kitten Stock Images||0.10845572|
https://thumbs.dreamstime.com/t/norwegian-forest-cat-white-background-10365846.jpg (opens new window)
|Norwegian Forest Cat on white background. Show champion black and white...||0.1116029|
https://envato-shoebox-0.imgix.net/2e27/8d88-aee4-11e3-9936-b8ca3a6774f8/2013_254_008_0124.jpg (opens new window)
|Rear view of a Maine Coon kitten sitting, looking up, 4 months old, isolated on white||0.11200631|
https://us.123rf.com/450wm/satina/satina1003/satina100301562/6615526-cat-isolated-over-white (opens new window)
|himalayan cat: Cat isolated over white background||0.11550176|
This discussion describes how migrating vector data from PostgreSQL to MyScale is straightforward. And even as we increase the data volume in MyScale by migrating data from PostgreSQL and importing new data, MyScale continues to exhibit reliable performance irrespective of the size of the dataset. The colloquial phrase: The bigger, the better, rings true in this regard. MyScale's performance remains reliable even when querying super-large datasets.
Therefore, if your business experiences data scale or performance bottlenecks, we strongly recommend migrating your data to MyScale using the steps outlined in this article.