jaguardb

JaguarDB

The Most Scalable Vector Database

Home

Technology

Product

Document

Download

Vector database sharding

Multimodal search

JaguarDB quantization

JaguarDB Vector API

Best Vector databases

JaguarDB in Docker

Setup JaguarDB with tar package

Setup JaguarDB on multiple nodes

Vector index sharing

How zeromove works

Video introduction

Example: Similarity Search

Jaguar vector database can be used to measure the similarity between documents. Textual data is transformed into vector representations, such as word embeddings or document embeddings. By comparing the vectors of different documents, the database can identify similar content, cluster related documents, and enable efficient document search.

The following Python example illustrates the integration of JaguarDB into AI applications for the benefit of software engineers and data scientists. In this demonstration, the focus lies on the seamless storage of textual data, the creation of embeddings, and the execution of similarity searches within the text data corpus. The process entails identifying texts that closely correspond to a given query text. Notably, this operation is solely reliant on vector embeddings, rendering the inclusion of explicit keywords or search cues unnecessary.

The following Python code connects to JaguarDB instance:

jag = jaguarpy.Jaguar()
host = "127.0.0.1"
port = sys.argv[1]
apikey = "myapikey"
database = "vdb"

rc = jag.connect( host, port, apikey, database )

Next, a table containing vector column and other related data is created:

jag.execute("create store textvec ( v vector(1024, 'cosine_fraction_short'), text char(2048)，source char(32))")

In this statement, the "zid" field is an automatically generated unique identifier. The "v" field represents a vector, comprising two primary elements: an integer vector ID and an array of vector components. Notably, the dimension of the vector is set at 1024. The inclusion of cosine within the string "cosine_fraction_short" signifies the intention to employ the cosine distance metric for similarity searches conducted on the vector. The term "fraction" alludes to the anticipated fractional-format input data. It is worth noting that JaguarDB vector storage implements distinct quantization levels. Specifically, the short quantization mode leverages 16-bit quantization techniques to efficiently store vector data. There is no limit on the number of vectors in a table. Multiple vectors can be created on the same table, to capture various types of vectors for the same object. The "text" field can store text data for an object, with a maximum capacity of 2048 bytes. The field "source" indicates source place where the text was imported from.

Then we can create an index, connecting the integer vector ID of a vector to the unique "zid" field for search of related attributes of an object:

jag.execute("create index textvec_idx on textvec(v, zid)")

With JaguarDB, users can store various types of vectors, such as feature vectors and embedding vectors. An embedding vector, often simply referred to as an "embedding", is a mathematical representation of a discrete item, such as a word, phrase, image, or any other entity, in a continuous vector space. This technique is commonly used in various fields, including natural language processing (NLP), computer vision, recommendation systems, and more. The primary idea behind embedding vectors is to capture semantic relationships between items by placing similar items closer together in the vector space.

In this example, we use the "BAAI/bge-large-en" pre-trained embedding model to generate embeddings for the text data. A pre-trained embedding model is a machine learning model that has been trained on a large dataset to create meaningful representations (embeddings) of items in a continuous vector space. These embeddings capture semantic relationships and contextual information about the items. The model should be installed first:

pip install -U FlagEmbedding
pip install -U sentence-transformers

The modal is loaded in Python program with:

model = SentenceTransformer('BAAI/bge-large-en')

Next, we store a group of text data in the table:

text = "Human impact on the environment (or anthropogenic environmental impact) refers to changes to biophysical environments and to ecosystems, biodiversity, and natural resources caused directly or indirectly by humans."
zuid1 = storeText( jag, model, text, 'wiki' )

The function storeText() is implemented as follows:

def storeText(jag, model, text, src):
    sentences = [ text ]
    embeddings = model.encode(sentences, normalize_embeddings=False)
    comma_str = ",".join( [str(x) for x in embeddings[0] ])

    istr = "insert into textvec values ('" + comma_str + "', '" + text + "',’" + src + "’)"
    jag.execute( istr )
    return jag.getLastUuid()

Now we have a query and get similar texts from database:

queryText = "More recently, that focus has shifted eastward by 400 to 500 miles. In the past decade or so tornadoes have become prevalent in eastern Missouri and Arkansas, western Tennessee and Kentucky, and northern Mississippi and Alabama—a new region of concentrated storms. Tornado activity in early 2023 epitomized the trend."
K = 3
retrieveTopK( jag, model, queryText, K )

For the full listing of Python programs, please visit the following link:

Jaguar User Manual

JaguarDB

JaguarDB offers comprehensive support for vector database in artificial intelligence, along with instantly scalable datalake storage for raw media files and robust similarity search capabilities. This facilitates efficient handling of large datasets and enhances AI applications that require rapid data retrieval and similarity comparisons. JaguarDB, with integrated features, provides a seamless solution for managing and analyzing complex data in AI-driven environments.

Products

AI VectorDB
AI Datalake
Time Series
Geospatial
JaguarDB
Client Drivers

Resources

Cloud Admin Manual
Developer Guide
Configuration Help
Frequent Questions
ZeroMove Demo
Video Introduction

Social

Youtube