Vector database sharding
Multimodal search
JaguarDB quantization
JaguarDB Vector API
Best Vector databases
JaguarDB in Docker
Setup JaguarDB with tar package
Setup JaguarDB on multiple nodes
Vector index sharing
How zeromove works
Video introduction
|
|
Example: Similarity Search
Jaguar vector database can be used to measure the similarity between documents. Textual data is transformed into vector representations, such as word embeddings or document embeddings. By comparing the vectors of different documents, the database can identify similar content, cluster related documents, and enable efficient document search.
The following Python example illustrates the integration of JaguarDB into AI applications for the benefit of software engineers and data scientists. In this demonstration, the focus lies on the seamless storage of textual data, the creation of embeddings, and the execution of similarity searches within the text data corpus. The process entails identifying texts that closely correspond to a given query text. Notably, this operation is solely reliant on vector embeddings, rendering the inclusion of explicit keywords or search cues unnecessary.
The following Python code connects to JaguarDB instance:
jag = jaguarpy.Jaguar() host = "127.0.0.1" port = sys.argv[1] apikey = "myapikey" database = "vdb"
rc = jag.connect( host, port, apikey, database )
Next, a table containing vector column and other related data is created:
jag.execute("create store textvec ( v vector(1024, 'cosine_fraction_short'), text char(2048),source char(32))")
In this statement, the "zid" field is an automatically generated unique identifier. The "v" field represents a vector, comprising two primary elements: an integer vector ID and an array of vector components. Notably, the dimension of the vector is set at 1024. The inclusion of cosine within the string "cosine_fraction_short" signifies the intention to employ the cosine distance metric for similarity searches conducted on the vector. The term "fraction" alludes to the anticipated fractional-format input data. It is worth noting that JaguarDB vector storage implements distinct quantization levels. Specifically, the short quantization mode leverages 16-bit quantization techniques to efficiently store vector data. There is no limit on the number of vectors in a table. Multiple vectors can be created on the same table, to capture various types of vectors for the same object. The "text" field can store text data for an object, with a maximum capacity of 2048 bytes. The field "source" indicates source place where the text was imported from.
Then we can create an index, connecting the integer vector ID of a vector to the unique "zid" field for search of related attributes of an object:
jag.execute("create index textvec_idx on textvec(v, zid)")
With JaguarDB, users can store various types of vectors, such as feature vectors and embedding vectors. An embedding vector, often simply referred to as an "embedding", is a mathematical representation of a discrete item, such as a word, phrase, image, or any other entity, in a continuous vector space. This technique is commonly used in various fields, including natural language processing (NLP), computer vision, recommendation systems, and more. The primary idea behind embedding vectors is to capture semantic relationships between items by placing similar items closer together in the vector space.
In this example, we use the "BAAI/bge-large-en" pre-trained embedding model to generate embeddings for the text data. A pre-trained embedding model is a machine learning model that has been trained on a large dataset to create meaningful representations (embeddings) of items in a continuous vector space. These embeddings capture semantic relationships and contextual information about the items. The model should be installed first:
pip install -U FlagEmbedding pip install -U sentence-transformers
The modal is loaded in Python program with:
model = SentenceTransformer('BAAI/bge-large-en')
Next, we store a group of text data in the table:
text = "Human impact on the environment (or anthropogenic environmental impact) refers to changes to biophysical environments and to ecosystems, biodiversity, and natural resources caused directly or indirectly by humans." zuid1 = storeText( jag, model, text, 'wiki' )
The function storeText() is implemented as follows:
def storeText(jag, model, text, src): sentences = [ text ] embeddings = model.encode(sentences, normalize_embeddings=False) comma_str = ",".join( [str(x) for x in embeddings[0] ])
istr = "insert into textvec values ('" + comma_str + "', '" + text + "',’" + src + "’)" jag.execute( istr ) return jag.getLastUuid()
Now we have a query and get similar texts from database:
queryText = "More recently, that focus has shifted eastward by 400 to 500 miles. In the past decade or so tornadoes have become prevalent in eastern Missouri and Arkansas, western Tennessee and Kentucky, and northern Mississippi and Alabama—a new region of concentrated storms. Tornado activity in early 2023 epitomized the trend." K = 3 retrieveTopK( jag, model, queryText, K )
For the full listing of Python programs, please visit the following link:
Jaguar User Manual
|