sdk is still in alpha, if you if you have some feedback join our discord (opens in a new tab) 🔥
find it on github (opens in a new tab)
Embedbase
Open-source API & SDK to connect any data to ChatGPT
Before you start, you need get a an API key at app.embedbase.xyz (opens in a new tab).
Note: we're working on a fully client-side SDK. In the meantime, you can use the hosted instance of Embedbase.
Design philosophy
- Simple
- Open-source
- Composable (integrates well with any AI providers, databases and LLM helpers)
What is it
These are the official clients for Embedbase. Open-source API & SDK to easily create, store and retrieve embeddings.
Who is it for
People who want to
- plug their own data into ChatGPT or any other LLM
- build recommendation systems
- build search engines
- build classification engines
- etc.
Installation
npm install embedbase-js
Initializing
import { createClient } from 'embedbase-js'
// you can find the api key at https://embedbase.xyz
const apiKey = 'your api key'
// this is using the hosted instance
const url = 'https://api.embedbase.xyz'
const embedbase = createClient(url, apiKey)
Searching datasets
// fetching data
const data = await embedbase
.dataset('test-amazon-product-reviews')
.search('best hot dogs accessories', { limit: 3 })
console.log(data)
// [
// {
// "similarity": 0.810843349,
// "data": "This nice little hot dog toaster is a great addition to our kitchen. It is easy to use and makes a great hot dog. It is also easy to clean. I would recommend this to anyone who likes hot dogs."
// "metadata": {
// "path": "https://amazon.com/hotdogtoaster",
// "source": "amazon"
// },
// "embedding": [0.35332, 0.23423, ...]
// },
// {
// "similarity": 0.294602573,
// "data": "200 years ago, people would never have guessed that humans in the future would communicate by silently tapping on glass",
// "embedding": [0.76532, 0.23423, ...]
// },
// {
// "similarity": 0.192932034,
// "data": "The average car in space is nicer than the average car on Earth",
// "embedding": [0.52342, 0.23423, ...]
// },
// ]
You can also filter by metadata:
const data = await embedbase
.dataset('test-amazon-product-reviews')
.search('best hot dogs accessories')
.where('source', '==', "amazon")
Adding Data
const data =
await // embeddings are extremely good for retrieving unstructured data
// in this example we store an unparsable html string
embedbase.dataset('test-amazon-product-reviews').add(`
<div>
<span>Lightweight. Telescopic. Easy zipper case for storage. Didn't put in dishwasher. Still perfect after many uses.</span>
`)
console.log(data)
//
// {
// "id": "eiew823",
// "data": "Lightweight. Telescopic. Easy zipper case for storage.
// Didn't put in dishwasher. Still perfect after many uses.",
// "embedding": [0.1, 0.2, 0.3, ...]
// }
If you have many documents to add, you should use batchAdd
:
embedbase.dataset(datasetId).batchAdd([{
data: 'some text',
}])
For better performance, you can run these add in parallel. For example,
you can use batches with Promise.all
:
const batch = async (myList: any[], fn: (chunk: any[]) => Promise<any>) => {
const batchSize = 100;
return Promise.all(
myList.reduce((acc: BatchAddDocument[][], chunk, i) => {
if (i % batchSize === 0) {
acc.push(myList.slice(i, i + batchSize));
}
return acc;
}, []).map(fn)
)
}
batch(chunks, (chunk) => embedbase.dataset(datasetId).batchAdd(chunk))
Splitting and chunking large texts
AI models are often limited in the amount of text they can process at once. Embedbase provides a utility function to split large texts into smaller chunks.
We highly recommend using this function.
To split and chunk large texts, use the splitText
function:
import { splitText } from 'embedbase-js/dist/main/split';
const text = 'some very long text...';
// ⚠️ note here that the value of maxTokens depends
// on the used embedder in embedbase.
// With models such as OpenAI's embeddings model, you can
// use a maxTokens of 500. With other models, you may need to
// use a lower maxTokens value.
// (embedbase cloud use openai model at the moment) ⚠️
const maxTokens = 500
// chunk_overlap is the number of tokens that will overlap between chunks
// it is useful to have some overlap to ensure that the context is not
// cut off in the middle of a sentence.
const chunkOverlap = 200
splitText(text, { maxTokens: maxTokens, chunkOverlap: chunkOverlap }, async ({ chunk, start, end }) =>
embedbase.dataset('some-data-set').add(chunk)
)
Check how we send our documentation to Embedbase (opens in a new tab) to let you ask it questions through GPT-4.
Creating a "context"
createContext
is very similar to .search
but it returns strings instead of an object. This is useful if you want to easily feed it to GPT.
// you can create a context to store data
const data = await embedbase
.dataset('my-documentation')
.createContext('my-context')
console.log(data)
[
"Embedbase API allows to store unstructured data...",
"Embedbase API has 3 main functions a) provides a plug and play solution to store embeddings b) makes it easy to connect to get the right data into llms c)..",
"Embedabase API is self-hostable...",
]
Adding metadata
Adding metadata can be useful, for example if you are feeding a LLM like ChatGPT, a typical best practice is to add the source of the text as metadata. For example an URL. Then you can ask the AI to add links or footnotes in it's output.
const data =
await
embedbase.dataset('test-amazon-product-reviews').add(`
<div>
<span>Lightweight. Telescopic. Easy zipper case for storage. Didn't put in dishwasher. Still perfect after many uses.</span>
// metadata can be anything you want that will appear in the search results later
`, {path: 'https://www.amazon.com/dp/B00004OCNS'})
console.log(data)
//
// {
// "id": "eiew823",
// "data": "Lightweight. Telescopic. Easy zipper case for storage.
// Didn't put in dishwasher. Still perfect after many uses.",
// "metadata": {"path": "https://www.amazon.com/dp/B00004OCNS"},
// "embedding": [0.1, 0.2, 0.3, ...]
// }
Listing datasets
const data = await embedbase.datasets()
console.log(data)
// [{"datasetId": "test-amazon-product-reviews", "documentsCount": 2}]
Create a recommendation engine
Check out this tutorial (opens in a new tab).
Contributing
We welcome contributions to Embedbase (opens in a new tab).
If you have any feedback or suggestions, please open an issue or join our Discord (opens in a new tab) to discuss your ideas.