Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Matryoshka embeddings are not sparse. And SPLADE can scale to tens or hundreds of thousands of dimensions.




Yeah, the standard SPLADE model trained from BERT typically already has a vocabulary/vector size of 30,552. If the SPLADE model is based on a multilingual version of BERT, such as mBERT or XLM-R, the vocabulary size could inherently expand to approximately 100,000, as does the vector size.

If you consider the actual latent space the full higher dimensional representation, and you take the first principle component, the other vectors are zero. Pretty sparse. No it's not a linked list sparse matrix. Don't be a pedant.

When you truncate Matryoshka embeddings, you get the storage benefits of low-dimensional vectors with the limited expressiveness of low-dimensional vectors. Usually, what people look for in sparse vectors is to combine the storage benefits of low-dimensional vectors with the expressiveness of high-dimensional vectors. For that, you need the non-zero dimensions to be different for different vectors.

No one means Matryoshka embeddings when they talk about sparse embeddings. This is not pedantic.

No one means wolves when they talk about dogs, obviously wolves and dogs are TOTALLY different things.

Why?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: