structured knowledge right into a RAG system, engineers usually default to embedding uncooked JSON right into a vector database. The fact, nevertheless, is that this intuitive strategy results in dramatically poor efficiency. Fashionable embeddings are based mostly on the BERT structure, which is actually the encoder a part of a Transformer, and are skilled on an enormous textual content dataset with the principle aim of capturing semantic which means. Fashionable embedding fashions can present unbelievable retrieval efficiency, however they’re skilled on a big set of unstructured textual content with a deal with semantic which means. Consequently, regardless that embedding JSON could appear to be an intuitively easy and chic answer, utilizing a generic embedding mannequin for JSON objects would exhibit outcomes removed from peak efficiency.
Deep dive
Tokenization
Step one is tokenization, which takes the textual content and splits it into tokens, that are usually a generic a part of the phrase. The fashionable embedding fashions make the most of Byte-Pair Encoding (BPE) or WordPiece tokenization algorithms. These algorithms are optimized for pure language, breaking phrases into widespread sub-components. When a tokenizer encounters uncooked JSON, it struggles with the excessive frequency of non-alphanumeric characters. For instance, "usd": 10, shouldn’t be seen as a key-value pair; as a substitute, it’s fragmented:
- The quotes (
"), colon (:), and comma (,) - Tokens
usdand10
This creates a low signal-to-noise ratio. In pure language, nearly all phrases contribute to the semantic “sign”. Whereas in JSON (and different structured codecs), a major share of tokens are “wasted” on structural syntax that accommodates zero semantic worth.
Consideration calculation
The core energy of Transformers lies within the consideration mechanism. This permits the mannequin to weight the significance of tokens relative to one another.
Within the sentence The worth is 10 US {dollars} or 9 euros, consideration can simply hyperlink the worth 10 to the idea value as a result of these relationships are well-represented within the mannequin’s pre-training knowledge and the mannequin has seen this linguistic sample hundreds of thousands of instances. However, within the uncooked JSON:
"value": {
"usd": 10,
"eur": 9,
}
the mannequin encounters structural syntax it was not primarily optimized to “learn”. With out the linguistic connector, the ensuing vector will fail to seize the true intent of the information, because the relationships between the important thing and the worth are obscured by the format itself.
Imply Pooling
The ultimate step in producing a single embedding illustration of the doc is Imply Pooling. Mathematically, the ultimate embedding (E) is the centroid of all token vectors (e1, e2, e3) within the doc:
That is the place the JSON tokens change into a mathematical legal responsibility. If 25% of the tokens within the doc are structural markers (braces, quotes, colons), the ultimate vector is closely influenced by the “which means” of punctuation. Consequently, the vector is successfully “pulled” away from its true semantic middle within the vector house by these noise tokens. When a consumer submits a pure language question, the space between the “clear” question vector and “noisy” JSON vector will increase, straight hurting the retrieval metrics.
Flatten it
So now that we all know in regards to the JSON limitations, we have to determine resolve them. The overall and most simple strategy is to flatten the JSON and convert it into pure language.
Let’s think about the everyday product object:
{
"skuId": "123",
"description": "It is a take a look at product used for demonstration functions",
"amount": 5,
"value": {
"usd": 10,
"eur": 9,
},
"availableDiscounts": ["1", "2", "3"],
"giftCardAvailable": "true",
"class": "demo product"
...
}
It is a easy object with some attributes like description, and many others. Let’s apply the tokenization to it and see the way it appears to be like:

Now, let’s convert it into textual content to make the embeddings’ work simpler. With a purpose to try this, we are able to outline a template and substitute the JSON values into it. For instance, this template may very well be used to explain the product:
Product with SKU {skuId} belongs to the class "{class}"
Description: {description}
It has a amount of {amount} out there
The worth is {value.usd} US {dollars} or {value.eur} euros
Obtainable low cost ids embody {availableDiscounts as comma-separated record}
Reward playing cards are {giftCardAvailable ? "out there" : "not out there"} for this product
So the ultimate outcome will appear to be:
Product with SKU 123 belongs to the class "demo product"
Description: It is a take a look at product used for demonstration functions
It has a amount of 5 out there
The worth is 10 US {dollars} or 9 euros
Obtainable low cost ids embody 1, 2, and three
Reward playing cards can be found for this product
And apply tokenizer to it:

Not solely does it have 14% fewer tokens now, nevertheless it is also a a lot clearer type with the semantic which means and required context.
Let’s measure the outcomes
Word: Full, reproducible code for this experiment is obtainable within the Google Colab pocket book
Now let’s attempt to measure retrieval efficiency for each choices. We’re going to deal with the usual retrieval metrics like Recall@ok, Precision@ok, and MRR to maintain it easy, and can make the most of a generic embedding mannequin (all-MiniLM-L6-v2) and the Amazon ESCI dataset with random 5,000 queries and three,809 related merchandise.
The all-MiniLM-L6-v2 is a well-liked selection, which is small (22.7m params) however gives quick and correct outcomes, making it a sensible choice for this experiment.
For the dataset, the model of Amazon ESCI is used, particularly milistu/amazon-esci-data (), which is obtainable on Hugging Face and accommodates a group of Amazon merchandise and search queries knowledge.
The flattening operate used for textual content conversion is:
def flatten_product(product):
return (
f"Product {product['product_title']} from model {product['product_brand']}"
f" and product id {product['product_id']}"
f" and outline {product['product_description']}"
)
A pattern of the uncooked JSON knowledge is:
{
"product_id": "B07NKPWJMG",
"title": "RoWood 3D Puzzles for Adults, Wood Mechanical Gear Kits for Teenagers Youngsters Age 14+",
"description": " Specs
Mannequin Quantity: Rowood Treasure field LK502
Common construct time: 5 hours
Whole Items: 123
Mannequin weight: 0.69 kg
Field weight: 0.74 KG
Assembled measurement: 100*124*85 mm
Field measurement: 320*235*39 mm
Certificates: EN71,-1,-2,-3,ASTMF963
Really helpful Age Vary: 14+
Contents
Plywood sheets
Steel Spring
Illustrated directions
Equipment
MADE FOR ASSEMBLY
-Observe the directions supplied within the booklet and meeting 3d puzzle with some thrilling and interesting enjoyable. Fell the pleasure of self creation getting this beautiful picket work like a professional.
GLORIFY YOUR LIVING SPACE
-Revive the enigmatic attraction and cheer your events and get-togethers with an expertise that's distinctive and fascinating .
",
"model": "RoWood",
"colour": "Treasure Field"
}
For the vector search, two FAISS indexes are created: one for the flattened textual content and one for the JSON-formatted textual content. Each indexes are flat, which signifies that they are going to examine distances for every of the present entries as a substitute of using an Approximate Nearest Neighbour (ANN) index. That is necessary to make sure that retrieval metrics usually are not affected by the ANN.
D = 384
index_json = faiss.IndexFlatIP(D)
index_flatten = faiss.IndexFlatIP(D)
To scale back the dataset a random variety of 5,000 queries has been chosen and all corresponding merchandise have been embedded and added to the indexes. Consequently, the collected metrics are as follows:

all-MiniLM-L6-v2 embedding mannequin on the Amazon ESCI dataset. The flattened strategy constantly yields larger scores throughout all key retrieval metrics (Precision@10, Recall@10, and MRR). Picture by creatorAnd the efficiency change of the flattened model is:

The evaluation confirms that embedding uncooked structured knowledge into generic vector house is a suboptimal strategy and including a easy preprocessing step of flattening structured knowledge constantly delivers important enchancment for retrieval metrics (boosting recall@ok and precision@ok by about 20%). The principle takeaway for engineers constructing RAG techniques is that efficient knowledge preparation is extraordinarily necessary for attaining peak efficiency of the semantic retrieval/RAG system.
References
[1] Full experiment code https://colab.analysis.google.com/drive/1dTgt6xwmA6CeIKE38lf2cZVahaJNbQB1?usp=sharing
[2] Mannequin https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
[3] Amazon ESCI dataset. Particular model used: https://huggingface.co/datasets/milistu/amazon-esci-data
The unique dataset out there at https://www.amazon.science/code-and-datasets/shopping-queries-dataset-a-large-scale-esci-benchmark-for-improving-product-search
[4] FAISS https://ai.meta.com/instruments/faiss/
