Everything you'd want to know before you install.
A detailed look at how DataFirefly AI Semantic Internal Linking — Vector embeddings, cosine similarity and semi-automatic insertion with smart anchors for PrestaShop 8 & 9 (Mistral, OpenAI) works, why we built it the way we did, and the thinking behind the features above.
Why semantic internal linking is superior to keyword-based linking
Classic internal linking modules work on keyword-to-URL rules. You enter berber rug and associate the URL of the berber-rug category. The engine then does a find-replace in the HTML of your articles, products or pages. This approach has two major limitations. It's rigid: it only triggers a link when the exact keyword appears, which excludes all pages where the topic is covered with different wording (moroccan rug, kilim, traditional rug). And it's blind to semantic context: the engine doesn't know whether the target page is actually relevant to the source content, it's just doing string matching. Semantic linking works differently: each piece of content is represented by a vector of several hundred dimensions that encodes its meaning — product, category, CMS page, blog post. Two contents are linked if they're close in this vector space, regardless of the words used. The module thus spots linking opportunities a keyword engine would never see, and avoids false positives where a keyword appears in an irrelevant context.
AI embeddings: how it works in practice
On first indexation, the module iterates over all active entities of your shop for the enabled types (products, categories, CMS pages). For each entity, the textual content is extracted and cleaned: title, meta_title, meta_description, short description and long description (HTML is properly stripped). The cleaned text is then sent in batch to the configured AI provider (Mistral or OpenAI), which returns one embedding vector per item — a list of floats that represents the text's semantics. This vector is stored in the database as float32 BLOB packed in little-endian, with its L2 norm pre-computed to speed up later similarity calculations. Similarity between two contents is then computed in PHP via normalized dot product (cosine similarity), an extremely fast operation once norms are pre-computed. On a catalog of 5,000 entities, computing all pairs in one language takes only a few seconds.
Why two providers rather than just one
Each provider has its sweet spot. Mistral mistral-embed is the recommended default: 1024 dimensions, very low latency, Europe hosting (EU sovereignty for sensitive shops), cost around 10 cents per million tokens — well under one euro to index a multilingual catalog of several thousand entities. OpenAI text-embedding-3-small is the alternative: 1536 dimensions (richer vector space), excellent on non-European languages, cost around 2 cents per million tokens (five times cheaper than Mistral in USD). The module unifies both providers behind a common interface: same return format, same batch mechanism, same error handling with PrestaShopLogger. You can switch from one provider to the other from the configuration dropdown — the module will detect that dimensions changed and prompt a reindex (one click on Reindex All).
The anchor generator, the real centerpiece of the module
This is what makes the difference with a raw suggestion module. For each (source, target) pair above the similarity threshold, the anchor generator runs the following algorithm: it extracts the target's title, splits it into n-grams from 2 to 6 words, removes stopwords (French and English), then looks for each of these n-grams verbatim in the source body. Found n-grams are ranked by decreasing length (longer ones are more discriminative and more SEO-optimized) and presented in the back-office dropdown. The default selected anchor is the longest one found — typically a 3 or 4-word anchor including the main keywords of the target title. If no n-gram of the target title appears in the source, the module offers the raw target title (fallback mode). You always keep control: editable dropdown, Custom option to type any anchor text. At insertion time, the module picks the first occurrence of the anchor text in the source body that's not already inside an a, code or pre tag — zero risk of breaking an existing link or re-linking already-linked text.
Surgical rollback via unique marker
This is the feature that reassures every merchant who's cautious with their descriptions. Each inserted link receives an HTML data-dfasl attribute carrying a unique 36-character identifier generated randomly at insertion (UUID-like format). The identifier is also stored in the database in the dfasl_inserted_link table, along with the source entity, target, anchor, insertion date and the ID of the employee who validated. To remove a link, you go to the Inserted Links tab, click Remove next to the row in question: the module runs a regex matching exactly the a data-dfasl anchor pattern with this unique identifier, removes the a tag while preserving the anchor text intact, and marks the link as removed in the database. No other tag in the description is touched, no manual link is at risk. On 500 links inserted by the module across 200 product pages, you can remove a single one in one click without touching any of the other 499.
CLI worker and processing strategy for large catalogs
On a catalog of a few dozen products, everything can be done from the back office: Reindex All then Process Batch is enough. Beyond a few thousand entities, the interface becomes slow and the user doesn't want to keep their browser open for hours. The module exposes a CLI worker (bin/analyze.php) that runs from PHP command line with four options. --shop to target a specific shop in a multishop environment. --enqueue-all to requeue all active entities before processing — useful for a full re-indexation after a provider or model change. --loop to keep looping while items remain to process. --max-batches to limit the number of batches processed in a single run (anti-runaway safety). --sleep to insert a pause between batches (useful to stay under API rate limits). The typical command for a cron every 15 minutes is: php modules/dfaisemanticlinks/bin/analyze.php --loop --max-batches=50 --sleep=1. The worker automatically resets entries stuck in Processing status for more than 30 minutes (case where a previous worker would have crashed), handles API errors by marking affected items as Error with the message, and continues processing healthy items in the batch.
Auto-reindexing and index freshness
An index that drifts away from the catalog has no value anymore. The module handles freshness via native PrestaShop hooks. On every change of a product, category or CMS page (hooks actionObjectProductUpdateAfter, actionObjectCategoryUpdateAfter, actionObjectCmsUpdateAfter), the affected entity is placed back in the queue with Pending status, in all active languages — the next worker will process it automatically. On deletion (hooks actionObjectProductDeleteAfter, actionObjectCategoryDeleteAfter, actionObjectCmsDeleteAfter), associated embeddings and suggestions are cascaded purged. The module also includes a content hash (SHA-256 of cleaned text): if an entity is requeued without its actual content having changed (e.g. because an employee only touched the stock), the indexing batch detects the unchanged hash and skips the API call — token savings. Auto-reindexing is toggleable from Settings (DFASL_AUTO_INDEX option), useful for pausing it during a massive CSV import and resuming with a Reindex All when the import is done.
Native multi-shop and multilingual
The module is natively multi-shop and multilingual. Embeddings are scoped by the triplet (entity, language, shop) — the same product in two shops will have two independent embeddings if descriptions differ, and the same product in French and English will have two different embeddings even if the product sheet is the same. Suggestions never cross language boundaries: a French product will never get a link suggested to an English product (which would make no SEO sense). Shop boundaries are respected the same way. Configuration can differ per shop (API key, similarity threshold, indexed types, display hook) — useful when you have a B2B shop with technical content and a B2C shop with general-public content in the same infrastructure.
Typical use cases
Multilingual fashion shop with a catalog of 2,000 products — semantic linking spots pairs of visually or stylistically close products (e.g. two variations of a dress cut) that keyword rules would systematically miss. B2B technical shop with dense descriptions — semantic linking connects products that share the same industrial use case without identical vocabulary. E-commerce blog — each article can automatically reference the most semantically relevant products, categories and other articles, with anchors extracted verbatim from the article text (the opposite of mechanical find-replace). Catalog overhaul — after a massive import or a reorganization, a Reindex All rebuilds the linking in a few minutes where a manual strategy would have taken weeks of editorial work. Catalog with strong specialized vocabulary (organic cosmetics, medical equipment, technical products) — the module detects semantic proximities that non-experts wouldn't see, letting the SEO team discover non-obvious linking opportunities.
Internal architecture and PrestaShop 8 and 9 compatibility
The module is built in PHP 8.1+ with strict types, readonly classes and modern features (match, enums). Autoload is PSR-4 under namespace DataFirefly/AiSemanticLinks/ mapped to src/. Admin controllers use legacy ModuleAdminController (not Symfony Grid) — deliberate choice to guarantee stable compatibility between PrestaShop 8.0 and 9.x without having to maintain two code variants. A tiny in-house service container (ServiceContainer) wires repositories and business services — isolating the module from Symfony container differences between PrestaShop versions, and avoiding a dependency that would break on every major update. Five SQL tables prefixed dfasl_: embedding (vectors and hashes), queue (work queue), suggestion (proposed pairs), inserted_link (active links), job (bulk operations). Uninstall cleanly drops the 5 tables and purges all DFASL_* configuration variables. Source code is delivered unencrypted, PSR-compliant — you can override, audit, or extend it as you wish.
There are no reviews yet.