Semantic Audit — Documentation
Semantic SEO audit of your PrestaShop catalog via vector clustering. Installation, configuration of OpenAI / Mistral / local TF-IDF providers, reading the report and automation.
Installation
Requirements
- PrestaShop 8.0 to 9.x
- PHP 7.4 minimum (8.x recommended)
- MySQL 5.7+ or MariaDB 10.3+
- An OpenAI or Mistral API key (optional — a local mode without API is included)
Install the module
- Unzip the
dfsemanticaudit.zipfile downloaded from your customer account. - Upload the
dfsemanticaudit/folder to/modules/of your PrestaShop via FTP, or use the ZIP install from Modules → Module manager → Upload a module. - Click Install.
Activate the module
The module automatically creates four SQL tables (ps_dfsa_content, ps_dfsa_audit, ps_dfsa_cluster, ps_dfsa_assignment) and an admin tab accessible from the left menu.
Configuration
Before the first audit, go to Modules → DataFirefly → Semantic Audit → Configuration.
Choosing an embeddings provider
Three providers are available. The choice determines the quality of the clusters obtained.
OpenAI (recommended)
The default provider. Uses the text-embedding-3-small model (1,536 dimensions). Top quality, marginal cost: about €0.02 per 1,000 products on the first indexing.
- API key: create one at platform.openai.com/api-keys
- Model: keep
text-embedding-3-smallby default.text-embedding-3-large(3,072 dim) gives slightly higher quality but costs 6× more.
Mistral
European alternative hosted in France. Uses mistral-embed (1,024 dimensions). Pricing comparable to OpenAI.
- API key: create one at console.mistral.ai
- Model:
mistral-embed
Local TF-IDF
Runs entirely on your server, with no API calls, no recurring cost. Uses classic statistical NLP principles (normalized TF-IDF) with a 384-dimension output.
- Quality sufficient for catalogs under 500 products.
- Supports FR, EN, ES, DE, IT (built-in stopwords).
- No API key required.
Audit parameters
- k (number of clusters): 8 by default. Range 2–50.
- Off-topic threshold: cosine distance above which content is flagged. 0.55 by default. Range 0.1 to 1.5.
Auto-reindex
Enabled by default. The module registers hooks on the creation, modification and deletion of products, categories and CMS pages. On every change, the content is flagged for re-embedding at the next run, with no manual effort.
Running your first audit
Three steps to run in order from the dashboard.
Step 1 — Reindex content
Click Reindex content. The module walks through your products, active categories, CMS pages and manufacturers, computes a SHA1 hash of title + excerpt, and only flags new or modified content for processing.
For 1,000 items, this step takes a few seconds.
Step 2 — Generate embeddings
Click Generate embeddings. The module sends “dirty” content to the selected provider in batches of 50 (OpenAI/Mistral) or in a single local pass (TF-IDF). A progress bar tracks the progress.
For 1,000 items:
- OpenAI: ~30 seconds
- Mistral: ~40 seconds
- Local TF-IDF: <1 second
Step 3 — Run the audit
Click Run audit. The k-means cosine clustering groups content into k clusters (k-means++ initialization, 50 max iterations), labels each cluster with its top terms (TF×IDF), computes each content’s distance to its centroid, and identifies outliers.
This step takes under a second, even for 5,000 items.
Understanding the report
Dashboard
Four key KPIs at the top:
- Indexed content — total products, categories, CMS and manufacturers embedded.
- Off-topic pages — absolute count, percentage rate and breakdown by type.
- Thematic clusters — number of thematic groups identified.
- Median distance — median cosine distance to the centroid. Below 0.40 = highly coherent catalog. Above 0.60 = scattered catalog.
Clusters
Clusters view: detailed list sorted by size, with automatic label (top 5 TF×IDF terms), cohesion score (0 = scattered, 1 = identical), and size (number of items).
A cluster with cohesion < 0.40 is too heterogeneous — often a sign that the topic should be split into two sub-themes, or that k is too low.
2D semantic map
Projection of all content onto a 2D plane via the Johnson-Lindenstrauss technique (a random projection that approximately preserves distances).
Each dot is a piece of content, each color a cluster. Crosses mark the centroids. Dots with a red outline are outliers. The legend on the right lets you hide/show each cluster individually by clicking on it.
Off-topic pages
Off-topic pages view: sortable table of content whose cosine distance to the centroid exceeds the configured threshold. For each row:
- Type, title, public URL and direct link to the edit form
- Current cluster (with its color)
- Distance to centroid (higher = further away)
- Suggested cluster (if relevant)
- Δ Gain: distance reduction if the content were moved
Hopeless pages
At the bottom of the Off-topic view, a special section lists content far from all clusters. The module didn’t find a viable destination for them.
Three options to consider:
- Delete if the page has no SEO traffic or conversion.
- Noindex to preserve crawl budget without losing history.
- Rewrite to align the content with an existing cluster.
Restructuring suggestions
Suggestions view: same as Off-topic pages but action-focused. All proposed moves are listed with the expected coherence gain. Sort by descending gain to handle the most impactful cases first.
The module never modifies your tree structure automatically. Changes remain under your control via the standard PrestaShop back-office.
CSV export
From each report view, an Export CSV button lets you download raw data. Useful for:
- Sharing the report with an external SEO consultant
- Processing data in Excel/Sheets
- Archiving an audit’s state before modifying the tree
Cron automation
The module exposes a signed URL shown on the configuration page. It triggers the full indexing → embeddings → audit pipeline in headless mode.
Example weekly cron job (every Monday at 3am):
0 3 * * 1 wget -q -O /dev/null "https://your-store.com/modules/dfsemanticaudit/cron.php?token=YOUR_TOKEN"
The token is derived from your PrestaShop’s _COOKIE_KEY_ and only changes on reinstallation. Save it carefully.
API costs
Estimate for an average catalog (1,000 products):
- OpenAI text-embedding-3-small: ~€0.02 the first time, then near-zero (only modified content is reprocessed)
- OpenAI text-embedding-3-large: ~€0.13 the first time
- Mistral mistral-embed: ~€0.10 the first time
- Local TF-IDF: €0
Multilingual and multi-store
The module is natively multilingual and multi-store. Each audit runs on a given language × store pair, using the back-office context language.
To audit your French store then your English store, change the language in PrestaShop’s top bar, then trigger a new audit.
Troubleshooting
The “Generate embeddings” step fails with a 401 error
Your API key is invalid or expired. Check it on the configuration page and reconfigure if necessary.
The “Generate embeddings” step fails with a 429 error
You’ve hit your provider’s rate limit. Wait a few minutes and retry — the module will resume where it left off thanks to its batch processing.
No cluster seems relevant
Three avenues:
- Increase the number of clusters (k). If your catalog has 10 distinct themes but k=4, clustering can’t separate them.
- Switch from local TF-IDF mode to OpenAI or Mistral. On heterogeneous catalogs, semantic quality makes all the difference.
- Check that your content titles and descriptions are rich enough. A product with a 2-word title and no description won’t produce a good embedding.
Too many pages flagged as off-topic
Raise the off-topic threshold (e.g. from 0.55 to 0.70). This is expected if your catalog legitimately covers several broad themes.
No off-topic page flagged but the catalog feels incoherent
Lower the threshold (e.g. from 0.55 to 0.40) to tighten detection.
FAQ
Is an API key mandatory?
No. Local TF-IDF mode works without any external connection. It’s slightly less accurate than OpenAI/Mistral but enough to get started or for a homogeneous catalog.
Does the module modify my tree structure automatically?
No. The module only recommends. All content moves remain your responsibility via PrestaShop’s standard back-office.
How do I choose the number of clusters (k)?
Rule of thumb: k ≈ number of top-level main categories. Default k=8 works well between 100 and 5,000 products. Run 2 or 3 audits with different k values to compare if you’re unsure — previous audits stay in history.
Are my vectors sent to a third-party server?
With OpenAI or Mistral: yes, your content’s titles and excerpts are sent to their embeddings API. With local TF-IDF mode: no, no data leaves your server.
How long are audits kept?
Indefinitely, until manual deletion from the dashboard. You can browse the full history to measure the evolution of your semantic coherence over time.
Does the module work in multi-store?
Yes. Each store in the multistore can have its own independent audits.