PS PrestaShop Intermediate

Semantic Audit — Documentation

Semantic SEO audit of your PrestaShop catalog via vector clustering. Installation, configuration of OpenAI / Mistral / local TF-IDF providers, reading the report and automation.

Updated Module version 1.0.5

Installation

Requirements

  • PrestaShop 8.0 to 9.x
  • PHP 7.4 minimum (8.x recommended)
  • MySQL 5.7+ or MariaDB 10.3+
  • An OpenAI or Mistral API key (optional — a local mode without API is included)

Install the module

  1. Unzip the dfsemanticaudit.zip file downloaded from your customer account.
  2. Upload the dfsemanticaudit/ folder to /modules/ of your PrestaShop via FTP, or use the ZIP install from Modules → Module manager → Upload a module.
  3. Click Install.

Activate the module

The module automatically creates four SQL tables (ps_dfsa_content, ps_dfsa_audit, ps_dfsa_cluster, ps_dfsa_assignment) and an admin tab accessible from the left menu.

Configuration

Before the first audit, go to Modules → DataFirefly → Semantic Audit → Configuration.

Choosing an embeddings provider

Three providers are available. The choice determines the quality of the clusters obtained.

The default provider. Uses the text-embedding-3-small model (1,536 dimensions). Top quality, marginal cost: about €0.02 per 1,000 products on the first indexing.

  • API key: create one at platform.openai.com/api-keys
  • Model: keep text-embedding-3-small by default. text-embedding-3-large (3,072 dim) gives slightly higher quality but costs 6× more.

Mistral

European alternative hosted in France. Uses mistral-embed (1,024 dimensions). Pricing comparable to OpenAI.

Local TF-IDF

Runs entirely on your server, with no API calls, no recurring cost. Uses classic statistical NLP principles (normalized TF-IDF) with a 384-dimension output.

  • Quality sufficient for catalogs under 500 products.
  • Supports FR, EN, ES, DE, IT (built-in stopwords).
  • No API key required.
Tip — You can switch providers at any time. On the next audit, all content will be automatically re-embedded.

Audit parameters

  • k (number of clusters): 8 by default. Range 2–50.
  • Off-topic threshold: cosine distance above which content is flagged. 0.55 by default. Range 0.1 to 1.5.

Auto-reindex

Enabled by default. The module registers hooks on the creation, modification and deletion of products, categories and CMS pages. On every change, the content is flagged for re-embedding at the next run, with no manual effort.

Running your first audit

Three steps to run in order from the dashboard.

Step 1 — Reindex content

Click Reindex content. The module walks through your products, active categories, CMS pages and manufacturers, computes a SHA1 hash of title + excerpt, and only flags new or modified content for processing.

For 1,000 items, this step takes a few seconds.

Step 2 — Generate embeddings

Click Generate embeddings. The module sends “dirty” content to the selected provider in batches of 50 (OpenAI/Mistral) or in a single local pass (TF-IDF). A progress bar tracks the progress.

For 1,000 items:

  • OpenAI: ~30 seconds
  • Mistral: ~40 seconds
  • Local TF-IDF: <1 second

Step 3 — Run the audit

Click Run audit. The k-means cosine clustering groups content into k clusters (k-means++ initialization, 50 max iterations), labels each cluster with its top terms (TF×IDF), computes each content’s distance to its centroid, and identifies outliers.

This step takes under a second, even for 5,000 items.

Understanding the report

Dashboard

Four key KPIs at the top:

  • Indexed content — total products, categories, CMS and manufacturers embedded.
  • Off-topic pages — absolute count, percentage rate and breakdown by type.
  • Thematic clusters — number of thematic groups identified.
  • Median distance — median cosine distance to the centroid. Below 0.40 = highly coherent catalog. Above 0.60 = scattered catalog.

Clusters

Clusters view: detailed list sorted by size, with automatic label (top 5 TF×IDF terms), cohesion score (0 = scattered, 1 = identical), and size (number of items).

A cluster with cohesion < 0.40 is too heterogeneous — often a sign that the topic should be split into two sub-themes, or that k is too low.

2D semantic map

Projection of all content onto a 2D plane via the Johnson-Lindenstrauss technique (a random projection that approximately preserves distances).

Each dot is a piece of content, each color a cluster. Crosses mark the centroids. Dots with a red outline are outliers. The legend on the right lets you hide/show each cluster individually by clicking on it.

Quick read — If you see dots of one color stranded far from their centroid, those are priority candidates for moving.

Off-topic pages

Off-topic pages view: sortable table of content whose cosine distance to the centroid exceeds the configured threshold. For each row:

  • Type, title, public URL and direct link to the edit form
  • Current cluster (with its color)
  • Distance to centroid (higher = further away)
  • Suggested cluster (if relevant)
  • Δ Gain: distance reduction if the content were moved

Hopeless pages

At the bottom of the Off-topic view, a special section lists content far from all clusters. The module didn’t find a viable destination for them.

Three options to consider:

  1. Delete if the page has no SEO traffic or conversion.
  2. Noindex to preserve crawl budget without losing history.
  3. Rewrite to align the content with an existing cluster.

Restructuring suggestions

Suggestions view: same as Off-topic pages but action-focused. All proposed moves are listed with the expected coherence gain. Sort by descending gain to handle the most impactful cases first.

The module never modifies your tree structure automatically. Changes remain under your control via the standard PrestaShop back-office.

CSV export

From each report view, an Export CSV button lets you download raw data. Useful for:

  • Sharing the report with an external SEO consultant
  • Processing data in Excel/Sheets
  • Archiving an audit’s state before modifying the tree

Cron automation

The module exposes a signed URL shown on the configuration page. It triggers the full indexing → embeddings → audit pipeline in headless mode.

Example weekly cron job (every Monday at 3am):

0 3 * * 1 wget -q -O /dev/null "https://your-store.com/modules/dfsemanticaudit/cron.php?token=YOUR_TOKEN"

The token is derived from your PrestaShop’s _COOKIE_KEY_ and only changes on reinstallation. Save it carefully.

API costs

Estimate for an average catalog (1,000 products):

  • OpenAI text-embedding-3-small: ~€0.02 the first time, then near-zero (only modified content is reprocessed)
  • OpenAI text-embedding-3-large: ~€0.13 the first time
  • Mistral mistral-embed: ~€0.10 the first time
  • Local TF-IDF: €0

Multilingual and multi-store

The module is natively multilingual and multi-store. Each audit runs on a given language × store pair, using the back-office context language.

To audit your French store then your English store, change the language in PrestaShop’s top bar, then trigger a new audit.

Note — The module respects each piece of content’s original language, with no automatic translation attempt. The clusters obtained will be different for each language, which is normal.

Troubleshooting

The “Generate embeddings” step fails with a 401 error

Your API key is invalid or expired. Check it on the configuration page and reconfigure if necessary.

The “Generate embeddings” step fails with a 429 error

You’ve hit your provider’s rate limit. Wait a few minutes and retry — the module will resume where it left off thanks to its batch processing.

No cluster seems relevant

Three avenues:

  1. Increase the number of clusters (k). If your catalog has 10 distinct themes but k=4, clustering can’t separate them.
  2. Switch from local TF-IDF mode to OpenAI or Mistral. On heterogeneous catalogs, semantic quality makes all the difference.
  3. Check that your content titles and descriptions are rich enough. A product with a 2-word title and no description won’t produce a good embedding.

Too many pages flagged as off-topic

Raise the off-topic threshold (e.g. from 0.55 to 0.70). This is expected if your catalog legitimately covers several broad themes.

No off-topic page flagged but the catalog feels incoherent

Lower the threshold (e.g. from 0.55 to 0.40) to tighten detection.

FAQ

Is an API key mandatory?

No. Local TF-IDF mode works without any external connection. It’s slightly less accurate than OpenAI/Mistral but enough to get started or for a homogeneous catalog.

Does the module modify my tree structure automatically?

No. The module only recommends. All content moves remain your responsibility via PrestaShop’s standard back-office.

How do I choose the number of clusters (k)?

Rule of thumb: k ≈ number of top-level main categories. Default k=8 works well between 100 and 5,000 products. Run 2 or 3 audits with different k values to compare if you’re unsure — previous audits stay in history.

Are my vectors sent to a third-party server?

With OpenAI or Mistral: yes, your content’s titles and excerpts are sent to their embeddings API. With local TF-IDF mode: no, no data leaves your server.

How long are audits kept?

Indefinitely, until manual deletion from the dashboard. You can browse the full history to measure the evolution of your semantic coherence over time.

Does the module work in multi-store?

Yes. Each store in the multistore can have its own independent audits.

Was this page helpful?

Still stuck? Contact support