Architecture

This document describes the technical architecture of Tractor.

Tech Stack

Component	Technology
Frontend	Next.js 15 (React 19), Material-UI v7
Backend	Django 5.2, Django REST Framework
Database	PostgreSQL 15
Task Queue	django-q2
NLP	SpanCat (spaCy 3.8), GLiNER (HuggingFace), Microsoft Presidio
Authentication	NextAuth v5, JWT

Project Structure

tractor/
├── frontend/src/
│   ├── app/                    # Next.js App Router pages
│   │   └── (dashboard)/        # Grouped route with shared layout
│   ├── components/             # React components
│   ├── services/               # API client wrappers
│   └── api/apiClient.js        # Axios instance with auth interceptor
├── backend/                    # Django project settings
├── cases/                      # Django app: Case, Document, Redaction models
├── authentication/             # Django app: JWT + Microsoft Entra ID auth
└── training/                   # Django app: Model training, spaCy integration

Data Flow

Document Upload: User uploads document → stored in media/originals
Text Extraction: python-docx (DOCX) or pdfplumber (PDF) extracts text and structure
Entity Recognition (Three-model pipeline):
- SpanCat identifies OPERATIONAL and THIRD_PARTY spans from the trained custom model (optional — skipped gracefully if no model trained yet)
- GLiNER identifies THIRD_PARTY spans (names, orgs, locations, DOB, addresses) using a zero-shot model from HuggingFace
- Presidio identifies structured THIRD_PARTY PII (phone, email, NHS, postcode, NI) and structured OPERATIONAL refs (crime references, collar numbers) via pattern recognisers
- Results are deduplicated with priority: SpanCat > GLiNER > Presidio
Data Subject Filtering: Entities matching the case's data subject name or DOB are excluded from suggestions
User Review: User accepts/rejects redactions in the UI. Adjacent same-type spans are automatically merged into compound display items for easier review. Merged items can be split and reviewed individually.
Export: WeasyPrint generates PDF exports with redactions applied
Training: Accepted redactions from completed documents feed into the SpanCat training pipeline

API Endpoints

All endpoints are prefixed with /api/.

Authentication (`/api/auth/`)

Method	Endpoint	Description
POST	`/login`	Login with username/password
POST	`/logout`	Logout current user
GET	`/user`	Get current user details
POST	`/token/verify`	Verify JWT token
POST	`/token/refresh`	Refresh JWT token
POST	`/microsoft`	Microsoft Entra ID callback
GET	`/api-keys`	List active API keys (admin only)
POST	`/api-keys`	Generate a new API key (admin only)
DELETE	`/api-keys/<id>`	Revoke an API key (admin only)

Cases (`/api/cases`)

Method	Endpoint	Description
GET	`/cases`	List all cases
POST	`/cases`	Create a new case
GET	`/cases/<case_id>`	Get case details
PATCH	`/cases/<case_id>`	Update case
DELETE	`/cases/<case_id>`	Delete case
POST	`/cases/<case_id>/export`	Generate disclosure package

Documents (`/api/cases/...`)

Method	Endpoint	Description
GET	`/cases/<case_id>/documents`	List documents in case
POST	`/cases/<case_id>/documents`	Upload document(s)
GET	`/cases/documents/<document_id>`	Get document details
PATCH	`/cases/documents/<document_id>`	Update document
DELETE	`/cases/documents/<document_id>`	Delete document
POST	`/cases/documents/<document_id>/resubmit`	Resubmit for processing
GET	`/cases/<case_id>/document/<document_id>/review`	Get document for review

Redactions (`/api/cases/document/...`)

Method	Endpoint	Description
GET	`/cases/document/<document_id>/redaction`	List redactions
POST	`/cases/document/<document_id>/redaction`	Create redaction
GET	`/cases/document/redaction/<id>`	Get redaction details
PATCH	`/cases/document/redaction/<id>`	Update redaction (accept/reject)
DELETE	`/cases/document/redaction/<id>`	Delete redaction
POST	`/cases/document/redaction/<id>/context`	Add/update context
PATCH	`/cases/document/<document_id>/redactions/bulk`	Bulk accept/reject/retype multiple redactions

Models & Training (`/api/...`)

Method	Endpoint	Description
GET	`/models`	List trained models
GET	`/models/<id>`	Get model details
POST	`/models/<id>/set-active`	Activate a model
POST	`/training/run-now`	Trigger manual training
GET	`/training-docs`	List training documents
POST	`/training-docs`	Upload training document
GET	`/schedules`	List training schedules
POST	`/schedules`	Create training schedule
GET	`/training-runs`	List training run history

Authentication Flow

Tractor supports two authentication methods, both enforced at the DRF layer.

JWT (Interactive Users)

The Next.js frontend authenticates via NextAuth v5 using either username/password credentials or Microsoft Entra ID OAuth2.
On login, NextAuth calls POST /api/auth/login (or POST /api/auth/microsoft) and stores the Django-issued JWT access and refresh tokens inside the NextAuth session JWT (server-side only).
All frontend API calls include Authorization: Bearer <access_token>.
The access token is valid for 60 minutes. NextAuth transparently refreshes it via POST /api/auth/token/refresh using the 7-day refresh token before expiry.
DRF authenticates the request via rest_framework_simplejwt.authentication.JWTAuthentication.

API Key (External Services / Machine-to-Machine)

An administrator generates an API key via the Settings page (admin-only card) or Django Admin. The key is stored as a SHA-256 hash — the raw value is shown once and never persisted.
External services include the key as Authorization: Api-Key <key>.
authentication.authentication.APIKeyAuthentication hashes the incoming key and looks it up in the APIKey table. On match, request.user is set to the api_service system account.
CaseListCreateView.perform_create attributes the case to api_service, providing stable authorship independent of staff changes.
API keys are permanent until revoked (is_active = False). Only is_staff users can create or revoke keys; API keys themselves authenticate as api_service (non-staff) and therefore cannot manage other keys.

Authentication Class Order

Both classes are listed in DEFAULT_AUTHENTICATION_CLASSES in backend/settings/base.py. DRF tries them in order:

Priority	Class	Triggers on
1	`APIKeyAuthentication`	`Authorization: Api-Key …`
2	`JWTAuthentication`	`Authorization: Bearer …`

A class that returns None passes control to the next in line. A class that raises AuthenticationFailed short-circuits all subsequent classes and returns a 401 response.

Entity Recognition

Tractor uses a three-model hybrid pipeline. All three models run on every document and their results are merged and deduplicated.

SpanCat (Trained Model)

A custom SpanCat (Span Categorisation) model trained on your organisation's accepted redactions. It can identify both:

OPERATIONAL — reference numbers, case IDs, and other domain-specific operational patterns
THIRD_PARTY — domain-specific PII patterns learned from training data

SpanCat is loaded as a singleton (SpanCatModelManager) and takes the highest priority in deduplication. If no SpanCat model has been trained yet, this step is skipped and the system falls back to GLiNER + Presidio. Trained models are stored in nlp_models/.

GLiNER (Third-Party PII — Zero-Shot)

GLiNER is a zero-shot generalist NER model downloaded from HuggingFace and registered in the database via the download_model management command. It identifies THIRD_PARTY entities:

person names, organisations, locations, dates of birth, addresses

GLiNER is loaded as a singleton (GLiNERModelManager). The model ID stored in the database is the HuggingFace model identifier (e.g. urchade/gliner_medium-v2.1); HuggingFace handles local caching automatically. Long texts are chunked to stay within the model's ~1500 character token limit per chunk.

Presidio (Pattern-Based)

Microsoft Presidio is a rule-based detection framework using custom pattern recognisers. It runs two separate analyzers:

THIRD_PARTY analyzer:

Recogniser	Entities detected
Built-in (spaCy `en_core_web_sm`)	PHONE_NUMBER, EMAIL_ADDRESS, UK_NHS
Custom pattern	UK postcodes
Custom pattern	National Insurance numbers

OPERATIONAL analyzer:

Recogniser	Entities detected
Custom pattern	Crime reference numbers (e.g. `42/12345/24`)
Custom pattern	Police collar numbers (e.g. `PC 1234`)

Both analyzers are instantiated lazily and cached as module-level singletons.

Deduplication

After all three models run, overlapping spans are deduplicated with this priority order:

SpanCat results are kept in full
GLiNER results are added where they don't overlap SpanCat spans
Presidio results are added where they don't overlap either of the above

Merged Display Items

Adjacent or near-adjacent spans of the same type (within a 2-character gap by default) are automatically merged into a single compound display item in the review sidebar. This reduces noise when, for example, a first name and surname are detected as separate spans.

Merged items show all underlying span IDs and can be split back into individual items by the user from the sidebar.

Data Subject Filtering

Entities matching the case's data_subject_name or data_subject_dob are automatically excluded from redaction suggestions. This includes:

Full name matches (case-insensitive)
Individual name parts (e.g., "John" or "Doe" from "John Doe")
DOB in common date formats (DD/MM/YYYY, YYYY-MM-DD, D Month YYYY, etc.)

The data subject's own information should remain visible in the document. Users can still manually mark text as DS_INFORMATION, which propagates across all documents in the case via find_and_flag_matching_text_in_case().

Export Font

The export font controls the typeface used in the HTML body of every WeasyPrint-generated PDF. Fonts are defined as a curated list of web-safe choices on the DocumentExportSettings singleton model in cases/models.py.

How It Works

DocumentExportSettings.FontFamily is a TextChoices enum — the database stores the short key (e.g. arial) and the model exposes a font_family_css property that returns the full CSS font stack string (e.g. Arial, sans-serif).
_generate_pdf_from_document() in cases/services.py reads export_settings.font_family_css and injects it into the HTML <body style="font-family: ..."> tag before passing the HTML to WeasyPrint.
The DocumentExportSettingsSerializer includes font_family so it is readable and writable via the settings API endpoint.
The frontend DocumentExportSettingsCard renders a MUI Select dropdown populated with the same choices.

Adding a New Font

Add the choice to DocumentExportSettings.FontFamily in cases/models.py:
```
ROBOTO = "roboto", "Roboto"
```
Add the CSS stack to DocumentExportSettings._FONT_CSS:
```
"roboto": "Roboto, sans-serif",
```
Ensure the font is available on the server. WeasyPrint renders PDFs server-side, so the font must be installed on the host OS (e.g. via a system package). Web-safe fonts (Arial, Georgia, etc.) are already present on most Linux distributions. For custom fonts, install the font files and verify WeasyPrint can find them via fontconfig.

Generate and apply a migration:

python manage.py makemigrations cases
python manage.py migrate

Add a MenuItem to the Select in DocumentExportSettingsCard.js:
```
<MenuItem value="roboto">Roboto</MenuItem>
```
Update the Cypress test mockSettings and defaultSettings objects and add a test case if needed.