Contents
1. Introduction
The successful deployment of an LLM in a highly technical and regulated domain like European digital identity (eIDAS2, OpenID4VC, ISO/IEC 18013-5) depends critically on the accuracy and timeliness of its knowledge base. Given the evolving nature of standards and implementation frameworks (such as the ARF), the model must not only internalise stable information but also recognise its limitations and know when to query live sources.
This report provides the analysis required to inform the model architecture, dataset preparation, and training strategy. Recent research in legal and technical NLP offers valuable guidance. Papers on fine-tuning for legal document drafting, reducing hallucinations, building multilingual retrieval corpora, and two-stage NDA analysis provide concrete methodologies and benchmarks.
2. Data Source Categorisation
The provided links can be broadly categorised into three types: (1) finalised technical specifications, (2) active draft specifications and evolving reference frameworks, and (3) static legal texts.
| Data Source | Type | Classification | Justification |
|---|---|---|---|
| OpenID4VCI [1] | Technical Spec | Dynamic (Low) | Final specification, but subject to errata. Version changes infrequent but must be checked. |
| OpenID4VP [2] | Technical Spec | Dynamic (Low) | Stable core specification. Version changes infrequent. |
| HAIP [3] | Technical Profile | Dynamic (Medium) | Implementation profile that may be updated to align with evolving ecosystem requirements. |
| PAR (RFC9126) [4] | IETF Standard | Static | Published IETF RFC. Content is fixed. |
| OpenID Connect Core [5] | Technical Spec | Static | Foundational, finalised OpenID Foundation specification. |
| ISO/IEC 18013-5:2021 [6] | International Standard | Static | Published ISO standard. Fixed for the 2021 version. |
| eIDAS2 [7, 8] | EU Regulation | Dynamic (Low) | Primary text is static, but subject to amendments, delegated acts, and defined timelines. |
| SD-JWT (RFC9901) [9] | IETF Standard | Static | Published IETF RFC. Content is fixed. |
| SD-JWT VC [10] | IETF Draft | Highly Dynamic | Active IETF draft. Content can change significantly with each revision. |
| ARF [11] | Framework / Repository | Highly Dynamic | Living document and code repositories under active development. |
| PSD2 Regulation [12] | EU Regulation | Static | Published EU regulation. Core text is fixed. |
| eudi-nexus [13] | Community Resource | Dynamic (Medium) | Independent overview site, updated periodically. |
| walt.id Repository [14] | Open-Source Code | Highly Dynamic | Active code repository with frequent updates and releases. |
2.1 Detailed Analysis of Key Sources
OpenID4VCI and OpenID4VP [1, 2]: The OpenID for Verifiable Credential Issuance and Presentation specifications are foundational. While stable, the presence of an errata mechanism and the potential for new versions necessitate a strategy where the model can confirm the current status. The model should be aware of errata URLs and, for critical tasks, query a trusted source to ensure it is not relying on outdated information.
ARF and EUDI Wallet Repositories [11]: The Architecture and Reference Framework and associated GitHub repositories are the "source of truth" for implementers but are constantly evolving. The model must interact with repositories (e.g., via MCP server) to fetch the latest documents, READMEs, or specific code sections.
ISO/IEC 18013-5:2021 [6] and RFCs [4, 9]: These published standards are the definition of static data. An LLM can be trained on these texts with high confidence that core information will not change.
eIDAS2 [7]: The regulation text itself is static. However, the broader context, including dates of application, implementing acts, and national transposition, is dynamic. The model must understand this distinction.
3. Data Scraping and Calibration Strategy
To transform the identified sources into a high-quality dataset suitable for LLM training and fine-tuning, we employ a two-phase strategy: scraping (acquisition and conversion) and calibration (validation, version tracking, and synthetic data generation).
3.1 Scraping Phase
- PDF Documents: Standards such as ISO/IEC 18013-5, eIDAS regulations, and IETF RFCs are distributed as PDFs. We use a robust PDF-to-text conversion pipeline inspired by the LEMUR paper [18], employing tools like olmOCR or docling. Output is stored as JSONL with metadata: title, source URL, version/date, and unique identifier.
- HTML Specifications: For OpenID Foundation specifications and IETF drafts available in HTML, we extract normative content using appropriate parsers (e.g., BeautifulSoup), preserving structural elements. Stored in JSONL format with metadata.
- GitHub Repositories: For highly dynamic sources like the ARF and walt.id repositories, we clone at a specific commit and extract relevant files: READMEs, documentation, and key source code. Commit hash serves as version identifier.
- Scheduling and Automation: Dynamic sources are periodically re-scraped (weekly for IETF drafts, daily for active GitHub repos) using CI/CD pipelines. Each snapshot is versioned and archived.
3.2 Calibration Phase
- Quality Assurance: Evaluate extraction fidelity using Lexical Content Score (LCS) as in LEMUR [18]. Documents below 95% threshold are flagged for reprocessing.
- Version Tracking: Each snapshot is tagged with its version identifier (RFC number, ISO edition, Git commit hash, or date of retrieval).
- Synthetic Data Generation: Using
docling-sdgto generate Q&A pairs via a three-stage process:- Sample: Extract and sample diverse passages from documents.
- Generate: An LLM generates three Q&A types: simple fact, reasoning, and summary.
- Critique: A second LLM evaluates each pair on groundedness, correctness, and usefulness. Only passing pairs are retained.
- Handling Dynamic Content: Generate "abstention" examples for highly dynamic queries. The model is trained to request external tool use (via MCP) rather than providing outdated answers.
- Cross-Document Calibration: Create combined contexts for queries requiring synthesis across multiple documents (e.g., comparing OpenID4VCI and SD-JWT VC).
- Iterative Refinement: After initial training, evaluate on expert-curated queries. Refine scraping and generation pipelines based on errors. Continuous improvement feedback loop.
4. Data Preparation and Generation: Research Insights
4.1 PDF-to-Text Conversion and Quality Assessment
The LEMUR paper [18] demonstrates a robust pipeline: olmOCR converts PDFs to structured JSONL, then fidelity is evaluated with a Lexical Content Score (LCS) comparing against authoritative HTML versions. For high-resource languages, conversion can exceed 95% lexical similarity, but older documents and low-resource languages may drop to 80-90%.
4.2 Creating High-Quality Training Pairs
LEMUR uses metadata blocks (title, subject, publication info) as queries with the remaining legislative text as the retrieval target. The NDA analysis paper [19] uses manual annotation by legal experts (322 NDAs, 3,714 clauses) with structured markup. For our domain, we extract preambles or abstracts of standards as queries, treating full specifications as target documents.
4.3 Decomposition of Legal/Technical Elements
The legal drafting paper [15] decomposes fraud judgments into constituent elements (Subject of Crime, Act, Victim, etc.). For our purposes, we decompose each standard into logical parts. For example, OpenID4VCI: protocol name, endpoints, request parameters, security requirements. This aids both training and completeness evaluation.
4.4 Handling Dynamic Data and Hallucinations
The hallucinations paper [17] shows fine-tuning dramatically reduces hallucinations: LLaMA-2-7b accuracy on segments without the target entity improved from near 0% to almost 48% after fine-tuning. Crucially, cleaned datasets (ensuring entities were actually present) yielded significantly better performance than raw extractions.
4.5 Estimating Corpus Size and Token Requirements
| Source | Corpus Size | Tokens | Model |
|---|---|---|---|
| Legal drafting [15] | 60,000 judgments | ~468M | BLOOM 560M (4h on RTX 3090) |
| LEMUR [18] | 24,953 PDF documents | ~194M | 0.6B, 1.1B, 4B models |
| NDA paper [19] | 322 NDAs, 3,714 clauses | Small | Legal-RoBERTa |
| MLEB [16] | 10 datasets, 100s-1000s queries | Evaluation only | Retrieval benchmarks |
For our domain (a few dozen core standards and frameworks), a corpus on the order of 50-200 million tokens would be sufficient for effective fine-tuning of a model in the 1-7 billion parameter range.
5. Architectural Recommendations
A monolithic model trained on a static dataset is insufficient. The following hybrid architecture is recommended, incorporating Retrieval-Augmented Generation (RAG) and dynamic tool use.
- Core Foundational Model Utilise a sparse Mixture-of-Experts (MoE) architecture with specialised "experts" for different domains (e.g., OAuth flows, mdoc data structures). Pre-train and fine-tune on "static" data: RFCs, ISO standards, stable specifications. Expected corpus: 100-200 million tokens.
- Retrieval-Augmented Generation (RAG) Layer Fine-tune a dedicated embedding model on domain-specific query-document pairs. Given a user question, the retriever fetches the most relevant sections from both static and dynamic sources. Retrieved context is fed to the generative model.
-
Dynamic Knowledge Layer via MCP
Implement tool-calling using the Model Context Protocol (MCP). For queries touching "highly dynamic" data, the model formulates requests to external tools instead of generating from internal parameters alone. Examples:
- Check latest commit on ARF repository via
get_github_latest() - Check SD-JWT VC draft status via
check_ietf_draft_status() - Verify OpenID4VCI errata via
fetch_openid_errata()
- Check latest commit on ARF repository via
- Training for Uncertainty and Abstention Include training examples where the retriever returns no relevant context or where context indicates information is dynamic/outdated. The model responds with appropriate caveats:
"Based on the static text of eIDAS2, Article X states [quote].
However, the specific technical implementation details are
defined in the ARF, which is actively updated. The latest
ARF requirement as of [current date] is [result from MCP call]."
// Or when no information is found:
"I could not find information on this in the provided standards.
This topic may be covered in a more recent document; please
check the official sources."
6. Conclusion
The proposed data corpus presents a clear mix of stable, versioned information and rapidly evolving content. A successful LLM in this domain cannot be a static knowledge base.
The recommended architecture is a hybrid system: a powerful, sparsely-activated MoE model trained on the static core (100-200M tokens), augmented by a RAG pipeline with a fine-tuned retriever and a dynamic tool-calling layer (via MCP servers) to fetch and verify information from active repositories, draft specifications, and legal updates.
Recent research, particularly on data preparation (LEMUR), decomposition (legal drafting), hallucination reduction, and two-stage analysis, provides concrete methodologies and benchmarks that directly inform this design. This approach ensures the model provides not only deep, contextual knowledge but also timely and accurate advice, which is critical for legal and technical implementation support.
References
- [1]T. Lodderstedt, K. Yasuda, T. Looker, P. Bastian. OpenID for Verifiable Credential Issuance 1.0. OpenID Foundation.
openid.net/specs/openid-4-verifiable-credential-issuance-1_0.html - [2]OpenID Foundation. OpenID for Verifiable Presentations 1.0.
openid.net/specs/openid-4-verifiable-presentations-1_0.html - [3]OpenID Foundation. OpenID4VC High Assurance Interoperability Profile 1.0.
openid.net/specs/openid4vc-high-assurance-interoperability-profile-1_0-ID1.html - [4]T. Lodderstedt, et al. OAuth 2.0 Pushed Authorization Requests. IETF RFC 9126.
datatracker.ietf.org/doc/html/rfc9126 - [5]OpenID Foundation. OpenID Connect Core 1.0.
openid.net/specs/openid-connect-core-1_0.html - [6]ISO/IEC. ISO/IEC 18013-5:2021 Personal identification. Mobile driving licence (mDL) application.
iso.org/standard/69084.html - [7]Digitaliseringsstyrelsen. eIDAS2 og den digitale identitetstegnebog.
digst.dk/it-loesninger/eid-og-single-digital-gateway/eidas2... - [8]EU. Regulation (EU) No 910/2014 (eIDAS).
eur-lex.europa.eu/legal-content/EN/TXT/... - [9]D. Fett, et al. Selective Disclosure JWT (SD-JWT). IETF RFC 9901.
datatracker.ietf.org/doc/rfc9901/ - [10]O. Terbu, et al. SD-JWT-based Verifiable Credentials (SD-JWT VC). IETF Draft.
datatracker.ietf.org/doc/draft-ietf-oauth-sd-jwt-vc/ - [11]EU Digital Identity Wallet. Architecture and Reference Framework. GitHub.
github.com/eu-digital-identity-wallet/eudi-doc-architecture-and-reference-framework - [12]EU. Directive (EU) 2015/2366 (PSD2).
eur-lex.europa.eu/legal-content/EN/TXT/... - [13]Overview of EUDI Standards.
cre8.github.io/eudi-nexus/ - [14]walt.id. walt.id Identity Repository. GitHub.
github.com/walt-id/waltid-identity - [15]C.-H. Lin, P.-J. Cheng. "Legal Documents Drafting with Fine-Tuned Pre-Trained Large Language Model." CS & IT 2024.
- [16]U. Butler, A.-R. Butler, A.L. Malec. "The Massive Legal Embedding Benchmark (MLEB)." 2025.
- [17]F. Vargas, et al. "El impacto del ajuste fino de LLaMA en las alucinaciones para la extraccion de entidades nombradas en documentos legales." 2025.
- [18]N. Baba Ahmadi, J. Strich, M. Semmann, C. Biemann. "LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval." 2025.
- [19]A. Begnini, M. Vicente, L. Souza. "A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification." 2025.