Analysis of Data Sources for a Specialized LLM on European Digital Identity Standards

Abstract

This report analyses the corpus of technical specifications, legal regulations, and reference frameworks provided for the development of a specialised Large Language Model (LLM). The objective is to classify each data source as either "static" or "dynamic" and to outline the architectural implications for an LLM intended to provide expert-level support on European legal standards for identity, e-wallets, and associated repositories. The analysis recommends a hybrid architecture that combines a robustly trained core model with dynamic, verifiable data-fetching capabilities.

Introduction
Data Source Categorisation
Data Scraping and Calibration Strategy
Data Preparation and Generation: Research Insights
Architectural Recommendations
Conclusion
References

1. Introduction

The successful deployment of an LLM in a highly technical and regulated domain like European digital identity (eIDAS2, OpenID4VC, ISO/IEC 18013-5) depends critically on the accuracy and timeliness of its knowledge base. Given the evolving nature of standards and implementation frameworks (such as the ARF), the model must not only internalise stable information but also recognise its limitations and know when to query live sources.

This report provides the analysis required to inform the model architecture, dataset preparation, and training strategy. Recent research in legal and technical NLP offers valuable guidance. Papers on fine-tuning for legal document drafting, reducing hallucinations, building multilingual retrieval corpora, and two-stage NDA analysis provide concrete methodologies and benchmarks.

2. Data Source Categorisation

The provided links can be broadly categorised into three types: (1) finalised technical specifications, (2) active draft specifications and evolving reference frameworks, and (3) static legal texts.

Data Source	Type	Classification	Justification
OpenID4VCI [1]	Technical Spec	Dynamic (Low)	Final specification, but subject to errata. Version changes infrequent but must be checked.
OpenID4VP [2]	Technical Spec	Dynamic (Low)	Stable core specification. Version changes infrequent.
HAIP [3]	Technical Profile	Dynamic (Medium)	Implementation profile that may be updated to align with evolving ecosystem requirements.
PAR (RFC9126) [4]	IETF Standard	Static	Published IETF RFC. Content is fixed.
OpenID Connect Core [5]	Technical Spec	Static	Foundational, finalised OpenID Foundation specification.
ISO/IEC 18013-5:2021 [6]	International Standard	Static	Published ISO standard. Fixed for the 2021 version.
eIDAS2 [7, 8]	EU Regulation	Dynamic (Low)	Primary text is static, but subject to amendments, delegated acts, and defined timelines.
SD-JWT (RFC9901) [9]	IETF Standard	Static	Published IETF RFC. Content is fixed.
SD-JWT VC [10]	IETF Draft	Highly Dynamic	Active IETF draft. Content can change significantly with each revision.
ARF [11]	Framework / Repository	Highly Dynamic	Living document and code repositories under active development.
PSD2 Regulation [12]	EU Regulation	Static	Published EU regulation. Core text is fixed.
eudi-nexus [13]	Community Resource	Dynamic (Medium)	Independent overview site, updated periodically.
walt.id Repository [14]	Open-Source Code	Highly Dynamic	Active code repository with frequent updates and releases.

2.1 Detailed Analysis of Key Sources

OpenID4VCI and OpenID4VP [1, 2]: The OpenID for Verifiable Credential Issuance and Presentation specifications are foundational. While stable, the presence of an errata mechanism and the potential for new versions necessitate a strategy where the model can confirm the current status. The model should be aware of errata URLs and, for critical tasks, query a trusted source to ensure it is not relying on outdated information.

Highly Dynamic Warning

SD-JWT VC [10] is an active Internet-Draft whose content is subject to change until published as an RFC. An LLM trained on a specific version could provide incorrect advice if the specification has moved. The model must treat such sources with high scepticism and always verify the latest version.

ARF and EUDI Wallet Repositories [11]: The Architecture and Reference Framework and associated GitHub repositories are the "source of truth" for implementers but are constantly evolving. The model must interact with repositories (e.g., via MCP server) to fetch the latest documents, READMEs, or specific code sections.

ISO/IEC 18013-5:2021 [6] and RFCs [4, 9]: These published standards are the definition of static data. An LLM can be trained on these texts with high confidence that core information will not change.

eIDAS2 [7]: The regulation text itself is static. However, the broader context, including dates of application, implementing acts, and national transposition, is dynamic. The model must understand this distinction.

3. Data Scraping and Calibration Strategy

To transform the identified sources into a high-quality dataset suitable for LLM training and fine-tuning, we employ a two-phase strategy: scraping (acquisition and conversion) and calibration (validation, version tracking, and synthetic data generation).

3.1 Scraping Phase

PDF Documents: Standards such as ISO/IEC 18013-5, eIDAS regulations, and IETF RFCs are distributed as PDFs. We use a robust PDF-to-text conversion pipeline inspired by the LEMUR paper [18], employing tools like olmOCR or docling. Output is stored as JSONL with metadata: title, source URL, version/date, and unique identifier.
HTML Specifications: For OpenID Foundation specifications and IETF drafts available in HTML, we extract normative content using appropriate parsers (e.g., BeautifulSoup), preserving structural elements. Stored in JSONL format with metadata.
GitHub Repositories: For highly dynamic sources like the ARF and walt.id repositories, we clone at a specific commit and extract relevant files: READMEs, documentation, and key source code. Commit hash serves as version identifier.
Scheduling and Automation: Dynamic sources are periodically re-scraped (weekly for IETF drafts, daily for active GitHub repos) using CI/CD pipelines. Each snapshot is versioned and archived.

3.2 Calibration Phase

Quality Assurance: Evaluate extraction fidelity using Lexical Content Score (LCS) as in LEMUR [18]. Documents below 95% threshold are flagged for reprocessing.
Version Tracking: Each snapshot is tagged with its version identifier (RFC number, ISO edition, Git commit hash, or date of retrieval).
Synthetic Data Generation: Using docling-sdg to generate Q&A pairs via a three-stage process:
1. Sample: Extract and sample diverse passages from documents.
2. Generate: An LLM generates three Q&A types: simple fact, reasoning, and summary.
3. Critique: A second LLM evaluates each pair on groundedness, correctness, and usefulness. Only passing pairs are retained.
Handling Dynamic Content: Generate "abstention" examples for highly dynamic queries. The model is trained to request external tool use (via MCP) rather than providing outdated answers.
Cross-Document Calibration: Create combined contexts for queries requiring synthesis across multiple documents (e.g., comparing OpenID4VCI and SD-JWT VC).
Iterative Refinement: After initial training, evaluate on expert-curated queries. Refine scraping and generation pipelines based on errors. Continuous improvement feedback loop.

4. Data Preparation and Generation: Research Insights

4.1 PDF-to-Text Conversion and Quality Assessment

The LEMUR paper [18] demonstrates a robust pipeline: olmOCR converts PDFs to structured JSONL, then fidelity is evaluated with a Lexical Content Score (LCS) comparing against authoritative HTML versions. For high-resource languages, conversion can exceed 95% lexical similarity, but older documents and low-resource languages may drop to 80-90%.

4.2 Creating High-Quality Training Pairs

LEMUR uses metadata blocks (title, subject, publication info) as queries with the remaining legislative text as the retrieval target. The NDA analysis paper [19] uses manual annotation by legal experts (322 NDAs, 3,714 clauses) with structured markup. For our domain, we extract preambles or abstracts of standards as queries, treating full specifications as target documents.

4.3 Decomposition of Legal/Technical Elements

The legal drafting paper [15] decomposes fraud judgments into constituent elements (Subject of Crime, Act, Victim, etc.). For our purposes, we decompose each standard into logical parts. For example, OpenID4VCI: protocol name, endpoints, request parameters, security requirements. This aids both training and completeness evaluation.

4.4 Handling Dynamic Data and Hallucinations

The hallucinations paper [17] shows fine-tuning dramatically reduces hallucinations: LLaMA-2-7b accuracy on segments without the target entity improved from near 0% to almost 48% after fine-tuning. Crucially, cleaned datasets (ensuring entities were actually present) yielded significantly better performance than raw extractions.

Key Insight

For dynamic sources (SD-JWT VC draft, ARF), create training examples where the model is explicitly told "the information may be outdated" or "check the latest version." Include negative examples (queries with no answer in context) to teach the model to say "I don't know" or trigger a retrieval call.

4.5 Estimating Corpus Size and Token Requirements

Source	Corpus Size	Tokens	Model
Legal drafting [15]	60,000 judgments	~468M	BLOOM 560M (4h on RTX 3090)
LEMUR [18]	24,953 PDF documents	~194M	0.6B, 1.1B, 4B models
NDA paper [19]	322 NDAs, 3,714 clauses	Small	Legal-RoBERTa
MLEB [16]	10 datasets, 100s-1000s queries	Evaluation only	Retrieval benchmarks

For our domain (a few dozen core standards and frameworks), a corpus on the order of 50-200 million tokens would be sufficient for effective fine-tuning of a model in the 1-7 billion parameter range.

5. Architectural Recommendations

A monolithic model trained on a static dataset is insufficient. The following hybrid architecture is recommended, incorporating Retrieval-Augmented Generation (RAG) and dynamic tool use.

Layer 4

Uncertainty and Abstention Training

↑

Layer 3

Dynamic Knowledge via MCP

↑

Layer 2

RAG Layer (Fine-tuned Retriever)

↑

Layer 1

Core MoE Foundation Model (100-200M tokens)

Core Foundational Model Utilise a sparse Mixture-of-Experts (MoE) architecture with specialised "experts" for different domains (e.g., OAuth flows, mdoc data structures). Pre-train and fine-tune on "static" data: RFCs, ISO standards, stable specifications. Expected corpus: 100-200 million tokens.
Retrieval-Augmented Generation (RAG) Layer Fine-tune a dedicated embedding model on domain-specific query-document pairs. Given a user question, the retriever fetches the most relevant sections from both static and dynamic sources. Retrieved context is fed to the generative model.
Dynamic Knowledge Layer via MCP Implement tool-calling using the Model Context Protocol (MCP). For queries touching "highly dynamic" data, the model formulates requests to external tools instead of generating from internal parameters alone. Examples:
- Check latest commit on ARF repository via get_github_latest()
- Check SD-JWT VC draft status via check_ietf_draft_status()
- Verify OpenID4VCI errata via fetch_openid_errata()
Training for Uncertainty and Abstention Include training examples where the retriever returns no relevant context or where context indicates information is dynamic/outdated. The model responds with appropriate caveats:

      // Example model response pattern:

      "Based on the static text of eIDAS2, Article X states [quote].

       However, the specific technical implementation details are

       defined in the ARF, which is actively updated. The latest

       ARF requirement as of [current date] is [result from MCP call]."

      // Or when no information is found:

      "I could not find information on this in the provided standards.

       This topic may be covered in a more recent document; please

       check the official sources."

6. Conclusion

The proposed data corpus presents a clear mix of stable, versioned information and rapidly evolving content. A successful LLM in this domain cannot be a static knowledge base.

The recommended architecture is a hybrid system: a powerful, sparsely-activated MoE model trained on the static core (100-200M tokens), augmented by a RAG pipeline with a fine-tuned retriever and a dynamic tool-calling layer (via MCP servers) to fetch and verify information from active repositories, draft specifications, and legal updates.

Recent research, particularly on data preparation (LEMUR), decomposition (legal drafting), hallucination reduction, and two-stage analysis, provides concrete methodologies and benchmarks that directly inform this design. This approach ensures the model provides not only deep, contextual knowledge but also timely and accurate advice, which is critical for legal and technical implementation support.

References

[1]
T. Lodderstedt, K. Yasuda, T. Looker, P. Bastian. OpenID for Verifiable Credential Issuance 1.0. OpenID Foundation.
openid.net/specs/openid-4-verifiable-credential-issuance-1_0.html
[2]
OpenID Foundation. OpenID for Verifiable Presentations 1.0.
openid.net/specs/openid-4-verifiable-presentations-1_0.html
[3]
OpenID Foundation. OpenID4VC High Assurance Interoperability Profile 1.0.
openid.net/specs/openid4vc-high-assurance-interoperability-profile-1_0-ID1.html
[4]
T. Lodderstedt, et al. OAuth 2.0 Pushed Authorization Requests. IETF RFC 9126.
datatracker.ietf.org/doc/html/rfc9126
[5]
OpenID Foundation. OpenID Connect Core 1.0.
openid.net/specs/openid-connect-core-1_0.html
[6]
ISO/IEC. ISO/IEC 18013-5:2021 Personal identification. Mobile driving licence (mDL) application.
iso.org/standard/69084.html
[7]
Digitaliseringsstyrelsen. eIDAS2 og den digitale identitetstegnebog.
digst.dk/it-loesninger/eid-og-single-digital-gateway/eidas2...
[8]
EU. Regulation (EU) No 910/2014 (eIDAS).
eur-lex.europa.eu/legal-content/EN/TXT/...
[9]
D. Fett, et al. Selective Disclosure JWT (SD-JWT). IETF RFC 9901.
datatracker.ietf.org/doc/rfc9901/
[10]
O. Terbu, et al. SD-JWT-based Verifiable Credentials (SD-JWT VC). IETF Draft.
datatracker.ietf.org/doc/draft-ietf-oauth-sd-jwt-vc/
[11]
EU Digital Identity Wallet. Architecture and Reference Framework. GitHub.
github.com/eu-digital-identity-wallet/eudi-doc-architecture-and-reference-framework
[12]
EU. Directive (EU) 2015/2366 (PSD2).
eur-lex.europa.eu/legal-content/EN/TXT/...
[13]
Overview of EUDI Standards.
cre8.github.io/eudi-nexus/
[14]
walt.id. walt.id Identity Repository. GitHub.
github.com/walt-id/waltid-identity
[15]
C.-H. Lin, P.-J. Cheng. "Legal Documents Drafting with Fine-Tuned Pre-Trained Large Language Model." CS & IT 2024.
[16]
U. Butler, A.-R. Butler, A.L. Malec. "The Massive Legal Embedding Benchmark (MLEB)." 2025.
[17]
F. Vargas, et al. "El impacto del ajuste fino de LLaMA en las alucinaciones para la extraccion de entidades nombradas en documentos legales." 2025.
[18]
N. Baba Ahmadi, J. Strich, M. Semmann, C. Biemann. "LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval." 2025.
[19]
A. Begnini, M. Vicente, L. Souza. "A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification." 2025.