Technical Report

Analysis of Data Sources for a Specialized LLM on European Digital Identity Standards

Architectural and Dataset Analysis for LLM Tuning. Classification of static vs. dynamic data sources, with recommendations for a hybrid RAG architecture.

Prepared for Dewa-AI
Date 16 March 2026
Origin DTU Research
Classification Confidential
Abstract
This report analyses the corpus of technical specifications, legal regulations, and reference frameworks provided for the development of a specialised Large Language Model (LLM). The objective is to classify each data source as either "static" or "dynamic" and to outline the architectural implications for an LLM intended to provide expert-level support on European legal standards for identity, e-wallets, and associated repositories. The analysis recommends a hybrid architecture that combines a robustly trained core model with dynamic, verifiable data-fetching capabilities.

Contents

  1. Introduction
  2. Data Source Categorisation
  3. Data Scraping and Calibration Strategy
  4. Data Preparation and Generation: Research Insights
  5. Architectural Recommendations
  6. Conclusion
  7. References

1. Introduction

The successful deployment of an LLM in a highly technical and regulated domain like European digital identity (eIDAS2, OpenID4VC, ISO/IEC 18013-5) depends critically on the accuracy and timeliness of its knowledge base. Given the evolving nature of standards and implementation frameworks (such as the ARF), the model must not only internalise stable information but also recognise its limitations and know when to query live sources.

This report provides the analysis required to inform the model architecture, dataset preparation, and training strategy. Recent research in legal and technical NLP offers valuable guidance. Papers on fine-tuning for legal document drafting, reducing hallucinations, building multilingual retrieval corpora, and two-stage NDA analysis provide concrete methodologies and benchmarks.

2. Data Source Categorisation

The provided links can be broadly categorised into three types: (1) finalised technical specifications, (2) active draft specifications and evolving reference frameworks, and (3) static legal texts.

Data Source Type Classification Justification
OpenID4VCI [1] Technical Spec Dynamic (Low) Final specification, but subject to errata. Version changes infrequent but must be checked.
OpenID4VP [2] Technical Spec Dynamic (Low) Stable core specification. Version changes infrequent.
HAIP [3] Technical Profile Dynamic (Medium) Implementation profile that may be updated to align with evolving ecosystem requirements.
PAR (RFC9126) [4] IETF Standard Static Published IETF RFC. Content is fixed.
OpenID Connect Core [5] Technical Spec Static Foundational, finalised OpenID Foundation specification.
ISO/IEC 18013-5:2021 [6] International Standard Static Published ISO standard. Fixed for the 2021 version.
eIDAS2 [7, 8] EU Regulation Dynamic (Low) Primary text is static, but subject to amendments, delegated acts, and defined timelines.
SD-JWT (RFC9901) [9] IETF Standard Static Published IETF RFC. Content is fixed.
SD-JWT VC [10] IETF Draft Highly Dynamic Active IETF draft. Content can change significantly with each revision.
ARF [11] Framework / Repository Highly Dynamic Living document and code repositories under active development.
PSD2 Regulation [12] EU Regulation Static Published EU regulation. Core text is fixed.
eudi-nexus [13] Community Resource Dynamic (Medium) Independent overview site, updated periodically.
walt.id Repository [14] Open-Source Code Highly Dynamic Active code repository with frequent updates and releases.

2.1 Detailed Analysis of Key Sources

OpenID4VCI and OpenID4VP [1, 2]: The OpenID for Verifiable Credential Issuance and Presentation specifications are foundational. While stable, the presence of an errata mechanism and the potential for new versions necessitate a strategy where the model can confirm the current status. The model should be aware of errata URLs and, for critical tasks, query a trusted source to ensure it is not relying on outdated information.

Highly Dynamic Warning
SD-JWT VC [10] is an active Internet-Draft whose content is subject to change until published as an RFC. An LLM trained on a specific version could provide incorrect advice if the specification has moved. The model must treat such sources with high scepticism and always verify the latest version.

ARF and EUDI Wallet Repositories [11]: The Architecture and Reference Framework and associated GitHub repositories are the "source of truth" for implementers but are constantly evolving. The model must interact with repositories (e.g., via MCP server) to fetch the latest documents, READMEs, or specific code sections.

ISO/IEC 18013-5:2021 [6] and RFCs [4, 9]: These published standards are the definition of static data. An LLM can be trained on these texts with high confidence that core information will not change.

eIDAS2 [7]: The regulation text itself is static. However, the broader context, including dates of application, implementing acts, and national transposition, is dynamic. The model must understand this distinction.

3. Data Scraping and Calibration Strategy

To transform the identified sources into a high-quality dataset suitable for LLM training and fine-tuning, we employ a two-phase strategy: scraping (acquisition and conversion) and calibration (validation, version tracking, and synthetic data generation).

3.1 Scraping Phase

3.2 Calibration Phase

4. Data Preparation and Generation: Research Insights

4.1 PDF-to-Text Conversion and Quality Assessment

The LEMUR paper [18] demonstrates a robust pipeline: olmOCR converts PDFs to structured JSONL, then fidelity is evaluated with a Lexical Content Score (LCS) comparing against authoritative HTML versions. For high-resource languages, conversion can exceed 95% lexical similarity, but older documents and low-resource languages may drop to 80-90%.

4.2 Creating High-Quality Training Pairs

LEMUR uses metadata blocks (title, subject, publication info) as queries with the remaining legislative text as the retrieval target. The NDA analysis paper [19] uses manual annotation by legal experts (322 NDAs, 3,714 clauses) with structured markup. For our domain, we extract preambles or abstracts of standards as queries, treating full specifications as target documents.

4.3 Decomposition of Legal/Technical Elements

The legal drafting paper [15] decomposes fraud judgments into constituent elements (Subject of Crime, Act, Victim, etc.). For our purposes, we decompose each standard into logical parts. For example, OpenID4VCI: protocol name, endpoints, request parameters, security requirements. This aids both training and completeness evaluation.

4.4 Handling Dynamic Data and Hallucinations

The hallucinations paper [17] shows fine-tuning dramatically reduces hallucinations: LLaMA-2-7b accuracy on segments without the target entity improved from near 0% to almost 48% after fine-tuning. Crucially, cleaned datasets (ensuring entities were actually present) yielded significantly better performance than raw extractions.

Key Insight
For dynamic sources (SD-JWT VC draft, ARF), create training examples where the model is explicitly told "the information may be outdated" or "check the latest version." Include negative examples (queries with no answer in context) to teach the model to say "I don't know" or trigger a retrieval call.

4.5 Estimating Corpus Size and Token Requirements

SourceCorpus SizeTokensModel
Legal drafting [15]60,000 judgments~468MBLOOM 560M (4h on RTX 3090)
LEMUR [18]24,953 PDF documents~194M0.6B, 1.1B, 4B models
NDA paper [19]322 NDAs, 3,714 clausesSmallLegal-RoBERTa
MLEB [16]10 datasets, 100s-1000s queriesEvaluation onlyRetrieval benchmarks

For our domain (a few dozen core standards and frameworks), a corpus on the order of 50-200 million tokens would be sufficient for effective fine-tuning of a model in the 1-7 billion parameter range.

5. Architectural Recommendations

A monolithic model trained on a static dataset is insufficient. The following hybrid architecture is recommended, incorporating Retrieval-Augmented Generation (RAG) and dynamic tool use.

Layer 4
Uncertainty and Abstention Training
Layer 3
Dynamic Knowledge via MCP
Layer 2
RAG Layer (Fine-tuned Retriever)
Layer 1
Core MoE Foundation Model (100-200M tokens)
  1. Core Foundational Model Utilise a sparse Mixture-of-Experts (MoE) architecture with specialised "experts" for different domains (e.g., OAuth flows, mdoc data structures). Pre-train and fine-tune on "static" data: RFCs, ISO standards, stable specifications. Expected corpus: 100-200 million tokens.
  2. Retrieval-Augmented Generation (RAG) Layer Fine-tune a dedicated embedding model on domain-specific query-document pairs. Given a user question, the retriever fetches the most relevant sections from both static and dynamic sources. Retrieved context is fed to the generative model.
  3. Dynamic Knowledge Layer via MCP Implement tool-calling using the Model Context Protocol (MCP). For queries touching "highly dynamic" data, the model formulates requests to external tools instead of generating from internal parameters alone. Examples:
    • Check latest commit on ARF repository via get_github_latest()
    • Check SD-JWT VC draft status via check_ietf_draft_status()
    • Verify OpenID4VCI errata via fetch_openid_errata()
  4. Training for Uncertainty and Abstention Include training examples where the retriever returns no relevant context or where context indicates information is dynamic/outdated. The model responds with appropriate caveats:
// Example model response pattern:

"Based on the static text of eIDAS2, Article X states [quote].
 However, the specific technical implementation details are
 defined in the ARF, which is actively updated. The latest
 ARF requirement as of [current date] is [result from MCP call]."

// Or when no information is found:

"I could not find information on this in the provided standards.
 This topic may be covered in a more recent document; please
 check the official sources."

6. Conclusion

The proposed data corpus presents a clear mix of stable, versioned information and rapidly evolving content. A successful LLM in this domain cannot be a static knowledge base.

The recommended architecture is a hybrid system: a powerful, sparsely-activated MoE model trained on the static core (100-200M tokens), augmented by a RAG pipeline with a fine-tuned retriever and a dynamic tool-calling layer (via MCP servers) to fetch and verify information from active repositories, draft specifications, and legal updates.

Recent research, particularly on data preparation (LEMUR), decomposition (legal drafting), hallucination reduction, and two-stage analysis, provides concrete methodologies and benchmarks that directly inform this design. This approach ensures the model provides not only deep, contextual knowledge but also timely and accurate advice, which is critical for legal and technical implementation support.

References