18 KiB
+++ date = '2026-04-04T13:38:29Z' draft = true title = 'Plan to Learn and Test RAG Concepts' tags = ['RAG', 'VyOS', 'AI-reviewed'] +++
Background
I run a VyOS router at home. I am happy with the product, but I waste time remembering CLI commands for non-trivial tasks. It is CLI-only. I previously used OPNsense and the UI-only configuration drove me crazy. I gave up after wasting three combined days trying to set up a site-to-site VPN -- it still did not work.
VyOS changed their CLI API at some point (I think), so AI output (Claude, ChatGPT) often needs tweaking. I am not sure how much the CLI actually changed, whether the AI training data is outdated, or whether the models are just making things up. I have hit many instances of invalid commands from commercial AI. At the time of writing, I could not find an official statement on breaking CLI changes in VyOS -- I did not look that hard.
Part of this project would be to explore and document these changes if they are not well covered. For instance, an AI should be capable of ingesting two large documentation sets (VyOS 1.2, released January 2019, vs VyOS 1.3, released December 2021 -- source) and producing a list of breaking changes. This would also help characterize AI failures into two buckets:
- Hallucinations -- the AI produces something that was never valid.
- Outdated training -- the AI produces something that was correct in older documentation.
This may look like a side goal, but it is a good exercise for a RAG project. You use the different concepts, combine them, and learn.
Prompts I Want to Test Retrieval For
These are example prompts I want to evaluate retrieval quality and AI accuracy on. I may try some of them and tweak them as I make progress -- they are not a fixed list.
Each prompt includes a reworded version for AI consumption. The reworded prompt is what I would actually send to the model during evaluation.
The blockquoted prompts below were written by Claude Opus 4.6 (GitHub Copilot, April 2026). They rephrase the original intent using more precise network terminology.
-
Static IP assignment -- assign a static IP on a specific network to a MAC address. Optionally generate the MAC from the desired IP. Given the IP, the AI should figure out which network it belongs to, generate the MAC, and apply the config.
Prompt: On a VyOS router, generate a locally-administered MAC address derived from IP
192.168.10.125. Determine which network192.168.10.125belongs to by inspecting the router's existing interface configuration. Assign a DHCP static mapping for hostjm-ragbinding that MAC to192.168.10.125on the correct subnet. Output the full sequence of VyOSsetcommands. -
VPN interfaces -- WireGuard, OpenVPN.
Prompt: Given the following WireGuard configuration file, create a VyOS
wg10tunnel interface with the correct private key, listen port, peer public key, allowed IPs, and endpoint. After applying the config, provide the VyOS operational command to verify the interface is up and the handshake succeeded. -
Network creation -- apply routing and firewall rules. Example: create a network, assign a VLAN tag, block all traffic to other local networks, and use it for my work computer.
- Networks with different DNS configs.
- Networks with VPN gateways.
Prompt: On a VyOS router, provision a new Layer 2 segment
JM-WORKas a VLAN sub-interface (VID 11) on the trunk-facing parent interface. Assign the subnet192.168.11.0/24with the router as the default gateway at.1. Configure a DHCP pool for the subnet with an appropriate lease range and DNS forwarder. Apply zone-based firewall policies so that: (1) all ingress from other local zones toJM-WORKis dropped (deny inter-VLAN routing inbound), (2) all egress fromJM-WORKto other RFC 1918 destinations is dropped (deny inter-VLAN routing outbound), and (3) egress to non-RFC 1918 destinations (i.e., the internet) via the default route is permitted with stateful connection tracking (established/relatedreturn traffic allowed). Output everysetcommand required, covering the VLAN sub-interface, DHCP server scope, and firewall rule-set assignments to the zone or interface direction. -
Port forwarding -- ports 22, 80, 443, etc. routed to specific VMs in the LAN. Bonus points for reading my specs from elsewhere and figuring it out automatically.
Prompt: On a VyOS router with WAN interface
eth0(public-facing, DHCP or static WAN address), configure a DNAT (destination NAT) rule to translate inbound TCP SYN packets on port 22 arriving oneth0to the internal hostsshbastionat192.168.10.50:22. Add a corresponding stateful firewall rule on theWAN_IN(or equivalentindirection oneth0) rule-set to acceptestablished/relatedreturn traffic as well as new connections matching the DNAT translation. If a source-NAT masquerade rule already covers outbound traffic, confirm that hairpin NAT is not required for this case. Output allsetcommands for the NAT destination rule and the firewall rule entry. -
Advanced DNS -- example:
coolservice.mydomain.compoints to my public IP externally, butcoolserviceis a LAN host. I wantcoolservice.mydomain.comto resolve to a reverse proxy on port 443 when queried from inside the LAN. This is a hack, but the AI should either handle it or suggest a better approach.Prompt: On a VyOS router acting as the LAN's recursive DNS forwarder, the A record for
coolservice.mydomain.comresolves to my WAN IP via public DNS. Internally,coolserviceis a host at192.168.10.30running a TLS reverse proxy on port 443. Configure a DNS split-horizon override so that queries forcoolservice.mydomain.comoriginating from any LAN zone return192.168.10.30instead of the public address (e.g., a static host mapping or an authoritative local zone entry). If split-horizon is suboptimal for this topology, evaluate alternatives -- such as destination NAT hairpin (NAT reflection) or a dedicated internal authoritative zone -- and compare their tradeoffs in terms of TTL handling, certificate validation, and operational complexity.
Stretch Goals
These are harder, multi-stage projects where AI with accurate documentation would help significantly. Again, just examples -- these may evolve.
-
Config audit -- feed the entire VyOS config from my router (maybe 10--20 pages of commands) into the AI. Have it evaluate the config: generate firewall tests for external security problems, identify unused interfaces, flag dead configuration.
Prompt: I am providing my full VyOS running configuration (approximately 15 pages). Analyze it for: (1) firewall rules that leave external-facing ports unintentionally open, (2) interfaces or address assignments that are defined but never referenced by any routing, NAT, or firewall rule, and (3) any deprecated or redundant configuration stanzas. For each finding, cite the specific config lines involved and recommend a fix using VyOS
setordeletecommands. -
CI/CD on configs -- run a virtualized cluster (router, hosts, etc.), apply new configs in staging, and run a security audit against the virtualized network.
Prompt: Given the attached VyOS configuration and a set of proposed changes (adding the
JM-WORKnetwork from previous examples), produce adocker-compose.ymlor equivalent setup that spins up a virtualized VyOS instance and two stub hosts. Apply the base config, then apply the proposed changes. Write a test suite (shell scripts or pytest) that validates: the new VLAN interface is reachable, inter-VLAN traffic is blocked as expected, and internet access from the new network works. Execute the tests and report results.
Test Prompts
I want to define prompts at varying difficulty levels to benchmark AI performance with and without RAG.
Easy
- Generate a random MAC address for host
jm-ragand assign it to192.168.10.125on the LAN. - Forward port 22 to host
sshbastion.domain.com(wheresshbastionis a host on the LAN).
Medium
- Create a new network
JM-WORKwith CIDR192.168.11.0/24. Follow up with:- Block all traffic between it and other networks. No host outside
JM-WORKcan reach any host inside it. - Assign VLAN tag 11 to network
JM-WORK.
- Block all traffic between it and other networks. No host outside
- Given a WireGuard config, create a
wg10tunnel. Provide a command to verify the interface is up.
Hard
- Create a new network
REMOTE-NETwith CIDR192.168.12.0/24usingwg10as the gateway. Configure all routes and firewall rules to allow traffic to and from the remote network. Assume the remote end is a simple LAN with CIDR192.168.2.0/24. - Use the supplied config and apply changes (e.g., from the examples above) in a staging environment -- VMs, docker-compose, k3s, whatever works. Write tests to validate the new functionality and execute them in staging. Without RAG, the AI presumably would not have correct data to do any of this. This is fundamentally an integration problem.
RAG Pipeline Plan
VyOS publishes their documentation on GitHub as RST files. From a quick glance it is mostly well structured.
I want to build a RAG pipeline using local and commercial models to:
- Identify prompts the AI fails to answer without help.
- Check whether providing VyOS documents along with the prompt improves the response -- whether or not the baseline response was already correct.
On initially correct responses (without RAG), I do not want to turn a valid answer into an invalid one by supplying wrong, irrelevant, or badly processed information. Checking for regressions matters. If the AI answers a prompt correctly on its own and then fails when given additional context, that invalidates your methodology (at least partially).
Chunking and Indexing
- Chunk the entire VyOS documentation into block documents.
- Store them in a vector database -- or any database that supports retrieval of relevant blocks given a query.
- Defining "relevant" is expected to be hard, but the requirement is loose to start. Early on, I can separate the task of retrieving documents (rules, heuristics, etc.) from the task of using retrieved information in the model. I can even have a human or the AI in the loop for triage.
Retrieval Strategy
- Define a framework for querying the database in a way that lets me assess results before automating.
- I probably cannot do a single vector search on a long compound prompt like "define a VPN interface using WireGuard with a DNS endpoint for the remote peer, and use it as a gateway for a new net with these properties" -- it will latch on to too many things. Instead, I can split the query into focused search terms: "WireGuard interface," "DNS address for WireGuard," "set up new network," "configure gateway of network." The usage would have to be intentional when prompting. I do not think that is a bad thing.
- Ranking problems -- I doubt the vector DB results will come back well ordered. A chunk that happens to be a header with the exact text "set up new network" would always rank as the best match for that query, but lower-ranked documents might be more relevant. I want to find mitigating solutions. One idea: the AI keeps probing the document DB until the matches are clearly drifting from the original question. For instance, when searching for "configuring WireGuard," iteratively evaluate retrieved documents until the first one the AI judges as "not relevant." Tell the AI to use exactly that criterion.
- Embedding models -- what models do I need for embeddings? How much does the choice matter? Performance tradeoffs, memory usage.
- Ranking validation -- for lack of a better idea, send the top 10--20 retrieved documents to a strong model (at the time of writing, Claude Opus 4.6) and ask whether the ranking is valid. How exactly to frame that question is TBD.
The Pipeline
A Python program that:
- Receives a prompt, possibly with some expected structure.
- In a loop or simple sequence, retrieves documents from the vector DB.
- Constructs a final prompt to feed to the AI -- including back-and-forth with the user if the AI needs to clarify.
- Returns the response to the user.
Existing Frameworks
This section was written by Claude Opus 4.6 (GitHub Copilot, April 2026). It surveys open-source RAG frameworks that are free to use for learning and production.
Before building the entire pipeline from scratch, it is worth knowing what already exists. The RAG framework space has matured significantly. Some of these are low-level toolkits where you assemble your own pipeline from primitives; others are batteries-included platforms with a web UI and built-in document processing. The tradeoff is control vs. time-to-first-result.
I grouped these into three tiers based on how much they do for you. For this project -- local Ollama, VyOS RST docs, Python with uv, ChromaDB already in use -- the most relevant question is whether a framework adds value over the hand-rolled chunking and retrieval pipeline already under development.
| Framework | Type | Pros | Cons | Fit for this project |
|---|---|---|---|---|
| LangChain | Orchestration | Massive ecosystem, swappable LLMs and vector stores, large community, good for prototyping many configurations. | Abstraction-heavy; frequent breaking changes historically; "chain" paradigm can obscure what is actually happening. | Viable but heavier than needed. The abstraction layers would hide the retrieval mechanics you are trying to learn. |
| LlamaIndex | Orchestration | Data-centric design, strong indexing and retrieval strategies, handles heterogeneous inputs (PDFs, APIs, databases), active community. | Complexity scales fast; documentation sometimes lags new features. | Strong fit if you want pre-built retrieval strategies (e.g., sentence-window, auto-merging) without reinventing them. |
| Haystack | Orchestration | Clean pipeline model (nodes connected explicitly), production-grade, YAML or Python config, Apache 2.0. | Smaller ecosystem than LangChain; fewer tutorials and community examples. | Excellent match. Pipelines are easy to reason about and expose the retrieval math without too much magic. |
| RAGFlow | Full-stack platform | Built-in UI, deep document understanding (tables, images, complex layouts), Docker self-hosting, supports Ollama as backend, agentic workflows, MCP support. | Opinionated architecture; heavier resource footprint; less control over internals. | Good for getting a working system fast. Pairs well with Ollama. Overkill if the goal is to understand every step of the pipeline. |
| Ragbits | Modular toolkit | Pythonic API with type-safe LLM calls (Pydantic), modular install (pip install only what you need), supports local LLMs via LiteLLM, built-in evaluation framework, MIT licensed, supports uv. |
Smaller community; relatively new; fewer production case studies. | Interesting middle ground. The type-safe prompt system and modular design fit a Python-first workflow well. The uv support is a plus. |
| RAGatouille | Retrieval add-on | ColBERT late interaction retrieval (token-level scoring instead of pooled embeddings), lightweight, plugs into LangChain or LlamaIndex, can be used as a reranker without reindexing. | Not a full pipeline -- retrieval only. Requires understanding ColBERT's scoring model. | Worth experimenting with. Late interaction scoring is a fundamentally different statistical approach to similarity vs. cosine on pooled embeddings. Relevant to the ranking problems described above. |
| LightRAG | Graph + vector hybrid | Combines knowledge graphs with vector retrieval for relational queries, incremental updates (no full rebuild), open source. | More complex conceptually; graph construction adds a pipeline stage; newer project. | Interesting for relational queries ("what config depends on what") but adds complexity beyond what the initial pipeline needs. |
| txtai | All-in-one | Built-in vector DB and semantic search, knowledge graphs, pipelines, agents -- no external vector store needed. | Less mainstream; documentation quality varies; monolithic feel. | Self-contained option if you want to avoid managing a separate vector store. Not needed here since ChromaDB is already set up. |
| Verba | Full-stack platform | Easy setup, transparent chat showing sources and highlighted chunks, supports Ollama. | Tied to Weaviate as vector store; not suited for multi-user deployments. | Good for quick solo prototyping. Less useful long-term since it locks you into Weaviate. |
Notes
- For learning the mechanics (embeddings, chunking, retrieval ranking): Haystack or LlamaIndex expose the most without hiding things behind abstractions. The hand-rolled pipeline already started in this project is also a valid path -- frameworks are not mandatory.
- For a fast working demo: RAGFlow gives you the most out of the box. Docker, UI, document parsing, Ollama integration.
- For retrieval experiments: RAGatouille is worth a look specifically because ColBERT's late interaction model addresses ranking problems differently than standard cosine similarity -- relevant to the ranking concerns described in the Retrieval Strategy section.
- Ragbits stood out for its Pythonic design and modular installation. It also ships a built-in evaluation framework (
ragbits-evaluate), which could be useful for the regression testing described in the RAG Pipeline Plan section.