{{ .Description | markdownify }}
+ {{ else }} + {{ .Summary }} + {{ end }} +diff --git a/AGENTS.md b/AGENTS.md index 88dbe2b..1880748 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,5 +1,64 @@ # AI Agent Instructions for gohugo +## Writing Style + +When formatting, editing, or generating prose for this site, follow these rules strictly. + +### Tone + +- **Dry, direct, first-person.** Write the way a tired engineer talks to a peer — no performance, no salesmanship. +- No enthusiasm markers. No "exciting," "powerful," "game-changing," "let's dive in," or similar. +- No motivational framing. Don't tell the reader why they should care. State facts. +- No emotional language. No "I'm thrilled," "this blew my mind," "frustratingly," etc. If something was annoying, say it plainly: "this wasted three days." +- No catchy punchlines. No forced closers. End sections when the content ends. +- Don't try to be funny. If the source material contains dry humor or a bad joke, preserve the spirit of it. Don't add humor that isn't there. + +### Sentence structure + +- Prefer short, declarative sentences. One idea per sentence. +- Use simple words. "Use" not "utilize." "Get" not "obtain." "Show" not "demonstrate." +- Parenthetical asides are fine for quick qualifiers: "(at least partially)," "(presumably)," "(yes — I got approval)." +- Em dashes for inline clarification. Keep them brief. +- Bold for genuine emphasis on key terms. Not for decoration. + +### Content rules + +- **Do not imply more than the source says.** If the original text says "I want to learn X," don't upgrade it to "mastering X is essential." Preserve the author's level of commitment and certainty. +- **Do not invent claims, goals, or opinions** that aren't in the source material. +- **Acknowledge uncertainty honestly.** If the source says "I think" or "I'm not sure," keep that hedging. Don't smooth it into a confident assertion. +- **Be concrete.** Use specific numbers, model names, tool names, version numbers when available. "24 GB of VRAM" not "a large amount of memory." +- **Delete fluff.** Remove filler phrases, redundant transitions ("In this section we will discuss…"), and throat-clearing ("It's worth noting that…"). +- **Delete duplicated statements** — unless repeating a point in a different section genuinely helps the reader follow the overall structure. + +### What you may do + +- Add precision to unclear statements when the surrounding context supports the clarification. +- Choose more precise words where the meaning stays the same. +- Vary repetitive wording to improve flow — without changing meaning. +- Reorganize paragraphs and sections for better logical order. +- Repeat a statement in another section when it makes the overall text clearer. + +### Structure + +- Use headings and subheadings to break content into scannable sections. +- Use bullet lists for enumerations. Don't turn a natural list into a paragraph. +- Use tables for structured comparisons (specs, tradeoffs, etc.). +- Keep paragraphs short — 2–4 sentences. + +### What to avoid — a blunt checklist + +- No "Let's explore…" / "In this article…" / "As we've seen…" +- No "Key takeaways" / "In summary" / "To wrap up" +- No rhetorical questions used for emphasis +- No exclamation marks +- No emoji +- No "powerful," "robust," "elegant," "seamless," "cutting-edge" +- No "dive into," "unpack," "leverage," "harness" +- No "the beauty of X is…" / "what makes X special…" +- No "at the end of the day" / "the bottom line is" + +--- + ## Tagging AI-Involved Content When creating or editing Hugo content files under `site/content/`, apply the following tags in the frontmatter: @@ -14,3 +73,15 @@ tags = ['some-topic', 'AI-gen'] ``` If both apply, include both tags. + +## AI Engine Attribution + +Light edits — correcting a word, choosing a more precise synonym, fixing grammar — do not require a disclaimer. If you replace "network" with "LAN" because the context calls for it, that is a normal edit. + +When substantially rewriting the original text — adding technical jargon, expanding a simple statement into a detailed specification, or producing something materially different from what was written — add an inline disclaimer stating that the text was significantly enhanced with AI. Include the model name and interface. For example: + +> *This text was significantly enhanced with AI (Claude Opus 4.6, GitHub Copilot).* + +Place the disclaimer at the start of the affected section or block. + +When referring to the original text, simply call it the "original text." diff --git a/exploring_rags.md b/exploring_rags.md new file mode 100644 index 0000000..1e98555 --- /dev/null +++ b/exploring_rags.md @@ -0,0 +1,55 @@ +Rags part 1: getting started + +# Context +I have a [vyos](https://vyos.io/) router box at my house. While I am 100% happy with the product, I tend to waste time with the configuration and remembering the commands to do non trivial things. It's a CLI-only product. I previously used OpnSense and the UI only configuration drove me absolutely crazy. I gave up using it after trying to set up a site-to-site VPN and wasting a combined three days on it, with the VPN still failing. + +VyOS changed (I think... or at least the AI thinks) their CLI API some time ago, so I always have to tweak the output of the AI (Claude, ChatGPT). I'm not sure how much the CLI really changed, if the AI are outdated, or simply making stuff up. At any rate I experienced many instances of getting invalid commands using the commercial AI in the past. At the time of writing I could not find an official statement on breaking changes in VyOS -- I did not look that hard. Part of this project would be to explore and document these changes if they are not well documented. For insance I believe an AI should be capable of ingesting two large documentation documents (VyOS 1.2 released in January 2019 vs VyOS 1.3 released in December 2021 for instance. source: https://docs.vyos.io/en/1.4/introducing/history.html) and produce a list of breaking changes. This would have the added benefit of characterizing AI failures into hallucinations (AI produces something that was never current) vs outdated training (AI produces something that was in the old documentation). This may look like a side goal, but I believe it's a good example project for a RAG project -- in the sense that one can use the different concepts, combine them and learn. + +That's the kind of operations I do and I want to test what the AI can do about it: +* set static IP on specific network to a mac address (optionally generate the mac given the desired IP) + * given the IP it should figure out what net it belongs to, generate the MAC, apply config +* VPN interface (wireguard, open VPN) +* create networks + * apply routing / firewall rules to it. Example I want to set a network, assign a vlan tag to it, and have specific firewall rules blocking all traffic to other local networks, and use this for my work computer. + * nets with different DNS configs + * nets with VPN gateways +* deal with port forwarding: 22, 80, 443, etc all go to specific VMs in the LAN. bonus points for reading my specs somewhere else and figuring out that automatically +* "advanced" DNS. Example use case is "coolservice.mydomain.com" points to my public IP, but "coolservice" is a host in my LAN. However I want "coolservice.mydomain.com" to point to a reverse proxy when in my LAN on port 443. This is hacking but ... AI needs to either deal with it or suggest better solutions. +* etc + +Other objectives: + +These are harder, multi-staged project where AI would be of great assistance. With the benefit of a RAG system that provides current, accurate documentation and configuration guide I would like to ... +* inspect the entire `VyOS` config on my router (it's not that long, maybe 10-20 pages of commands) and evaluate it: come up firewall tests for at least external security problems, identify unused features (example interfaces not used anywhere) +* CI/CD on new configs, run virtualized cluster with router, hosts, etc and run security audit on the virtualized network. + +# learning idea +VyOS publishes their [documentation](https://github.com/vyos/vyos-documentation) on git in RST files. From a quick glance it is mostly well structured. + + +I want to develop a RAG pipeline using local and commercial models to (1) identify prompts the AIs fail to answer and (2) see if providing VyOS documents along with the prompt can improve the prompt response, whether the baseline response was correct or not. On initially (without RAG) correct responses, I don't want to turn a valid response into an invalid one just because I provide additional information that is wrong or irrelevant. It is important to check for regressions because if the AI is capable of answering a prompt, and it fails to answer it when supplied with additional erroneous/incorrectly processed/irrelevant/etc information, that would invalidate your methodology (at least partially). I suggest the following prompts (may have to tweak / reword): +* (easy) generate a random MAC address for host jm-rag and assign it to 192.168.10.125 on the LAN +* (easy) forward port 22 to host `sshbastion.domain.com` (where sshbastion is a host on my LAN) +* (medium) create a new network JM-WORK with CIDR 192.168.11.0/24. follow up with: + * block all traffic between it and other nets. Nobody outside of JM-WORK can reach any host in it + * assign VLAN tag 11 to network JM-WORK +* (medium) given the following wireguard config create a wg10 tunnel. Provide a command to test that the interface is up + * (hard) create a new network REMOTE-NET with CIDR 192.168.12.0/24 using wg10 as the gateway. configure all routes and firewall rules to allow traffic to and from the remote network. Assume that the remote network is a simple LAN with a know CIDR 192.168.2.0/24. +* (hard) use the supplied config and apply (SOME CHANGES, say from the examples above) in a staging environment (some sort of virtualization, VMs, docker-compose, k3s, etc). Write appropriate tests to validate the new functionality. execute the tests in the staging environment. (note to AI, without the RAG, presumably you would not even have correct data to even do any of this, but it's true that it's an integration problem, maybe we can mark it as such) + + +I will try to: +* chunks the entire documentation into block documents +* put it into a vector database, or any database that supports retrieval of relevant blocks given a prompt + * dealing with defining "relevant" is expected to be hard, but somewhat loose requirement, at least to start. At the beginning I can separate the tasks of retrieving documents (make rules, heuristics, etc) from the task of using the retrieved information in the AI model. I can even have a human or the AI in the look to do a triage of the supplied document. +* define a mental framework for querying the database in such a way that I can assess the results, before eventually automating + * example: maybe I can't do a vector search on "define a VPN interface using wireguard with a DNS endpoint for the remote peer, and use it as a gateway for a new net NEWNET with blablabla properties" because it will latch on all sorts of things, but I can specify search items such as "wireguard interface", "DNS address for wireguard", "set up new network", "configure gateway of network", etc. The use would have to be intentional when prompting -- which I don't think is a bad thing anyways. This is just an idea, the objective of this research is to learn more. + * caveat here: When querying the vector DB, I highly doubt that the results will come well ordered. The chunking for instance may have a chuck that just happens to be a header whose text is exactly "set up new network". That document would (presumably) always rank as the best match for the eponymous query "net up new network", but lower ranked documents would be more relevant. I'm hoping to find mitigating solutions for this. Maybe the AI can keep probing the document DB untill it ascentained that the matches are getting too far from the original question. For instance when searching for "configuring wireguard" in the vector DB, maybe we can iteratively evaluate retrieved documents unlis the first one that the AI evaluates to be "not relevant" -- we'd tell the AI this exact term to make it's decision. Again, just ideas, this is what the learning is for. + * what models do I need to use for embeddings, does it matter that much? performance tradeoffs, memory usage + * for lack of a better idea I plan on sending my top 10-20 retrieved documents to a powerful AI (At the time of writing that would be Claude Opus 4.6) and ask if the ranking is valid -- how to ask TBD. +* develop a simple pipeline to issue a prompt and get the answer back. a python program that + * receive a prompt, perhaps with some expected structure in it + * in a loop, or in a simple sequence, retrieve documents from the vector DB + * constructs a prompt to feed to the final AI + * deal with back and forth with the user. Maybe the AI can provide feedback to clarify the prompt. + * returns to the user the response to the query \ No newline at end of file diff --git a/site/content/llm/_index.md b/site/content/llm/_index.md new file mode 100644 index 0000000..88b5dff --- /dev/null +++ b/site/content/llm/_index.md @@ -0,0 +1,6 @@ ++++ +date = '2026-04-03T23:42:21Z' +draft = false +title = 'LLM' +tags = [] ++++ diff --git a/site/content/llm/exploring-rag-plan.md b/site/content/llm/exploring-rag-plan.md new file mode 100644 index 0000000..ac68956 --- /dev/null +++ b/site/content/llm/exploring-rag-plan.md @@ -0,0 +1,143 @@ ++++ +date = '2026-04-04T13:38:29Z' +draft = true +title = 'Plan to Learn and Test RAG Concepts' +tags = ['RAG', 'VyOS', 'AI-reviewed'] ++++ + +## Background + +I run a [VyOS](https://vyos.io/) router at home. I am happy with the product, but I waste time remembering CLI commands for non-trivial tasks. It is CLI-only. I previously used OPNsense and the UI-only configuration drove me crazy. I gave up after wasting three combined days trying to set up a site-to-site VPN -- it still did not work. + +VyOS changed their CLI API at some point (I think), so AI output (Claude, ChatGPT) often needs tweaking. I am not sure how much the CLI actually changed, whether the AI training data is outdated, or whether the models are just making things up. I have hit many instances of invalid commands from commercial AI. At the time of writing, I could not find an official statement on breaking CLI changes in VyOS -- I did not look that hard. + +Part of this project would be to explore and document these changes if they are not well covered. For instance, an AI should be capable of ingesting two large documentation sets (VyOS 1.2, released January 2019, vs VyOS 1.3, released December 2021 -- [source](https://docs.vyos.io/en/1.4/introducing/history.html)) and producing a list of breaking changes. This would also help characterize AI failures into two buckets: + +- **Hallucinations** -- the AI produces something that was never valid. +- **Outdated training** -- the AI produces something that was correct in older documentation. + +This may look like a side goal, but it is a good exercise for a RAG project. You use the different concepts, combine them, and learn. + +## Prompts I Want to Test Retrieval For + +These are example prompts I want to evaluate retrieval quality and AI accuracy on. I may try some of them and tweak them as I make progress -- they are not a fixed list. + +Each prompt includes a reworded version for AI consumption. The reworded prompt is what I would actually send to the model during evaluation. + +*The blockquoted prompts below were written by Claude Opus 4.6 (GitHub Copilot, April 2026). They rephrase the original intent using more precise network terminology.* + +- **Static IP assignment** -- assign a static IP on a specific network to a MAC address. Optionally generate the MAC from the desired IP. Given the IP, the AI should figure out which network it belongs to, generate the MAC, and apply the config. + + > **Prompt:** On a VyOS router, generate a locally-administered MAC address derived from IP `192.168.10.125`. Determine which network `192.168.10.125` belongs to by inspecting the router's existing interface configuration. Assign a DHCP static mapping for host `jm-rag` binding that MAC to `192.168.10.125` on the correct subnet. Output the full sequence of VyOS `set` commands. + +- **VPN interfaces** -- WireGuard, OpenVPN. + + > **Prompt:** Given the following WireGuard configuration file, create a VyOS `wg10` tunnel interface with the correct private key, listen port, peer public key, allowed IPs, and endpoint. After applying the config, provide the VyOS operational command to verify the interface is up and the handshake succeeded. + +- **Network creation** -- apply routing and firewall rules. Example: create a network, assign a VLAN tag, block all traffic to other local networks, and use it for my work computer. + - Networks with different DNS configs. + - Networks with VPN gateways. + + > **Prompt:** On a VyOS router, provision a new Layer 2 segment `JM-WORK` as a VLAN sub-interface (VID 11) on the trunk-facing parent interface. Assign the subnet `192.168.11.0/24` with the router as the default gateway at `.1`. Configure a DHCP pool for the subnet with an appropriate lease range and DNS forwarder. Apply zone-based firewall policies so that: (1) all ingress from other local zones to `JM-WORK` is dropped (deny inter-VLAN routing inbound), (2) all egress from `JM-WORK` to other RFC 1918 destinations is dropped (deny inter-VLAN routing outbound), and (3) egress to non-RFC 1918 destinations (i.e., the internet) via the default route is permitted with stateful connection tracking (`established`/`related` return traffic allowed). Output every `set` command required, covering the VLAN sub-interface, DHCP server scope, and firewall rule-set assignments to the zone or interface direction. + +- **Port forwarding** -- ports 22, 80, 443, etc. routed to specific VMs in the LAN. Bonus points for reading my specs from elsewhere and figuring it out automatically. + + > **Prompt:** On a VyOS router with WAN interface `eth0` (public-facing, DHCP or static WAN address), configure a DNAT (destination NAT) rule to translate inbound TCP SYN packets on port 22 arriving on `eth0` to the internal host `sshbastion` at `192.168.10.50:22`. Add a corresponding stateful firewall rule on the `WAN_IN` (or equivalent `in` direction on `eth0`) rule-set to accept `established`/`related` return traffic as well as new connections matching the DNAT translation. If a source-NAT masquerade rule already covers outbound traffic, confirm that hairpin NAT is not required for this case. Output all `set` commands for the NAT destination rule and the firewall rule entry. + +- **Advanced DNS** -- example: `coolservice.mydomain.com` points to my public IP externally, but `coolservice` is a LAN host. I want `coolservice.mydomain.com` to resolve to a reverse proxy on port 443 when queried from inside the LAN. This is a hack, but the AI should either handle it or suggest a better approach. + + > **Prompt:** On a VyOS router acting as the LAN's recursive DNS forwarder, the A record for `coolservice.mydomain.com` resolves to my WAN IP via public DNS. Internally, `coolservice` is a host at `192.168.10.30` running a TLS reverse proxy on port 443. Configure a DNS split-horizon override so that queries for `coolservice.mydomain.com` originating from any LAN zone return `192.168.10.30` instead of the public address (e.g., a static host mapping or an authoritative local zone entry). If split-horizon is suboptimal for this topology, evaluate alternatives -- such as destination NAT hairpin (NAT reflection) or a dedicated internal authoritative zone -- and compare their tradeoffs in terms of TTL handling, certificate validation, and operational complexity. + +## Stretch Goals + +These are harder, multi-stage projects where AI with accurate documentation would help significantly. Again, just examples -- these may evolve. + +- **Config audit** -- feed the entire VyOS config from my router (maybe 10--20 pages of commands) into the AI. Have it evaluate the config: generate firewall tests for external security problems, identify unused interfaces, flag dead configuration. + + > **Prompt:** I am providing my full VyOS running configuration (approximately 15 pages). Analyze it for: (1) firewall rules that leave external-facing ports unintentionally open, (2) interfaces or address assignments that are defined but never referenced by any routing, NAT, or firewall rule, and (3) any deprecated or redundant configuration stanzas. For each finding, cite the specific config lines involved and recommend a fix using VyOS `set` or `delete` commands. + +- **CI/CD on configs** -- run a virtualized cluster (router, hosts, etc.), apply new configs in staging, and run a security audit against the virtualized network. + + > **Prompt:** Given the attached VyOS configuration and a set of proposed changes (adding the `JM-WORK` network from previous examples), produce a `docker-compose.yml` or equivalent setup that spins up a virtualized VyOS instance and two stub hosts. Apply the base config, then apply the proposed changes. Write a test suite (shell scripts or pytest) that validates: the new VLAN interface is reachable, inter-VLAN traffic is blocked as expected, and internet access from the new network works. Execute the tests and report results. + +## Test Prompts + +I want to define prompts at varying difficulty levels to benchmark AI performance with and without RAG. + +### Easy + +- Generate a random MAC address for host `jm-rag` and assign it to `192.168.10.125` on the LAN. +- Forward port 22 to host `sshbastion.domain.com` (where `sshbastion` is a host on the LAN). + +### Medium + +- Create a new network `JM-WORK` with CIDR `192.168.11.0/24`. Follow up with: + - Block all traffic between it and other networks. No host outside `JM-WORK` can reach any host inside it. + - Assign VLAN tag 11 to network `JM-WORK`. +- Given a WireGuard config, create a `wg10` tunnel. Provide a command to verify the interface is up. + +### Hard + +- Create a new network `REMOTE-NET` with CIDR `192.168.12.0/24` using `wg10` as the gateway. Configure all routes and firewall rules to allow traffic to and from the remote network. Assume the remote end is a simple LAN with CIDR `192.168.2.0/24`. +- Use the supplied config and apply changes (e.g., from the examples above) in a staging environment -- VMs, docker-compose, k3s, whatever works. Write tests to validate the new functionality and execute them in staging. Without RAG, the AI presumably would not have correct data to do any of this. This is fundamentally an integration problem. + +## RAG Pipeline Plan + +VyOS publishes their [documentation](https://github.com/vyos/vyos-documentation) on GitHub as RST files. From a quick glance it is mostly well structured. + +I want to build a RAG pipeline using local and commercial models to: + +1. Identify prompts the AI fails to answer without help. +2. Check whether providing VyOS documents along with the prompt improves the response -- whether or not the baseline response was already correct. + +On initially correct responses (without RAG), I do not want to turn a valid answer into an invalid one by supplying wrong, irrelevant, or badly processed information. Checking for **regressions** matters. If the AI answers a prompt correctly on its own and then fails when given additional context, that invalidates your methodology (at least partially). + +### Chunking and Indexing + +- Chunk the entire VyOS documentation into block documents. +- Store them in a vector database -- or any database that supports retrieval of relevant blocks given a query. +- Defining "relevant" is expected to be hard, but the requirement is loose to start. Early on, I can separate the task of retrieving documents (rules, heuristics, etc.) from the task of using retrieved information in the model. I can even have a human or the AI in the loop for triage. + +### Retrieval Strategy + +- Define a framework for querying the database in a way that lets me assess results before automating. +- I probably cannot do a single vector search on a long compound prompt like "define a VPN interface using WireGuard with a DNS endpoint for the remote peer, and use it as a gateway for a new net with these properties" -- it will latch on to too many things. Instead, I can split the query into focused search terms: "WireGuard interface," "DNS address for WireGuard," "set up new network," "configure gateway of network." The usage would have to be intentional when prompting. I do not think that is a bad thing. +- **Ranking problems** -- I doubt the vector DB results will come back well ordered. A chunk that happens to be a header with the exact text "set up new network" would always rank as the best match for that query, but lower-ranked documents might be more relevant. I want to find mitigating solutions. One idea: the AI keeps probing the document DB until the matches are clearly drifting from the original question. For instance, when searching for "configuring WireGuard," iteratively evaluate retrieved documents until the first one the AI judges as "not relevant." Tell the AI to use exactly that criterion. +- **Embedding models** -- what models do I need for embeddings? How much does the choice matter? Performance tradeoffs, memory usage. +- **Ranking validation** -- for lack of a better idea, send the top 10--20 retrieved documents to a strong model (at the time of writing, Claude Opus 4.6) and ask whether the ranking is valid. How exactly to frame that question is TBD. + +### The Pipeline + +A Python program that: + +1. Receives a prompt, possibly with some expected structure. +2. In a loop or simple sequence, retrieves documents from the vector DB. +3. Constructs a final prompt to feed to the AI -- including back-and-forth with the user if the AI needs to clarify. +4. Returns the response to the user. + +## Existing Frameworks + +*This section was written by Claude Opus 4.6 (GitHub Copilot, April 2026). It surveys open-source RAG frameworks that are free to use for learning and production.* + +Before building the entire pipeline from scratch, it is worth knowing what already exists. The RAG framework space has matured significantly. Some of these are low-level toolkits where you assemble your own pipeline from primitives; others are batteries-included platforms with a web UI and built-in document processing. The tradeoff is control vs. time-to-first-result. + +I grouped these into three tiers based on how much they do for you. For this project -- local Ollama, VyOS RST docs, Python with uv, ChromaDB already in use -- the most relevant question is whether a framework adds value over the hand-rolled chunking and retrieval pipeline already under development. + +| Framework | Type | Pros | Cons | Fit for this project | +|---|---|---|---|---| +| [LangChain](https://github.com/langchain-ai/langchain) | Orchestration | Massive ecosystem, swappable LLMs and vector stores, large community, good for prototyping many configurations. | Abstraction-heavy; frequent breaking changes historically; "chain" paradigm can obscure what is actually happening. | Viable but heavier than needed. The abstraction layers would hide the retrieval mechanics you are trying to learn. | +| [LlamaIndex](https://github.com/run-llama/llama_index) | Orchestration | Data-centric design, strong indexing and retrieval strategies, handles heterogeneous inputs (PDFs, APIs, databases), active community. | Complexity scales fast; documentation sometimes lags new features. | Strong fit if you want pre-built retrieval strategies (e.g., sentence-window, auto-merging) without reinventing them. | +| [Haystack](https://github.com/deepset-ai/haystack) | Orchestration | Clean pipeline model (nodes connected explicitly), production-grade, YAML or Python config, Apache 2.0. | Smaller ecosystem than LangChain; fewer tutorials and community examples. | Excellent match. Pipelines are easy to reason about and expose the retrieval math without too much magic. | +| [RAGFlow](https://ragflow.io/) | Full-stack platform | Built-in UI, deep document understanding (tables, images, complex layouts), Docker self-hosting, supports Ollama as backend, agentic workflows, MCP support. | Opinionated architecture; heavier resource footprint; less control over internals. | Good for getting a working system fast. Pairs well with Ollama. Overkill if the goal is to understand every step of the pipeline. | +| [Ragbits](https://ragbits.deepsense.ai/stable/) | Modular toolkit | Pythonic API with type-safe LLM calls (Pydantic), modular install (`pip install` only what you need), supports local LLMs via LiteLLM, built-in evaluation framework, MIT licensed, supports uv. | Smaller community; relatively new; fewer production case studies. | Interesting middle ground. The type-safe prompt system and modular design fit a Python-first workflow well. The uv support is a plus. | +| [RAGatouille](https://github.com/AnswerDotAI/RAGatouille) | Retrieval add-on | ColBERT late interaction retrieval (token-level scoring instead of pooled embeddings), lightweight, plugs into LangChain or LlamaIndex, can be used as a reranker without reindexing. | Not a full pipeline -- retrieval only. Requires understanding ColBERT's scoring model. | Worth experimenting with. Late interaction scoring is a fundamentally different statistical approach to similarity vs. cosine on pooled embeddings. Relevant to the ranking problems described above. | +| [LightRAG](https://github.com/HKUDS/LightRAG) | Graph + vector hybrid | Combines knowledge graphs with vector retrieval for relational queries, incremental updates (no full rebuild), open source. | More complex conceptually; graph construction adds a pipeline stage; newer project. | Interesting for relational queries ("what config depends on what") but adds complexity beyond what the initial pipeline needs. | +| [txtai](https://github.com/neuml/txtai) | All-in-one | Built-in vector DB and semantic search, knowledge graphs, pipelines, agents -- no external vector store needed. | Less mainstream; documentation quality varies; monolithic feel. | Self-contained option if you want to avoid managing a separate vector store. Not needed here since ChromaDB is already set up. | +| [Verba](https://github.com/weaviate/Verba) | Full-stack platform | Easy setup, transparent chat showing sources and highlighted chunks, supports Ollama. | Tied to Weaviate as vector store; not suited for multi-user deployments. | Good for quick solo prototyping. Less useful long-term since it locks you into Weaviate. | + +### Notes + +- **For learning the mechanics** (embeddings, chunking, retrieval ranking): Haystack or LlamaIndex expose the most without hiding things behind abstractions. The hand-rolled pipeline already started in this project is also a valid path -- frameworks are not mandatory. +- **For a fast working demo**: RAGFlow gives you the most out of the box. Docker, UI, document parsing, Ollama integration. +- **For retrieval experiments**: RAGatouille is worth a look specifically because ColBERT's late interaction model addresses ranking problems differently than standard cosine similarity -- relevant to the ranking concerns described in the Retrieval Strategy section. +- **Ragbits** stood out for its Pythonic design and modular installation. It also ships a built-in evaluation framework (`ragbits-evaluate`), which could be useful for the regression testing described in the RAG Pipeline Plan section. diff --git a/site/content/llm/learning-motivations.md b/site/content/llm/learning-motivations.md new file mode 100644 index 0000000..dd46b9b --- /dev/null +++ b/site/content/llm/learning-motivations.md @@ -0,0 +1,132 @@ ++++ +date = '2026-04-03T23:43:34Z' +draft = false +title = "Why I'm Learning LLMs" +tags = ['llm', 'learning', 'motivation', 'AI-reviewed'] ++++ + +## Motivation + +Part **curiosity** — I like gardening, math, geology, history, psychology, and now this. Part **continued employment**. I don't shine at working faster or adopting the coolest tools. I am, however, capable of thinking critically and **asking why**, even when it bothers everyone else. + +AI will disrupt the technology landscape. It's also opening opportunities. We have to figure out what they are. + +Software engineers don't work the way they did five years ago. Five years ago I was reprojecting maps in tile servers. Today I develop data pipelines for robotic applications. I want to be ready for whatever comes next. + +## What I Want to Learn + +My objective is to understand how LLMs work. Specifically: + +### When they work, when they don't + +- When they tend to fail — and what I can do about it +- Why they sometimes cycle between telling you something and the complete opposite +- See the example applications below for the kind of tasks I have in mind + +### Getting predictable results + +- Improve **consistency**: different outputs should not contradict each other on similar input +- Improve **quality**: more relevant, more insightful output +- Reduce instances of the model flat-out ignoring instructions ("For the 4th time: use UV when running Python. DO NOT set a virtual environment yourself.") +- Learn to manage context windows effectively + +### Diagnosing problems + +- Tell apart: me failing at using the model vs picking the wrong model vs feeding bad information +- Identify when the model is capable but outdated — doesn't "know" about breaking changes in a library, for example + +### Comparing models + +- Evaluate paid vs local models +- Evaluate free public models (and paid ones too, since that's what I use professionally) +- Measure consistency and quality: does Claude give a more complete answer than Llama for task X, and does it do so reliably on the same input? + +### The math + +I used to be a legit statistician. I want to understand LLMs at the mathematical level: + +- Embeddings: what they mean, how they're computed +- Transformer steps: be able to trace 1–2 iterations on paper and understand what I'm achieving +- Encoders only vs decoders only, model heads +- Enough depth to build real intuition, not just hand-waving + +### Prompt engineering + +Get good at getting the model to do what I want, as cheaply and quickly as possible. + +### Dealing with scale + +Large contexts — very large code bases, very large documents. "Large" meaning tasks that exceed current model capacity. At the time of writing, models I use stretch to 100K–200K tokens. What about a codebase with 2M lines? What about parsing GBs of logs for trends? (AI may not be the first tool for that last one.) + +### Privacy + +- From my ISP, my employer, my family, hackers, the government, Google, my job +- Example: suppose I'm setting up a VPN to keep my government from knowing that _Facebook_ (or _The Onion_ — same thing really) is my main news source. Neither my government nor my wife should know about this. +- I want to supply an API key to the AI when asking it to configure something, with confidence it's handled properly. With public models I don't expect this. On my LAN, I should be able to achieve it. + +## What I Want to Build + +My interest in LLMs centers on technical work, technical writing, and professional development. Concrete examples: + +- **Router configuration** — advanced setups: separate networks with firewall rules, VPN gateways, site-to-site VPNs +- **Home lab** — media servers, VPN endpoints, related projects +- **Home automation** — Z-Wave thermostat with Home Assistant, rules, power bill monitoring +- **Technical writing** — have the AI document finished projects, identify gaps in my documentation trail, make updating this website easier when I have interesting results +- **Finish stalled projects** — ones resembling the above +- **Vibe code** — example: a small app to track HSA receipts. Go to `myapp.mydomain.com`, scan a receipt, parse it, attach metadata, store results and images in a database for claims. Achievable but not trivial. We'll see where it takes me. + +## Hardware + +I have two Nvidia GPUs on hand: + +| GPU | VRAM | +|-----|------| +| RTX 3090 | 24 GB | +| A30 | 24 GB | + +Plus 96 GB of DDR5 RAM for spillover. Consumer-grade GPU limitations prevent pooling the GPU memory, but I can run a sizable collection of models with this setup. + +My plan: accept the speed penalty of memory swapping to run larger, allegedly better models and see what they're actually capable of. I don't mind waiting 5–10 minutes on a big model as a test — especially for **reproducible and valid** answers on complex queries. + +Examples of complex queries: + +- *Multi-step with information retrieval*: "Set up a separate network on my router with a VPN gateway. Figure out the WireGuard interface config. Make sure traffic never spills to other gateways. Make sure I can test it. Handle a DNS entry for the remote endpoint instead of a static IP — if the router doesn't support it, find a workaround." +- *Messy real-world context*: "Read this massive email chain. Help me gauge the client's mood and decide the next step. Pull context from my codebase, git history, or some other system if it helps." + +I also have access to an RTX 5070 (12 GB VRAM) through my work at Forterra. I don't do much after-hours development, but sometimes it's the only machine available and I can test an idea during lunch. (Yes — I got approval from our cybersecurity team.) + +What I really want is to play with the software and the hardware. Run models locally as much as possible. I don't object to commercial models either. + +## Evaluating Output + +I want a better mental picture of how to evaluate LLM output — and how to manage the volume of text these models produce without losing my mind. + +I'm not trying to understand everything. If the AI writes a parser for a crappy file format and I understand the spec, I'm not going to micromanage it. But I want to stay on top of things so I can keep iterating **and stay confident** the output is good. In my experience, this is hard in general — and not easier with AI. + +- This is open-ended by definition. The AI outputs tons of text. Am I supposed to save it? Save what I fed it? How do I keep track? What's worth keeping and what's not? + +## What I'm NOT Doing + +- Not pursuing heavy agent integration yet. I'm fine talking to a Python script for now. +- Not starting a company or creating anything novel. +- Not focused on scaling — my objective is mastery of the tool, not productizing my learning. + +I do want to use what I build reliably in my home lab. Maybe my wife can tap into the resources too. + +--- + +## Potential Leads Going Forward + +> *The ideas below were suggested by AI (Claude) based on the goals above. They're not commitments — just threads worth pulling on.* + +- **RAG (retrieval-augmented generation)** — I'm already building this with vyosindex/ChromaDB, but it's not listed as an explicit learning objective. Chunking strategies, embedding models, vector similarity, and retrieval quality are worth studying deliberately. +- **Fine-tuning vs RAG vs in-context learning** — when to use each, and the tradeoffs. Directly relevant to the "outdated model" question. +- **Quantization and model formats** — running big models on consumer hardware means understanding quantization levels and their impact on quality, speed, and memory. +- **Reproducibility** — seeds, temperature settings, and the fundamental non-determinism of GPU floating point. Even `temperature=0` doesn't guarantee identical outputs across runs. +- **Hallucination detection and grounding** — goes beyond "when they fail." Specific strategies: verifying outputs, forcing citations, grounding responses in source material. +- **Cost analysis** — token pricing for API models, electricity and time cost for local inference, and when local vs cloud makes economic sense. +- **Security and prompt injection** — what "private" really means locally vs through an API, and how prompt injection attacks work. +- **The math curriculum** — subtopics worth scoping: attention mechanism, transformer architecture, tokenization, loss functions, backpropagation, softmax, positional encoding. That's a curriculum in itself. +- **Evaluation frameworks** — formal benchmarks and task-specific eval approaches, even if not used directly. +- **Multi-modal models** — vision, audio, code. The hardware supports it. Worth deciding if this is in or out of scope. +- **Model selection heuristics** — how to quickly pick the right model for a task without trial and error every time. diff --git a/site/hugo.toml b/site/hugo.toml index 234ff4c..0e1a023 100644 --- a/site/hugo.toml +++ b/site/hugo.toml @@ -68,6 +68,11 @@ enableGitInfo = true name = "This Site" url = "/thissite" weight = 12 + [[languages.en.menu.main]] + identifier = "llm" + name = "LLM" + url = "/llm" + weight = 13 [module] [[module.imports]] diff --git a/site/layouts/_default/index.html b/site/layouts/_default/index.html new file mode 100644 index 0000000..e33939d --- /dev/null +++ b/site/layouts/_default/index.html @@ -0,0 +1,63 @@ +{{ define "main" }} + {{ if .Content }} +
{{ .Description | markdownify }}
+ {{ else }} + {{ .Summary }} + {{ end }} +