Source Research Lab / AI Systems Research Lane Qualification-Gated

AI Systems Researcher
Benchmark Operator

Help test, benchmark, document, and improve Source’s AI tutors, agent harnesses, RAG corpora, model workflows, dashboards, local inference labs, custom Pi harnesses, and proprietary AI systems.

△ Apply for AI Systems Research Review View Research Programs → Back to Research Lab

AI Systems Research

Benchmark Operators

AI Tutors

Agent Harnesses

RAG / Retrieval

Local Inference

AI Systems Benchmark Console Active

research.source/ai-systems

System Vetting

Evaluate
The System

AI tutors, RAG pipelines, agent harnesses, local models, dashboards, run ledgers, validation gates, and proof-of-work outputs.

Active Programs 10

Access Mode Scoped

Primary Output Reports

Current ModeResearch Review

GateQualified Only

EvidenceProof-of-Work

Measuring whether custom RAG vector recall, agent pipeline reliability, local quantizations, and feedback loops meet business guidelines.

Lane Definition

The AI Systems
Research Lane.

The AI Systems Researcher lane is the applied AI systems research and benchmark lane inside Source Research Lab. Its job is to evaluate whether Source’s AI systems are useful, reliable, grounded, measurable, auditable, and scalable.

RESEARCH THESIS

This Lane Measures Whether AI Systems Actually Work

Rather than accepting marketing claims of LLM capabilities, we stress-test models, prompts, corpora, and pipelines to establish empirical thresholds.

• Do AI tutors improve real learning outcomes for student operators?
• Do custom agent harnesses improve execution reliability over direct prompting?
• Do RAG vector setups retrieve grounded, citeable material consistently?
• Do local models perform well enough to replace expensive frontier model calls?

TARGET SYSTEM INTEGRATIONS

Systems Under Evaluation

Researchers validate the components that form the foundation of Source’s training and execution stack:

Source University Tutors Custom Pi.dev Harnesses Claude Code Workflows OpenClaw Runtimes Vertical RAG Corpora Model Cost Router BuildGraph Telemetry

Testing focus is locked to metrics, diagnostics, error logs, and verification runbooks rather than simple chat prompts.

Research vs Training

Not The Training Track.
The Research Layer Behind It.

Do not confuse this with enrollment. The training program is about learning how to use the systems. The research lane is about verifying whether the systems themselves function as intended.

SOURCE UNIVERSITY AI SYSTEMS PROGRAM

AI Systems Program

Capability Development Layer

Complete training runbooks and assignments
Learn multi-agent frameworks, cloud API loops, and web design
Focuses on developing operational capability

View AI Systems Program →

SOURCE RESEARCH LAB RESEARCH LANE

AI Systems Researcher Lane

Benchmark & Evaluation Layer

Audit LLM tutor performance and hallucination rates
Compare vector embeddings, rerankers, and local GGUF models
Focuses on validating system limits and compiling memos

Active Research Lane Focus

Source University asks: Can this person learn the system? • Source Research Lab asks: Can this person help prove whether the system works?

"The training program is about becoming capable inside the system. The research lane is about testing whether the system itself is capable."

Candidate Profile

Who This
Is For.

We require capable operators with technical discipline. While a PhD is not required, applicants must show structured reasoning and technical comfort under uncertainty.

Applied AI Researchers

Research Assistants

ML/Systems Engineers

RAG Specialists

STRONG FIT INDICATORS

• Detail-oriented, data-aware, and heavy on written documentation
• Capable of writing reproducible test sandboxes or validation scripts
• Comfortable testing tedious edge cases and cataloging model failure logs

WEAK FIT INDICATORS

• Wanting compute credentials or model access without producing memos
• Looking for guaranteed employment, structured classes, or simple prompts
• Unwilling to document research methodologies or trace execution paths

Research Surface

What You May
Help Test.

Selected researchers are assigned specific testing scopes inside our sandboxes. These tasks focus on auditing pipeline performance, measuring cost routing, and identifying anomalies.

1. AI Tutors

Validate tutor explanation consistency, feedback loops, and capability progression metrics.

2. Agent Harnesses

Stress-test custom Pi harnesses, Codex task runs, and Claude Code execution pipelines.

3. RAG Retrieval

Audit citation precision, semantic search chunking configs, and vector recall anomalies.

4. Vertical Corpora

Evaluate specialized industry corpora used for local business diagnostics and reports.

5. QA Control Gates

Test approval queues, validation scripts, and human-in-the-loop safety frameworks.

6. Consoles

Audit whether dashboards communicate operational trace logs, drawdowns, and latency paths.

7. Outreach Tech

Evaluate reply classification agents, campaign suppression rules, and delivery setups.

8. BuildGraphs

Test multi-agent build convergence, layer locking, and repository branch tracking.

9. Model Routing

Compare local vLLM performance with frontier APIs to balance routing cost and latency.

10. SOP RAG Context

Audit runbooks, decision logs, and mistake tracking directories for ingestion safety.

Research Programs

Top 10 Active AI Systems
Research Programs.

The AI Systems Researcher lane is structured around ten active research programs. Each program is designed to turn vague AI capability claims into measurable research evidence.

Research Question: Can AI tutors, structured assignments, and review loops accelerate real capability development compared to passive courses?

Why It Matters: Verifying whether AI-assisted mentoring reduces student confusion logs and improves follow-through.

Tools Involved: AI tutor prompts, RAG over curriculum markdown, progress dashboards, review rubrics.

Benchmark Method: Before/after skill testing, rubric-based output grading, time-to-understanding latency analysis.

Expected Outputs: AI Tutor Evaluation Report, Rubric Effectiveness Scorecard, Confusion Taxonomy.

Access Gate: Private curriculum files, student review logs, sandboxed tutor environments. Gated for privacy.

Research Question: Do custom Pi harnesses, Codex task configurations, and Claude Code loops improve task-completion safety?

Why It Matters: Preventing infinite agent loops, hallucinated file actions, or dependency breakage.

Tools Involved: Pi.dev harnesses, Codex workflows, Claude Code workspaces, OpenClaw, WebSocket MCP servers, task manifests.

Benchmark Method: Task completion percentage, retry count tracking, regression tests, validation gate pass rate.

Expected Outputs: Agent Harness Benchmark Report, Failure-Mode Taxonomy, Harness-vs-Direct Scorecard.

Research Question: Can vector chunking parameters and hybrid search filters eliminate factual retrieval errors?

Why It Matters: Grounding system outputs in real source material to secure citation integrity.

Tools Involved: LangChain/LlamaIndex chunking configs, pgvector, Qdrant search streams, embedding models, rerankers.

Benchmark Method: Retrieval precision audits, semantic recall mapping, source faithfulness verification.

Expected Outputs: Retrieval Quality Scorecard, Citation Faithfulness Log, RAG Benchmark Dataset.

Research Question: Can industry-specific service business datasets power reliable automated diagnostics?

Why It Matters: Safeguarding business analysis tools from generating generic or false diagnostics.

Tools Involved: Local service vertical corpora, review data, audit scripts, diagnostic generator prompts.

Benchmark Method: Fact-checking report details, comparing AI outputs with manual expert reviews.

Expected Outputs: Vertical Corpus Audit Memo, Diagnostic Accuracy Scorecard, Gap Registry.

Research Question: Which validation gates and human approval workflows best contain autonomous agent errors?

Why It Matters: Securing software and database runs from executing unverified actions.

Tools Involved: Human approval dashboards, validation test scripts, automated critique agents, run logs.

Benchmark Method: Incident frequency tracking, human review time-cost logs, regression checks.

Expected Outputs: Safety Control Framework, Human Review Queue Protocol, Incident Audit Memo.

Research Question: Do admin consoles expose alerts, queue states, and processing anomalies clearly to operators?

Why It Matters: Reducing operator response delay and ensuring quick identification of system locks.

Tools Involved: Admin panels, operator telemetry displays, queue tables, event logging views.

Benchmark Method: Operator task-resolution latency, UI information density scorecards.

Expected Outputs: Dashboard Utility Report, Console UX Audit Scorecard, Alert Taxonomy.

Research Question: Do email setups and response classification models run safely without spam or routing errors?

Why It Matters: Protecting deliverability reputation and ensuring compliant email loops.

Tools Involved: Inbox networks, DNS records (SPF/DKIM/DMARC), campaign reply log scripts, suppression databases.

Benchmark Method: Deliverability score checks, email routing logs accuracy audit.

Expected Outputs: Outreach Delivery Checklist, Reply Classification Accuracy Scorecard.

Research Question: Can branch locking and multi-agent critique loops improve code generation reliability?

Why It Matters: Enabling structured code generation without manual code conflict resolution.

Tools Involved: Git repo graphs, BuildGraph trace systems, agent task packets, CI/CD verification scripts.

Benchmark Method: Build success percentage, code convergence speed, syntax defect density.

Expected Outputs: BuildGraph Execution Memo, Layer-Lock Protocol Guide, Convergence Analysis.

Research Question: Can model Fallbacks and routing middleware cut API costs without sacrificing accuracy?

Why It Matters: Optimizing token costs by routing simple tasks to cheap local open models.

Tools Involved: Model routing middleware, local serving endpoints, cost telemetry metrics, latency logs.

Benchmark Method: Cost-accuracy trade-off curves, routing latency benchmarks.

Expected Outputs: Model Routing Matrix, Cost-vs-Quality Trade-off Report.

Research Question: Do markdown SOP files and decision registers improve agentic search accuracy?

Why It Matters: Designing contexts so that agents extract structural logic instantly.

Tools Involved: SOP files, mistake logs, decision history, repository contexts, prompt template files.

Benchmark Method: Context retrieval scores, prompt execution accuracy compare.

Expected Outputs: Context Engineering Guide, Documentation Quality Rubric, Mistake Logging Protocol.

Infrastructure Layer

AI Systems Infrastructure
And Proprietary Research Environment.

We provide controlled, qualification-gated access to custom harnesses, private databases, and serving networks. Access is staged and tied directly to project workloads.

Encrypted model routing pipelines and token rate limiter configurations.
Confidential prompt setups, benchmark configurations, and evaluation environments.
Source-built task ledgers, internal QA protocols, and custom system context packages.

Oh-My-Pi styled execution sandboxes for logging agent operations.
Pi.dev agent orchestrations, Claude Code trace scripts, and Codex task registers.
WebSocket MCP servers, validation gates, and agent loop breaker scripts.

Frontier model API integrations for Claude Code and OpenAI Codex.
Hermes-style multi-terminal scripts and OpenClaw repo agent loops.
Custom critique agents and regression validation tool calls.

Local runtimes serving Llama-3, Qwen, and quantized GGUF models via llama.cpp.
vLLM serving structures, Ollama interfaces, and local embedding models.
Private API gateways comparing local model latency against frontier providers.

Fine-tuned models customized on Source University training data and system runbooks.
Specialized diagnostic models tuned to parse organic visibility and GBP metrics.
Model steering scripts, context evaluation tests, and custom log output validators.

Source is interested in neuron-level hallucination mitigation research, including hallucination-associated neuron detection, activation steering, local model intervention experiments, groundedness evaluation, and reliability benchmarking.

*Safety Boundary: We make no claim that hallucinations are solved, that we possess non-hallucinating models, or that perfect factuality is guaranteed.

pgvector, Qdrant, Pinecone, and Weaviate index setups for vertical service data.
Chunking systems, embeddings, hybrid search, and semantic similarity rerankers.
Hallucination check scripts verifying retrieved chunks against generated answers.

Routing middleware verifying query complexity prior to model selection.
Multi-model committee consensus scripts and automated grading runbooks.
Model latency/cost dashboards mapping routing policy effectiveness.

Automated evaluations, regression tests, and version control check routines.
Model prompt tracking logs, evaluation dataset registries, and run manifests.
Continuous integration pipelines testing code generation safety gates.

Grafana interfaces, Loki logs, Prometheus dashboards, and OpenTelemetry tracking.
Loki logs and tempo trace charts mapping agent tool-execution paths.
Token expenditure tracking dashboards and API latency monitors.

Private virtual servers (VPS) and isolated Linux VMs.
High-core CPU and high-RAM workstations with remote SSH/RDP targets.
Docker/Podman environments and remote JupyterLab server endpoints.

Model Context Protocol (MCP) servers and WebSocket communication bridges.
Agent file-system tools, command execution guards, and network filters.
Tool-call logs and automated schema validation scripts.

Markdown document manifests and system SOP repositories.
Mistake tracking databases, decision logs, and context packaging tools.
Grounded-prompt templates and retrieval-ready runbook formatting guides.

RAG + tuned models integrated with custom Pi harnesses.
Codex execution + validation scripts linked to telemetry dashboards.
Ensemble routing configs + Grafana logs + human approval queues.

Proof-of-Work

Research Output
Comes First.

The AI Systems Researcher lane is strictly output-driven. We measure value by the quality and reproducibility of the scorecards, datasets, and memos submitted.

01

AI Tutor Evaluation

Reports measuring whether AI tutors improve learning, assignment quality, proof-of-work output, and operator readiness.

02

Harness Benchmarks

Reports measuring task completion, validation pass rate, restartability, auditability, model behavior, and tool-use reliability.

03

RAG scorecards

Reports measuring retrieval precision, recall, source grounding, citation faithfulness, chunking quality, and hallucination behavior.

04

Hallucination Audits

Reports identifying unsupported claims, fabricated citations, missing sources, overconfidence, and grounding failures.

05

Comparison Matrices

Side-by-side comparisons of frontier models, local models, coding agents, judge models, embedding models, and rerankers.

06

Model routing Maps

Reports identifying which models should handle which tasks, with cost, latency, quality, and risk considerations.

07

Local Inference Memos

Reports measuring latency, throughput, quality, cost, privacy, local-vs-frontier performance, and deployment tradeoffs.

08

Dashboard Utility

Reports reviewing whether dashboards expose useful state, alerts, queues, failures, readiness, and next actions.

09

Workflow QA

Reports identifying failure modes, validation gaps, review burden, false approvals, and human-in-the-loop improvements.

10

BuildGraph Memos

Reports reviewing multi-agent build runs, branch variants, layer locks, critique loops, telemetry, and convergence patterns.

11

Documentation Quality

Reports evaluating canonical artifacts, SOPs, handoff packets, mistake logs, decision logs, and retrieval-ready documentation.

12

Proof-of-Work

Public or private artifacts demonstrating real research contribution, where mutually approved and appropriate.

Every formal research memo submitted to the lab is structured to include:

1. Research Question

9. Failure Cases

2. Target Hypothesis

10. System Limitations

3. Systems Tested

11. Screenshots or Logs

4. Tools Used

12. Reproducibility Notes

5. Data or Corpus Used

13. Confidence Level

6. Methodology Steps

14. Recommendations

7. Benchmark Criteria

15. Next Experiments

8. Results Summary

16. Reviewer Notes

Access Model

Access Is
Qualification-Gated.

Not every researcher receives access to every system. Source Research Lab access is qualification-based, project-scoped, permissioned, staged, revocable, and confidentiality-bound.

01

Public Review

Candidate reviews research lanes, program summaries, and boundaries.

02

Application

Candidate submits background, proof-of-work, research interests, and availability.

03

Vetting

Source evaluates technical capability, logical clarity, and confidentiality readiness.

04

Match

Candidate matched to specific, scoped AI Systems programs.

05

Access

Researcher receives scoped access keys for target compute tools.

06

Output

Researcher submits structured memos, code runs, and scorecards.

07

Review

Strong candidates evaluated for paid collaboration or deeper tasks.

Selected researchers may receive controlled, qualification-based access to Source research environments, proprietary configurations, custom harnesses, private corpora, model workflows, and dashboards depending on project fit, approval, licensing, confidentiality, and availability.

Certain configurations, internal benchmarks, and experimental systems are held back from public visibility and may only be discussed with qualified candidates after review. Access to compute, terminals, paid APIs, local inference setups, or frontier models is not guaranteed.

Boundaries

What This Research Lane
Is Not.

We operate a serious applied research space. Verify these non-guarantees prior to submitting consideration logs.

No Passive Video course

This is not a prompt bootcamp or slide presentation track. Value is measured strictly by active test sandbox outputs.

No Open compute

We do not provide free API tokens or server access for personal research projects. Resources are locked to matched scopes.

No Job Guarantees

Vetting models does not guarantee employment, paid fellowships, or contract engagements with Source.

Research Output Comes First

This page is not offering unrestricted access to models, servers, tools, corpora, or proprietary Source systems. Access is tied strictly to qualification, trust, project fit, confidentiality, availability, resource cost, clear scope, and useful research output.

Application Preview

What Applicants Should Be
Ready To Answer.

We vet candidates systematically. Review these targeted questions to prepare your research log submissions.

General Questions

• Why apply to Source Research Lab as an AI Researcher?
• What code proof-of-work can you showcase?
• How many hours per week can you commit?
• Are you comfortable writing detailed memos?

AI Systems Questions

• Which model classes (OpenAI, Claude, Llama) have you deployed?
• Have you configured LangChain, LlamaIndex, or pgvector?
• Have you tested local runtimes via llama.cpp or Ollama?
• Have you built or evaluated tool-using AI agents?

Research Judgment

• What RAG chunking failure mode concerns you most?
• How would you systematically audit LLM hallucinations?
• How would you compare two model runtimes fairly?
• How would you verify whether a dashboard is useful?

Boundary Questions

• Are you willing to document failure logs?
• Do you understand that compute resources are not guaranteed?
• Are you comfortable working with staged database access keys?

Support Desk

Frequently Asked
Questions.

Review clarification details concerning testing scopes, requirements, and bench routes.

No. This is a selective research and benchmark lane inside Source Research Lab. Strong candidates may be considered for deeper involvement, paid collaboration, or future Source tasks, but nothing is guaranteed.

No. The Source University AI Systems training program is designed to train candidates. This page is for the research lane that evaluates the systems, workflows, tutors, corpora, dashboards, and infrastructure behind that training.

No. A PhD is not required. Strong technical ability, research discipline, structured documentation, proof-of-work, and seriousness matter more than titles.

Coding ability is strongly useful, especially for agent harnesses, RAG systems, dashboards, local inference labs, benchmark scripts, and evaluation workflows. However, some research tasks may emphasize evaluation, documentation, QA, benchmarking, or system analysis.

Depending on fit, you may help evaluate AI tutors, RAG systems, agent harnesses, local inference labs, custom model workflows, dashboards, vertical intelligence corpora, model routing, hallucination mitigation experiments, BuildGraph-style workflows, documentation systems, or proprietary Source configurations.

Possibly. Access is qualification-based, project-scoped, and subject to approval, confidentiality, availability, and need. Not all systems are publicly described or available to every researcher.

Possibly. Selected researchers may work with frontier model workflows, local models, private inference endpoints, custom-tuned models, or evaluation infrastructure where available and approved.

Possibly. Selected researchers may work inside custom Pi-style harness environments where AI agents are assigned, observed, logged, benchmarked, corrected, and compared across controlled execution runs.

Possibly. The lane includes RAG/retrieval quality evaluation, corpus audits, citation faithfulness testing, chunking comparisons, embedding/reranking comparisons, and source-grounded answer review.

Possibly. Source is interested in hallucination-associated neuron detection, activation steering, local model intervention experiments, groundedness evaluation, hallucination audits, and reliability benchmarking. We make no claim that hallucination has been solved.

Possibly. Public proof-of-work, case studies, testimonials, or public research summaries may happen only where mutually approved and appropriate. Public exposure is not guaranteed.

Possibly. Strong candidates may be considered for paid collaboration, deeper research involvement, or future Source tasks if qualified. This is not guaranteed.

No. Access is controlled, project-scoped, and subject to availability, cost, licensing, approval, qualification, and confidentiality boundaries.

Expected outputs may include AI tutor evaluation reports, agent harness benchmarks, RAG scorecards, hallucination audits, model comparison matrices, model-routing recommendations, local inference benchmarks, dashboard utility reports, workflow QA reports, BuildGraph memos, documentation quality reports, and proof-of-work artifacts.

Applicants should apply for AI Systems Research review and provide background, proof-of-work, research interests, technical experience, availability, and willingness to produce structured outputs.

AI Systems Research Review

Apply For
AI Systems Research Review.

If you are serious about helping test, benchmark, document, and improve advanced AI learning systems, agent harnesses, RAG corpora, model workflows, dashboards, local inference labs, custom Pi harnesses, and proprietary Source research environments, you may apply for AI Systems Research review.

AI Systems Research Qualification-Gated Outputs Matter

△ Apply for AI Systems Review Back to Source Research Lab → View Trading Systems Research

AI Systems ResearcherBenchmark Operator

This Lane Measures Whether AI Systems Actually Work

Systems Under Evaluation

AI Systems Program

AI Systems Researcher Lane

1. AI Tutors

2. Agent Harnesses

3. RAG Retrieval

4. Vertical Corpora

5. QA Control Gates

6. Consoles

7. Outreach Tech

8. BuildGraphs

9. Model Routing

10. SOP RAG Context

AI Tutor Evaluation

Harness Benchmarks

RAG scorecards

Hallucination Audits

Comparison Matrices

Model routing Maps

Local Inference Memos

Dashboard Utility

Workflow QA

BuildGraph Memos

Documentation Quality

Proof-of-Work

Public Review

Application

Vetting

Match

Access

Output

Review

No Passive Video course

No Open compute

No Job Guarantees

Research Output Comes First

General Questions

AI Systems Questions

Research Judgment

Boundary Questions

Apply ForAI Systems Research Review.

AI Systems Researcher
Benchmark Operator

Apply For
AI Systems Research Review.