Knowledge Management with RAG

BioMaster employs a dual Retrieval-Augmented Generation (RAG) system to dynamically access and utilize relevant domain knowledge for planning and execution:

PLAN RAG – Guides high-level workflow decomposition
EXECUTE RAG – Provides detailed tool/script usage for task execution

PLAN RAG

The PLAN Agent retrieves step-by-step analysis workflows using PLAN RAG. To add a new workflow:

Collect a reference Find a reliable source or protocol (e.g., from nf-core, published papers, or existing pipelines).

Describe the workflow Use a standardized, concise format that includes:

Step description
Input required (with data format)
Expected output
Tools used

Example:

Step 2: Alignment – Align reads to the reference genome.
Input: Cleaned FASTQ files and the reference genome
Output: Sorted BAM file
Tools: BWA-MEM, STAR

Add entry to `doc/Plan_Knowledge.json` Use the following JSON format:

{
  "content": "Full workflow steps in plain text...",
  "metadata": {
    "source": "workflow",
    "page": 1
  }
}

EXECUTE RAG

The TASK Agent uses EXECUTE RAG to generate shell scripts for each step. To contribute:

Document script/tool/function usage
- Include input/output specifications
- Provide example commands
- Note usage location (e.g., ./scripts/, functions.py)

Add entry to `doc/Task_Knowledge.json` Example:

{
  "content": "run-sort-bam.sh:\nSorts BAM file by coordinate...\nUsage:\nbash ./scripts/run-sort-bam.sh <input.bam> <output_prefix>",
  "metadata": {
    "source": "run-sort-bam.sh",
    "page": 6
  }
}

Best Practices

✅ Be specific and concise
✅ Mention file formats where applicable
🚫 Avoid redundant or vague entries
🔁 After any change, delete the local vector store:
```
rm -rf ./chroma_db
```
📌 Use the metadata.source field to tag by tool/script/workflow name for better retrieval

Updating Knowledge

To update or remove knowledge:

Edit the corresponding JSON file (Plan_Knowledge.json or Task_Knowledge.json)
Then delete ./chroma_db/ to force regeneration of knowledge embeddings