Knowledge Management with RAG
BioMaster employs a dual Retrieval-Augmented Generation (RAG) system to dynamically access and utilize relevant domain knowledge for planning and execution:
PLAN RAG – Guides high-level workflow decomposition
EXECUTE RAG – Provides detailed tool/script usage for task execution
PLAN RAG
The PLAN Agent retrieves step-by-step analysis workflows using PLAN RAG. To add a new workflow:
Collect a reference Find a reliable source or protocol (e.g., from nf-core, published papers, or existing pipelines).
Describe the workflow Use a standardized, concise format that includes:
Step description
Input required (with data format)
Expected output
Tools used
Example:
Step 2: Alignment – Align reads to the reference genome. Input: Cleaned FASTQ files and the reference genome Output: Sorted BAM file Tools: BWA-MEM, STAR
Add entry to `doc/Plan_Knowledge.json` Use the following JSON format:
{ "content": "Full workflow steps in plain text...", "metadata": { "source": "workflow", "page": 1 } }
EXECUTE RAG
The TASK Agent uses EXECUTE RAG to generate shell scripts for each step. To contribute:
Document script/tool/function usage
Include input/output specifications
Provide example commands
Note usage location (e.g., ./scripts/, functions.py)
Add entry to `doc/Task_Knowledge.json` Example:
{ "content": "run-sort-bam.sh:\nSorts BAM file by coordinate...\nUsage:\nbash ./scripts/run-sort-bam.sh <input.bam> <output_prefix>", "metadata": { "source": "run-sort-bam.sh", "page": 6 } }
Best Practices
✅ Be specific and concise
✅ Mention file formats where applicable
🚫 Avoid redundant or vague entries
🔁 After any change, delete the local vector store:
rm -rf ./chroma_db
📌 Use the metadata.source field to tag by tool/script/workflow name for better retrieval
Updating Knowledge
To update or remove knowledge:
Edit the corresponding JSON file (Plan_Knowledge.json or Task_Knowledge.json)
Then delete ./chroma_db/ to force regeneration of knowledge embeddings