feat: initial ResearchOwl

2026-04-27 13:49:07 +00:00
commit ba08536337
37 changed files with 2431 additions and 0 deletions
@@ -0,0 +1,108 @@
+# 🦉 ResearchOwl
+
+**Exhaustive research engine with Telegram interface.**
+
+Recursively discovers, scrapes, and processes sources from across the web,
+then generates podcast scripts, blog posts, reports, or social threads using Ollama.
+
+## Architecture
+
+```
+Telegram (/research <topic>)
+    ↓
+ExhaustiveScraper
+    ├── DuckDuckGo (8 queries × 5 results)
+    ├── Wikipedia + recursive internal links
+    ├── Reddit (top posts + top comments)
+    ├── YouTube (transcripts)
+    ├── PDFs (public documents)
+    └── Web scraping (trafilatura)
+         ↓ recursive expansion (depth 1-3)
+ContentProcessor (Ollama qwen2.5:3b)
+    ├── Chunking (800 token chunks, 100 overlap)
+    ├── Quality scoring (0-10 per chunk)
+    ├── Embeddings (cosine similarity RAG)
+    └── Deduplication
+         ↓
+OutputGenerator (Ollama)
+    ├── 🎙️ Podcast script (20-30 min)
+    ├── 📝 Blog post (1500-2500 words)
+    ├── 📊 Research report (structured)
+    └── 🐦 Social thread (15-25 tweets)
+```
+
+## Telegram Commands
+
+| Command | Description |
+|---------|-------------|
+| `/research <topic>` | Start exhaustive research |
+| `/status` | Check progress |
+| `/finish` | Stop early, proceed to generation |
+| `/generate podcast\|blog\|report\|thread` | Generate output |
+| `/sources` | List all sources found |
+| `/cancel` | Cancel current research |
+
+## Local Development
+
+```bash
+# 1. Clone and setup
+git clone https://git.chemavx.xyz/chemavx/researchowl
+cd researchowl
+
+# 2. Create virtualenv
+python3 -m venv venv && source venv/bin/activate
+pip install -r requirements.txt
+
+# 3. Configure
+cp .env.example .env
+# Edit .env with your values
+
+# 4. Run
+python main.py
+```
+
+## Deploy to k3s
+
+```bash
+# 1. Create namespace and secrets
+kubectl create namespace researchowl
+kubectl create secret generic researchowl-secrets \
+  --from-literal=telegram-bot-token=YOUR_TOKEN \
+  --from-literal=telegram-allowed-users=YOUR_USER_ID \
+  -n researchowl
+
+# 2. Copy manifests to your k8s-manifests repo
+cp k8s/*.yaml /path/to/k8s-manifests/researchowl/
+
+# 3. Apply ArgoCD app
+kubectl apply -f k8s/argocd-app.yaml
+
+# 4. Push to Gitea → Gitea Actions builds → ArgoCD deploys
+git add . && git commit -m "feat: add researchowl" && git push
+```
+
+## Tuning
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MAX_SOURCES` | 150 | Hard cap on sources |
+| `MAX_DEPTH` | 3 | Link recursion depth |
+| `QUALITY_THRESHOLD` | 0.4 | Min chunk quality (0-1) |
+| `REQUEST_DELAY` | 1.0s | Delay between requests |
+
+**Want more thoroughness?**
+- Increase `MAX_SOURCES` to 300+
+- Increase `MAX_DEPTH` to 4-5
+- Lower `QUALITY_THRESHOLD` to 0.3
+
+**Want faster results?**
+- Lower `MAX_SOURCES` to 50
+- Set `MAX_DEPTH` to 1-2
+- Higher `QUALITY_THRESHOLD` to 0.6
+
+## Notes
+
+- Uses **qwen2.5:3b** (your existing Ollama) for all AI tasks — zero API cost
+- Optionally add `ANTHROPIC_API_KEY` for Claude fallback on generation
+- SQLite database stored in `/data/researchowl.db`
+- All outputs saved to DB and available via `/outputs`