ChemaVX 0c7176dd0b
Build & Deploy ResearchOwl / build-and-push (push) Successful in 5s
fix: add /process command, log quality filtering, improve Reddit headers
- bot.py: add cmd_process handler to manually trigger chunk processing
  on the last session; register CommandHandler("process")
- processor.py: log exceptions from asyncio.gather instead of silently
  dropping them; add per-chunk quality score debug logging; warn when
  all chunks filtered by quality threshold with actionable hint;
  raise fallback score to 0.6 so Ollama failures don't filter chunks
- exhaustive.py: replace bot User-Agent with full browser UA + headers
  for REDDIT_HEADERS; downgrade Reddit 403 from warning to info since
  server IPs are routinely blocked; use content_type=None on json()
  to avoid aiohttp content-type mismatch errors

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 20:37:39 +00:00
2026-04-27 13:49:07 +00:00
2026-04-27 13:49:07 +00:00
2026-04-27 13:49:07 +00:00
2026-04-27 13:49:07 +00:00
2026-04-27 13:49:07 +00:00
2026-04-27 13:49:07 +00:00
2026-04-27 13:49:07 +00:00
2026-04-27 13:49:07 +00:00

🦉 ResearchOwl

Exhaustive research engine with Telegram interface.

Recursively discovers, scrapes, and processes sources from across the web, then generates podcast scripts, blog posts, reports, or social threads using Ollama.

Architecture

Telegram (/research <topic>)
    ↓
ExhaustiveScraper
    ├── DuckDuckGo (8 queries × 5 results)
    ├── Wikipedia + recursive internal links
    ├── Reddit (top posts + top comments)
    ├── YouTube (transcripts)
    ├── PDFs (public documents)
    └── Web scraping (trafilatura)
         ↓ recursive expansion (depth 1-3)
ContentProcessor (Ollama qwen2.5:3b)
    ├── Chunking (800 token chunks, 100 overlap)
    ├── Quality scoring (0-10 per chunk)
    ├── Embeddings (cosine similarity RAG)
    └── Deduplication
         ↓
OutputGenerator (Ollama)
    ├── 🎙️ Podcast script (20-30 min)
    ├── 📝 Blog post (1500-2500 words)
    ├── 📊 Research report (structured)
    └── 🐦 Social thread (15-25 tweets)

Telegram Commands

Command Description
/research <topic> Start exhaustive research
/status Check progress
/finish Stop early, proceed to generation
/generate podcast|blog|report|thread Generate output
/sources List all sources found
/cancel Cancel current research

Local Development

# 1. Clone and setup
git clone https://git.chemavx.xyz/chemavx/researchowl
cd researchowl

# 2. Create virtualenv
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 3. Configure
cp .env.example .env
# Edit .env with your values

# 4. Run
python main.py

Deploy to k3s

# 1. Create namespace and secrets
kubectl create namespace researchowl
kubectl create secret generic researchowl-secrets \
  --from-literal=telegram-bot-token=YOUR_TOKEN \
  --from-literal=telegram-allowed-users=YOUR_USER_ID \
  -n researchowl

# 2. Copy manifests to your k8s-manifests repo
cp k8s/*.yaml /path/to/k8s-manifests/researchowl/

# 3. Apply ArgoCD app
kubectl apply -f k8s/argocd-app.yaml

# 4. Push to Gitea → Gitea Actions builds → ArgoCD deploys
git add . && git commit -m "feat: add researchowl" && git push

Tuning

Variable Default Description
MAX_SOURCES 150 Hard cap on sources
MAX_DEPTH 3 Link recursion depth
QUALITY_THRESHOLD 0.4 Min chunk quality (0-1)
REQUEST_DELAY 1.0s Delay between requests

Want more thoroughness?

  • Increase MAX_SOURCES to 300+
  • Increase MAX_DEPTH to 4-5
  • Lower QUALITY_THRESHOLD to 0.3

Want faster results?

  • Lower MAX_SOURCES to 50
  • Set MAX_DEPTH to 1-2
  • Higher QUALITY_THRESHOLD to 0.6

Notes

  • Uses qwen2.5:3b (your existing Ollama) for all AI tasks — zero API cost
  • Optionally add ANTHROPIC_API_KEY for Claude fallback on generation
  • SQLite database stored in /data/researchowl.db
  • All outputs saved to DB and available via /outputs
S
Description
No description provided
Readme 583 KiB
Languages
Python 99.7%
Dockerfile 0.3%