327 lines
10 KiB
Markdown
327 lines
10 KiB
Markdown
# ResearchOwl — Instrucciones para Claude Code
|
|
|
|
## Contexto del proyecto
|
|
|
|
Eres el agente de construcción e implementación de **ResearchOwl**, un bot de Telegram que realiza investigación exhaustiva sobre cualquier tema usando scraping recursivo y Ollama (qwen2.5:3b) para procesamiento y generación de contenido.
|
|
|
|
El homelab donde se desplegará tiene:
|
|
- **k3s** con Traefik + cert-manager + Cloudflare DNS
|
|
- **ArgoCD** para GitOps (repo: `k8s-manifests` en Gitea)
|
|
- **Gitea** en `git.chemavx.xyz` + Container Registry
|
|
- **Ollama** en `http://ollama.chemavx.xyz` con modelo `qwen2.5:3b`
|
|
- **Telegram bot** ya existente en `@chemavx_bot`
|
|
- Dominio base: `chemavx.xyz`
|
|
|
|
---
|
|
|
|
## Objetivo
|
|
|
|
Construir el proyecto completo, corregir todos los bugs, y dejarlo listo para desplegar en k3s.
|
|
|
|
---
|
|
|
|
## Tareas a realizar — en orden
|
|
|
|
### 1. Crear estructura del proyecto
|
|
|
|
```
|
|
researchowl/
|
|
├── src/
|
|
│ ├── __init__.py
|
|
│ ├── config.py
|
|
│ ├── scraper/
|
|
│ │ ├── __init__.py
|
|
│ │ └── exhaustive.py
|
|
│ ├── processor/
|
|
│ │ ├── __init__.py
|
|
│ │ └── processor.py
|
|
│ ├── generator/
|
|
│ │ ├── __init__.py
|
|
│ │ └── generator.py
|
|
│ ├── bot/
|
|
│ │ ├── __init__.py
|
|
│ │ └── bot.py
|
|
│ └── db/
|
|
│ ├── __init__.py
|
|
│ └── database.py
|
|
├── k8s/
|
|
│ ├── deployment.yaml
|
|
│ └── argocd-app.yaml
|
|
├── .gitea/
|
|
│ └── workflows/
|
|
│ └── build.yml
|
|
├── tests/
|
|
│ └── test_scraper.py
|
|
├── main.py
|
|
├── requirements.txt
|
|
├── Dockerfile
|
|
├── .env.example
|
|
└── README.md
|
|
```
|
|
|
|
### 2. Corregir bug crítico en database.py
|
|
|
|
La tabla `source_contents` está referenciada en `processor.py` pero no existe en el schema.
|
|
|
|
**Añadir al SCHEMA en `database.py`:**
|
|
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS source_contents (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
source_id INTEGER NOT NULL UNIQUE REFERENCES sources(id),
|
|
content TEXT NOT NULL,
|
|
created_at REAL NOT NULL
|
|
);
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_source_contents ON source_contents(source_id);
|
|
```
|
|
|
|
**Añadir método en la clase `ResearchDB`:**
|
|
|
|
```python
|
|
async def save_source_content(self, source_id: int, content: str):
|
|
await self.db.execute(
|
|
"""INSERT OR REPLACE INTO source_contents (source_id, content, created_at)
|
|
VALUES (?, ?, ?)""",
|
|
(source_id, content, time.time())
|
|
)
|
|
await self.db.commit()
|
|
|
|
async def get_source_content(self, source_id: int) -> Optional[str]:
|
|
cursor = await self.db.execute(
|
|
"SELECT content FROM source_contents WHERE source_id = ?", (source_id,)
|
|
)
|
|
row = await cursor.fetchone()
|
|
return row[0] if row else None
|
|
```
|
|
|
|
### 3. Corregir bug en exhaustive.py — guardar contenido
|
|
|
|
En el método `_mark_scraped` del `ExhaustiveScraper`, después de validar el contenido, hay que guardarlo en `source_contents`. Cambiar el método a:
|
|
|
|
```python
|
|
async def _mark_scraped(self, source_id: int, content: Optional[str],
|
|
title: Optional[str], url: str):
|
|
if not content or len(content) < settings.min_content_length:
|
|
await self.db.update_source(source_id, status="skipped",
|
|
error="Content too short or empty")
|
|
return
|
|
|
|
word_count = len(content.split())
|
|
|
|
# Guardar contenido raw
|
|
await self.db.save_source_content(source_id, content)
|
|
|
|
await self.db.update_source(
|
|
source_id,
|
|
status="scraped",
|
|
title=title or url,
|
|
word_count=word_count,
|
|
scraped_at=time.time(),
|
|
quality_score=min(1.0, word_count / 1000)
|
|
)
|
|
```
|
|
|
|
### 4. Corregir bug en processor.py — usar save/get content
|
|
|
|
En `_process_source`, la consulta a `source_contents` usa `self.db.db.execute` directamente pero ahora debería usar el método del DB:
|
|
|
|
```python
|
|
async def _process_source(self, session_id: int, topic: str, source: dict) -> int:
|
|
source_id = source["id"]
|
|
|
|
# Usar el método correcto
|
|
content = await self.db.get_source_content(source_id)
|
|
if not content:
|
|
return 0
|
|
|
|
chunks = simple_chunk(content, settings.chunk_size, settings.chunk_overlap)
|
|
stored = 0
|
|
|
|
for i, chunk in enumerate(chunks):
|
|
if len(chunk.split()) < 30:
|
|
continue
|
|
|
|
quality = await self._score_quality(chunk, topic)
|
|
if quality < settings.quality_threshold:
|
|
continue
|
|
|
|
embedding = await self.ollama.embed(chunk[:1000])
|
|
|
|
await self.db.add_chunk(
|
|
session_id=session_id,
|
|
source_id=source_id,
|
|
content=chunk,
|
|
chunk_index=i,
|
|
token_count=len(chunk.split()),
|
|
quality_score=quality,
|
|
embedding=embedding
|
|
)
|
|
stored += 1
|
|
|
|
return stored
|
|
```
|
|
|
|
### 5. Añadir comando /outputs al bot
|
|
|
|
En `bot.py`, añadir este handler:
|
|
|
|
```python
|
|
async def cmd_outputs(update: Update, ctx: ContextTypes.DEFAULT_TYPE):
|
|
if not is_authorized(update.effective_user.id):
|
|
return
|
|
|
|
chat_id = update.effective_chat.id
|
|
db_conn = await get_db()
|
|
db = ResearchDB(db_conn)
|
|
|
|
try:
|
|
cursor = await db_conn.execute(
|
|
"SELECT * FROM research_sessions WHERE telegram_chat_id = ? ORDER BY created_at DESC LIMIT 1",
|
|
(chat_id,)
|
|
)
|
|
row = await cursor.fetchone()
|
|
if not row:
|
|
await update.message.reply_text("No sessions found.")
|
|
return
|
|
|
|
outputs = await db.get_outputs(row["id"])
|
|
if not outputs:
|
|
await update.message.reply_text(
|
|
"No outputs generated yet. Use `/generate podcast|blog|report|thread`",
|
|
parse_mode=ParseMode.MARKDOWN
|
|
)
|
|
return
|
|
|
|
lines = [f"📄 *Outputs for: {row['topic']}*\n"]
|
|
for o in outputs:
|
|
from datetime import datetime
|
|
dt = datetime.utcfromtimestamp(o['created_at']).strftime("%Y-%m-%d %H:%M")
|
|
lines.append(f"• `{o['output_type']}` — {dt} ({len(o['content'])} chars)")
|
|
|
|
await update.message.reply_text(
|
|
"\n".join(lines),
|
|
parse_mode=ParseMode.MARKDOWN
|
|
)
|
|
finally:
|
|
await db_conn.close()
|
|
```
|
|
|
|
Y registrarlo en `create_bot()`:
|
|
```python
|
|
app.add_handler(CommandHandler("outputs", cmd_outputs))
|
|
```
|
|
|
|
### 6. Instalar dependencias y verificar que importa correctamente
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
python -c "from src.bot.bot import create_bot; print('OK')"
|
|
python -c "from src.scraper.exhaustive import ExhaustiveScraper; print('OK')"
|
|
python -c "from src.processor.processor import ContentProcessor; print('OK')"
|
|
python -c "from src.generator.generator import OutputGenerator; print('OK')"
|
|
```
|
|
|
|
Si hay errores de importación, corrígelos.
|
|
|
|
### 7. Escribir test básico
|
|
|
|
En `tests/test_scraper.py`:
|
|
|
|
```python
|
|
import pytest
|
|
import asyncio
|
|
from src.scraper.exhaustive import (
|
|
detect_source_type, is_blacklisted, normalize_url, simple_chunk
|
|
)
|
|
|
|
def test_detect_source_type():
|
|
assert detect_source_type("https://youtube.com/watch?v=abc123") == "youtube"
|
|
assert detect_source_type("https://reddit.com/r/test/comments/abc") == "reddit"
|
|
assert detect_source_type("https://en.wikipedia.org/wiki/Roswell") == "wikipedia"
|
|
assert detect_source_type("https://example.com/doc.pdf") == "pdf"
|
|
assert detect_source_type("https://example.com/article") == "web"
|
|
|
|
def test_is_blacklisted():
|
|
assert is_blacklisted("https://facebook.com/something") == True
|
|
assert is_blacklisted("https://en.wikipedia.org/wiki/Test") == False
|
|
|
|
def test_normalize_url():
|
|
assert normalize_url("https://example.com/page#section") == "https://example.com/page"
|
|
assert normalize_url("https://example.com/page/") == "https://example.com/page"
|
|
```
|
|
|
|
Nota: importar `simple_chunk` desde `processor.py`:
|
|
|
|
```python
|
|
from src.processor.processor import simple_chunk
|
|
|
|
def test_simple_chunk():
|
|
text = "\n\n".join([f"Paragraph {i} with some content here." for i in range(50)])
|
|
chunks = simple_chunk(text, chunk_size=100, overlap=20)
|
|
assert len(chunks) > 1
|
|
assert all(isinstance(c, str) for c in chunks)
|
|
```
|
|
|
|
Ejecutar: `pytest tests/ -v`
|
|
|
|
### 8. Build Docker y verificar
|
|
|
|
```bash
|
|
docker build -t researchowl:test .
|
|
docker run --rm researchowl:test python -c "from src.bot.bot import create_bot; print('Docker OK')"
|
|
```
|
|
|
|
### 9. Preparar para despliegue
|
|
|
|
Verificar que estos ficheros están correctos y completos:
|
|
- `k8s/deployment.yaml` — Deployment + PVC + Secret template
|
|
- `k8s/argocd-app.yaml` — ArgoCD Application apuntando a `k8s-manifests`
|
|
- `.gitea/workflows/build.yml` — CI con build → push → notificación Telegram
|
|
|
|
### 10. Instrucciones finales para el usuario
|
|
|
|
Al finalizar, mostrar:
|
|
|
|
```
|
|
✅ ResearchOwl listo para desplegar.
|
|
|
|
Pasos para desplegar:
|
|
1. Crear secret en k3s:
|
|
kubectl create namespace researchowl
|
|
kubectl create secret generic researchowl-secrets \
|
|
--from-literal=telegram-bot-token=TU_TOKEN \
|
|
--from-literal=telegram-allowed-users=TU_USER_ID \
|
|
-n researchowl
|
|
|
|
2. Subir código a Gitea:
|
|
git init && git remote add origin https://git.chemavx.xyz/chemavx/researchowl
|
|
git add . && git commit -m "feat: initial ResearchOwl"
|
|
git push -u origin main
|
|
|
|
3. Gitea Actions construirá la imagen automáticamente.
|
|
|
|
4. Copiar manifests k8s/ a tu repo k8s-manifests/researchowl/
|
|
y aplicar el ArgoCD app:
|
|
kubectl apply -f k8s/argocd-app.yaml
|
|
|
|
5. ArgoCD desplegará automáticamente.
|
|
|
|
Uso desde Telegram:
|
|
/research Incidente Roswell
|
|
/status
|
|
/finish
|
|
/generate podcast
|
|
```
|
|
|
|
---
|
|
|
|
## Notas importantes
|
|
|
|
- **No crear un bot de Telegram nuevo** — el usuario ya tiene `@chemavx_bot`. Solo necesita configurar el token en el secret de k3s.
|
|
- **No modificar** los manifests de k8s para añadir Ingress — el bot usa polling de Telegram, no necesita exponer ningún puerto.
|
|
- **Ollama** ya está corriendo en el cluster. La URL `http://ollama.chemavx.xyz` es correcta.
|
|
- Si `qwen2.5:3b` es lento para scoring de calidad, se puede desactivar el scoring con `QUALITY_THRESHOLD=0` y todos los chunks pasan directamente.
|
|
- El proyecto usa **SQLite** (coherente con el resto del homelab).
|
|
- Respetar el `REQUEST_DELAY=1.0` para no hacer ban en las fuentes.
|