chemavx/researchowl

Fork 0

Files

T

ChemaVX ba08536337

Build & Deploy ResearchOwl / build (push) Failing after 1m38s

Details

feat: initial ResearchOwl

2026-04-27 13:49:07 +00:00

10 KiB

Raw Blame History

ResearchOwl — Instrucciones para Claude Code

Contexto del proyecto

Eres el agente de construcción e implementación de ResearchOwl, un bot de Telegram que realiza investigación exhaustiva sobre cualquier tema usando scraping recursivo y Ollama (qwen2.5:3b) para procesamiento y generación de contenido.

El homelab donde se desplegará tiene:

k3s con Traefik + cert-manager + Cloudflare DNS
ArgoCD para GitOps (repo: k8s-manifests en Gitea)
Gitea en git.chemavx.xyz + Container Registry
Ollama en http://ollama.chemavx.xyz con modelo qwen2.5:3b
Telegram bot ya existente en @chemavx_bot
Dominio base: chemavx.xyz

Objetivo

Construir el proyecto completo, corregir todos los bugs, y dejarlo listo para desplegar en k3s.

Tareas a realizar — en orden

1. Crear estructura del proyecto

researchowl/
├── src/
│   ├── __init__.py
│   ├── config.py
│   ├── scraper/
│   │   ├── __init__.py
│   │   └── exhaustive.py
│   ├── processor/
│   │   ├── __init__.py
│   │   └── processor.py
│   ├── generator/
│   │   ├── __init__.py
│   │   └── generator.py
│   ├── bot/
│   │   ├── __init__.py
│   │   └── bot.py
│   └── db/
│       ├── __init__.py
│       └── database.py
├── k8s/
│   ├── deployment.yaml
│   └── argocd-app.yaml
├── .gitea/
│   └── workflows/
│       └── build.yml
├── tests/
│   └── test_scraper.py
├── main.py
├── requirements.txt
├── Dockerfile
├── .env.example
└── README.md

2. Corregir bug crítico en database.py

La tabla source_contents está referenciada en processor.py pero no existe en el schema.

Añadir al SCHEMA en database.py:

CREATE TABLE IF NOT EXISTS source_contents (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_id INTEGER NOT NULL UNIQUE REFERENCES sources(id),
    content TEXT NOT NULL,
    created_at REAL NOT NULL
);

CREATE INDEX IF NOT EXISTS idx_source_contents ON source_contents(source_id);

Añadir método en la clase ResearchDB:

async def save_source_content(self, source_id: int, content: str):
    await self.db.execute(
        """INSERT OR REPLACE INTO source_contents (source_id, content, created_at)
           VALUES (?, ?, ?)""",
        (source_id, content, time.time())
    )
    await self.db.commit()

async def get_source_content(self, source_id: int) -> Optional[str]:
    cursor = await self.db.execute(
        "SELECT content FROM source_contents WHERE source_id = ?", (source_id,)
    )
    row = await cursor.fetchone()
    return row[0] if row else None

3. Corregir bug en exhaustive.py — guardar contenido

En el método _mark_scraped del ExhaustiveScraper, después de validar el contenido, hay que guardarlo en source_contents. Cambiar el método a:

async def _mark_scraped(self, source_id: int, content: Optional[str],
                         title: Optional[str], url: str):
    if not content or len(content) < settings.min_content_length:
        await self.db.update_source(source_id, status="skipped",
                                    error="Content too short or empty")
        return

    word_count = len(content.split())
    
    # Guardar contenido raw
    await self.db.save_source_content(source_id, content)
    
    await self.db.update_source(
        source_id,
        status="scraped",
        title=title or url,
        word_count=word_count,
        scraped_at=time.time(),
        quality_score=min(1.0, word_count / 1000)
    )

4. Corregir bug en processor.py — usar save/get content

En _process_source, la consulta a source_contents usa self.db.db.execute directamente pero ahora debería usar el método del DB:

async def _process_source(self, session_id: int, topic: str, source: dict) -> int:
    source_id = source["id"]
    
    # Usar el método correcto
    content = await self.db.get_source_content(source_id)
    if not content:
        return 0

    chunks = simple_chunk(content, settings.chunk_size, settings.chunk_overlap)
    stored = 0

    for i, chunk in enumerate(chunks):
        if len(chunk.split()) < 30:
            continue

        quality = await self._score_quality(chunk, topic)
        if quality < settings.quality_threshold:
            continue

        embedding = await self.ollama.embed(chunk[:1000])

        await self.db.add_chunk(
            session_id=session_id,
            source_id=source_id,
            content=chunk,
            chunk_index=i,
            token_count=len(chunk.split()),
            quality_score=quality,
            embedding=embedding
        )
        stored += 1

    return stored

5. Añadir comando /outputs al bot

En bot.py, añadir este handler:

async def cmd_outputs(update: Update, ctx: ContextTypes.DEFAULT_TYPE):
    if not is_authorized(update.effective_user.id):
        return

    chat_id = update.effective_chat.id
    db_conn = await get_db()
    db = ResearchDB(db_conn)

    try:
        cursor = await db_conn.execute(
            "SELECT * FROM research_sessions WHERE telegram_chat_id = ? ORDER BY created_at DESC LIMIT 1",
            (chat_id,)
        )
        row = await cursor.fetchone()
        if not row:
            await update.message.reply_text("No sessions found.")
            return

        outputs = await db.get_outputs(row["id"])
        if not outputs:
            await update.message.reply_text(
                "No outputs generated yet. Use `/generate podcast|blog|report|thread`",
                parse_mode=ParseMode.MARKDOWN
            )
            return

        lines = [f"📄 *Outputs for: {row['topic']}*\n"]
        for o in outputs:
            from datetime import datetime
            dt = datetime.utcfromtimestamp(o['created_at']).strftime("%Y-%m-%d %H:%M")
            lines.append(f"• `{o['output_type']}` — {dt} ({len(o['content'])} chars)")

        await update.message.reply_text(
            "\n".join(lines),
            parse_mode=ParseMode.MARKDOWN
        )
    finally:
        await db_conn.close()

Y registrarlo en create_bot():

app.add_handler(CommandHandler("outputs", cmd_outputs))

6. Instalar dependencias y verificar que importa correctamente

pip install -r requirements.txt
python -c "from src.bot.bot import create_bot; print('OK')"
python -c "from src.scraper.exhaustive import ExhaustiveScraper; print('OK')"
python -c "from src.processor.processor import ContentProcessor; print('OK')"
python -c "from src.generator.generator import OutputGenerator; print('OK')"

Si hay errores de importación, corrígelos.

7. Escribir test básico

En tests/test_scraper.py:

import pytest
import asyncio
from src.scraper.exhaustive import (
    detect_source_type, is_blacklisted, normalize_url, simple_chunk
)

def test_detect_source_type():
    assert detect_source_type("https://youtube.com/watch?v=abc123") == "youtube"
    assert detect_source_type("https://reddit.com/r/test/comments/abc") == "reddit"
    assert detect_source_type("https://en.wikipedia.org/wiki/Roswell") == "wikipedia"
    assert detect_source_type("https://example.com/doc.pdf") == "pdf"
    assert detect_source_type("https://example.com/article") == "web"

def test_is_blacklisted():
    assert is_blacklisted("https://facebook.com/something") == True
    assert is_blacklisted("https://en.wikipedia.org/wiki/Test") == False

def test_normalize_url():
    assert normalize_url("https://example.com/page#section") == "https://example.com/page"
    assert normalize_url("https://example.com/page/") == "https://example.com/page"

Nota: importar simple_chunk desde processor.py:

from src.processor.processor import simple_chunk

def test_simple_chunk():
    text = "\n\n".join([f"Paragraph {i} with some content here." for i in range(50)])
    chunks = simple_chunk(text, chunk_size=100, overlap=20)
    assert len(chunks) > 1
    assert all(isinstance(c, str) for c in chunks)

Ejecutar: pytest tests/ -v

8. Build Docker y verificar

docker build -t researchowl:test .
docker run --rm researchowl:test python -c "from src.bot.bot import create_bot; print('Docker OK')"

9. Preparar para despliegue

Verificar que estos ficheros están correctos y completos:

k8s/deployment.yaml — Deployment + PVC + Secret template
k8s/argocd-app.yaml — ArgoCD Application apuntando a k8s-manifests
.gitea/workflows/build.yml — CI con build → push → notificación Telegram

10. Instrucciones finales para el usuario

Al finalizar, mostrar:

✅ ResearchOwl listo para desplegar.

Pasos para desplegar:
1. Crear secret en k3s:
   kubectl create namespace researchowl
   kubectl create secret generic researchowl-secrets \
     --from-literal=telegram-bot-token=TU_TOKEN \
     --from-literal=telegram-allowed-users=TU_USER_ID \
     -n researchowl

2. Subir código a Gitea:
   git init && git remote add origin https://git.chemavx.xyz/chemavx/researchowl
   git add . && git commit -m "feat: initial ResearchOwl"
   git push -u origin main

3. Gitea Actions construirá la imagen automáticamente.

4. Copiar manifests k8s/ a tu repo k8s-manifests/researchowl/
   y aplicar el ArgoCD app:
   kubectl apply -f k8s/argocd-app.yaml

5. ArgoCD desplegará automáticamente.

Uso desde Telegram:
  /research Incidente Roswell
  /status
  /finish
  /generate podcast

Notas importantes

No crear un bot de Telegram nuevo — el usuario ya tiene @chemavx_bot. Solo necesita configurar el token en el secret de k3s.
No modificar los manifests de k8s para añadir Ingress — el bot usa polling de Telegram, no necesita exponer ningún puerto.
Ollama ya está corriendo en el cluster. La URL http://ollama.chemavx.xyz es correcta.
Si qwen2.5:3b es lento para scoring de calidad, se puede desactivar el scoring con QUALITY_THRESHOLD=0 y todos los chunks pasan directamente.
El proyecto usa SQLite (coherente con el resto del homelab).
Respetar el REQUEST_DELAY=1.0 para no hacer ban en las fuentes.

10 KiB Raw Blame History