Skip to main content

PR Duplicate Detection

Simili Bot v0.2.0 can detect duplicate pull requests by indexing PR content in a dedicated Qdrant collection and performing semantic search at PR open time.

How it works

  1. PR content is embedded: title + body + changed file paths
  2. Qdrant is searched for similar items in both the issues collection and the PR collection
  3. Candidates are ranked by similarity score
  4. Optionally, an LLM gives a duplicate verdict on the top candidates
  5. Results are returned as JSON (or posted as a comment, when run via GitHub Actions)

Setup

1. Configure a PR collection

In .github/simili.yaml:
qdrant:
  url: "${QDRANT_URL}"
  api_key: "${QDRANT_API_KEY}"
  collection: "my-issues"
  pr_collection: "my-prs"   # Dedicated PR collection
If pr_collection is omitted, PRs are stored alongside issues in the main collection.

2. Index existing PRs

Before the bot can find duplicates, your PRs must be indexed:
simili index --repo owner/repo --include-prs=true
Or via GitHub Actions (bulk index workflow).

3. Create a PR triage workflow

Create .github/workflows/simili-pr.yml:
name: Simili PR Triage

on:
  pull_request:
    types: [opened, edited, reopened]

jobs:
  pr-triage:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read
    steps:
      - uses: actions/checkout@v4

      - uses: similigh/simili-bot@v0.2.0
        with:
          command: "pr-duplicate"
          config_path: ".github/simili.yaml"
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          QDRANT_URL: ${{ secrets.QDRANT_URL }}
          QDRANT_API_KEY: ${{ secrets.QDRANT_API_KEY }}
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}

Running from the CLI

Check a specific PR for duplicates:
simili pr-duplicate --repo owner/repo --number 42
With custom threshold:
simili pr-duplicate --repo owner/repo --number 42 --threshold 0.80 --top-k 10

Example output

{
  "pr": {
    "repo": "owner/repo",
    "number": 42,
    "title": "Fix authentication timeout in middleware"
  },
  "candidates": [
    {
      "type": "pull_request",
      "number": 38,
      "title": "Fix session expiry in auth middleware",
      "score": 0.93,
      "url": "https://github.com/owner/repo/pull/38"
    },
    {
      "type": "issue",
      "number": 123,
      "title": "Login session expires unexpectedly",
      "score": 0.88,
      "url": "https://github.com/owner/repo/issues/123"
    }
  ],
  "duplicate_detected": true,
  "duplicate_of": 38,
  "confidence": 0.91,
  "reasoning": "PR #38 addresses the exact same authentication timeout issue with an overlapping fix."
}

What gets embedded

The following content is combined for the PR embedding:
Title: Fix authentication timeout in middleware

Body: This PR fixes the issue where sessions expire after 30 seconds...

Changed Files:
- internal/middleware/auth.go
- internal/middleware/session.go
- tests/middleware_test.go
Including changed file paths improves matching accuracy for code-level duplicate detection.

Tips

  • Index PRs regularly — run simili index --include-prs=true on a schedule to keep the PR collection fresh.
  • Set a dedicated pr_collection — this keeps PR and issue search results cleanly separated.
  • Tune the threshold — for stricter duplicate detection, raise --threshold to 0.80 or higher.
  • Use LLM reasoning — the bot’s LLM verdict provides human-readable reasoning about why two PRs are considered duplicates.

Next steps