Skip to main content

Semantic Search

Simili Bot uses semantic search to find related issues across your repositories based on meaning, not just keywords. This approach allows the bot to identify relationships that traditional keyword-based search would miss.

How It Works

Theory vs Reality

In a traditional search system, an issue titled “Login button doesn’t work” might be missed if you search for “authentication failures”. Semantic search bridges this gap.
  • Traditional Search: Only finds issues with exact word matches.
  • Semantic Search: AI understands that “Can’t authenticate” is semantically related to “Sign-in issues” and “Login broken”.

The Process

Simili Bot follows a three-step process to enable semantic discovery:
1

Embedding

Text from the issue title, body, and comments is converted into a 768-dimensional vector using Google’s text-embedding-004 model.
2

Indexing

These vectors are stored in the Qdrant vector database along with the issue’s metadata (repository, labels, author).
3

Querying

When a new issue arrives, its embedding is compared against the entire database using cosine similarity to find the most relevant historical issues in milliseconds.

Configuration

Tuning the search sensitivity is crucial for balancing noise and discovery.

Tuning Thresholds

The similarity_threshold determines how strict the bot is when suggesting related issues.
LevelValueEffect
Conservative0.80Only returns issues that are nearly identical in meaning.
Recommended0.70Provides a good balance of accuracy and broad discovery.
Permissive0.60Returns loosely related issues; higher chance of false positives.

Configuration Example

Add these settings to your simili.yaml or workflow environment:
defaults:
  similarity_threshold: 0.70      # Minimum confidence score (0.0 - 1.0)
  max_similar_to_show: 5          # Number of results to display in the comment
  cross_repo_search: true         # Enable discovery across all configured repositories

Use Cases

Finding Duplicate Issues

Instantly identify if a new bug report has already been discussed in another repo, even if the wording is completely different. This prevents engineers from working on the same problem in isolation.

Pattern Recognition

Detect if multiple teams are experiencing related infrastructure failures by seeing cross-repo search results in real-time. For example, a “Database timeout” in the backend and “Slow login” in the frontend can be linked semantically.

Next Steps