How to find similar records in PostgreSQL with Trigram similarity search in your Ruby on Rails App?

Have you ever tried building a search feature that needed to find records even when users make typos? Or maybe you wanted to detect potential duplicates in your database? Regular exact-match queries won't cut it here, but PostgreSQL's trigram similarity search might be exactly what you need. This powerful feature lets you find records that are "kinda like" what you're looking for, and it's surprisingly easy to implement in your Rails app.

What the heck are trigrams anyway?

Before diving into code, let's understand what we're working with. A trigram is simply a group of three consecutive characters taken from a string. For example, the word "hello" would be broken down into the trigrams: " h", " he", "hel", "ell", "llo", "lo ". (Notice the padding spaces at the beginning and end).

PostgreSQL uses these trigrams to calculate how similar two strings are. The more trigrams they share, the higher the similarity score, which ranges from 0.0 (completely different) to 1.0 (identical).

Setting up trigram similarity in your Rails app

PostgreSQL doesn't have the trigram module enabled by default, so our first step is to create a migration to enable it:

class EnablePgTrgmExtension < ActiveRecord::Migration[7.1]
  def up 
    execute "CREATE EXTENSION IF NOT EXISTS pg_trgm;"
  end

  def down
    execute "DROP EXTENSION IF EXISTS pg_trgm;"
  end
end

After running this migration, you're all set to start using similarity searches in your application. No gems, no extra libraries, just pure PostgreSQL goodness!

Basic similarity searches

Let's say we have an User model with first_name and last_name columns, and we want to find users whose names are similar to "Kamil". Here's how you'd do it:

User.where("similarity(first_name, ?) > 0.3 OR similarity(last_name, ?) > 0.3", "Kamil", "Kamil")

This query finds all users where either the first or last name has a similarity score above 0.3 compared to Kamil. But what's the deal with that 0.3 value?

Finding the right similarity threshold

Similarity scores range from 0.0 to 1.0, where:

  • 1.0 means an exact match
  • 0.0 means completely different

The threshold you choose determines how strict or loose your matching will be:

  • Higher thresholds (like 0.7+) require very close matches
  • Medium thresholds (around 0.3-0.6) catch common misspellings and variations
  • Lower thresholds (below 0.3) might return too many false positives

I typically start with 0.3 or 0.4 and adjust based on testing. For names, 0.3 usually catches things like "Kamil" vs "Kamill" or "Kameel" while avoiding completely unrelated names.

Ordering results by similarity

Often you'll want to show the closest matches first. You can easily order by similarity score:

User.order(Arel.sql("similarity(first_name, 'Kamil') DESC, similarity(last_name, 'Kamil') DESC"))

Notice we're using Arel.sql here, which is important for security reasons when passing raw SQL to ActiveRecord's order method.

Displaying and debugging similarity scores

When implementing a similarity feature, it's super helpful to see the actual scores to understand why certain records are being included or excluded. You can select the similarity scores alongside your regular columns:

User.select("id, first_name, last_name, similarity(first_name, 'Kamil') as first_name_similarity, similarity(last_name, 'Kamil') as last_name_similarity")
     .order(Arel.sql("similarity(first_name, 'Kamil') DESC, similarity(last_name, 'Kamil') DESC"))
     .limit(10)
     .each do |user|
  puts "ID: #{user.id}, Name: #{user.first_name} #{user.last_name}"
  puts "First Name Similarity: #{user.first_name_similarity}, Last Name Similarity: #{user.last_name_similarity}"
end

This gives you output like:

ID: 1, Name: Kamil Dz
First Name Similarity: 0.42857143, Last Name Similarity: 0.0
ID: 3, Name: John Smith
First Name Similarity: 0.0, Last Name Similarity: 0.0
...

Creating a reusable search scope

Instead of repeating this logic throughout your app, consider adding a reusable scope to your model:

class User < ApplicationRecord
  scope :similar_to_name, ->(name, threshold = 0.3) {
    where("similarity(first_name, ?) > ? OR similarity(last_name, ?) > ?", 
          name, threshold, name, threshold)
    .order(Arel.sql("GREATEST(similarity(first_name, '#{ActiveRecord::Base.connection.quote_string(name)}'), " +
                    "similarity(last_name, '#{ActiveRecord::Base.connection.quote_string(name)}')) DESC"))
  }
end

Now you can simply call:

User.similar_to_name("Kamil")

Notice how we added extra security by using quote_string to prevent SQL injection. Always be careful when including user input in SQL queries!
Please remember to add indexes/or caching for this such actions:

class AddTrigramIndexToUsers < ActiveRecord::Migration[7.1]
  def up
    execute "CREATE INDEX index_users_on_first_name_trigram ON users USING gin (first_name gin_trgm_ops)"
    execute "CREATE INDEX index_users_on_last_name_trigram ON users USING gin (last_name gin_trgm_ops)"
  end
  
  def down
    execute "DROP INDEX IF EXISTS index_users_on_first_name_trigram"
    execute "DROP INDEX IF EXISTS index_users_on_last_name_trigram"
  end
end

Real-world applications

Trigram similarity is incredibly versatile. Here are some practical ways I've used it:

  1. User-friendly search: Allow users to find records even when they misspell terms.
  2. Duplicate detection: Find potential duplicate customer records during data import.
  3. Autocomplete suggestions: Offer "did you mean" functionality when no exact matches exist.
  4. Fuzzy joins: Match records from different sources that might have slightly different naming conventions.

Summary

PostgreSQL's trigram similarity search provides a powerful, database-native way to perform fuzzy string matching in your Rails applications. By enabling the pg_trgm extension, you can easily find, order, and filter records based on string similarity without needing external services or complex algorithms.

We've covered how to enable the extension, perform basic and advanced searches, order results by similarity, and even optimize performance with GIN indexes. Next time you need to build a search feature that's forgiving of typos or you want to find potential duplicate records, give trigram similarity a try!

If you want something more powerful - check full-text search functionality in Postgres.

Happy fuzzying!