Event sourcing vs GDPR, or how to forget in a world that remembers everything

Event sourcing vs GDPR, or how to forget in a world that remembers everything

Designing systems based on event-driven architecture is great fun, until the legal department enters the room with GDPR on their lips. As experienced engineers, we know that Event Sourcing is based on data immutability. If something happened, it stays in the log forever. But what do you do when a user exercises their right to be forgotten and your database is one big ‘append-only’ log? This is not just a technical problem, it is a real architectural paradox that requires us to be clever instead of mindlessly typing ‘DELETE’ in the console.

The problem with data collected like Pokemon

Many of us tend to log everything we can get our hands on. We collect user data in the hope that it will be useful someday, thereby creating a huge legal risk. In distributed systems, this data replicates to Read Models, backups, and external analytics systems. The traditional approach of deleting a record from an SQL database using the query DELETE FROM users WHERE id = 1 is insufficient in the world of events. Even in relational databases, this data often remains in transaction logs (binlogs), so physically erasing it is more difficult than we think.

Crypto-shredding as an elegant solution

The most effective technique for dealing with this problem is ‘crypto-shredding’. Instead of fighting the immutability of the event log, we simply make the data unreadable. This involves encrypting sensitive data with a unique key assigned to a specific user. When we receive a request to delete data, we simply destroy that key. Without it, your events in Kafka or another EventStore become useless bit noise that no one, not even you, can read.

Let's look at how we could implement a simple payload encryption mechanism in Ruby. Let's assume we are using a library for AES-256-CBC encryption.

require 'openssl'
require 'json'
require 'base64'

class EventEncryptor
  def self.encrypt(data, key)
    cipher = OpenSSL::Cipher.new("AES-256-CBC")
    cipher.encrypt
    iv = cipher.random_iv
    cipher.key = Digest::SHA256.digest(key)

    encrypted_data = cipher.update(data.to_json) + cipher.final

    {
      payload: Base64.strict_encode64(encrypted_data),
      iv: Base64.strict_encode64(iv)
    }
  end

  def self.decrypt(encrypted_hash, key)
    decipher = OpenSSL::Cipher.new("AES-256-CBC")
    decipher.decrypt
    decipher.iv = Base64.decode64(encrypted_hash[:iv])
    decipher.key = Digest::SHA256.digest(key)

    decrypted_data = decipher.update(Base64.decode64(encrypted_hash[:payload])) + decipher.final
    JSON.parse(decrypted_data)
  end
end

data_to_encrypt = { my_key: 'my_value', user_id: 123, active: true }
secret_key = "my_password"
p data_to_encrypt

encrypted_result = EventEncryptor.encrypt(data_to_encrypt, secret_key)

puts "Payload: #{encrypted_result[:payload]}"
puts "IV:      #{encrypted_result[:iv]}"

decrypted_result = EventEncryptor.decrypt(encrypted_result, secret_key)
p decrypted_result

The decrypt method is key in the above code. If we remove the key from our secure storage (e.g. Vault), this method will throw an error or return nil, which means that the data has been successfully ‘forgotten’. The description of the encrypt code shows how we prepare data for secure storage in the event log, which is by nature public to other system modules.

Deleting keys in the database

Encryption keys should be stored in a dedicated table or an external KMS system. If you decide to use your own database, the forgetting operation boils down to a simple SQL query that destroys the ability to read historical events.

System boundaries and data

An important aspect is where personal data ends and business data begins. We can often anonymise a person's name, but we must retain the fact that the transaction took place in order to maintain the consistency of financial reports. Do not use sensitive data such as email addresses in stream names (Stream ID). If your stream is called [email protected], even if you encrypt the payload, the very existence of that stream reveals the user's identity. Always use UUIDs, which are neutral and do not carry any personal information.

Summary and work hygiene

Remember that technology is only half the battle. GDPR is primarily about processes and data governance. As engineers, we need to talk to lawyers, because they are the ones who ultimately decide whether crypto-shredding is sufficient for them. Sometimes it may be necessary to physically delete data by rewriting the stream (Copy and Replace), which is technically difficult but provides 100% certainty. The most important lesson is simple: don't be a Pokemon collector and design your system so that privacy is its default state, not a patch stuck on at the last minute before an audit.

Happy encrypting!