Solving LPN Problem with Machine Learning: An Introduction

Cryptography Jan 3, 2021

Ever since the series finale of HBO's Silicon Valley premiered, a lot of people (including me) started asking the question:

"Is it actually possible for an AI to break an encrypted message?"
Lorenz SZ42 – used to generate the encrypted traffic.
Photo by Alex Motoc / Unsplash

There is a tremendous amount of research going on in this domain and there has been some luck in breaking AES Encryption. As an introduction to the above domain, we will take a look at the Learning Parity with Noise or LPN Problem to help us understand how an AI can break encryption?

Basics of Encryption and Encryption Keys:

What are encryption keys and how do they work? ?
In cryptography a ‘key’ is a piece of information used in combination with an algorithm (a ‘cipher’) to transform plaintext into ciphertext (encryption) and vice versa (decryption). A cipher can be…
Please go through this link in case you want to understand basics of Encryption and Encryption keys

So, What is the LPN Problem?

To understand the LPN Problem, we must first understand what is Parity Learning? (Parity - A parity bit, or check bit, is a bit added to a string of binary code. Parity bits are used as the simplest form of error detecting code).

Parity learning is a problem in machine learning in which an algorithm that solves this problem must guess the function ƒ, given some samples (x, ƒ(x)) such that ƒ computes the parity of bits at some fixed locations. The samples are generated using some distribution over the input. The problem is easy to solve using Gaussian elimination provided that a sufficient number of samples (from a distribution which is not too skewed) are provided to the algorithm which can help in cracking the secret key.

$$As = b$$ where A are the samples, s is the secret key and b is the response.

Now, in usage, the above is averted by adding "Noise" which makes things interesting as the problem now transforms to:

$$As \approx b$$

This is where LPN problem comes in. Instead of sending As as it is, it adds a noise e with some probability p (p<0.5). Simply put, it flips the response bit from 0 to 1 OR from 1 to 0 with probability p and with probability 1-p it sends the response without any noise. The systems making use of above are a lot more immune to the Man in the Middle attack. Since, the above computations are simple and so hard to crack such that LPN is believed to be resistant to quantum computers. This makes it a good candidate to the number-theoretic problems (e.g. factorization and discrete logarithm) which can be solved easily with quantum algorithms. Also, due to its simplicity, it is a nice candidate for lightweight devices. (Door Locks, Hotel Room Keys etc.)

Now, let's have a look at a code snippet that generates samples according to a random secret key: (We will be using Scikit Learn's DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier for the scope of this tutorial)

class LPN:
    def __init__(self, secret_key, error_rate):
        self.secret_key = secret_key
        self.dimension = len(secret_key)
        self.error_rate = error_rate
    def sample(self, n_amount):
        # Create random matrix.
        A = np.random.randint(0, 2, size=(n_amount, self.dimension))
        # Add Bernoulli errors.
        e = np.random.binomial(1, self.error_rate, n_amount)
        # Compute the labels.
        b = np.mod(A @ self.secret_key + e, 2)
        return A, b
Code snippet to generate samples

Let's generate some examples: (We will be taking a very small probability for p for the scope of this tutorial)

p = 0.125
dim = 16
s_original = np.random.randint(0, 2, dim)
lpn = LPN(s_original, p)

A, b = lpn.sample(100000)

We will now fit these sample on a ExtraTreesClassifier:

from sklearn.ensemble import ExtraTreesClassifier
et = ExtraTreesClassifier()
et.fit(A, b)

s_candidate_et = et.predict(np.eye(dim))

et_ef = np.mod(A @ s_candidate_et + b, 2).sum()
print(et_ef)
if et_ef < 14000:
    print(s_candidate_et, s_original)
else:
    print('Wrong candidate. Try again!')
Fitting the model and predicting a secret key with dimension dim

The calculation for a Valid Candidate is done as follows:

The learning algorithm might fail to capture the ground truth and learn another function. In this case, the Hamming weight of the vector is quite high (around 50000 for our vector of length 100000). This corresponds to the case which has the wrong key and can answer about half of the challenges correctly. If s_candidate = s_original, the Hamming weight will be around 0.15 * 100000 = 15000. Having a threshold of 14000 in this example is a good tradeoff between recognizing the correct secret and not outputting a wrong candidate as the solution.


[Extra]

If we tweak the dimensions, up the samples by 10x and hope that most the random generated samples are the original messages (the reason why we kept the probability so small: 0.125), sometimes we can actually find out the secret and can be checked here:

LPN Problem Tutorial
Explore and run machine learning code with Kaggle Notebooks | Using data from no data sources

Now, before anyone goes ahead sniffing hotel room key and lock's traffic to get enough examples, keep in mind those keys change on an interval, have higher value for p and most probably use variable dimensions for the secret.

[Bonus]

This blog is a boiled down version of an awesome blog by Dr. Robert Kübler:

Where Machine Learning meets Cryptography
The LPN problem stems from learning theory and is used in cryptography for building secure encryption. Learn about this connection in my article!

Cheers!

P.S. - Happy New Year!

Tags

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.