Breaking RSA Using Shor's Algorithm
Site: | Saylor Academy |
Course: | CS260: Introduction to Cryptography and Network Security |
Book: | Breaking RSA Using Shor's Algorithm |
Printed by: | Guest user |
Date: | Monday, 28 April 2025, 5:55 AM |
Description

Table of contents
- Abstract
- Introduction
- Our construction
- Quantum algorithms
- Reference implementation
- The quantum Fourier transform
- The coset representation of modular integers
- Windowed arithmetic
- Oblivious carry runways
- Interactions between optimizations
- Abstract circuit model cost estimate
- Approximation error
- Spacetime layout
- Runtime
- Distillation error
- Topological error
- Physical qubit count
- Construction summary
- Other physical gate error rates
- Cryptographic implications of our construction
- Future work
- Conclusion
Abstract
Use this article to see a quantum implementation capable of breaking RSA. Like classical complexity theory, quantum complexity is measured based on the number of gates in a quantum circuit and the time complexity. Try to gain an understanding of the complexity measures presented in this article.
We significantly reduce the cost of factoring integers and computing discrete logarithms in finite fields on a quantum computer by combining techniques from Shor
1994, Griffiths-Niu 1996, Zalka 2006, Fowler 2012, Eker˚a-H˚astad 2017, Eker˚a 2017,
Eker˚a 2018, Gidney-Fowler 2019, Gidney 2019. We estimate the approximate cost of
our construction using plausible physical assumptions for large-scale superconducting
qubit platforms: a planar grid of qubits with nearest-neighbor connectivity, a characteristic physical gate error rate of 10−3, a surface code cycle time of 1 microsecond, and
a reaction time of 10 microseconds. We account for factors that are normally ignored
such as noise, the need to make repeated attempts, and the spacetime layout of the
computation. When factoring 2048 bit RSA integers, our construction's spacetime volume is a hundredfold less than comparable estimates from earlier works. In the abstract
circuit model (which ignores overheads from distillation, routing, and error correction)
our construction uses logical qubits,
Toffolis, and
measurement depth to factor n-bit RSA integers. We quantify the cryptographic implications of our work, both for RSA and for schemes based on the DLP
in finite fields.
Source: Craig Gidney and Martin Eker, https://quantum-journal.org/papers/q-2021-04-15-433/pdf/
This work is licensed under a Creative Commons Attribution 4.0 License.
Introduction
Peter Shor's introduction in 1994 of polynomial time quantum algorithms for factoring integers and computing discrete logarithms was a historic milestone that greatly increased interest in quantum computing. Shor's algorithms were the first quantum algorithms that achieved a superpolynomial speedup over classical algorithms, applied to problems outside the field of quantum mechanics, and had obvious applications. In particular, Shor's algorithms may be used to break the RSA cryptosystem based on the hardness of factoring integers that are the product of two similarly-sized primes (hereafter "RSA integers"), and cryptosystems based on the discrete logarithm problem (DLP), such as the Diffie-Hellman key agreement protocol and the Digital Signature Algorithm.
The most expensive operation performed by Shor's factoring algorithm is a modular exponentiation. Modern classical computers can perform modular exponentiations on numbers with
thousands of bits in under a second. These two facts may at first glance appear to suggest that fac-
toring a thousand bit number with Shor's algorithm should only take seconds, but unfortunately (or
perhaps fortunately), that is not the case. The modular exponentiation in Shor's algorithm is per-
formed over a superposition of exponents, meaning a quantum computer is required, and quantum
hardware is expected to be many orders of magnitude noisier than classical hardware.
This noise necessitates the use of error correction, which introduces overheads that ultimately make
performing reliable arithmetic on a quantum computer many orders of magnitude more expensive
than on classical computers.
Although Shor's algorithms run in polynomial time, and although there has been significant
historical work focusing on reducing the cost of Shor's algorithms and large scale quantum computing architectures, the
constant factors hidden by the asymptotic notation remain substantial. These constant factors
must be overcome, by heavy optimization at all levels, in order to make the algorithms practical. Current quantum computers are far from being capable of executing Shor's algorithms for
cryptographically relevant problem sizes.
Our contributions and a summary of our results
In this work, we combine several novel and existing optimizations to reduce the cost of implementing Shor's algorithms. The main hurdle to overcome is to implement one or more modular exponentiations efficiently, as these exponentiations dominate the overall cost of Shor's algorithms.
We use the standard square-and-multiply approach to reduce the exponentiations into a sequence of modular multiplications. We apply optimizations to reduce the number of multiplications and the cost of each multiplication.
The number of multiplications is reduced by using windowed arithmetic, which uses small table lookups to fuse several multiplications together. It is also reduced by using Ekerå and Håstad's derivatives of Shor's algorithms, that require fewer multiplications to be performed compared to Shor's original algorithms.
The cost of each multiplication is reduced by combining several optimizations. We use Zalka's coset representation of modular integers, which allows the use of cheaper non-modular arithmetic circuits to perform modular additions. We use oblivious carry runways to split registers into independent pieces that can be worked on in parallel when performing additions. We bound the approximation error introduced by using oblivious carry runways and the coset representation of modular integers by analyzing them as approximate encoded permutations. We use windowed arithmetic (again) to fuse multiple additions into individual lookup additions. We use a layout of the core lookup addition operation where carry propagation is limited by the reaction time of the classical control system, and where the lookups are nearly reaction limited. Finally, we optimize the overall computation by trying many possible parameter sets (e.g. window sizes and code distances) for each problem size and selecting the best parameter set.
We estimate the approximate cost of our construction, both in the abstract circuit model, and in terms of its runtime and physical qubit usage in an error corrected implementation under plausible physical assumptions for large-scale superconducting qubit platforms with nearest neighbor connectivity (see Figure 1). Our physical estimates are based on the surface code. We assume distance-d logical qubits are stored using square patches of physical qubits (see Figure 8) and are operated on via lattice surgery. We provide concrete cost estimates for several cryptographically relevant problems, such as the RSA integer factoring problem, and various parameterizations of the DLP in finite fields. These cost estimates may be used to inform decisions on when to mandate migration from currently deployed vulnerable cryptosystems to post-quantum secure systems or hybrid systems.
We caution the reader that, although we report two significant figures in our tables, there are large systemic uncertainties in our estimates. For example, doubling (or halving) the physical gate error rate would increase (or decrease) the number of qubits required by more than 10%. The estimates presented are intended to be ballpark figures that the cryptographic community can use to understand the potential impact of quantum computers; not exacting predictions of the future.
Compared to previous works, we reduce the Toffoli count when factoring RSA integers by over 10x (see Table 1). To only compare the Toffoli counts as in Table 1 may prove misleading, however, as it ignores the cost of routing, the benefits of parallelization, etc. Ideally, we would like to compare our runtime and physical qubit usage to previous works in the literature. However, this is only possible when such estimates are actually reported and use physical assumptions similar to our own. The number of works for which this requirement is met is limited.
The works by Jones et al., Fowler et al., and Gheorgiu et al. stand out in that they use the same basic cost model as we use in this paper, enabling reasonable comparisons to be made. We improve on their estimates by over 100x (see Table 2) when accounting for slight remaining differences in the cost model. We also include other historical cost estimates, although in these cases the differences in physical assumptions are substantial.
Figure 1: Log-log plot of estimated space and expected-time costs, using our parallel construction, for various problems and problem sizes. See Section 3 for additional details. The jumps in space around occur as the algorithm exceeds the error budget of the CCZ factory and switches to the T factory. Generated by ancillary file "estimate costs.py".
Abstract Qubits | Measurement Depth | Toffoli + T /2 Count | Toffoli + T /2 Count (billions) | Min Volume (megaqubitdays) | |||||
---|---|---|---|---|---|---|---|---|---|
Factoring RSA integers | Asympototic | n = 1024 | n = 2048 | n = 3072 | n = 1024 | n = 2048 | n = 3072 | ||
Vedral et al. 1996 | 7n+1 | 80n3 + O(n2) | 80n3 + O(n2) | 86 | 690 | 2300 | 240 | 4100 | 23000 |
Zalka 1998 (basic) | 3n + O(1) | 12n3 + O(n) | 12n3 + O(n2) | 13 | 100 | 350 | 16 | 250 | 1400 |
Zalka 1998 (log add) | 5n + O(1) | 600n2 + O(n) | 52n3 + O(n2) | 56 | 450 | 1500 | 16 | 160 | 540 |
Zalka 1998 (fft mult) | ≈96n | ≈217n1.2 | ≈217n2 | 140 | 550 | 1200 | 62 | 260 | 710 |
Beauregard 2002 | 2n+3 | 144n3 lg n + O(n2 lg n) | 576n3 lg2n+O(n3 lg n) | 62000 | 600000 | 2200000 | 32000 | 380000 | 1700000 |
Fowler et al. 2012 | 3n + O(1) | 40n3 + O(n2) | 40n3 + O(n2) | 43 | 340 | 1200 | 53 | 850 | 4600 |
Haner et al. 2016 | 2n+2 | 52n3 + O(n2) | 64n3 lg n + O(n3) | 580 | 5200 | 19000 | 230 | 2800 | 13000 |
(ours) 2019 | 3n+0.002n lg n | 500n2 + n2 lg n | 0.3n3 + 0.0005n3 lg n | 0.4 | 2.7 | 9.9 | 0.5 | 5.9 | 21 |
Solving elliptic curve DLPs | Asympototic | n = 160 | n = 224 | n = 256 | n = 160 | n = 224 | n = 256 | ||
Roetteler et al. 2017 | 9n+ O(lg n) | 448 n3 lg n + 4090 n3 | 448n3 lg n + 4090 n3 | 30 | 84 | 130 | 13 | 52 | 83 |
Table 1: Expected costs of factoring n bit RSA integers using various constructions proposed in the literature. For comparison, we include a single construction for solving the DLP in n bit prime order elliptic curve groups with comparable classical security levels. The estimated minimum spacetime volumes assume modern surface code constructions, even for older papers.
Physical assumptions | Approach | Estimated costs | |||||||
---|---|---|---|---|---|---|---|---|---|
Historical cost estimate at n = 2048 | Physical gate error rate | Cycle time (microseconds) | Reaction time (microseconds) | Physical connectivity | Distillation strategy | Execution strategy | Physical qubits (millions) | Expected runtime (days) | Excepected vloume (megaqubitdays) |
Van Meter et al. 2009 | 0.20% | 49 | N/A | planar | 1000 + T and S | distillation limited carry lookahead | 6500 | 410 | 2600000 |
Joned et al. 2010 | 0.10% | 0.25 | N/A | planar | 7000 T | distillation limited carry lookahead | 620 | 10 | 6200 |
Fowler et al. 2012 | 0.10% | 1 | 0.1 | planar | 1200 T | reaction limited ripple carry | 1000 | 1.1 | 1100 |
O'Gorman et al. 2017 | 0.10% | 10 | 1 | arbitary | block CCZ | reaction limited ripple carry | 230 | 3.7 | 850 |
Gheoghiu et al. 2019 | 0.10% | 0.2 | 0.1 | planar | 1100 T | reaction limited ripple carry | 170 | 1 | 170 |
(ours) 2019 (1 factory) | 0.10% | 1 | 10 | planar | 1 CCZ | distillation limited ripple carry | 16 | 6 | 90 |
(ours) 2019 (1 thread) | 0.10% | 1 | 10 | planar | 14 CCZ | reaction limited ripple carry | 19 | 0.36 | 6.6 |
(ours) 2019 (1 parallel) | 0.10% | 1 | 10 | planar | 28 CCZ | reaction limited carry runways | 20 | 0.31 | 5.9 |
Table 2: Historical estimates of the expected costs of factoring bit RSA integers, and the assumptions they used. We caution the reader to carefully consider the physical assumptions of each paper when comparing their expected volumes. For example, assuming a cycle time that is 5x more optimistic will reduce the expected volume by a corresponding factor of 5. See Appendix B for details on each entry in this table.
As most previous works focus either on factoring, or on the DLP in elliptic curve groups, it is hard to find references against which to compare our cost estimates for solving the DLP in multiplicative groups of finite fields. In general, the improvements we achieve for factoring RSA integers are comparable to the improvements we achieve for solving the general DLP in finite fields. As may be seen in Figure 1, the choice of parameterization has a significant impact on the costs. For the short DLP, and the DLP in Schnorr groups, we achieve significant improvements. These are primarily due to our use of derivatives of Shor's algorithm that are optimized for these parameterizations.
Notation and conventions
Throughout this paper, we refer to the modulus as . The modulus is the composite integer to be
factored, or the prime characteristic of the finite field when computing discrete logarithms. The
number of bits in
is
where
. The number of modular multiplication
operations to perform (i.e. the combined exponent length in the modular exponentiations) is
denoted
. Our construction has a few adjustable parameters, which we refer to as
(the
exponent window length),
(the multiplication window length),
(the runway separation),
and
(the padding length used in approximate representations).
In the examples and figures, we consider moduli of length bits when n needs to be
explicitly specified. This is because
bits is the default modulus length in several widely
used software programs. Our optimizations are not specific to this choice of
. Section 3
provides cost estimates for a range of cryptographically relevant modulus lengths
.
We often quote costs as a function of both the number of exponent qubits and the problem
size
. We do this because the relationship between
and
changes from problem to problem, and
optimizations that improve
are orthogonal to optimizations that improve the cost of individual
operations.
On the structure of this paper
Our construction
Quantum algorithms
In Shor's original algorithm, composite integers that are not pure powers are factored by
computing the order r of a randomly selected element
. Specifically, period finding is performed with respect to the function
. This involves computing a modular exponentiation
with
qubits in the exponent
.
If is even and
, which holds with probability at least 1/2, this yields nontrivial factors of
. To see why, lift
to
. As
it
suffices to compute
to factor
.
Ekerå and Håstad explain how to factor RSA integers in a different
way; namely by computing a short discrete logarithm. This algorithm proceeds as follows: First
is computed classically, where as before
is randomly selected from
and of unknown
order
. Then the discrete logarithm
is computed on the quantum
computer. To see why
, note that
has order
by Euler's
totient theorem, so
as
divides
and
,
with equality if
. For large random RSA integers, the order
with overwhelming
probability. Hence, we may assume
. By using that
and
, where
and
are both known, it is trivial to deterministically recover the factors
and
as the roots of
the quadratic equation
.
The quantum part of Ekerå and Håstad's algorithm is similar to the quantum part of Shor's
algorithm, except for the following important differences: there are two exponents , of lengths
and
qubits respectively. The value
is a positive integer that must satisfy
.
We set \)m = 0.5n + O(1)\). Period finding is performed against the function
rather than the function
. The total exponent length is hence
qubits, compared to
qubits in Shor's algorithm. It is this reduction in the exponent length that
translates into a reduction in the overall number of multiplications that are performed quantumly.
The two exponent registers are initialized to uniform superpositions of all and
values,
respectively, and two quantum Fourier transforms are applied independently to the two registers.
This implies that standard optimization techniques, such as the square-and-multiply technique,
the semi-classical Fourier transform of Griffiths and Niu, recycling of qubits in the exponent
registers, and so forth, are directly applicable.
The classical post-processing algorithm used to extract the logarithm from the observed frequencies is by necessity different from Shor's. It uses lattice-based techniques, and critically does
not require to be known. Ekerå shows that the post-processing algorithm has probability
above 99% of recovering
. As the factors
and
are recovered deterministically from
and
,
this implies that it in general suffices to run the quantum algorithm once.
In summary, Ekerå and Håstad's algorithm is similar to Shor's algorithm from an implementation perspective: a sequence of multiplications is computed, where one operand is in a quantum
register and one operand is a classical constant. The multiplications are interleaved with the
Fourier transform. It is of no significance to our construction what sequence of classical constants
are used, or if one or more independent Fourier transforms need to be applied. This fact, coupled
with the fact that the algorithms of Ekerå and Håstad may be used to solve other relevant problems
where ne is a different function of , lead us to describe our construction in terms
of
and
.
Note that the above description of the algorithm has been somewhat simplified compared to the algorithm in the interest of improved legibility. Furthermore, there are technical conditions that need to be respected for the analysis to be applicable.
Reference implementation
To avoid overloading the reader, we will describe our factoring construction by starting from a simple reference implementation of the quantum part of Shor's original algorithm and then apply optimizations one by one.
The reference implementation works the way most implementations of
Shor's algorithm do, by
decomposing exponentiation into iterative controlled modular
multiplication.
A register is initialized to the
state, then a controlled modular
multiplication of the classical
constant
into
is performed, controlled by the qubit
from the
exponent
, for
each integer
from
down to 0. After the multiplications are
done,
is storing
and measuring
completes the hard part of Shor's algorithm.
Controlled modular multiplication is still not a primitive operation, so
it must also be decomposed. It can be performed by introducing a work register y initialized
to and then performing
the following two controlled scaled additions:
then
.
After these two operations,
is storing the result and
has been
cleared to the
state. Swapping
the two registers, so that the result is in
, completes the
multiplication.
Performing a controlled scaled addition with classical scale factor
can be done with a series of
controlled modular additions. For each qubit
in the input register,
you add
into
the target, controlled by
. Controlled modular addition in turn is
performed via a series of non-
modular additions and comparisons. Finally,
using the Cuccaro adder, uncontrolled non-modular additions can be
performed with
additional workspace using
Toffolis.
Combining the numbers in the preceding two paragraphs implies a Toffoli
count of for the reference implementation. It constitutes the baseline for
the optimizations that we apply in the following sections.
The quantum Fourier transform
Before proceeding, note that we focus exclusively on costing the controlled modular multiplications, both in the reference implementation and throughout the remainder of this paper. This is because they constitute the hard part of Shor's algorithm from an implementation perspective. The other part of the algorithm, the quantum Fourier transform (QFT), can be implemented semi-classically and interleaved with the controlled modular multiplications.
When the QFT is implemented in this manner (see Figure 2) it suffices to have control qubits
for the window of controlled modular multiplications currently being processed, as these control
qubits may be recycled. In the QFT, a sequence of classically controlled gates that shift
the phase by for various integers
up to the QFT dimension are applied to the control
qubit for each modular multiplication. This sequence can be combined into a single shift, and
implemented using e.g. the technique, to a level of precision far below other sources of error
in the algorithm without incurring a noticeable cost. This implies that the cost of implementing
the QFT is negligible.
Note furthermore that when we analyze the success probabilities of Shor's algorithms, and the
various derivatives, we assume the use of an ideal QFT even though the implemented QFT is
technically an approximation.
The coset representation of modular integers
Following Zalka, the first improvement we make over the reference implementation is to switch
from the usual representation of integers to the coset representation of modular integers. The usual
representation of an integer k in a quantum computer is the computational basis state .
The coset representation is different: It uses a periodic superposition with period and offset
.
In ideal conditions, the integer
is represented by the state
where
is the number of additional padding qubits placed at the end of the register after the high
order bits. The key idea is that the periodicity of the superposition causes a non-modular adder to
perform approximate modular addition on this representation, and the error of the approximation
can be exponentially suppressed by adding more padding to the register.
As we will discuss later, the amount of padding required is logarithmic in the number of
operations. This small cost enables the large benefit of using non-modular adders instead of
modular adders. It is possible to perform a controlled non-modular addition in Toffolis,
significantly cheaper than the
we assumed for a controlled modular adder. Therefore switching
to the coset representation reduces the leading asymptotic term of the Toffoli count of the reference
implementation from
to
.
Windowed arithmetic
The next optimization we use is windowed arithmetic. Specifically, we follow the modular exponentiation construction and use windowing at two levels.
First, at the level of a multiplication, we window over the controlled additions. We fuse groups of controlled additions into single lookup additions. A lookup addition is an addition where the value to add into a register is the result of a table lookup. Small windows of the qubits that would have been used as controls for the additions are instead treated as addresses into classically precomputed tables of values to unconditionally add into the target.
Second, at the level of the exponentiation, we window over the controlled multiplications. This is done by including exponent qubits in the addresses of all lookups being performed within the multiplications.
The cost of windowed arithmetic depends on the size of the windows. Let be the size of the
window over exponent qubits that are being used to control multiplications. Increasing
proportionally decreases the number of multiplications needed for the modular exponentiation, since
more exponent qubits are handled by each multiplication. Let
be the size of the window over factor qubits being used to control additions. Increasing
proportionally decreases the number
of additions needed within each multiplication. By using windowed arithmetic, the
controlled
multiplications we needed to perform become
uncontrolled multiplications while the
controlled additions we needed to perform within each multiplication become
uncontrolled
additions. The cost of increasing
and
is that each addition is now accompanied by a
lookup, and the Toffoli count of the lookup operation is
.
Figure 2: Working through the qubits representing an exponent in Shor's algorithm with a window size
of four, while using a semi-classical Fourier transform with recycling. The notation
refers to a slice of (qu)bits in e from (qu)bit index a (inclusive) to (qu)bit index b (exclusive).
Each
gate rotates by an amount determined by previous measurements, e.g. using the technique. All 16 possible values of the expression
(and similar expressions) can be precomputed
classically, and looked up on demand within the multiplication circuit. This reduces the number of multiplications
by a factor of the window size, at the cost of some additional lookup work within the multiplication.
Using Cuccaro et al.'s adder, each n-bit addition has a Toffoli count and measurement
depth of . Using Babbush et al.'s QROM read, each table lookup has a
Toffoli count and measurement depth of
and uses a negligible
ancillae.
Using measurement-based uncomputation, uncomputing each table lookup has
a Toffoli count and measurement depth of
which is negligible compared to the lookup
for the parameters we will be using.
The measurement-based uncomputation uses ancillae (it starts by measuring away n logical qubits of the lookup output register). This ancillae count is always zero for
the reasonable values of
,
, and classically intractable values of
, that we consider in this
paper. The overhead due to the logarithmic padding introduced by the coset representation of
modular integers is logarithmic in
.
Thus, by using windowed arithmetic, the leading term of the Toffoli count of the reference
implementation has been reduced to . If we set
then
we achieve a Toffoli count with leading term
. In terms of space, we are using
logical qubits. The qubits are distributed as follows:
for the accumulation register
storing the product,
for a workspace register during the multiplication,
for the lookup output register, and
qubits to hold the part of the exponent needed for the
current stage of the semi-classical Fourier transform.
Oblivious carry runways
The last major algorithmic optimization we apply is the use of oblivious carry runways. The basic problem addressed by runways is that, normally, a carry signal generated at the bottom of a register must propagate all the way to the top of the register. This process can be short-circuited by instead terminating the carry propagation into an appropriately constructed runway. Runways allow large additions to be performed piecewise, with each piece being worked on in parallel, by terminating carries into appropriately placed runways at the end of each piece.
As with the coset representation of modular integers, circuits using oblivious carry runways
approximate the original circuit instead of perfectly reproducing it. But, as we will discuss later,
increasing the runway length cpad exponentially suppresses the approximation error.
Figure 3: How to temporarily reduce oblivious carry runways to a single qubit, in preparation for being used as
the input factor in a multiply-add operation that must iterate over all qubits in the register. The multiply-add
should occur during the ". . . " section in the middle.
That being said, we will point out one optimization not mentioned in that paper which applies here. Note that, ultimately, each register we add carry runways to is going to either be measured or discarded. The last thing that would happen to these registers, before they were measured or discarded, is carry runway removal. However, the carry runway removal process is classical; it can be constructed out of Toffoli gates. As a result, the carry runway removal process does not need to be performed under superposition. The register qubits and runway qubits can simply be measured just before the runways would have been removed, and the runway removal process can be performed by classical post-processing of the measurements.
Note that we use cpad for both the runway length of oblivious carry runways and the padding length of the coset representation. We do this because they have identical error mechanisms, namely unwanted carries into the extra qubits. There is negligible benefit in suppressing one more than the other.
The major benefit of oblivious carry runways, compared to previous techniques for reducing
the depth of addition such as Draper et al.'s logarithmic depth carry-lookahead adder, is
that oblivious carry runways can be introduced gradually without incurring large overheads. The
Toffoli count and workspace overheads are linear in the number of pieces (where
is
the runway spacing) but only logarithmic in
.
For example, if you place a single runway at the midpoint of a 2048-qubit register, then the
number of qubits and the number of Toffolis needed to perform an addition will increase by a
couple percent but the depth of the additions is nearly halved.
Interactions between optimizations
A major benefit of the set of optimizations we have chosen to use in this paper is that they complement each other. They make orthogonal improvements that compound or reinforce, instead of conflicting.
For example, when using the coset representation of modular integers, it is important that additions only use offsets less than N. Larger offsets cause larger approximation error. However, because we are using windowed arithmetic, every addition we perform has an offset that is being returned by a table lookup. Since the entries in the tables are classically precomputed, we can classically ensure all offsets are canonicalized into the [0, N ) range.
There are two cases where the optimizations do not mesh perfectly.
First, using oblivious runways reduces the depth of addition but not the depth of lookups. This
changes their relative costs, which affects the optimal window sizes to use in windowed arithmetic.
When addition is suddenly four times faster than it used to be, the optimal window sizes
and
decrease by 1 (approximately).
Second, when iterating over input qubits during a multiply-add, it is necessary to also iterate
over runway qubits and padding qubits. This increases the number of lookup additions that need
to be performed in order to complete a windowed multiplication. This issue is partially solved
by temporarily adding the runways back into the main register before performing a multiply-add
where that register is used as one of the factors (instead of the target). This temporarily reduces
each runway to a single carry qubit (see Figure 3). We could reduce the runways to zero qubits,
but this would require propagating the carry qubits all the way across the register, so we do not.
Abstract circuit model cost estimate
We have now described all the details necessary to estimate the cost of our implementation in the
abstract circuit model. The cost depends on the two parameters specified by the problem (the
input size and the number of exponent qubits
) and also the four controllable parameters
we have discussed (the exponentiation window size
, the multiplication window size
, the
runway separation
, and the padding/runway length
).
Although we do consider alternate values when producing tables and figures, in general we have
found that the settings work
well. In this overview we will focus on these simple, though slightly suboptimal, values.
Recall that an exponentiation may be reduced to a sequence of multiplications, which we
process in groups of size
. For each multiplication, we do two multiply-adds. Each multiply add will use several small additions to temporarily reduce the runway registers to single qubits.
The multiply-add then needs to perform a sequence of additions controlled by the n main register
qubits, the
coset padding qubits, and the
reduced runway qubits. Using windowed
arithmetic, these additions are done in groups of size
with each group handled by one lookup
addition. Therefore the total number of lookup additions we need to perform is
The lookup additions make up essentially the entire cost of the algorithm, i.e. the total Toffoli
count is approximately equal to the Toffoli count of a lookup addition times the lookup addition
count. This works similarly for the measurement depth. Ignoring the negligible cost of uncomputing the lookup, the Toffoli count of a lookup addition is and the
measurement depth is
. Therefore
These approximate upper bounds, with set to
, are the formulae we report in the abstract
and in Table 1 for the cost of factoring RSA integers.
Approximation error
Because we are using oblivious carry runways and the coset representation of modular integers, the computations we perform are not exact. Rather, they are approximations. This is not a problem in practice, as we may bound the approximation error using concepts and results.
The oblivious runways and the coset representation of modular integers are both examples of
"approximate encoded permutations" which have a "deviation". When using a padding/runway length of , and a runway separation of csep, the deviation per addition is at most
.
Deviation is subadditive under composition, so the deviation of the entire modular exponentiation
process is at most the number of lookup additions times the deviation per addition.
Figure 4: Space usage during the key lookup-add-unlookup inner loop of our construction. During lookup,
the target register is spread out to make room for the temporary register that will hold the lookup's output.
During addition the target register and lookup output register are squeezed through a moving operating area
that sweeps up then down, applying the MAJ and UMA sweeps of Cuccaro's adder. Uncomputing the lookup
is done with measurement-based uncomputation, which is overlapped slightly with the UMA sweep (this is
why the yellow rows disappear early). "CCZ factory" space is being used to produce CCZ states. "CCZ fixup"
space is being used to convert the CCZ states into AutoCCZ states. "MAJ/UMA" space is being used to
perform a ripple carry addition, fueled by AutoCCZ states, of the temporary register (the result of the lookup)
into the target register.
We can use this to check how much deviation results from using , which
we so far have merely claimed would be sufficient:
When an approximate encoded permutation has a deviation of , the trace distance between
its output and the ideal output is at most
. Therefore the approximation error (i.e. the
trace distance due to approximations), using the parameter assignments we have described, is
roughly 0.1%. This is significantly lower than other error rates we will calculate
Spacetime layout
Our implementation of Shor's algorithm uses a layout that is derived from the (nearly) reaction limited layouts. In these layouts, a lookup addition is performed as follows (see Figure 4). All data qubits are arranged into rows. The rows of qubits making up the target register of the addition are spread out into row pairs with gaps five rows wide in between. In these gaps there will be two rows for the lookup output register, and three rows for access hallways. This arrangement allows the lookup computation two ways to access each output row, doubling the speed at which it can run, while simultaneously ensuring the target register and the lookup output register are interleaved in a convenient fashion.
After the lookup is completed, the rows of the lookup output register and the addition target register are packed tightly against the top (or equivalently bottom) of the available area. An operating area is prepared below them, and the data is gradually streamed through the operating area while the operating area gradually shifts upward. This performs the MAJ sweep of Cuccaro's ripple carry adder. Everything then begins moving in the opposite direction in order to perform the UMA sweep on the return stroke.
The lookup register is quickly uncomputed using measurement-based uncomputation, and the system is returned to a state where another lookup addition can be performed.
In order to perform piecewise additions separated by oblivious carry runways, we simply partition the computation horizontally as shown in Figure 5 and Figure 7. The lookup before each addition prepares registers across the pieces, as shown in Figure 6, but the additions themselves stick to their own piece.
The width of each piece is determined by the number of CCZ factories needed to run at the
reaction limited rate. This number is 14, assuming a code distance of 27 and using the CCZ
factories. Also, assuming a level 1 code distance of 17, the footprint of the factory
is 15 × 8. The factories are laid out into 2 rows of 7, with single logical qubit gaps to allow
space for data qubits to be routed in and out. The total width of a piece is 15 · 7 + 7 = 113 logical
qubits. The height of the operating area is 33 rows of logical qubits (2 · 8 for the CCZ factories,
three for the ripple-carry operating area, six for AutoCCZ fixup boxes, and eight for routing data
qubits). The qubits from each register add another 30 rows (approximately). So, overall,
when working on a problem of size n, the computation covers a
grid of logical qubits
where
.
Runtime
Because our implementation is dominated almost entirely by the cost of performing lookup additions, its runtime is approximately equal to the number of lookup additions times the runtime of a single lookup addition.
During the lookup phase of a lookup addition, the computation is code depth limited. Assuming
a code depth of and a surface code cycle time of 1 microsecond, it takes
milliseconds to perform the lookup using double-access hallways.
During the addition phase of a lookup addition, the computation is reaction limited. Given a
reaction time of 10 microseconds, it takes milliseconds to perform the
addition. The remaining bits of a lookup addition, such as uncomputing the lookup and rearranging the rows, take approximately 1 millisecond. Thus one lookup addition takes approximately
37 milliseconds. Given this fact, i.e. that we perform quantum lookup additions slower than most
video games render entire frames, we can approximate the total runtime:
Though we caution the reader that this estimate ignores the fact that, at larger problem sizes, lookups become slower due to the minimum code distance increasing.
This estimate implies that factoring a 2048 bit integer will take approximately 7 hours, assuming only one run of the quantum part of the algorithm is needed. Note that in our reported numerical estimates we achieve lower per-run numbers by using more precise intermediate values and by more carefully selecting parameters.
Figure 5: Data layout during a parallel addition. Corresponds to one of the "addition" columns from Figure 4.
To scale, assuming a 2048 bit number is being factored. The left and right halves of the computer run completely
independently and in parallel. The factories (red boxes and pink boxes) are feeding AutoCCZ states into the
blue area, which is rippling the carry bit back and forth as the offset and target register data (green and yellow
rows) is routed through gaps between the factories to the other side.
Figure 6: Data layout during a table lookup. To scale, assuming a 2048 bit number is being factored. Corresponds to the center "lookup" column from Figure 4. The factories (red boxes and pink boxes) are feeding AutoCCZ states into the dark gray region, which is performing the unary iteration part of a table lookup computation. There are enough factories, and enough work space, to run the lookup at double speed, made possible by the fact that every qubit in the lookup output register (yellow) is adjacent to two access hallways. The target register (green rows) and factor register (blue rows) are idle, except that a few qubits from the factor register are being used as address bits in the table lookup.
Figure 7: Spacetime layout during part of an addition. Time moves upward and has been truncated. Corresponds
to one of the "addition" columns from Figure 4. Tiny gray squares are the size of one logical qubit. Yellow and
green rows are qubits from the target and input registers of the addition. Dark blue rows are qubits from an idle
register. Red boxes are CCZ magic state factories, turned into AutoCCZ states by the pink boxes. Light blue
boxes are the MAJ operation of Cuccaro's adder, arranged into a spacelike sweep to keep ahead of the control
software). The full adder is formed by repeating this pattern, with the operating area gradually sweeping up and
then down through the green/yellow data. The diagram is to scale for with a level 2 code distance
of
and a level 1 code distance of 17, and fits on a 225 × 63 rectangular grid of logical qubits.
Distillation error
In order to perform the Toffoli gates our implementation uses to factor
an
bit RSA integer, we need to distill approximately 3 billion CCZ states. Using a level 1 code distance 17 and a level 2 code distance of 27,
this corresponds to a total distillation error of 6.4%.
This quantity is computed by considering the initial error rate of injecting physical T states,
topological error within the factory, and the likelihood of the various stages of distillation producing
a false negative.
Topological error
Elsewhere in this paper we frame our hardware assumptions in terms of a physical gate error rate. For a physical gate error rate of 10−3, the probability of error in a logical
qubit of distance d, per surface code cycle, is approximately . We base our cost estimates
on the assumption that this scaling relationship holds. Each time the code distance is increased
by two, the logical error suppression must jump by at least a factor of 10. We believe a physical
error rate of 10−3 is sufficient to achieve this scaling relationship.
Now that we know the number of logical qubits, the runtime of the algorithm, and the relationship between code distance and logical error rates, we can approximate the probability of a
topological error occurring within the surface code during the execution of the algorithm. This
will allow us to verify our initial assumption that a code distance of 27 is sufficient in the case
where . Larger computations will require larger code distances.
When factoring an bit RSA integer we are using a board of
logical qubits.
Approximately 25% of these qubits are being used for distillation, which we already accounted
for, and so we do not count them in this calculation. The remaining qubits are kept through
billion surface code cycles, which implies that the probability of a topological
error arising is approximately
.
This is a large error rate. Using a code distance of 27 is pushing the limits of feasibility. We would need to repeat the computation roughly 1.4 times, on average, to factor a number. If our goal is to minimize the expected spacetime volume of the computation, perhaps we should increase the code distance to 29. Doing so would increase the physical qubit count by 15%, but the error rate would drop by approximately a factor of 10 and so the expected number of runs would be much closer to 1. Ultimately the choice comes down to one's preferences for using more space versus taking more time.
Physical qubit count
In lattice surgery, a logical qubit covers physical qubits where d is the code distance (see
Figure 8). Assuming we push the limits and use a code distance of 27 at
, each logical
qubit will cover 1568 physical qubits. Therefore the total physical qubit count is the number of
logical qubits
times 1568; approximately 23 million qubits.
Attentive readers will note that this number disagrees with the estimate in the title of the
paper. This is because, throughout this section, we have been sloppily rounding quantities up and
choosing fixed parameters in order to keep things simple. The estimate in the title is produced by
the ancillary file "estimate costs.py", which does not make these simplifications. (In particular, "estimate costs.py" realizes that the level 1 code distance used during distillation can be reduced
from 17 to 15 when n = 2048 and this adjusts the layout in several fortuitous ways).
Construction summary
We now review the basic flow of our implementation of Shor's algorithm, including some details we would have otherwise left unstated.
The quantum part of the algorithm starts by preparing two registers, storing zero and one respectively, into the coset representation of modular integers with oblivious carry runways. The cost of this preparation is negligible compared to the cost of the rest of the algorithm. The register storing one is the accumulation register that will ultimately store the modular exponentiation, while the zero register is a workspace register.

Figure 8: Stabilizer configuration diagrams for lattice surgery qubits in the rotated surface code with code
distances d of (from left to right) 3, 5, and 11. Each small yellow circle is a data qubit. Each filled shape is
a stabilizer that is measured each surface code cycle. Black squares are four body Z stabilizers, gray squares
are four body X stabilizers, black semi-circles are two body Z stabilizers, and gray semi-circles are two body
X stabilizers. Not all physical qubits are shown; in particular there is a measurement qubit present for each
stabilizer. There are physical qubits per logical qubit, including the spacing between logical qubits.
We assume that each increase of the code distance by 2 suppresses logical errors per surface code cycle by a
factor of 10 and that this is achieved with a physical error rate of 10−3.
The bulk of the execution time is then spent performing the modular exponentiation(s). Ex- ponent qubits are iteratively introduced in groups of size cexp. We are performing a semi-classical Fourier transform, so the exponent qubits must be phased according to measurements on previous exponent qubits. Performing the phasing operations requires using T factories instead of CCZ factories, but there is so little phasing work compared to CCZ work that we ignore this cost in our estimates. The phased exponent qubits are used as address qubits during a windowed modular multiplication, then measured in the frequency basis and discarded. See Figure 2. Within this step, almost all of the computation time is spent on lookup additions.
After all of the modular multiplications have been completed, the accumulation register is storing the result of the modular exponentiation. Note that this register still has oblivious carry runways and is still using the coset representation of modular integers. We could decode the register before measuring it but, because the decoding operations are all classical, it is more efficient to simply measure the entire register and perform the decoding classically. As with the cost of initialization of the registers, this cost is completely negligible.
This completes the quantum part of the algorithm. The exponent qubit measurement results, and the decoded accumulator measurement result, are fed into the classical post-processing code which uses them to derive the solution. If this fails, which can occur e.g. due to a distillation error during quantum execution, the algorithm is restarted.
Note that there are many minor implementation details that we have not discussed. For ex-
ample, during a windowed multiplication, one of the registers is essentially sitting idle except that
its qubits are being used (cmul at a time) as address qubits for the lookup additions. We have not
discussed how these qubits are routed into and out of the lookup computation as they are needed.
It is simply clear that there is enough leeway, in the packing of the computation into spacetime,
for this routing task to be feasible and contribute negligible cost.
Other physical gate error rates
Throughout this paper, we focus primarily on a physical gate error rate of 0.1%, because we consider this error rate to be plausibly achievable by quantum hardware, yet sufficiently low to enable error correction with the surface code.
This being said, it is of course important to be conservative when estimating the security of cryptographic schemes. The reader may therefore also be interested in estimates for lower error rates. With this in mind, in addition to including a 0.01% curve in Figure 1, we present the following rule of thumb for translating to other error rates:
Here is the expected spacetime volume, assuming a physical gate error rate of p. The rule
of thumb is based on the following observation:
Each time the code distance of a construction is increased by 2, logical errors are suppressed by a factor roughly equal to the ratio between the code's threshold error rate and the hardware's physical error rate. The surface code has a threshold error rate of around 1%. For example, this means that at a physical error rate of 0.01% you suppress logical errors by 100x each time you increase the code distance by 2, whereas at a physical error rate of 0.1% the same suppression factor would require increasing the code distance by 4. Consequently, code distances tend to be half as large at a physical error rate of 0.01%. The code distance is a linear measure, and we are packing the computation into a three-dimensional space (one of the dimensions being time), so proportional improvements to the code distance get cubed.
Hence the idea is that at a physical error rate of 0.01% code distances are half as big, meaning the computational pieces being packed into spacetime are 23 = 8 times smaller, meaning in turn that the optimal packing will also be roughly eight times smaller. Similarly, at an error rate of 0.001%, code distances are a third as big, so spacetime volumes should be roughly 27 times smaller.
Of course, this rule of thumb is imperfect. For example, it does not account for the fact that
code distances have to be rounded to integer values, or for the fact that the spacetime tradeoffs
being made have to change due to the reaction time of the control system staying constant as the
code distance changes, or for the fact that at really low error rates you would switch to different
error correcting codes. Regardless, for order of magnitude estimates involving physical error rates
between 0.3% and 0.001%, we consider this rule of thumb to be a sufficient guide.
Cryptographic implications of our construction
In this section, we consider how the security of RSA, and of cryptographic schemes based on the intractability of the DLP in finite fields, is affected when the optimized construction previously described is used to implement the currently most efficient derivatives of Shor's algorithms.
Our goal throughout this section is to minimize the overall expected spacetime volume, includ-
ing expected repetitions, when factoring RSA integers or computing discrete logarithms. That is
to say, we minimize the spacetime volume in each run of the quantum algorithm times the expected
number of runs required.
Methodology
To minimize the spacetime volume for a given choice of and
, we consider all combinations
of level 1 and 2 surface code distances
and
used
during distillation and computation, window sizes
and
, runway
spacings
, and padding offsets
where
. Furthermore, we consider two different magic state distillation strategies: the
CCZ factory and the T factory. For each combination of parameters we
estimate the execution time t and physical qubit count s, and upper bound the overall probability
of errors occurring (to obtain the "retry risk" ).
To derive an upper bound on the overall probability of errors occurring, we separately estimate
the probabilities of topological errors occurring due to a failure of the surface code to correct errors,
of approximation errors occurring due to using oblivious carry runways and the coset representation
of modular integers, of magic state distillation errors, and of the classical post-processing algorithm
failing to recover the solution from a correct run of the quantum algorithm. We combine these
error probabilities, assuming independence, to derive an upper bound on the overall probability of
errors occurring.
We chose to optimize the quantity , which we refer to as the "skewed expected
spacetime volume". The
factor is the expected runtime, and the
factor grows slightly
faster than the space usage. We skew the space usage when optimizing because we have a slight
preference for decreasing space usage over decreasing runtime. We consider all combinations of
parameters
, choose the set that minimizes the skewed expected space-
time volume, and report the corresponding estimated costs.
Implications for RSA
Today, the arguably most commonly used modulus size for RSA is bits. Larger moduli
are however in widespread use and smaller moduli have been used historically.
The best published academic record is the factorization of an 829 bit RSA modulus in 2020. For the earlier record.
In Table 3 and Figure 1, we provide estimates for the resource and time requirements for
attacking RSA for various cryptographically relevant modulus lengths . The estimates are for
factoring RSA integers with Ekerå -Håstad's algorithm that computes a short discrete
logarithm. As single correct
run of this quantum algorithm suffices for the RSA integer to be factored with at least 99% success
probability in the classical post-processing.
Parameters | Volume (megaqubitdays) | Qubits (megaqubits) | Runtime (hours) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
n | ne | d1 | d2 | δoff | cmul | cexp | csep | Retry Risk | per run | expected | per run | per run |
1024 |
3(n/2 - 1) - 40
|
15 | 27 | 5 | 5 | 5 | 1024 | 6% | 0.5 | 0.5 | 9.7 | 1.3 |
2048 | 15 | 27 | 4 | 5 | 5 | 1024 | 31% | 4.1 | 5.9 | 20 | 5.1 | |
3072 | 17 | 29 | 6 | 4 | 5 | 1024 | 9% | 19 | 21 | 38 | 12 | |
4096 | 17 | 31 | 9 | 4 | 5 | 1024 | 5% | 48 | 51 | 55 | 22 | |
8192 | 19 | 33 | 4 | 4 | 5 | 1024 | 5% | 480 | 510 | 140 | 86 | |
12288 | 19 | 33 | 3 | 4 | 5 | 1024 | 12% | 1700 | 1900 | 200 | 200 | |
16384 | 19 | 33 | 4 | 4 | 5 | 1024 | 24% | 3900 | 5100 | 270 | 350 |
Table 3: Factoring an bit RSA integer by computing a short discrete logarithm. This table was produced by
the script in the ancillary file "estimate costs.py”.
Implications for finite field discrete logarithms
Given a generator g of an order r subgroup to , where the modulus
is prime, and an element
, the finite field discrete logarithm problem is to compute
. In what follows, we
assume r to be prime. If r is composite, the discrete logarithm problem may be decomposed into
problems in subgroups of orders dividing
, as shown by Pohlig and Hellman. For this reason,
prime order subgroups are used in cryptographic applications.
As has order
, it must be that
divides
, so
for some integer
.
The asymptotically best currently known classical algorithms for computing discrete logarithms in
subgroups of this form are generic cycle-finding algorithms, such as Pollard's ρ- and λ-algorithms, that run in time
and
, respectively, and the general number field sieve (GNFS),
that runs in subexponential time in the bit length
of
.
The idea of factoring via algebraic number fields was originally introduced by Pollard for integers on special forms. It was over time generalized, by amongst others Buhler et al., Lenstra et al. and Pollard, to factor general integers, and modified by amongst others Gordon and Schirokauer to compute discrete logarithms in finite fields. Much research effort has since been devoted to optimizing the GNFS in various respects. For an in-depth historical account of the development of the GNFS. For the best academic record.
Let z be the number of bits of security provided by the modulus with respect to classical attacks
using the GNFS. Let and
be the lengths in bits of
and
, respectively. It then suffices to
pick
to achieve
bits of classical security, as the generic cycle-finding algorithms are
then not more efficient than the GNFS.
When instantiating schemes based on the intractability of the finite field discrete logarithm
problem, one may hence choose between using a Schnorr group, for which , or a
safe-prime group, for which
. In the latter case, one may in turn choose between using
a short exponent, such that
, or a full length exponent, such that
. All
three parameterization options provide
bits of classical security.
What finite field groups are used in practice?
In practice, Schnorr groups or safe-prime groups with short exponents are often preferred over safe- prime groups with full length exponents, as the comparatively short exponents yield considerable performance improvements.
The discrete logarithm problem in Schnorr groups is of standard form, unlike the short discrete
logarithm problem in safe-prime groups, and Schnorr groups are faster to generate than safe-prime
groups. A downside to using Schnorr groups is that group elements received from untrusted parties
must be tested for membership of the order subgroup. This typically involves exponentiating
the element to the power of
, which is computationally expensive. Safe-prime groups are more
flexible than Schnorr groups, in that the exponent length may be adaptively selected depending
on the performance requirements. In more recent years, the use of safe-prime groups would appear
to have become increasingly prevalent. Some cryptographic schemes, such as the Diffie-Hellman
key agreement protocol, are agnostic to the choice of group, whereas other schemes, such as DSA,
use Schnorr groups for efficiency reasons.
The National Institute of Standards and Technology (NIST) in the United States standardizes
the use of cryptography in unclassified applications within the federal government. Up until April
of 2018, NIST recommended the use of randomly selected Schnorr groups with moduli of length
2048 bits for Diffie-Hellman key agreement. NIST changed this recommendation in the 3rd revision
of SP800-56A, and are now advocating using a fixed set of safe-prime groups with moduli of
length up to 8192 bits, with short or full length exponents. These groups were originally developed
for TLS and IKE where, again, they are used either with short of full length exponents.
Complexity estimates
To estimates the resource and time requirements for computing discrete logarithms in finite fields
for various modulus lengths , and for the aforementioned parameterization options, we need to
decide on what model to use for estimating
as a function of
. Various models have been proposed
over the years. For simplicity, we use the same model that NIST uses
in SP 800-56A. Note that NIST rounds z to
the closest multiple of eight bits.
To compute short logarithms in safe-prime groups, the best option is to use Ekerå -Håstad's algorithm that is specialized for this purpose. To compute general discrete logarithms in safe-prime or Schnorr groups, one option is to use Ekerå 's algorithm. A single correct run of these quantum algorithms suffices for the logarithm to be recovered with ≥ 99% success probability in the classical post-processing. These algorithms do not require the order of the group to be known. See Table 4 and Figure 1 for complexity estimates.
If the group order is known, a better option for computing general discrete logarithms in safe-
prime groups and Schnorr groups when not making tradeoffs is to use Shor's original algorithm,
modified to work in the order subgroup rather than in the whole multiplicative group
, and
to start with a uniform superposition of all exponent values, as opposed to superpositions of
values. Note that the latter modification is necessary to enable the use of the semi-classical Fourier
transform, qubit recycling and the windowing technique.
When using this modified version of Shor's algorithm to compute discrete logarithms, the
heuristic analysis shows that a single correct run suffices to compute the logarithm with ≥ 99%
success probability, assuming that each of the two exponent registers is padded with 5 bits, and
that a small search is performed in the classical post-processing. This implies that Shor's algo-
rithm outperforms Ekerå 's algorithm, as Shor's algorithm performs only approximately group
operations per run, compared to
operations in Ekerå 's algorithm, see Table 5 for complexity estimates. This is because Ekerå 's algorithm does not require r to be known. In fact, it computes
both
and
.
Parameters | Volume (megaqubitdays) | Qubits (megaqubitdays) | Runtime (hours) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n | ne | nd | nr | z | d1 | d2 | δoff | cmul | cexp | csep | Retry Risk | per run | expected | per run | per run | ||
Schnorr | 1024 | 3nr = 6z | 2z | 2z | 80 | 15 | 25 | 4 | 5 | 5 | 1024 | 10% | 0.2 | 0.2 | 9.2 | 0.4 | |
2048 | 112 | 15 | 27 | 3 | 5 | 5 | 1024 | 9% | 0.9 | 1 | 20 | 1.2 | |||||
3072 | 128 | 15 | 27 | 9 | 5 | 5 | 1024 | 18% | 2.4 | 2.9 | 29 | 2 | |||||
4096 | 152 | 17 | 29 | 7 | 4 | 5 | 1024 | 4% | 6.5 | 6.8 | 51 | 3.1 | |||||
8192 | 200 | 17 | 31 | 3 | 4 | 5 | 1024 | 5% | 38 | 40 | 110 | 8.3 | |||||
12288 | 240 | 17 | 31 | 9 | 4 | 5 | 1024 | 9% | 110 | 120 | 170 | 15 | |||||
16384 | 272 | 17 | 31 | 5 | 4 | 5 | 1024 | 17% | 210 | 250 | 220 | 23 | |||||
Safe-prime | Short | 1024 | 3nd = 6z | 2z | n-1 | 80 | 15 | 25 | 4 | 5 | 5 | 1024 | 10% | 0.2 | 0.2 | 9.2 | 0.4 |
2048 | 112 | 15 | 27 | 3 | 5 | 5 | 1024 | 9% | 0.9 | 1 | 20 | 1.2 | |||||
3072 | 128 | 15 | 27 | 9 | 5 | 5 | 1024 | 18% | 2.4 | 2.9 | 29 | 2 | |||||
4096 | 152 | 17 | 29 | 7 | 4 | 5 | 1024 | 4% | 6.5 | 6.8 | 51 | 3.1 | |||||
8192 | 200 | 17 | 31 | 3 | 4 | 5 | 1024 | 5% | 38 | 40 | 110 | 8.3 | |||||
12288 | 240 | 17 | 31 | 9 | 4 | 5 | 1024 | 9% | 110 | 120 | 170 | 15 | |||||
16384 | 272 | 17 | 31 | 5 | 4 | 5 | 1024 | 17% | 210 | 250 | 220 | 23 | |||||
Full length | 1024 | 3nr = 3(n-1) | n-1 | n-1 | 80 | 15 | 27 | 9 | 5 | 5 | 1024 | 10% | 1.1 | 1.2 | 9.7 | 2.7 | |
2048 | 112 | 17 | 29 | 6 | 4 | 5 | 1024 | 6% | 12 | 12 | 26 | 11 | |||||
3072 | 128 | 17 | 31 | 5 | 4 | 5 | 1024 | 5% | 41 | 43 | 41 | 24 | |||||
4096 | 152 | 19 | 31 | 7 | 4 | 5 | 1024 | 9% | 97 | 110 | 55 | 43 | |||||
8192 | 200 | 19 | 33 | 4 | 4 | 5 | 1024 | 8% | 960 | 1100 | 140 | 180 | |||||
12288 | 240 | 19 | 33 | 3 | 4 | 5 | 1024 | 21% | 3300 | 4100 | 200 | 390 | |||||
16384 | 272 | 21 | 35 | 4 | 4 | 5 | 1024 | 16% | 9100 | 11000 | 320 | 700 |
Table 4: Computing discrete logarithms using Ekerå -Håstad's and Ekerå 's algorithms. This table was produced by the script in the ancillary file "estimate costs.py".
Parameters | Volume (megaqubitdays) | Qubits (megaqubitdays) | Runtime (hours) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n | ne | nd | nr | z | d1 | d2 | δoff | cmul | cexp | csep | Retry Risk | per run | expected | per run | per run | |
Schnorr | 1024 | 2(nr+5) | 2z | 2z | 80 | 15 | 25 | 6 | 5 | 5 | 1024 | 8% | 0.1 | 0.1 | 9.2 | 0.3 |
2048 | 112 | 15 | 27 | 8 | 5 | 5 | 1024 | 6% | 0.6 | 0.7 | 20 | 0.8 | ||||
3072 | 128 | 15 | 27 | 6 | 5 | 5 | 1024 | 13% | 1.6 | 1.8 | 29 | 1.3 | ||||
4096 | 152 | 17 | 27 | 3 | 5 | 5 | 1024 | 25% | 3.3 | 4.4 | 39 | 2.1 | ||||
8192 | 200 | 17 | 29 | 5 | 4 | 5 | 1024 | 11% | 23 | 26 | 110 | 5.5 | ||||
12288 | 240 | 17 | 31 | 4 | 4 | 5 | 1024 | 8% | 68 | 74 | 170 | 10 | ||||
16384 | 272 | 17 | 31 | 5 | 4 | 5 | 1024 | 12% | 140 | 160 | 220 | 16 | ||||
Safe-prime | 1024 | 2(nr+5) | n-1 | n-1 | 80 | 15 | 27 | 3 | 5 | 5 | 1024 | 8% | 0.7 | 0.8 | 9.7 | 1.8 |
2048 | 112 | 17 | 29 | 3 | 4 | 5 | 1024 | 5% | 7.4 | 7.8 | 26 | 7 | ||||
3072 | 128 | 17 | 29 | 4 | 4 | 5 | 1024 | 12% | 25 | 29 | 38 | 16 | ||||
4096 | 152 | 17 | 31 | 4 | 4 | 5 | 1024 | 8% | 65 | 70 | 55 | 29 | ||||
8192 | 200 | 19 | 33 | 3 | 4 | 5 | 1024 | 6% | 640 | 680 | 140 | 120 | ||||
12288 | 240 | 19 | 33 | 4 | 4 | 5 | 1024 | 15% | 2200 | 2600 | 200 | 260 | ||||
16384 | 272 | 21 | 35 | 2 | 4 | 5 | 1024 | 12% | 6100 | 6900 | 320 | 470 |
Table 5: Computing discrete logarithms using Shor's algorithm modified. This
table was produced by the script in the ancillary file "estimate costs.py".
Note that for safe-prime groups, , so when
is known to the adversary then so
is
. For Schnorr groups, it may be that
is unknown to the adversary, especially if the group is
randomly selected. It may be hard to compute
classically, as it amounts to finding a
bit
prime factor of
.
Implications for elliptic curve discrete logarithms
Over the past decades cryptographic schemes based on the intractability of the DLP in finite fields and the RSA integer factoring problem have gradually been replaced by cryptography based on the intractability of the DLP in elliptic curve groups. This is reflected in standards issued by organizations such as NIST.
Not all optimizations developed in this paper are directly applicable to arithmetic in elliptic
curve groups. It is an interesting topic for future research to study to what extent the optimizations
developed in this paper may be adapted to optimize such arithmetic operations.
This paper should not be perceived to indicate that the RSA integer factoring problem and the
DLP in finite fields is in itself less complex than the DLP in elliptic curve groups on quantum
computers. The feasibility of optimizing the latter problem must first be properly studied.
On the importance of complexity estimates
It is important to estimate the complexity of attacking widely deployed asymmetric cryptographic schemes using future large-scale quantum computers. Such estimates enable informed decisions to be made on when to mandate migration from existing schemes to post-quantum secure schemes.
For cryptographic schemes that are used to protect confidentiality, such as encryption and key agreement schemes, a sufficiently long period must elapse inbetween the point in time when the scheme ceases to be used, and the point in time when the scheme is projected to become susceptible to practical attacks. This is necessary so as to ensure that the information that has been afforded protection with the scheme is no longer sensitive once the scheme becomes susceptible to practical attacks. This is because one must assume that encrypted information may be recorded and archived for decryption in the future.
If the information you seek to protect is to remain confidential for 25 years, you must hence
stop using asymmetric schemes such as RSA and Diffie-Hellman at least 25 years before quantum
computers capable of breaking these schemes become available to the adversary. For cryptographic
schemes that are used to protect authenticity, or for authentication, such as signature schemes, it
suffices to migrate to post-quantum secure schemes only right before the schemes become susceptible to practical attacks. This is an important distinction.
On early adoption of post-quantum secure schemes
The process of transitioning to post-quantum secure schemes has already begun. However, no
established or universally recognized standards are as of yet available. Early adopters may therefore
wish to consider implementing schemes conjectured to be post-quantum secure alongside existing
classically secure schemes, in such a fashion that both schemes must be broken for the combined
hybrid scheme to be broken.
Future work
Investigate asymptotically efficient multiplication
The multiplication circuits that we are using have a Toffoli count that scales quadratically (up to
polylog factors). There are multiplication circuits with asymptotically better Toffoli counts. For
example, the Karatsuba algorithm has a Toffoli count of and the Schönhage–Strassen algorithm has a Toffoli count of
. However, there are difficulties when attempt-
ing to use these asymptotically efficient algorithms in the context of Shor's algorithm.
The first difficulty is that efficient multiplication algorithms are typically classical, implemented
with non-reversible computation in mind. They need to be translated into a reversible form. This
is not trivial. For example, a naive translation of Karatsuba multiplication will result in twice as
many recursive calls at each level (due to the need to uncompute), and increase the asymptotic
Toffoli count from to
. Attempting to fix this problem can result in the space
complexity increasing, though it is possible to solve this problem.
The second difficulty is constant factors, both in workspace and in Toffoli count. Clever multiplication circuits have better Toffoli counts for sufficiently large n but, when we do back-of-the-
envelope estimates, "sufficiently large" is beyond . This difficulty is made worse if the
multiplier is incompatible with optimizations that work on naive multipliers, such as windowed
arithmetic and the coset representation of modular integers. Clever multiplication circuits also
tend to use additional workspace, and it is necessary to contrast using the better multiplier against
the opportunity cost of using the space for other purposes (such as distillation of magic states).
For example, the first step of the Schönhage–Strassen algorithm is to pad the input up to double length, then split the padded register into
pieces and pad each piece up to double
length. The target register quadruples in size before even getting into the details of performing the
number theoretic transform! This large increase in space means that the quantum variant of the Schönhage–Strassen algorithm is competing with the many alternative opportunities one has when
given 6000 more logical qubits of workspace (at
). For example, that's enough space for
fifty additional CCZ factories.
Can multipliers with asymptotically lower Toffolis counts help at practical sizes, such as bits? We believe that they do not, but only a careful investigation can properly determine
the answer to this question.
Optimize distillation
To produce our magic states, we use slightly-modified copies of the CCZ factory. We have many ideas for improving on this approach. First, the factory we are using is optimized for the case where it is working in isolation. But, generally speaking, error correcting codes get better when working over many objects instead of one object. It is likely that a factory using a block code to produce multiple CCZ states at once would perform better than the factory we used. Adistillation protocol that produces good CCZ states ten at a time.
Second, we have realized there are two obvious-in-hindsight techniques that could be used to reduce the volume of the factories in that paper. First, we assumed that topological errors that can occur within the factories were undetected, but actually many of them are heralded as distillation failures. By taking advantage of this heralding, it should be possible to reduce the code distance used in many parts of the factories. Second, when considering the set of S gates to apply to correct T gate teleportations performed by the factories, there are redundancies that we were not previously aware of. In particular, for each measured X stabilizer, one can toggle whether or not every qubit in that stabilizer has an S gate applied to it. This freedom makes it possible to e.g. apply dynamically chosen corrections while guaranteeing that the number of S gate fixups in the 15-to-1 T factory is at most 5 (instead of 15), or to ensure a particular qubit will never need an S gate fixup (removing some packing constraints).
Third, we believe it should be possible to use detection events produced by the surface code to estimate how likely it is that an error occurred, and that this information can be use to discard "risky factory runs" in a way that increases reliability. This would allow us to trade the increased reliability for a decreased code distance. As an extreme example, suppose that instead of attempting to correct errors during a factory run we simply discarded the run if there were any detection events where a local stabilizer flipped. Then the probability that a factory run would complete without being discarded would be approximately zero, but when a run did pass the chance of error would be amazingly low. We believe that by picking the right metric (e.g. number of detections or diameter of alternating tree during matching), then interpolating a rejection threshold between the extreme no-detections-anywhere rule and the implicit hope-all-errors-were-corrected rule that is used today, there will be a middle ground with lower expected volume per magic state. (Even more promisingly, this thresholded error estimation technique should work on any small state production task and almost all quantum computation can be reformulated as a series of small state production tasks).
Reducing the volume of distillation would improve the space and time estimates in this paper.
But turning some combination of the above ideas into a concrete factory layout, with understood
error behavior, is too large of a task for one paper. Of course, the problem of finding small factories
is a problem that the field has been exploring for some time. In fact, in the week before we first
released this paper, Daniel Litinsky made available. He independently arrived at the idea of
using distillation to herald errors within the factory, and provided rough numerics indicating this
may reduce the volume of distillation by as much as a factor of 10. Therefore we leave optimizing
distillation not as future work, but as ongoing work.
Optimize qubit encoding
Most of the logical qubits in our construction spend most of their time sitting still, waiting for other qubits to be processed. It should be possible to store these qubits in a more compact, but less computationally convenient, form. A simple example is that resting qubits do not need the "padding layer" between qubits implicit in Figure 8, because this layer is only there to make surgery easier. Another example is that there may to be ways to pack lattice surgery qubits that use less area while preserving the code distance (e.g. the "bulbs" shown in Figure 9).
Figure 9: A possible dense packing for idle qubits, using physical qubits instead of
. In
the diagram, there is one distance 13 logical qubit per "bulb". The X and Z observables of one of the logical
qubits are shown in red and blue respectively. The pattern continues past the cut-off at the left and right sides.
One could also imagine encoding the resting qubits into a block code. The difficulty is in finding a block code that a) works with small groups of qubits, b) can be encoded and decoded fast enough and compactly enough to fit into the existing computation, and c) is sufficiently better than the surface code that the benefits (reduced code distance within the surface code) outweigh the costs (redundant additional qubits).
Finally, there are completely different ways of storing information in the surface code. For example, qubits stored using dislocations could be denser than qubits stored using lattice surgery. However, dislocations also create runs of stabilizers over not-quite-adjacent physical qubits. Measuring these stabilizers requires additional physical operations, creating more opportunities for errors, and so errors will propagate more quickly along these runs.
Do the "qubit houses" in Figure 9 actually work, or is there some unexpected error mechanism?
Can dislocation qubits be packed more tightly than lattice surgery qubits while achieving an
equivalent logical error rate? Is there a block code that is sufficiently beneficial when layered over
surface code qubits? Until careful simulations are done it will be unclear what the answer to these
questions is, and so we leave the answers to future work.
Distribute the computation
An interesting consequence of how we parallelize addition, by segmenting registers into pieces terminated by carry runways, is that there is limited interaction between the different pieces. In fact, the additions themselves require no communication between the pieces; it is only the lookup procedure preparing the input of the additions that requires communication. If we place each piece on a separate quantum computer, only a small amount of communication is needed to seed the lookups and keep the computation progressing.
Given the parameters we chose for our construction, each lookup is controlled by ten address qubits. Five of those qubits come from the exponent and are re-used hundreds of times. The other five come from the factor register, which is being iterated over during multiplication. If the computation was to be distributed over multiple machines, the communication cost would be dominated by broadcasting the successive groups of five qubits from the factor register.
Recall from Section 2 that each lookup addition takes approximately 37 milliseconds. Five qubits must be broadcast per lookup addition. Therefore, if each quantum computer had a 150 qb/s quantum channel, the necessary qubits could be communicated quickly enough to keep up with the lookup additions. This would distribute the factoring computation.
For example, a network topology where the computers were arranged into a linear chain and
were constantly forwarding the next set of factor qubits to each other would be sufficient. Each
quantum computer in the distributed computation would have a computational layout basically
equivalent to the left half (or right half) of Figure 7. Factoring an bit number could be performed with two machines each having perhaps 11 million physical qubits, instead of one
machine with 20 million physical qubits.
We can decrease the size of each individual quantum computer by decreasing the piece size
we use for the additions. For example, suppose we used an addition piece size of
and distributed an
RSA factoring computation over 8 machines. Normally using smaller
pieces would proportionally accelerate the addition, but the number of magic state factories has not
changed (they have simply been distributed) so each lookup addition will still take approximately
the same amount of time. So a 150 qb/s quantum channel per computer should still be sufficient.
One caveat here is that each of these much smaller machines has to run its own copy of the
lookup operation, and because there are so few factories per machine the lookup will take much
longer. The optimal windows sizes of the lookups will decrease, and the total computation time
will approximately double.
Based on this surface analysis it seems that, instead of using 1 machine with 20 million qubits,
we could use 8 machines each with perhaps 4 million qubits, as long as they were connected by
quantum channels with bandwidths of 150qb/s. But we have not carefully explored the question
of how to distribute a quantum computation structured in this way and, although there has been
significant previous work done on distributing quantum computations, we think
the piecewise nature of carry runway adders makes them well suited to a specialized analysis.
Revisit elliptic curves
Many of the optimization techniques that we use in this paper generalize to other contexts where arithmetic is performed. In particular, consider the Toffoli count for computing discrete logarithms over elliptic curves. It can likely be improved substantially by using windowed arithmetic. On the other hand, because of the need to compute modular inverses, it is not clear if the coset representation of modular integers is applicable.
Which of our optimizations can be ported over, and which ones cannot? How much of an
improvement would result? These are interesting questions for future research work.
Conclusion
In this paper, we combined several techniques and optimizations into an efficient construction for factoring integers and computing discrete logarithms over finite fields. We estimated the approximate cost of our construction, both in the abstract circuit model and under plausible physical assumptions for large-scale quantum computers based on superconducting qubits. We presented concrete cost estimates for several cryptographically relevant problems. Our estimated costs are orders of magnitude lower than in previous works with comparable physical assumptions.
Mosca poses the rhetorical question: "How many physical qubits will we need to break RSA-2048? Current estimates range from tens of millions to a billion physical qubits". Our physical assumptions are more pessimistic than the physical assumptions used in that paper (see Table 2) so it is reasonable to say that, in the four years since 2015, the upper end of the estimate of how many qubits will be needed to factor 2048 bit RSA integers has dropped nearly two orders of magnitude; from a billion to twenty million.
Clearly the low end of Mosca's estimate should also drop. However, the low end of the estimate is highly sensitive to advances in the design of quantum error correcting codes, the engineering of physical qubits, and the construction of quantum circuits. Predicting such advances is beyond the scope of this paper.
Post-quantum cryptosystems are in the process of being standardized, and small-scale ex-
periments with deploying such systems on the internet have been performed. However, a
considerable amount of work remains to be done to enable large-scale deployment of post-quantum
cryptosystems. We hope that this paper informs the rate at which this work needs to proceed.