Naively computing sine

Posted on 9 June 2025 by John

Suppose you need to write software to compute the sine function. You were told in a calculus class that this is done using Taylor series—it’s not, but that’s another story—and so you start writing code to implement Taylor series.

How many terms do you need? Uh, …, 20? Let’s go with that.

from math import sin, factorial

def sine(x):
    s = 0
    N = 20
    for n in range(N//2):
        m = 2*n + 1
        s += (-1)**n * x**m / factorial(m)
    return s

You test your code against the built-in sin function and things look good.

    >>> sin(1)
    0.8414709848078965
    >>> sine(1)
    0.8414709848078965

Larger arguments

Next, just for grins, you try x = 100.

    >>> sin(100)
    -0.5063656411097588
    >>> sine(100)
    -7.94697857233433e+20

Hmm. Off by 20 orders of magnitude.

So you do a little pencil and paper calculation and determine that the code should work if you set N = 300 in the code above. You try again, and now you’re off by 25 orders of magnitude. And when you try entering 100.0 rather than 100, the program crashes.

What’s going on? In theory, setting N = 300 should work, according to the remainder term in the Taylor series for sine. But the largest representable floating point number is on the order of 10³⁰⁸ and so x**m terms in the code overflow. The reason the program didn’t crash when you typed in 100 with no decimal point is that Python executed x**m in integer arithmetic.

If you were patient and willing to use a huge number of terms in your Taylor series, you couldn’t include enough terms in your series to get accurate results for even moderately large numbers because overflow would kill you before you could get the series truncation error down. Unless you used extended precision floating point. But this would be using heroic measures to rescue a doomed approach.

Taylor series are good local approximations. They’re only valid within a radius of convergence; outside of that radius they’re worthless even in theory. Inside the radius they may be valid in theory but not in practice, as is the case above.

Range reduction

Taylor series isn’t the best approach for evaluating sine numerically, but you could make it work if you limited yourself to a small range of angles, say 0 to π/4, and used trig identities to reduce the general problem to computing sine in the range [0, π/4]. So how do you do that?

Let’s look at reducing arguments to live in the interval [0, 2π]. In practice you’d want to reduce the range further, but that requires keeping up with what quadrant you’re in etc. We’ll just consider [0, 2π] for simplicity.

The natural thing to do is subtract off the largest multiple of 2π you can.

    >>> from math import pi
    >>> k = int(100/(2*pi))
    >>> x = 100 - k*2*pi
    >>> sine(x)
    -0.5065317636696883
    >>> sin(100)
    -0.5063656411097588

That’s better. Now we don’t overflow, and we get three correct decimal places. We could increase N and get more. However, there’s a limit to how well we could do, as the calculation below shows.

    >>> sin(x)
    -0.5063656411097495
    >>> sin(100)
    -0.5063656411097588

When we reduce 100 mod 2π and call sin(x), we get a result that differs from sin(100) in the last three decimal places. That’s not terrible, depending on the application, but the loss of precision increases with the size of x and would be a complete loss for large enough values of x.

Let’s look back at where we divided 100 by 2π:

    >>> 100/(2*pi)
    15.915494309189533

When we throw away the integer part (15) and keep the rest (.915494309189533) we lose two figures of precision. The larger the integer part, the more precision we lose.

Our naive attempt at range reduction doesn’t work so well, but there are clever ways to make range reduction work. It is possible, for example, to compute the sine of a googol (i.e. sin 10¹⁰⁰).

At first it seems simple to reduce the range. Then when you become aware of the precision problems and think about it a while, it seems precise range reduction is impossible. But if you keep thinking about it long enough you might find a way to make it work. A lot of effort went into range reduction algorithms years ago, and now they’re part of the computational infrastructure that we take for granted.

Computing the Euler-Mascheroni Constant

Posted on 8 June 2025 by John

The Euler-Mascheroni constant is defined as the limit

$\gamma = \lim_{n\to\infty} \left(\sum_{k=1}^n \frac{1}{k} \right) - \log n$

So an obvious way to try to calculate γ would be to evaluate the right-hand side above for large n. This turns out to not be a very good approach. Convergence is slow and rounding error accumulates.

A much better approach is to compute

$\gamma = \lim_{n\to\infty} \left(\sum_{k=1}^n \frac{1}{k} \right) - \log\left( n + \tfrac{1}{2}\right)$

It’s not obvious that you can add the extra factor of ½ in the log term, but you can: the right-hand sides of both equations above converge to γ, but the latter converges faster.

Here’s a comparison of the two methods.

The error in the first method with n = 4000 is comparable to the error in the second method with n = 16.

Even though the second equation above is better for numerical evaluation, there are much more sophisticated approaches. And this brings us to y-cruncher, software that has been used to set numerous world records for computing various constants. A new record for computing the digits of π was set a few weeks ago using y-cruncher. From the y-cruncher documentation:

Of all the things that y-cruncher supports, the Euler-Mascheroni Constant is the slowest to compute and by far the most difficult to implement. …

The Euler-Mascheroni Constant has a special place in y-cruncher. It is one of the first constants to be supported (even before Pi) and it is among the first for which a world record was set using y-cruncher. As such, y-cruncher derives its name from the gamma symbol for the Euler-Mascheroni Constant.

Posts involving γ

The bad version of a good test

Posted on 28 May 2025 by John

Ever since 1952, the largest known primes have all had the form 2ⁿ − 1, with one exception in 1989. The reason the largest known primes have this form is that it is easier to test whether these numbers are prime than other numbers.

A number of the form 2ⁿ − 1 is called a Mersenne number, denoted M_n. A Mersenne prime is a Mersenne number that is also prime. There is a theorem that says that if M_n is prime then n must be prime.

Lehmer’s theorem

Derrick Henry Lehmer proved in 1930 that M_pis prime if p is an odd prime and M_p divides s_p−2 where the s terms are defined recursively as follows. Define s₀ = 4 and s_n = s_n−1² − 2 for n > 0. This is the Lucas-Lehmer primality test for Mersenne numbers, the test alluded to above.

Let’s try this out on M₇ = 2⁷ − 1 = 127. We need to test whether 127 divides s₅. The first six values of s_n are

4, 14, 194, 37634, 1416317954, 2005956546822746114

and indeed 127 does divide 2005956546822746114. As you may have noticed, s₅ is a big number, and s₆ will be a lot bigger. It doesn’t look like Lehmer’s theorem is going to be very useful.

The missing piece

Here’s the idea that makes the Lucas-Lehmer [1] test practical. We don’t need to know s_p−2 per se; we only need to know whether it is divisible by M_p. So we can carry out all our calculations mod M_p. In our example, we only need to calculate the values of s_n mod 127:

4, 14, 67, 42, 111, 0

Much better.

Lucas-Lehmer test takes the remainder by M_p at every step along the way when computing the recursion for the s terms. The naive version does not, but computes the s terms directly.

To make the two versions of the test explicit, here are Python implementations.

Naive code

def lucas_lehmer_naive(p):
    M = 2**p - 1
    s = 4
    for _ in range(p-2):
        s = (s*s - 2) 
    return s % M  == 0

Good code

def lucas_lehmer(p):
    M = 2**p - 1
    s = 4
    for _ in range(p-2):
        s = (s*s - 2) % M
    return s == 0

The bad version

How bad is the naive version of the Lucas-Lehmer test?

The OEIS sequence A003010 consists of the s_n. OEIS gives the following formula:

$s_n = \left\lceil (2 + \sqrt{3})^{2^n}\right\rceil$

This shows that the s_n grow extremely rapidly, which suggests the naive algorithm will hit a wall fairly quickly

When I used the two versions of the test above to verify that the first few Mersenne primes M_p really are prime, the naive version gets stuck after p = 19. The better version immediately works through p = 9689 before slowing down noticeably.

Historical example

Let’s look at M₅₂₁. This was the first Mersenne number verified by a computer to be prime. This was done on the vacuum tube-based SWAC computer in 1952 using the smart version of the Lucas-Lehmer test.

Carrying out the Lucas-Lehmer test to show that this number is prime requires doing modular arithmetic with 521-bit numbers, which was well within the 9472 bits of memory in the vacuum tube-based SWAC computer that proved M₅₂₁ was prime.

But it would not be possible now, or ever, to verify that M₅₂₁ is prime using the naive version of the Lucas-Lehmer test because we could not store, much less compute, s₅₁₉.

Since

$\log_{2} s_{519} = 2^{519} \log_{2} (2 + \sqrt{3}) \approx 2^{520}$

we’d need around 2⁵²⁰ bits to store s₅₁₉. The number of bits we’d need is itself a 157-digit number. We’d need more bits than there are particles in the universe, which has been estimated to be on the order of 10⁸⁰.

The largest Mersenne prime that the SWAC could verify using the naive Lucas-Lehmer test would have been M₁₃= 8191, which was found to be prime in the Middle Ages.

[1] Édouard Lucas conjectured in 1878 the theorem that Lehmer proved in 1930.

Multiplying a matrix by its transpose

Posted on 20 May 2025 by John

An earlier post claimed that there practical advantages to partitioning a matrix, thinking of the matrix as a matrix of matrices. This post will give an example.

Let M be a square matrix and suppose we need to multiply M by its transpose M^T. We can compute this product faster than multiplying two arbitrary matrices of the same size by exploiting the fact that MM^T will be a symmetric matrix.

We start by partitioning M into four blocks

$M = \begin{bmatrix}A & B \\ C & D \end{bmatrix}$

Then

$M^\intercal = \begin{bmatrix}A^\intercal & C^\intercal \\ B^\intercal & D^\intercal \end{bmatrix}$

and

$MM^\intercal = \begin{bmatrix} AA^\intercal + BB^\intercal & AC^\intercal + BD^\intercal \\CA^\intercal + DB^\intercal & CC^\intercal + DD^\intercal \end{bmatrix}$

Now for the first clever part: We don’t have to compute both of the off-diagonal blocks because each is the transpose of the other. So we can reduce our calculation by 25% by not calculating one of the blocks, say the lower left block.

And now for the second clever part: apply the same procedure recursively. The diagonal blocks in MM^T involve a matrix times its transpose. That is, we can partition A and use the same idea to compute AA^T and do the same for BB^T, CC^T, and DD^T. The off diagonal blocks require general matrix multiplications.

The net result is that we can compute MM^T in about 2/3 the time it would take to multiply two arbitrary matrices of the same size.

Recently a group of researchers found a way to take this idea even further, partitioning a matrix into a 4 by 4 matrix of 16 blocks and doing some clever tricks. The RXTX algorithm can compute MM^T in about 26/41 the time required to multiply arbitrary matrices, a savings of about 5%. A 5% improvement may be significant if it appears in the inner loop of a heavy computation. According to the authors, “The algorithm was discovered by combining Machine Learning-based search methods with Combinatorial Optimization.”

A bit-twiddling marvel

Posted on 20 May 2025 by John

Pop quiz: what does the following code do?

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

It determines whether the year y is a leap year in the Gregorian calendar, of course. :)

It’s very efficient, though I don’t image that would ever matter. But it’s very clever.

Gregorian vs Julian calendar

A year is a leap year in the Julian calendar if and only if it is a multiple of 4. But the Julian year is a bit too long on average to match the earth’s orbit around the sun. You get a much better fit if you add the complication that years divisible by 100 but not by 400 are not leap years.

Presumably no one reading this recalls 1900 not being a leap year, but some of you may need to know some day that 2100 is not a leap year either.

C code

The cryptic function above comes from the recent post A leap year check in three instructions by Falk Hüffner. The function is correct for the next 100 thousand years (more precisely, through 103499) and correct if we anachronistically extent the Gregorian calendar back to the year 0.

The following C code shows empirically that Hüffner’s code is correct, but you’ll have to read his post to understand why it is correct.

#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

bool is_leap_year(uint32_t y) {
    if ((y % 4) != 0) return false;
    if ((y % 100) != 0) return true;
    if ((y % 400) == 0) return true;
    return false;
}

int main() {
    for (uint32_t y = 0; y <= 102499; y++)
        if (is_leap_year_fast(y) != is_leap_year(y))
            printf("Exception: %d\n", y);
    return 0;
}

Decimal Separator and Internationalization

Posted on 13 May 2025 by John

This morning I ran across the following tip from Joost Helberg on Mastodon:

TIL don’t report numbers with three digits after the decimal point. People may interpret the decimal point as a thousands separator. Using 2 or 4 digits, although wrong, avoids off by a factor thousand errors.

I usually report four decimal places, but I hadn’t thought about that in relation to the decimal separator problem.

In software development, it’s best to let a library handle numeric input and output, using local conventions. Failure to do so can lead to problems, as I’ll never forget.

Embarrassed in Bordeaux

In 2006, Peter Thall and I gave a week-long course on Bayesian clinical trial design in Bordeaux, France. Part of the course was presenting software that my team at MD Anderson Cancer Center had developed.

A few minutes into my first presentation I realized the software wasn’t working for the course attendees. The problem had to do with the US and France using opposite conventions for decimal separator and thousands separator. I had tested our software on a French version of Windows, but I had entered integers in my testing and decimals in my presentation.

I apologized and asked my French audience to enter decimals in the American style, such as 3.14 rather than 3,14. But that didn’t work either!

We were using a Windows API for parsing input, which correctly handles input and output per local conventions. But we had written custom validation code. We checked that the input fields contained only valid numeric characters, i.e. digits and periods. Oops!

Users were between a rock and hard place. The input validation would not accept French notation, and the parsing code would not accept American notation.

The solution was for the attendees to set their operating system locale to the US. They were gracious about having to apply the hack and said that it was a common problem. It was a humiliating way to start the course, but the rest of the week went well.

Formulating eight queens as a SAT problem

Posted on 11 May 2025 by John

The Boolean satisfiability problem is to determine whether there is a way to assign values to variables in a set of Boolean formulas to make the formulas hold [1]. If there is a solution, the next task would be to enumerate the solutions.

You can solve the famous eight queens problem, or its generalization to n-queens, by formulating the problem as a Boolean formula then using a SAT solver to find the solutions.

It’s pretty obvious how to start. A chessboard is an 8 by 8 grid, so you have a variable for each square on the board, representing whether or not that square holds a queen. Call the variables b_ij where i and j run from 1 to 8.

The requirement that every row contains exactly one queen can be turned into two subrequirements:

Each row contains at least one queen.
Each row contains at most one queen.

The first requirement is easy. For each row i, we have a clause

b_i1 ∨ b_i2 ∨ b_i3 ∨ … ∨ b_i8

The second requirement is harder. How do you express in terms of our boolean variables that there is no more than one queen in each row? This is the key difficulty. If we can solve this problem, then we’re essentially done. We just need to do the analogous thing for columns and diagonals. (We don’t require a queen on every diagonal, but we require that there be at most one queen on every diagonal.)

First approach

There are two ways to encode the requirement that every row contain at most one queen. The first is to use implication. If there’s a queen in the first column, then there is not a queen in the remaining columns. If there’s a queen in the second column, then there is not a queen in all but the second column, etc. We have an implication for each row in each column. Let’s just look at the first row and first column.

b₁₁ ⇒ ¬ (b₁₂ ∨ b₁₃ ∨ … ∨ b₁₈)

We can turn an implication of the form a ⇒ b into the clause ¬a ∨ b.

Second approach

The second way to encode the requirement that every row contain at most one queen is to say that for every pair of squares in a row (a, b) either a has no queen or b has no queen. So for the first row we would have ₈C₂ = 28 clauses because there are 28 ways to choose pairs from a set of 8 things.

(¬b₁₁ ∨ ¬b₁₂) ∧ (¬b₁₁ ∨ ¬b₁₃) ∧ … ∧ (¬b₁₇ ∨ ¬b₁₈)

An advantage of this approach is that it directly puts the problem into conjunctive normal form (CNF). That is, our formula is a conjunction of terms that contain only disjunctions, an AND of ORs.

[1] You’ll see the SAT problem described as finding the solution to a Boolean formula. If you have multiple formulas, then the first holds, and the second, etc. So you can AND them all together to make one formula.

Square root of a small number

Posted on 8 May 2025 by John

The recent post Converting between quaternions and rotation matrices describes an algorithm as “a mathematically correct but numerically suboptimal method/”

Let’s just look at a small piece of that post. The post explains how to find a rotation matrix to describe the same rotation as pre- and post- multiplying by a unit quaternion q, and how to recover q from a rotation. The latter involves

q₀ = ½ √(1 + r₁₁ + r₂₂ + r₃₃).

If you look at the definition of the r terms you’ll see that the quantity under the square root equal 4q₀² and so the term is positive. In theory. In floating point arithmetic, the term you’re taking the square root of could be small or even slightly negative.

I mentioned at the bottom of the earlier post that I came up with a unit quaternion q that caused the code to throw an exception because it attempted to take the square root of a negative number.

There’s an easy way to keep the code from throwing an exception: check whether the argument is positive before taking the square root. If it’s positive, forge ahead. If it’s negative, then round it up to zero.

This makes sense. The argument is theoretically positive, so rounding it up to zero can’t do any harm, and in fact it would to a tiny bit of good. But this is missing something.

As a general rule of thumb, if an algorithm has trouble when a quantity is negative, it might have trouble when the quantity is small but positive too.

How could the term

1 + r₁₁ + r₂₂ + r₃₃

which is theoretically positive be computed as a negative number? When all precision is lost in the sum. And this happens when

r₁₁ + r₂₂ + r₃₃

is nearly −1, i.e. when you’re subtracting nearly equal numbers. In general, if two positive numbers a and b are the same up to n bits, you can lose n bits of precision when subtracting them. You could lose all of your precision if two numbers agree to n bits and all you have are n bits.

Putting in a “breaker” to prevent taking the square root of a negative number doesn’t address the problem of code not throwing an exception but returning an inaccurate result. Maybe you’re subtracting two numbers that don’t agree to all the bits of precision you have, but they agree to more bits than you’d care to lose.

The solution to this problem is to come up with another expression for q₀ and the components, another expression that is equal in theory but numerically less sensitive, i.e. one that avoids subtracting nearly equal numbers. That’s what the authors do in the paper cited in the previous post.

A simple way to generate random points on a sphere

Posted on 6 May 2025 by John

I recently ran across a simple way to generate random points uniformly distributed on a sphere: Generate random points in a cube until you get one inside the sphere, then push it to the surface of the sphere by normalizing it [1].

In more detail, to generate random points on the unit sphere, generate triples

(u₁, u₂, u₃)

of independent random values uniformly generated on [−1, 1] and compute the sum of there squares

S² = u₁² + u₂² + u₃²

If S² > 1 then reject the sample and try again. Otherwise return

(u₁/S, u₂/S, u₃/S).

There are a couple things I like about this approach. First, it’s intuitively plausible that it works. Second, it has minimal dependencies. You could code it up quickly in a new environment, say in an embedded system, though you do need a square root function.

Comparison with other methods

It’s not the most common approach. The most common approach to generate points on a sphere in n dimensions is to generate n independent values from a standard normal distribution, then normalize. The advantage of this approach is that it generalizes efficiently to any number of dimensions.

It’s not obvious that the common approach works, unless you’ve studied a far amount of probability, and it raises the question of how to generate samples from normal random variable. The usual method, the Box-Muller algorithm, requires a logarithm and a sine in addition to a square root.

The method starting with points in a cube is more efficient (when n = 3, though not for large n) than the standard approach. About half the samples that you generate from the cube will be thrown out, so on average you’ll need to generate six values from [−1, 1] to generate one point on the sphere. It’s not the most efficient approach—Marsaglia gives a faster algorithm in the paper cited aove—but it’s good enough.

Accept-reject methods

One disadvantage to accept-reject methods in general is that although they often have good average efficiency, their worst-case efficiency is theoretically infinitely bad: it’s conceivable that you’ll reject every sample, never getting one inside the sphere. This is not a problem in practice. However, it may be a problem in practice that the runtime is variable.

When computing in parallel, a task may have to wait on the last thread to finish. The runtime for the task is the runtime of the slowest thread. In that case you may want to use an algorithm that is less efficient on average but that has constant runtime.

It also matters in cryptography whether processes have a constant runtime. An attacker may be able to infer something about the path taken through code by observing how long it took for the code to execute.

[1] George Marsaglia. Choosing a point from the surface of a sphere. The Annals of Mathematical Statistics. 1972, Vol. 43, No. 2, 645–646.

Distribution of run times for Euclidean algorithm

Posted on 27 April 2025 by John

The worst case run time for the Euclidean algorithm occurs when you give the algorithm a pair of consecutive Fibonacci numbers. The algorithm takes n steps to compute the greatest common divisor of F_n+1 and F_n+2.

The nth Fibonacci number is the nearest integer to φⁿ/√5 where φ = (1 + √5)/2 is the golden ratio. You can take logs and solve for the largest n such that F_n is less than x, subtract 2, and have an upper bound on the number of steps necessary to find the gcd of two numbers less than x.

But what about the average run time, or the distribution of run times?

For large x, the distribution of number of steps needed to calculate the gcd of two numbers 0 < a < b < x using the Euclidean algorithm is approximately normal with mean

12 log(2) log(x) / π².

See, for example, [1].

Let’s illustrate this with a simulation. Here’s Python code to return the number of steps used in computing gcd(a, b).

    def euclid(a, b):
        steps = 0
        while a != 0:
            a, b = b%a, a
            steps += 1
        return steps # gcd = b

I generated 10,000 pairs of random integers less than 2⁶³ (because the maximum signed 64 bit integer is 2⁶³ − 1) and plotted a histogram of the run times.

I got a nice bell curve with mean 36.3736. The theoretical mean is 0.8428 log(2⁶³) = 36.8021.

[1] Doug Hensley. The Number of Steps in the Euclidean Algorithm. Journal of Number Theory 49. 142–182 (1994).

Computing

Naively computing sine

Larger arguments

Range reduction

Computing the Euler-Mascheroni Constant

Posts involving γ

The bad version of a good test

Lehmer’s theorem

The missing piece

Naive code

Good code

The bad version

Historical example

Related posts

Multiplying a matrix by its transpose

Related posts

A bit-twiddling marvel

Gregorian vs Julian calendar

C code

Related posts

Decimal Separator and Internationalization

Embarrassed in Bordeaux

Formulating eight queens as a SAT problem

First approach

Second approach

Related posts

Square root of a small number

A simple way to generate random points on a sphere

Comparison with other methods

Accept-reject methods

Related posts

Distribution of run times for Euclidean algorithm