The need for asymptotic notation: a case study

Consider the following problem. In Python, the numerical computation library numpy provides numpy arrays — a memory- and CPU-efficient alternative to ordinary Python lists. Numpy arrays have one problem, however: unlike Python lists, they cannot be resized (akin to arrays in many other languages like Java and C). Your program is provided with a long list of integers coming from a file, one integer per line. You don't know the number of lines beforehand. You want to make a function to efficiently read the data into a numpy array. How would you do it?¹⁾

The simplest solution would be reading the data into an ordinary Python list and then converting the list into a numpy array; however, with big inputs that would make the program consume multiple times more memory than the constructed array, as Python lists are rather inefficient. Another idea would be to read the file twice — count the number of lines on the first reading, then create the array of appropriate size and fill the array on the second reading; however, we would like our function to work also in case the input file cannot be scrolled back (e. g. if the file is actually the standard input stream). So let's try to find a solution that doesn't use Python lists and reads the file only once.

Here's a first try: we grow the array one element at a time — after reading a line from standard input, we create a new array which contains one more element than the old one, copy the contents of the old array into the new one and put the integer we just read at the last position of the array.

from numpy import empty
 
def readNumbers1(inputFile):
  a = empty(0, dtype=int)     # create a numpy array of length 0 whose element type is int
  for line in inputFile:      # iterate over all lines of the file
    k = len(a)                # k is the length of the array
    b = empty(k+1, dtype=int) # create a new uninitialized numpy array of length k+1
    for i in xrange(k):       # i ranges from 0 to k-1
      b[i] = a[i]             # copy the contents of the old array into the new one
    b[k] = int(line)          # parse the line into an int and put the int into the array
    a = b                     # make variable a point to the new array
  return a
 
#Testing:
with f = open("numbers.txt", "r"):
  a = readNumbers1(f)
  print("The sum of the numbers is {0}.".format(sum(a)))

The function readNumbers1 solves the problem correctly. But what can one tell about its performance? Will it be fast enough if the file contains many numbers?

We notice that the amount of time spent in each step of the outer for cycle becomes the larger, the longer the array a is (i. e. the more integers we have already read). Indeed, the time spent in the inner for cycle is proportional to $k$, the length of the array. We may suspect that the time spent by empty(k+1, dtype=int) could also be linear in $k+1$ (the size of the allocated array). The running time of other statements in the body of the outer for cycle is a constant which does not depend on k. So, all in all, the time spent in a step of the outer for cycle is linear in $k$, i. e. of the form $ck + d$ where $c$ and $d$ are some constants.

So, for reading the first line, the function readNumbers1 spends $c \cdot 0 + d = d$ units of time; for reading the second line, readNumbers1 spends $c \cdot 1 + d = c + d$ units of time, for reading the third line, its spends $2c + d$ units of time and so on. All in all, if the file contains $n$ lines then the for cycle takes \[\begin{align*} &d + (c + d) + (2c + d) + (3c + d) + \ldots + ( (n-2)c + d) + ( (n-1)c + d) =\\ &= nd + (1 + 2 + 3 + \ldots + (n-2) + (n-1) )c = nd + \frac{n(n-1)}{2}c \end{align*}\] units of time to run. Here we used the formula for the sum of arithmetic progression²⁾. The function also spends a constant amount of the time $e$ outside the for cycle, so the total running time of readNumbers1 is $$T_1(n) = nd + \frac{n(n-1)}{2}c + e$$. For large $n$, we can approximate the expression by leaving only the quadratic term: $$T_1(n) = nd + \frac{n(n-1)}{2}c + e = \frac{c}2n^2 + (d - \frac{c}{2})n + e \approx \frac{c}2n^2.$$ Indeed, for large $n$, the constant $e$ will be small compared to the other terms, which grow to infinity, and the term $\frac{c}2n^2$ eventually also becomes much bigger than the term $(d - \frac{c}{2})n$. This is obvious if you think of the graphs $y = ax^2$ and $y = bx$ — see the figure below.

Figure (parabola and line). For any positive constants $a$ and $b$, the function $ax^2$ will eventually become (much) larger than $bx$. Geometrically: given an upwards-oriented parabola through the origin and any rising straight line through the origin, the parabola will meet the line at some point to the right of the origin.

Real running times of this program with the python 2.7 interpreter on an Intel Core i3 380 processor are like this:

n	$T_1(n)$ (milliseconds)
1	0.010863
10	0.059700
100	1.646996
1000	130.244970
10000	12799.756050

Figure (running times of readNumbers1) Running times of readNumbers1 for $n$ up to 500, in milliseconds.

The times indeed grow quadratically, as expected: if n grows $10$ times then the running time of the algorithm grows approximately $10^2 = 100$ times. For $n = 10,000$ the function already takes multiple seconds to run. From the last two rows of table it appears that for large inputs, the running time of the algorithm is $T_1(n) \approx 1.3 \cdot 10^{-4} n^2$ milliseconds. This can also be seen from Figure (running times of readNumbers1).

Can we do better? Noting that the quadratic part in the time of readNumbers came from resizing the array, let's try to reduce the number of resizings:

def readNumbers2(inputFile):
  a = empty(1, dtype=int)
  m = 0
  for line in inputFile:
    k = len(a)
    if m >= k:
      b = empty(2*k, dtype=int)
      for i in xrange(k):
        b[i] = a[i]
      a = b
    a[m] = int(line)
    m = m + 1
  b = empty(m, dtype=int)
  for i in xrange(m):
    b[i] = a[i]
  return b

Unlike readNumbers1, readNumbers2 does not increase the array size by $1$ but doubles it. Therefore creation of a new array and copying of the contents of the old array into the new one is needed more rarely, at the cost of some memory being wasted on unused positions at the end of the array. The difference between the two algorithms is illustrated by the drawing below.

Does this pay off? Let's calculate. Note that the size of array a in readNumbers2 is always a power of $2$. Let $p$ be such a natural number that $2^{p-1} < n \leq 2^p$. Then readNumbers2 resizes the array $p$ times: $1 \to 2 \to 4=2^2 \to 8=2^3 \to \ldots \to 2^{p-1} \to 2^p$. Suppose creating a new array of size $x$ and copying the contents of the old array into it takes time $Cx$ for some constant $C$. Then the time spent on the resizings is $$(1 + 2 + 2^2 + 2^3 + \ldots + 2^{p-1} + 2^p)C = (2^{p+1} - 1)C < 2^{p+1} C = 4C \cdot 2^{p-1} < 4Cn.$$ Here we used the formula for the sum of a geometric progression³⁾. Apart from resizing the array, the function readNumbers2 spends only a constant amount of time D in each step of the for loop and a constant amount of time $A$ outside the for loop. Therefore, its total running time is $T_2(n) < (4C+D)n + A$. For large $n$, the constant $A$ becomes irrelevantly small compared to $(4C+D)n$, so for large $n$ we have $T_2(n) < dn$ for some constant $d$ (for example, if we only consider $n>A$, we may take $d = 4C + D + 1$). Therefore, our theoretical analysis indicates our effort has paid off: the running time of readNumbers1 grew quadratically with n, whereas readNumbers2 is only at most linear in n.

A benchmark gives running times like this:

n	$T_2(n)$ (milliseconds)
1	0.011342
10	0.042400
100	0.209093
1000	1.776934
10000	18.935204
100000	181.163073
$2^{18}$=262144	460.182905
$2^{18}$+1=262145	523.746967
$2^{19}$=524288	916.903019
$2^{19}$+1=524289	1050.997019
1000000	1763.423920

Figure (running times of readNumbers2) Running times of readNumbers2 for $n$ up to 600, in milliseconds.

Looking at the rows $n=10000, 100000, 10000000$ in the table, it seems that the running time is proportional to $n$, being approximately equal to $0.0018n$. Looking at the graph, one sees that $T_2(n)$ actually oscillates between two linear functions, with jumps at numbers of the form $n=2^p+1$ — points where the array is resized. We see that although $T_2(n)$ is not exactly proportional to $n$ itself, for large $n$, there is indeed an upper bound to the running time of this function which is proportional to $n$, just as our theoretical analysis predicted — namely it seems from the data that for sufficiently large $n$, $T_2(n) < 0.0021n$. Thanks to this fact, readNumbers2 has running time growing more slowly than that of readNumbers1 (which had quadratic time) and can even be used in case of millions of elements. Even if the constants $c$ and $d$ in the estimates $T_1(n) \approx \frac{c}2n^2$ and $T_2(n) < dn$ of the two algorithms were different, the second one would still beat the first one for large enough $n$ (remember Figure (parabola and line)!). For example, if the first program were run on a fast computer and the second on an old slow one, the first one would probably be better than the second on small inputs, but inevitably the second one would finish faster for large enough inputs. Changing the constants does not change much the order of magnitude of the inputs, for which the algorithms can be used, either: even if the constant $c$ were a hundred times smaller than we measured in the benchmark, the first algorithm still would still take a long time to handle a million elements ($1.3 \cdot 10^{-4}/100 \cdot (10^6)^2$ milliseconds = 1300 seconds ≈ 22 minutes) — something which the second algorithm accomplishes in less than two seconds.

Our conclusions from these two examples are:

if your algorithm's running time is a polynomial like $c n^2 + b n + a$ (where $n$ is the input size), it's only the highest term $c n^2$ that matters for large inputs;
when assessing the order of magnitude of inputs an algorithm is good for and whether one algorithm is faster than another for large inputs, the constants before the highest terms are not so important either, but rather the order of the highest terms. In particular, a linear term $dn$ is always much smaller than a quadratic term $cn^2$ for sufficiently large $n$;
therefore, for example, readNumbers2 is qualitatively faster than readNumbers1.

These observations lead us to the next lessons, where we will define asymptotic notations: the Big Oh and the Big Omega.

As an aside: in real life, to solve the problem of reading a text file of integers, one would use an appropriate function in numpy like numpy.fromfile; such functions use the same idea as we implemented in readNumbers2. The same idea is also used by Python lists (a Python list uses an underlying array, which has some spare elements at the end, so that the time of adding $n$ elements one-by-one to a list by calling list.add is linear, not quadratic in $n$).

Big Oh

¹⁾

Well, there's actually a function in numpy for reading arrays from text files, but let's not use it. It is still worth knowing how to write such a function, to be prepared for similar situations where there are no special functions at hand.

²⁾

To recall, an arithmetic progression is a sequence of equally spaced numbers (like $7, 9, 11, 13$) and its sum can be calculated as follows: the number of elements is multiplied by the arithmetic average of the first and the last element, e. g. $7 + 9 + 11 + 13 = 4 \cdot \frac{7 + 13}2 = 40$.

³⁾

To recall, a geometric progression is a sequence of numbers where the ratio of consecutive numbers is constant (like $0.4, 2, 10, 50$, where each element is obtained by multiplying the previous one by $5$) and its sum can be calculated by the formula $a\frac{1-r^n}{1-r}$ where $a$ is the first element, $r$ is the ratio of an element to the previous one and $n$ is the number of elements, e. g. $0.4 + 2 + 10 + 50 = 0.4 \cdot \frac{1-5^4}{1-5} = 62.4$. In case $a = 1$ and $r = 2$, the sum formula is especially simple: $1 + 2 + 2^2 + 2^3 + \ldots + 2^n = 2^{n+1} - 1$.

MathWiki

The need for asymptotic notation: a case study