====== - #1 The need for asymptotic notation: a case study ====== Consider the following problem. In Python, the numerical computation library ''numpy'' provides //numpy arrays// — a memory- and CPU-efficient alternative to ordinary Python lists. Numpy arrays have one problem, however: unlike Python lists, they cannot be resized (akin to arrays in many other languages like Java and C). Your program is provided with a long list of integers coming from a file, one integer per line. You don't know the number of lines beforehand. You want to make a function to efficiently read the data into a numpy array. How would you do it?((Well, there's actually a function in ''numpy'' for reading arrays from text files, but let's not use it. It is still worth knowing how to write such a function, to be prepared for similar situations where there are no special functions at hand.)) The simplest solution would be reading the data into an ordinary Python list and then converting the list into a numpy array; however, with big inputs that would make the program consume multiple times more memory than the constructed array, as Python lists are rather inefficient. Another idea would be to read the file twice — count the number of lines on the first reading, then create the array of appropriate size and fill the array on the second reading; however, we would like our function to work also in case the input file cannot be scrolled back (e. g. if the file is actually the standard input stream). So let's try to find a solution that doesn't use Python lists and reads the file only once. Here's a first try: we grow the array one element at a time — after reading a line from standard input, we create a new array which contains one more element than the old one, copy the contents of the old array into the new one and put the integer we just read at the last position of the array. from numpy import empty def readNumbers1(inputFile): a = empty(0, dtype=int) # create a numpy array of length 0 whose element type is int for line in inputFile: # iterate over all lines of the file k = len(a) # k is the length of the array b = empty(k+1, dtype=int) # create a new uninitialized numpy array of length k+1 for i in xrange(k): # i ranges from 0 to k-1 b[i] = a[i] # copy the contents of the old array into the new one b[k] = int(line) # parse the line into an int and put the int into the array a = b # make variable a point to the new array return a #Testing: with f = open("numbers.txt", "r"): a = readNumbers1(f) print("The sum of the numbers is {0}.".format(sum(a))) The function ''readNumbers1'' solves the problem correctly. But what can one tell about its performance? Will it be fast enough if the file contains many numbers? We notice that the amount of time spent in each step of the outer ''for'' cycle becomes the larger, the longer the array ''a'' is (i. e. the more integers we have already read). Indeed, the time spent in the inner ''for'' cycle is proportional to $k$, the length of the array. We may suspect that the time spent by ''empty(k+1, dtype=int)'' could also be linear in $k+1$ (the size of the allocated array). The running time of other statements in the body of the outer ''for'' cycle is a constant which does not depend on ''k''. So, all in all, the time spent in a step of the outer ''for'' cycle is linear in $k$, i. e. of the form $ck + d$ where $c$ and $d$ are some constants. So, for reading the first line, the function ''readNumbers1'' spends $c \cdot 0 + d = d$ units of time; for reading the second line, ''readNumbers1'' spends $c \cdot 1 + d = c + d$ units of time, for reading the third line, its spends $2c + d$ units of time and so on. All in all, if the file contains $n$ lines then the ''for'' cycle takes \[\begin{align*} &d + (c + d) + (2c + d) + (3c + d) + \ldots + ( (n-2)c + d) + ( (n-1)c + d) =\\ &= nd + (1 + 2 + 3 + \ldots + (n-2) + (n-1) )c = nd + \frac{n(n-1)}{2}c \end{align*}\] units of time to run. Here we used the formula for the sum of arithmetic progression((To recall, an arithmetic progression is a sequence of equally spaced numbers (like $7, 9, 11, 13$) and its sum can be calculated as follows: the number of elements is multiplied by the arithmetic average of the first and the last element, e. g. $7 + 9 + 11 + 13 = 4 \cdot \frac{7 + 13}2 = 40$.)). The function also spends a constant amount of the time $e$ outside the ''for'' cycle, so the total running time of ''readNumbers1'' is $$T_1(n) = nd + \frac{n(n-1)}{2}c + e$$. For large $n$, we can approximate the expression by leaving only the quadratic term: $$T_1(n) = nd + \frac{n(n-1)}{2}c + e = \frac{c}2n^2 + (d - \frac{c}{2})n + e \approx \frac{c}2n^2.$$ Indeed, for large $n$, the constant $e$ will be small compared to the other terms, which grow to infinity, and the term $\frac{c}2n^2$ eventually also becomes much bigger than the term $(d - \frac{c}{2})n$. This is obvious if you think of the graphs $y = ax^2$ and $y = bx$ — see the figure below. {{:asymptotics:parabola_and_line.png}} **Figure (parabola and line).** For any positive constants $a$ and $b$, the function $ax^2$ will eventually become (much) larger than $bx$. Geometrically: given an upwards-oriented parabola through the origin and any rising straight line through the origin, the parabola will meet the line at some point to the right of the origin. Real running times of this program with the ''python'' 2.7 interpreter on an Intel Core i3 380 processor are like this: ^n ^$T_1(n)$ (milliseconds)^ |1 |0.010863 | |10 |0.059700 | |100 |1.646996 | |1000 |130.244970 | |10000 |12799.756050 | {{:asymptotics:readNumbers1.png|size=200}} **Figure (running times of ''readNumbers1'')** Running times of ''readNumbers1'' for $n$ up to 500, in milliseconds. The times indeed grow quadratically, as expected: if ''n'' grows $10$ times then the running time of the algorithm grows approximately $10^2 = 100$ times. For $n = 10,000$ the function already takes multiple seconds to run. From the last two rows of table it appears that for large inputs, the running time of the algorithm is $T_1(n) \approx 1.3 \cdot 10^{-4} n^2$ milliseconds. This can also be seen from [[#readNumbers1_times|Figure (running times of readNumbers1)]]. Can we do better? Noting that the quadratic part in the time of ''readNumbers'' came from resizing the array, let's try to reduce the number of resizings: def readNumbers2(inputFile): a = empty(1, dtype=int) m = 0 for line in inputFile: k = len(a) if m >= k: b = empty(2*k, dtype=int) for i in xrange(k): b[i] = a[i] a = b a[m] = int(line) m = m + 1 b = empty(m, dtype=int) for i in xrange(m): b[i] = a[i] return b Unlike ''readNumbers1'', ''readNumbers2'' does not increase the array size by $1$ but doubles it. Therefore creation of a new array and copying of the contents of the old array into the new one is needed more rarely, at the cost of some memory being wasted on unused positions at the end of the array. The difference between the two algorithms is illustrated by the drawing below. {{:asymptotics:readNumbers_algorithm.png}} Does this pay off? Let's calculate. Note that the size of array ''a'' in ''readNumbers2'' is always a power of $2$. Let $p$ be such a natural number that $2^{p-1} < n \leq 2^p$. Then ''readNumbers2'' resizes the array $p$ times: $1 \to 2 \to 4=2^2 \to 8=2^3 \to \ldots \to 2^{p-1} \to 2^p$. Suppose creating a new array of size $x$ and copying the contents of the old array into it takes time $Cx$ for some constant $C$. Then the time spent on the resizings is $$(1 + 2 + 2^2 + 2^3 + \ldots + 2^{p-1} + 2^p)C = (2^{p+1} - 1)C < 2^{p+1} C = 4C \cdot 2^{p-1} < 4Cn.$$ Here we used the formula for the sum of a geometric progression((To recall, a geometric progression is a sequence of numbers where the ratio of consecutive numbers is constant (like $0.4, 2, 10, 50$, where each element is obtained by multiplying the previous one by $5$) and its sum can be calculated by the formula $a\frac{1-r^n}{1-r}$ where $a$ is the first element, $r$ is the ratio of an element to the previous one and $n$ is the number of elements, e. g. $0.4 + 2 + 10 + 50 = 0.4 \cdot \frac{1-5^4}{1-5} = 62.4$. In case $a = 1$ and $r = 2$, the sum formula is especially simple: $1 + 2 + 2^2 + 2^3 + \ldots + 2^n = 2^{n+1} - 1$.)). Apart from resizing the array, the function ''readNumbers2'' spends only a constant amount of time ''D'' in each step of the ''for'' loop and a constant amount of time $A$ outside the ''for'' loop. Therefore, its total running time is $T_2(n) < (4C+D)n + A$. For large $n$, the constant $A$ becomes irrelevantly small compared to $(4C+D)n$, so for large $n$ we have $T_2(n) < dn$ for some constant $d$ (for example, if we only consider $n>A$, we may take $d = 4C + D + 1$). Therefore, our theoretical analysis indicates our effort has paid off: the running time of ''readNumbers1'' grew quadratically with ''n'', whereas ''readNumbers2'' is only at most linear in ''n''. A benchmark gives running times like this: ^n ^$T_2(n)$ (milliseconds)^ |1 |0.011342 | |10 |0.042400 | |100 |0.209093 | |1000 |1.776934 | |10000 |18.935204 | |100000 |181.163073| |$2^{18}$=262144 |460.182905| |$2^{18}$+1=262145 |523.746967| |$2^{19}$=524288|916.903019| |$2^{19}$+1=524289|1050.997019| |1000000 |1763.423920| {{:asymptotics:readNumbers2.png|size=200}} **Figure (running times of ''readNumbers2'')** Running times of ''readNumbers2'' for $n$ up to 600, in milliseconds. Looking at the rows $n=10000, 100000, 10000000$ in the table, it seems that the running time is proportional to $n$, being approximately equal to $0.0018n$. Looking at the graph, one sees that $T_2(n)$ actually oscillates between two linear functions, with jumps at numbers of the form $n=2^p+1$ --- points where the array is resized. We see that although $T_2(n)$ is not exactly proportional to $n$ itself, for large $n$, there is indeed an upper bound to the running time of this function which is proportional to $n$, just as our theoretical analysis predicted — namely it seems from the data that for sufficiently large $n$, $T_2(n) < 0.0021n$. Thanks to this fact, ''readNumbers2'' has running time growing more slowly than that of ''readNumbers1'' (which had quadratic time) and can even be used in case of millions of elements. Even if the constants $c$ and $d$ in the estimates $T_1(n) \approx \frac{c}2n^2$ and $T_2(n) < dn$ of the two algorithms were different, the second one would still beat the first one for large enough $n$ (remember [[#parabola_and_line|Figure (parabola and line)]]!). For example, if the first program were run on a fast computer and the second on an old slow one, the first one would probably be better than the second on small inputs, but inevitably the second one would finish faster for large enough inputs. Changing the constants does not change much the order of magnitude of the inputs, for which the algorithms can be used, either: even if the constant $c$ were a hundred times smaller than we measured in the benchmark, the first algorithm still would still take a long time to handle a million elements ($1.3 \cdot 10^{-4}/100 \cdot (10^6)^2$ milliseconds = 1300 seconds ≈ 22 minutes) — something which the second algorithm accomplishes in less than two seconds. Our conclusions from these two examples are: * if your algorithm's running time is a polynomial like $c n^2 + b n + a$ (where $n$ is the input size), it's only the highest term $c n^2$ that matters for large inputs; * when assessing the order of magnitude of inputs an algorithm is good for and whether one algorithm is faster than another for large inputs, the constants before the highest terms are not so important either, but rather the order of the highest terms. In particular, a linear term $dn$ is always much smaller than a quadratic term $cn^2$ for sufficiently large $n$; * therefore, for example, ''readNumbers2'' is qualitatively faster than ''readNumbers1''. These observations lead us to the next lessons, where we will define asymptotic notations: the Big Oh and the Big Omega. As an aside: in real life, to solve the problem of reading a text file of integers, one would use an appropriate function in ''numpy'' like ''numpy.fromfile''; such functions use the same idea as we implemented in ''readNumbers2''. The same idea is also used by Python lists (a Python list uses an underlying array, which has some spare elements at the end, so that the time of adding $n$ elements one-by-one to a list by calling ''list.add'' is linear, not quadratic in $n$). [[asymptotics:02_big_oh]] ---------------------------------------------------------------------------------