Refactor your code with Numpy

7 min readJun 4, 2019

How to make a Python code more efficient ?

Introduction

Hello everyone !

In this article I want to share with you some my experience about refactoring Python code. As we know, Python has one important disadvantage — it is a bit slow (or we may interpret its slowness like a cost of its good usability) and Python “eats” to much RAM resources while doing huge tasks like list creation etc. In this way there are a lot of good libraries to make the code faster and more efficient, but in this article I will show you the most popular package — Numpy.

You may read other my article how to become more efficient in data science.

Content

Lists: create and append
Operations with arrays
Data generation
Data types changing
Save and load your arrays

Lists: create and append

First of all we need to understand why it’s slow. On the upper image we see two examples of list: numpy list and python list.

Python is a dynamic interpreting language. In this case to create variable in python we may just write:

a = 10 # python object

In this way, python doesn’t create just an integer variable, our a is a python object which one we may interpret like this:

Imaging when you create a list with 100 such variables. Will it work as fast as simple C array of 100 integers ? No.

In this case we will use numpy. For example we will take a code example and evaluate its time running and memory loading.

Key points: list creation, for loop with appending.

import time
import os
import psutil

ts = time.time()
custom_list = [x for x in range(0, 1000000)] #  list creation
other_empty_list = []
pid = os.getpid()
py = psutil.Process(pid)
for number in custom_list:
    other_empty_list.append(number * 2) # appending new value

memoryUse = py.memory_info()[0] * 1e-9
print('memory use: %s GB' % round(memoryUse, 3))
print(round(time.time() - ts, 3), 'Seconds spent')
#================================================
We got:memory use: 0.094 GB
0.098 Seconds spent

First of all, let’s rewrite a list creation with numpy.arange() function.

This function takes one argument and works like python range(). So, it means: numpy.arange(3) = numpy.array([0, 1, 2]).

And we need to rewrite appending in numpy syntax. See next:

destination_list = numpy.append(destination_list, value, axis)

Parameter axis by default is -1, but if you have a multidimensional array, you may choose an axis to append a value. But how to create an empty numpy array ?

empty_array = numpy.empty(shape=(0, ))

When you create an empty array, it’s important to make first dimension = 0.

Let’s assemble them all into a new code:

import time
import os
import psutil
import numpy as np

ts = time.time()
custom_list = np.arange(1000000)
other_empty_list = np.empty(shape=(0,))
pid = os.getpid()
py = psutil.Process(pid)
for number in custom_list:
    print(number)
    other_empty_list = np.append(other_empty_list, (number * 2))

memoryUse = py.memory_info()[0] * 1e-9
print('memory use: %s GB' % round(memoryUse, 3))
print(round(time.time() - ts, 3), 'Seconds spent')
#=========================================================
memory use: 0.053 GB
870.415 Seconds spent

It took 14 minutes (!!!) to finish the code. Crazy, I know, but look at memory usage, it was reduced in a half.

And one more about appending. For example you have to create a 2D array (array of arrays). You have a function get_array(), which returns an 1D array of 100 numbers. How to write this code ?

destination_array = np.empty(shape=(0, 100))
for i in range(0, 100):
    array = get_array() # 1D array of 100 numbers like: [0,1,2,...]
    destination_array = np.append(destination_array, [array], axis=0) # we are appending 2D array here with additional square brackets [array] by the first axis (0, 100)

It will works with any structure. Try to experiment with it.

Conclusion about lists: you may use numpy appending to use your memory more efficient, but it will take much more time to finish the code.

Operations with arrays

In this part I want to show differences between lists and numpy arrays while you do basic operations.

Concatenation

To concatenate two python lists we may use just operator “+”.

a = [1, 2, 3]
b = [4, 5, 6]
print(a+b)
#============================
Output: [1,2,3,4,5,6]

But what will happen if we do the same with two numpy arrays ?

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a+b)
#============================
Output: [5,7,9]

We see that numpy uses “+” not like concatenation, but like a pointwise addition between pairs of elements in arrays. How to concatenate numpy arrays ?

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.concatenate([a, b]))
#============================
#Output: [1,2,3,4,5,6]

And we may concatenate 2D arrays with 2 approaches:

import numpy as np

a = np.array([[1, 2, 3]])
b = np.array([[4, 5, 6]])
print(np.vstack([a, b]))
#============================
# Output: [[1,2,3],
#          [4,5,6]]
# OR:
print(np.concatenate([a, b], axis = 0))
#============================
# Output: [[1,2,3],
#          [4,5,6]]

Pointwise operations

Numpy has a big advantage. Its arrays support any pointwise operation:

import numpy as np

a = np.array([1, 2, 3, 4, 5, 6, 7])
print(a * 2)
#[ 2  4  6  8 10 12 14]
print(a - 2)
#[-1  0  1  2  3  4  5]
print(a * a)
#[ 1  4  9 16 25 36 49]
print(a + a)
#[ 2  4  6  8 10 12 14]

Even with custom functions:

import numpy as np

a = np.array([1, 2, 3, 4, 5, 6, 7])

def my_function(value):
    return value * 10


print(my_function(a))
#[10, 20, 30, 40, 50, 60, 70]

But what will be if we get as an argument just a python list ?

def my_function(value):
    return value * 10


python_list = [1, 2, 3]
print(my_function(python_list))
#[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]

Python has no pointwise operations. With multiplication we got just a copying of an initial list.

Data generation

There are many situations when you need to create some data to test your code. Numpy has few good instruments to do it.

Random arrays

numpy.random has few good tools like:
random integers — numpy.random.randint(min, max, shape)
random values from 0 to 1 — numpy.random.rand(shape)

import numpy as np

a = np.random.randint(0, 10, size=(10, ))
print(a)
#[9 7 7 7 8 6 4 2 1 1]
a = np.random.rand(10, )
print(a)
#[0.96129224 0.33143719 0.71664022 0.30985344 0.30896639 0.75746138
 0.47165612 0.87208496 0.28826863 0.17739896]

Numpy rand() is more powerful than you may think. It may create an array of any dimensionality (1D, 2D, 100D etc):

import numpy as np

a = np.random.rand(2, 2, 2, 2)
print(a)
#=================================
[[[[0.05124643 0.41544372]
   [0.55786205 0.52363237]]
[[0.5845776  0.42107395]
   [0.05805909 0.23181747]]]
[[[0.20071307 0.05955022]
   [0.60296557 0.07028826]]
[[0.38031588 0.8651122 ]
   [0.26330836 0.71948145]]]]

Zeros and ones

zero array — numpy.zeros(shape)
ones array — numpy.ones(shape)

import numpy as np

a = np.zeros(3,3)
print(a)
#=================================
[[0,0,0]
 [0,0,0]
 [0,0,0]]
a = np.ones(3,3)
print(a)
#=================================
[[1,1,1]
 [1,1,1]
 [1,1,1]]

Data types changing

Be careful when you work with numpy or other complex libraries. There are a lot of points in a code when you can accidentally destroy your data.

For example: you have an array of floats from 0 to 1 and apply a function to it. For example you add there few integers:

import numpy as np

a = np.array([0.1, 0.2, 0.7])
a = np.append(a, [1, 2, 3, 4, 5])
print(a)
# ==================================
# [0.1 0.2 0.7 1.  2.  3.  4.  5. ]

The result is an array floats, where integers were transformed into floats.

And we have some function. Let’s imagine that we use it from other library and forgot to look at its return parameters. But we know that this function divides each element of our array by 10. So, we need this result:

import numpy as np

a = np.array([0.1, 0.2, 0.7])
a = np.append(a, [1, 2, 3, 4, 5])
print(a/10)
#=====================================
[0.01 0.02 0.07 0.1  0.2  0.3  0.4  0.5 ]

But we will use this function:

import numpy as np

a = np.array([0.1, 0.2, 0.7])
a = np.append(a, [1, 2, 3, 4, 5])
print(a)


def my_function(array):
    return np.array(array, dtype='int8') / 10


a = my_function(a)
print(a)
# ==================================
# [0.  0.  0.  0.1 0.2 0.3 0.4 0.5]

We lost three values because there was a type transformation to “int8”.

It’s not a complex thing, but just be careful with it and always look at your returning types.

Who knows, keras (ML) function pad_sequences returns “int32”. So if you want to make a padding for a float sequence, you will lost your data if you do not change it to “float32”.

Save and load your arrays

Numpy has good tools to save your arrays and load it in a compressed form without loosing information.

Saving:

import numpy as np

a = np.arange(10)
np.save('path/to/your/directory/name_of_the_array', a)

You will get a file with extension “.npy”, it is a saved array.

Loading:

import numpy as np

a = np.load('path/to/your/directory/name_of_the_array.npy')
print(a)
# =======================================
# [0 1 2 3 4 5 6 7 8 9] returns numpy array

With loading be careful and add an “.npy” extension to a file name to load it.

You may save and load even python lists. But there is one important thing: if you have a variable x = list(1,2,3) and use np.save(path, x), you duplicate your x and it may cause a memory overloading (of course in case of huge arrays, trust me). Just initially create a numpy array and work with it as it was written at the article beginning. In this case you wont duplicate your array with saving.

Conclusion

It was a brief demonstration of main things what you can do with the help of numpy. For sure there are a lot of other good and bad tools. You may discover it yourself. Main things that you should understand now: numpy is powerful and may do operations faster then python, but in case of arrays appending it works much slower, but uses RAM better, just decide what is more important to you in your task.

Good luck ! Hope, this article has helped you to understand what you have been looking for.

Bondarenko K. , machine learning engineer