Python vs Copy on Write

I recently found myself awake at 4 AM trying to science myself out of a memory problem in Python on Linux. If you’re trying to share a read only cache to a number of Linux subprocesses and find yourself running out of memory and time, this blog is for you. The tl;dr of the story is down below.

This issue was the first serious whack I’ve had at tuning memory issues in Python. I’m by no means an expert on the subject so feel free fill in some of the gaps in the comments below.

For the purposes of this blog, I’ll try to keep the toolset simple. No fancy valgrind, just Stock Python 2.7 scripts running on Ubuntu 15.04, with some PyPy thrown in for bonus points. I’ll be allocating a bunch of memory and hoping it shows up in the stock Ubuntu System Monitor.

There are most certainly better tools available. If you’re running in production, you’ll want to log the memory consumption of your process and have a way of associating it with what is happening in your process. Be sure to keep some Linux performance tools tucked safely under your pillow, as well as some grep, awk, sed and gnuplot goodness for log file mining.

However, instead of tooling I’d like to talk basic concepts. It’s important to understand the mechanics of a problem before trying to treat it, and that’s what I hope to accomplish with this blog. With that said, let’s dive into the problem.

no_diving

Measure First

The bit of code I was trying to tune was a basic multiprocessing worker pool running on Linux (i.e. using fork). The code will fork, do some processing and join the main process when its complete. Each worker in the pool requires a cache of data to do its job. Those familiar with copy on write issues in Python will immediately spot the problem. I’ll go over details of this problem in the section below.

However, I should underscore that when you’re dealing with performance issues, a large part of the effort is properly diagnosing the problem. As with any performance issue, the size of n matters just as much as the code that processes it. While I had suspicions that my problem was a copy on write issue, I did not pursue a solution until I had clear evidence that suggested my hypothesis.

Like unit testing, this sort of rigour grows more important as time lines shrink…which begs the question, how do you know you have a memory issue? Or more specifically:

traceback

On Linux, oom-killer (out of memory) is lurking in the background waiting to defend your system from processes that request too much memory. Unfortunately you will loose your traceback because the process will be halted and won’t be given the opportunity to generate a traceback. If you ever find yourself wondering why your Python process abruptly halted, check your syslog for something like:

Jan 24 00:00:23 host kernel: python invoked oom-killer: gfp_mask=0x100bc, order=0, oom_adj=0, oom_score_adj=0

So, hopefully you’ve done your diligence and come to the conclusion that you have a memory issue associated somehow with multiprocessing. What’s next?

The Problem

I was fortunate to be sitting in the audience when Brandon Rhodes gave his talk on Python, Linkers, and Virtual Memory at PyCon. He does a much better job of explaining Copy on Write than I do in this blog, so go watch the video. I’ll try to do a quick summary of his talk here, so that you can understand all of the working pieces.

A very simple mental model for memory is a list of data blocks that each have an address. “How much memory does a process use?” is a loaded question for the simple reason that processes share memory.

If two processes want to load the same shared library, instead of loading the library into two separate addresses, the operating system can load it into one address and point both processes to the same address in memory. As a result there is a distinction between the total memory actually used by a process (called the Resident Set Size or RSS) and the virtual memory used by a process (called the Virtual Memory Size or VMS).

In addition to sharing code, processes can also share data. The classic example on Unix happens when forking a child process. When you fork a child process on Unix, it doesn’t re-run the process from scratch. Instead it will create a new process and point the processes virtual memory to the same physical blocks that the parent is using.

This practice works well until until either process needs to modify the memory. When this happens, the operating system will create a copy of the memory block and then write the data; hence the term copy on write.

Copy on write is handy for creating caches that subprocesses can reference. The reason is because read only memory need only be referenced, and not actually duplicated, significantly reducing the memory overhead of a subprocess.

Copy on write is a problem for Python because its objects are reference counted. To keep track of when an object should be deleted, Python stores a count of how many references to an object currently exist. When that count reaches 0 the object can be cleaned. Benjamin Peterson has an interesting talk on the subject.

The unfortunate implication is that when your subprocess comes along, even the act of reading a block of memory will result in a duplicate copy of the data being created as Python needs to update the reference count.

As a result, every new subprocess that reads the cache will also create a full copy of the cached data.

The Benchmark

WARNING: Running some of these tests might run your computer out of memory and possibly crash it or worse. Here be dragons. Use proper precautions.

Before we dive into a full blown solution, we’ll need a benchmark for comparison. If we can address the behaviour we see in the benchmark, we’ve got a candidate solution.

The benchmark is relatively simple. It first creates a large data structure to serve as our cache using the range function with a very large number. It will then spawn a number of subprocesses that attempt to read from the cache. The expectation is that our memory usage will be multiplied by the number of subprocesses that we spawn.

In order to run these examples, you’ll have to tune them a bit to get them to show up in your system monitor. Three parameters control the behaviour of the benchmark script below. The most important is the CACHE, which specifies the number of records to be created (i.e. the very large number above). PROCESSES is the number of subprocesses that we’ll spawn. To differentiate between the cache allocation part of the benchmark and the subprocess part, we’ll inject a number of pauses using PAUSE, in seconds.

Tuning CACHE can be a bit tricky. In order to reproduce my experiments you’ll need the process to show up distinctly in your system monitor. Again take caution because running your machine out of memory is easier than you think. I’ve intentionally made the CACHE size small enough for most machines. I usually add 0s to the CACHE size until it consumes about 5% of the memory on the machine. Setting the PAUSE value to something low will also speed up this process.

Here’s the benchmark:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Pool
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache. Note that in order to see the effect
    # the copy on write problem, we need to actually read the cache. By reading
    # the cache, we create a new reference to the memory being tracked and
    # therefore copy the memory although all we're doing is reading.
    for item in cache:
        pass

    # Make sure the memory that the allocated memory is visible in the system
    # monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create the cache. Note that allocating a large amount of memory takes
    # time. After the memory is allocated, we want to make sure the new memory
    # level is visible in the system monitor, so we pause.
    cache = range(CACHE)
    sleep(PAUSE)

    # Run the jobs.
    pool = Pool(PROCESSES)
    pool.map(job, [cache]*PROCESSES)

if __name__ == '__main__':
    main()

Here’s what happens when we run the benchmark:

benchmark

Time proceeds from left being the start to right being the end. You can see the first long PAUSE after the cache is allocated. Then the cache is magnified and the second long PAUSE caused by the parallel processes all finishing.

The Solution?

The core of our solution is to tell the operating system that it should be using shared memory that isn’t reference counted. Fortunately, multiprocessing has a number of primitives that can help. Unfortunately, there are a few ways to use these primitives and some of them have pitfalls.

Note that some of this section will be conjecture on my part. I’ll try to probe as deep as I can, but some of this material will have to be left for later blogs.

The first consideration is how we launch and manage our subprocesses. In particular, weather we should use multiprocessing.Pool.map or multiprocessing.Process to launch our subprocess. Sadly it matters.

The second consideration is speed. If we’re not careful and use the wrong primitives, we may pay a runtime cost. As we are attempting to produce a read only cache, we’re going to try to optimize both the memory consumption and runtime of our code.

Something That Works

Our first priority is to get something that works as painlessly as possible to confirm the core premise; that shared memory will fix our problem. We can bypass a lot of complexity by using a global object.

Note that to address our second concern, we need to tell the array not to lock. Once launched, all of our jobs will start reading immediately and if the cache locks, our subprocesses will all contend for read access; we won’t get any benefit from parallelism.

Setting lock to False is generally a bad idea if you have writes into your cache from a separate subprocess. However if you prepare the cache in advance and only read from it in your subprocesses, you should be okay.

Feel free to set lock=True if you like, but be prepared for a long runtime and keep the kill command handy.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Pool, Array
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

# A global object that contains the generated cache.
cache = None

def job(arg):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache. Note that in order to see the effect
    # the copy on write problem, we need to actually read the cache. In this
    # case the problem doesn't happen because cache is an Array object, which
    # lives in shared memory.
    global cache
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache to address the copy on write issue.
    # We need to tell the array not to lock. All of our jobs will start reading
    # immediately from the cache and if the array locks, our subprocesses will
    # all contend for read access.
    global cache
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs.
    pool = Pool(PROCESSES)
    pool.map(job, [None]*PROCESSES)

if __name__ == '__main__':
    main()

Running the script, it looks like we have a winner:

global

Now that we have something that works, let’s see if we can make it any better. Using a global variable is bad because of the large potential for import side effects that come with it. Next, we’ll see if we can pass the cache as a parameter.

Consideration the First: Pool vs Process

Why is process management a consideration? Intuitively, when I started hacking on this problem, I expected that both Pool.map and Process to work in approximately the same way. Let’s see what happens if we attempt to take the example above and pass the shared memory cache as a parameter to map.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Pool, Array
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a shared memory cache.
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs.
    pool = Pool(PROCESSES)
    # At this point, cache is expected to break the map function, because map
    # attempts to pickle the iterate parameter.
    pool.map(job, [cache]*PROCESSES)

if __name__ == '__main__':
    main()

When we run above we get the following traceback:

Traceback (most recent call last):
 File "./map.py", line 52, in <module>
 main()
 File "./map.py", line 48, in main
 pool.map(job, [cache]*PROCESSES)
 File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map 
 return self.map_async(func, iterable, chunksize).get()
 File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get 
 raise self._value
cPickle.PicklingError: Can't pickle <class 'multiprocessing.sharedctypes.c_int_Array_10000000'>: attribute lookup
 multiprocessing.sharedctypes.c_int_Array_10000000 failed

It seems like map is trying to pickle our Array primitive. If I had to guess, it’s trying to store the elements of the iterable in a queue and failing. Let’s see what happens if we try to use multiprocessing.Process (multirpocessing.Manager is discussed below):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process, Array
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs using Process and not map.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

When we run the script:

process

That works but by switching to Process, we’ve given up a lot of the nice functionality that we get with map. Fortunately, multiprocessing has a solution for this problem in the form of multiprocessing.Manager. Manager is a class that will wrap data structures in proxies that can then be pickled.

Taking a look at the configuration it seems as though the manager should work. However, the manager does not respect the lock parameter and so if you do dare to run the script below, prepare yourself for a long runtime.

WARNING: I do not recommend running the script below. It has a long runtime.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Pool, Manager
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a shared memory cache using a memory manager. Unfortunately this
    # manager doesn't seem to respect the lock=False argument, causing a lot
    # of reading contention and therefore a long runtime.
    manager = Manager()
    cache = manager.Array('i', range(CACHE), lock=False)

    # Run the jobs.
    pool = Pool(PROCESSES)
    pool.map(job, [cache]*PROCESSES)

if __name__ == '__main__':
    main()

As it stands, I’m not sure if this is a bug. multiprocessing does allow for some extensibility, however I’ve yet to figure out the right song and dance to make it all work. If there’s interest, I can do a deep dive into the guts of multiprocessing in another blog.

No Read?

As an aside, what happens if we comment out the following lines?

 for item in cache:
     pass

If our theory is correct, then Python will not have to build references to the individual records in the cache, and the memory will not be copied. Note that this trick will only work in the Process version because the Pool.map version will always pickle the memory.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # With these lines omitted, the copy on read issue won't occur because
    # python will not build references as a result of reading the memory.
    '''
    for item in cache:
        pass
    '''

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = range(CACHE)
    sleep(PAUSE)

    # Run the jobs using Process and not map.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

As expected, no cache duplication:

no_read

Consideration the Second: Speed

If our cache is large, then accessing it needs to be fast. We’ve already tackled part of this problem by setting lock=False above, with the assumption that the cache will be read only. That takes care of the issues associated with subprocesses contending for read. Now we’ll look at the process of getting the data from the Array into the Python subprocess.

Time for more speculation on my part. Under the hood, Array is actually a primitive C array allocated with malloc. When the Python subprocess iterates over this array, it creates an internal Python integer, which takes time.

Wouldn’t it be handy if we could just read directly from the buffer? numpy to the rescue with the frombuffer function. This function takes our Array and reads directly off it from the blocks in memory. First lets install numpy in a virtualenv:

 $ mkvirtualenv numpy
 $ pip install numpy

Then let’s run the example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process, Array
from time import sleep
import numpy

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in numpy.frombuffer(cache, 'i'):
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

When compared to the vanilla approach, we get a marginal improvement on the subprocess runtime. As n grows, this margin should grow as well. As always, your mileage may vary:

frombuffer

It is also possible to do all of this magic with strings, however be cautioned that the Array object must be both written and read with a fixed character length. Brush up on your malloc skills and you should be just fine.

tl;dr

If you’re looking for a vanilla example, look here:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process, Array
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs using Process and not map.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

If you’re using numpy, look here:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process, Array
from time import sleep
import numpy

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in numpy.frombuffer(cache, 'i'):
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

PyPy

Time for some bonus points. pypy is awesome and weird in all of the ways that make my mad scientist senses tingle. One of those ways is its pluggable garbage collection system.

The default collector is called minimark. I know very little about pypy, but I’ll try to give a quick overview. See Benjamin Peterson’s talk for more detail.

The underlying algorithm is called Mark and Sweep. It cleans memory in two phases. In the first phase, it will walk the reference tree and mark objects that are still alive. In the second phase it will sweep away any objects that aren’t marked. As a result no reference counting, which means copy on write should work out of the box.

Let’s test it! First, let’s install pypy and create our virtualenv:

$ sudo apt-get install pypy
$ mkvirtualenv -p /usr/bin/pypy pypy

Now let’s rerun the benchmark with no changes and see what happens:

pypy_benchmark

So, that’s not exactly what we expected. I suspect pypy doesn’t like the pickling that needs to happen to get the cache into Pool.map. Let’s tweak the code to use Process instead, except without Array:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = range(CACHE)
    sleep(PAUSE)

    # Run the jobs using Process and not map.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

pypy_process

Now that’s the kind of result that makes me smile 🙂 And yes, I did check to make sure that the process actually ran :p

Results

good_news

After running the patched code in production, I’m happy to report that it worked. Huzzah for science!

As a general rule it’s hard to gauge the performance of a patch until you run it in production. In this particular case, I had good evidence to suggest the root cause of the problem and the memory was reduced to the bottleneck I predicted.

We also got a bit of runtime back, but that’s a story for a future blog. I’d also like to do a deeper dive into Python’s memory allocation and see if we can break out strace and valgrind.

Until then, if you’re stuck on a hard problem just remember, try science!

Advertisements

About llvllatrix

I use vim and Python. I prefer a terminal to a GUI. I read text books for fun.
This entry was posted in Uncategorized. Bookmark the permalink.

7 Responses to Python vs Copy on Write

  1. siyandai says:

    Why do you need [cache]*PROCESSES?

    Doesn’t this explain the extra memory usage?

    • llvllatrix says:

      You need it to provide the cache as a parameter to map, which applies the function to all of the elements in the list.

      cache is just a reference; calling [cache]*PROCESSES doesn’t actually allocate new memory except for the additional references. [cache[:]]*PROCESSES, in contrast, would allocate new memory.

      I can write an example script to prove the above if you like (assuming I’m correct of course 🙂 ).

      • Avish says:

        [cache[:]]*PROCESSES copies `cache` once (allocating new memory), then creates extra references to the copy according to the number of PROCESSES. This takes 2 x CACHE the memory, but not PROCESSES x CACHE.

        To create a copy per process, you’d need to say something like:
        [cache[:] for _ in xrange(PROCESSES)]
        Here the list comprehension would evaluate `cache[:]` once per iteration, copying it each time, and you’ll end up with PROCESSES x CACHE memory used.

  2. Pingback: System Calls for Developers | The Space Lab in Space

  3. petroslamb says:

    Hi, great article.

    A quick question on the pypy solution:

    Why do we not see the bump on the graph, that marks the initial allocation of the cache (before multiprocessing)? Is it also some type of pypy magic?

    Thanks, keep it up!

  4. Pingback: memory profiler multiprocess and shared data not showing accurate readings

  5. Pingback: Personal Page of Isaia Nisoli » Memory handling of parallel computations on Cocalc

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s