The Human Rules of Acquisition

There’s very little else as satisfying as calling someone a pataQ, questioning their honour and claiming their D’k tahg. Quoting an obscure rule of acquisition during a business negotiation comes close.

I’ve always found it mildly irritating that humans  don’t have an equivalent. Then I read The Intelligent Investor, by Benjamin Graham and thought “by the Nagus, I’ve found the rules!”

So I highlighted, summarised, summarised again, sorted and then curated the remaining list. I’m a nerd who likes to read text books, what do you want from me?

summary

So, in accordance with the 74th rule of acquisition, “knowledge equals profit”, I present version 0.1.0 of The Rules of Investment:

https://github.com/RishiRamraj/The-Rules-of-Investment/tree/master

Sadly there are only 237 rules😦 It should go without saying that this is, more or less, an elaborate joke and that if you rely on this summary, you’re insane.

jim

Feel free to shoot me a pull request if you have any other useful rules. For my next trick, I think I’ll translate a Shakespearean play from its original Klingon.

– Qapla’

Posted in Uncategorized | Leave a comment

How To: Make a Dive Knife

Between pizza fuelled coding sessions, I’m occasionally let out to spend time on things that are not programming related. Today I thought; “it’s about time I made a dive knife”, followed by “I may as well blog about it”.

Why Make a Dive Knife?

Diving is a technical field filled with a large number of novices, groups of self taught gurus who tend to do careless things, and a small handful of experts who are dogmatic about the right way of doing things. As a programmer, I fit right in.

With my level of experience I fall well within the novice camp, but as per my usual approach I try to understand and adopt the practices of the expert camp when it’s appropriate. Hopefully I don’t end up falling into the guru camp of divers.

…which brings us to knives. ISE did a recent video about knives and it got me thinking about my own setup. Many divers have a misconception about why they carry a knife. Most of the public seems to think it’s to fend off sharks.

shark

If a shark wants to attack you, about the most effective thing you can do is to blow bubbles; they’re loud and scary. Some divers will carry a bayonet strapped to the inside of their calf thinking they might need to pry open a open a treasure chest.

The reason you carry a dive knife is to cut yourself free from entanglements. Here are two of the most common offenders:

00

On the left we have mono-filament line, often in the form of nets. On the right we have cave line, which divers usually lay in caves to feel your way out, if you accidentally kick up silt.

So what makes a good dive knife? Here’s a checklist:

  • It cuts line.
  • It doesn’t cut you, your dry suit, your gloves or  your hoses.
  • You can see it when you wear it.
  • You can use it with thick gloves on.
  • You don’t need to take a yoga class to be able to reach it.

A tiny, blunt knife with a serrated edge fits the bill perfectly. You’d think the diving industry would be able to supply something like that; I suspect they’re hard to sell to the uninitiated.

To be fair, dive rite did introduce the econo knife, which is effectively (if not actually) a ground down steak knife. You can also buy line cutters, which is what I’ve done in the past, but as you’ll see there are limitations. It’s not uncommon for cave and wreck divers to make their own ground down steak knife, diving knives.

Step 0: The Premise

I usually wear two trilobite line cutters near my shoulders (sorry GUE divers, I know it should be on my belt), in two little nylon pouches. I’m going to be taking a little stainless steel serrated knife (above) and turn it into a drop in replacement for a trilobite (below).

01

Why won’t the trilobite fit all emergency situations? Well, what if a bolt snap gets caught somewhere and I need to cut it free?

02.jpg

Unless I break the plastic guard, it’ll be very hard to get the trilobite to fit in between the bolt snap and the hose. Not a problem for the steak knife:

03

In addition to the knife, we’ll need some nylon webbing (1 inch), some Velcro and some nylon thread:

05

A note about thread; it’s common for DIYers to use great materials for everything except the thread, only to find their creations unravelling when used. Here’s the stuff that I use, via ebay:

06

Step 1: Grind Down the Blade

For want of a bench grinder, I opted to use my Dremel with a diamond cutoff wheel to cut the profile of the blade.

07.jpg

Knives have a temper, which means that the metal has been heated to a certain temperature which causes the steel to have a certain hardness (I’m not a metallurgist). If you overheat the steel (say by attacking it with a diamond cutoff wheel), you may end up ruining the temper. So I occasionally dunked the blade into a cup full of water. I’m not sure if it actually helps.

08

After some coaxing with some grinding stones, I finally had the blade profile I wanted.

12

Step 2: Cut the Handle

Next, it was time to introduce the knife to the band saw in order to cut the handle down to size.

13

Fortunately this knife wasn’t a full tang knife (you can check with a strong light). Here’s the knife cut down to size:

15

I then drilled a hole to attach the Velcro strip, and shaped the hole using a grinding attachment.

16

Step 3: Create the Velcro Strap

With the knife done, it’s time to focus on the Velcro strap. The strap serves two purposes. It allows you to store the knife securely in its nylon sheath. It also acts as an extended handle. First, I started with a loop:

17

Then I sandwiched the loop in between two strips of Velcro.

18

Finally, I attached the knife to the strap using a bit of cave line. ISE has a video on tying off gear.

19

All Done

I present the finished knife:

20

A quick neoprene glove test:

21

And the knife mounted on my back plate, well within eye shot, deploy-able with a single hand.

22

Posted in Uncategorized | Leave a comment

Python vs Copy on Write

I recently found myself awake at 4 AM trying to science myself out of a memory problem in Python on Linux. If you’re trying to share a read only cache to a number of Linux subprocesses and find yourself running out of memory and time, this blog is for you. The tl;dr of the story is down below.

This issue was the first serious whack I’ve had at tuning memory issues in Python. I’m by no means an expert on the subject so feel free fill in some of the gaps in the comments below.

For the purposes of this blog, I’ll try to keep the toolset simple. No fancy valgrind, just Stock Python 2.7 scripts running on Ubuntu 15.04, with some PyPy thrown in for bonus points. I’ll be allocating a bunch of memory and hoping it shows up in the stock Ubuntu System Monitor.

There are most certainly better tools available. If you’re running in production, you’ll want to log the memory consumption of your process and have a way of associating it with what is happening in your process. Be sure to keep some Linux performance tools tucked safely under your pillow, as well as some grep, awk, sed and gnuplot goodness for log file mining.

However, instead of tooling I’d like to talk basic concepts. It’s important to understand the mechanics of a problem before trying to treat it, and that’s what I hope to accomplish with this blog. With that said, let’s dive into the problem.

no_diving

Measure First

The bit of code I was trying to tune was a basic multiprocessing worker pool running on Linux (i.e. using fork). The code will fork, do some processing and join the main process when its complete. Each worker in the pool requires a cache of data to do its job. Those familiar with copy on write issues in Python will immediately spot the problem. I’ll go over details of this problem in the section below.

However, I should underscore that when you’re dealing with performance issues, a large part of the effort is properly diagnosing the problem. As with any performance issue, the size of n matters just as much as the code that processes it. While I had suspicions that my problem was a copy on write issue, I did not pursue a solution until I had clear evidence that suggested my hypothesis.

Like unit testing, this sort of rigour grows more important as time lines shrink…which begs the question, how do you know you have a memory issue? Or more specifically:

traceback

On Linux, oom-killer (out of memory) is lurking in the background waiting to defend your system from processes that request too much memory. Unfortunately you will loose your traceback because the process will be halted and won’t be given the opportunity to generate a traceback. If you ever find yourself wondering why your Python process abruptly halted, check your syslog for something like:

Jan 24 00:00:23 host kernel: python invoked oom-killer: gfp_mask=0x100bc, order=0, oom_adj=0, oom_score_adj=0

So, hopefully you’ve done your diligence and come to the conclusion that you have a memory issue associated somehow with multiprocessing. What’s next?

The Problem

I was fortunate to be sitting in the audience when Brandon Rhodes gave his talk on Python, Linkers, and Virtual Memory at PyCon. He does a much better job of explaining Copy on Write than I do in this blog, so go watch the video. I’ll try to do a quick summary of his talk here, so that you can understand all of the working pieces.

A very simple mental model for memory is a list of data blocks that each have an address. “How much memory does a process use?” is a loaded question for the simple reason that processes share memory.

If two processes want to load the same shared library, instead of loading the library into two separate addresses, the operating system can load it into one address and point both processes to the same address in memory. As a result there is a distinction between the total memory actually used by a process (called the Resident Set Size or RSS) and the virtual memory used by a process (called the Virtual Memory Size or VMS).

In addition to sharing code, processes can also share data. The classic example on Unix happens when forking a child process. When you fork a child process on Unix, it doesn’t re-run the process from scratch. Instead it will create a new process and point the processes virtual memory to the same physical blocks that the parent is using.

This practice works well until until either process needs to modify the memory. When this happens, the operating system will create a copy of the memory block and then write the data; hence the term copy on write.

Copy on write is handy for creating caches that subprocesses can reference. The reason is because read only memory need only be referenced, and not actually duplicated, significantly reducing the memory overhead of a subprocess.

Copy on write is a problem for Python because its objects are reference counted. To keep track of when an object should be deleted, Python stores a count of how many references to an object currently exist. When that count reaches 0 the object can be cleaned. Benjamin Peterson has an interesting talk on the subject.

The unfortunate implication is that when your subprocess comes along, even the act of reading a block of memory will result in a duplicate copy of the data being created as Python needs to update the reference count.

As a result, every new subprocess that reads the cache will also create a full copy of the cached data.

The Benchmark

WARNING: Running some of these tests might run your computer out of memory and possibly crash it or worse. Here be dragons. Use proper precautions.

Before we dive into a full blown solution, we’ll need a benchmark for comparison. If we can address the behaviour we see in the benchmark, we’ve got a candidate solution.

The benchmark is relatively simple. It first creates a large data structure to serve as our cache using the range function with a very large number. It will then spawn a number of subprocesses that attempt to read from the cache. The expectation is that our memory usage will be multiplied by the number of subprocesses that we spawn.

In order to run these examples, you’ll have to tune them a bit to get them to show up in your system monitor. Three parameters control the behaviour of the benchmark script below. The most important is the CACHE, which specifies the number of records to be created (i.e. the very large number above). PROCESSES is the number of subprocesses that we’ll spawn. To differentiate between the cache allocation part of the benchmark and the subprocess part, we’ll inject a number of pauses using PAUSE, in seconds.

Tuning CACHE can be a bit tricky. In order to reproduce my experiments you’ll need the process to show up distinctly in your system monitor. Again take caution because running your machine out of memory is easier than you think. I’ve intentionally made the CACHE size small enough for most machines. I usually add 0s to the CACHE size until it consumes about 5% of the memory on the machine. Setting the PAUSE value to something low will also speed up this process.

Here’s the benchmark:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Pool
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache. Note that in order to see the effect
    # the copy on write problem, we need to actually read the cache. By reading
    # the cache, we create a new reference to the memory being tracked and
    # therefore copy the memory although all we're doing is reading.
    for item in cache:
        pass

    # Make sure the memory that the allocated memory is visible in the system
    # monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create the cache. Note that allocating a large amount of memory takes
    # time. After the memory is allocated, we want to make sure the new memory
    # level is visible in the system monitor, so we pause.
    cache = range(CACHE)
    sleep(PAUSE)

    # Run the jobs.
    pool = Pool(PROCESSES)
    pool.map(job, [cache]*PROCESSES)

if __name__ == '__main__':
    main()

Here’s what happens when we run the benchmark:

benchmark

Time proceeds from left being the start to right being the end. You can see the first long PAUSE after the cache is allocated. Then the cache is magnified and the second long PAUSE caused by the parallel processes all finishing.

The Solution?

The core of our solution is to tell the operating system that it should be using shared memory that isn’t reference counted. Fortunately, multiprocessing has a number of primitives that can help. Unfortunately, there are a few ways to use these primitives and some of them have pitfalls.

Note that some of this section will be conjecture on my part. I’ll try to probe as deep as I can, but some of this material will have to be left for later blogs.

The first consideration is how we launch and manage our subprocesses. In particular, weather we should use multiprocessing.Pool.map or multiprocessing.Process to launch our subprocess. Sadly it matters.

The second consideration is speed. If we’re not careful and use the wrong primitives, we may pay a runtime cost. As we are attempting to produce a read only cache, we’re going to try to optimize both the memory consumption and runtime of our code.

Something That Works

Our first priority is to get something that works as painlessly as possible to confirm the core premise; that shared memory will fix our problem. We can bypass a lot of complexity by using a global object.

Note that to address our second concern, we need to tell the array not to lock. Once launched, all of our jobs will start reading immediately and if the cache locks, our subprocesses will all contend for read access; we won’t get any benefit from parallelism.

Setting lock to False is generally a bad idea if you have writes into your cache from a separate subprocess. However if you prepare the cache in advance and only read from it in your subprocesses, you should be okay.

Feel free to set lock=True if you like, but be prepared for a long runtime and keep the kill command handy.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Pool, Array
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

# A global object that contains the generated cache.
cache = None

def job(arg):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache. Note that in order to see the effect
    # the copy on write problem, we need to actually read the cache. In this
    # case the problem doesn't happen because cache is an Array object, which
    # lives in shared memory.
    global cache
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache to address the copy on write issue.
    # We need to tell the array not to lock. All of our jobs will start reading
    # immediately from the cache and if the array locks, our subprocesses will
    # all contend for read access.
    global cache
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs.
    pool = Pool(PROCESSES)
    pool.map(job, [None]*PROCESSES)

if __name__ == '__main__':
    main()

Running the script, it looks like we have a winner:

global

Now that we have something that works, let’s see if we can make it any better. Using a global variable is bad because of the large potential for import side effects that come with it. Next, we’ll see if we can pass the cache as a parameter.

Consideration the First: Pool vs Process

Why is process management a consideration? Intuitively, when I started hacking on this problem, I expected that both Pool.map and Process to work in approximately the same way. Let’s see what happens if we attempt to take the example above and pass the shared memory cache as a parameter to map.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Pool, Array
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a shared memory cache.
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs.
    pool = Pool(PROCESSES)
    # At this point, cache is expected to break the map function, because map
    # attempts to pickle the iterate parameter.
    pool.map(job, [cache]*PROCESSES)

if __name__ == '__main__':
    main()

When we run above we get the following traceback:

Traceback (most recent call last):
 File "./map.py", line 52, in <module>
 main()
 File "./map.py", line 48, in main
 pool.map(job, [cache]*PROCESSES)
 File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map 
 return self.map_async(func, iterable, chunksize).get()
 File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get 
 raise self._value
cPickle.PicklingError: Can't pickle <class 'multiprocessing.sharedctypes.c_int_Array_10000000'>: attribute lookup
 multiprocessing.sharedctypes.c_int_Array_10000000 failed

It seems like map is trying to pickle our Array primitive. If I had to guess, it’s trying to store the elements of the iterable in a queue and failing. Let’s see what happens if we try to use multiprocessing.Process (multirpocessing.Manager is discussed below):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process, Array
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs using Process and not map.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

When we run the script:

process

That works but by switching to Process, we’ve given up a lot of the nice functionality that we get with map. Fortunately, multiprocessing has a solution for this problem in the form of multiprocessing.Manager. Manager is a class that will wrap data structures in proxies that can then be pickled.

Taking a look at the configuration it seems as though the manager should work. However, the manager does not respect the lock parameter and so if you do dare to run the script below, prepare yourself for a long runtime.

WARNING: I do not recommend running the script below. It has a long runtime.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Pool, Manager
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a shared memory cache using a memory manager. Unfortunately this
    # manager doesn't seem to respect the lock=False argument, causing a lot
    # of reading contention and therefore a long runtime.
    manager = Manager()
    cache = manager.Array('i', range(CACHE), lock=False)

    # Run the jobs.
    pool = Pool(PROCESSES)
    pool.map(job, [cache]*PROCESSES)

if __name__ == '__main__':
    main()

As it stands, I’m not sure if this is a bug. multiprocessing does allow for some extensibility, however I’ve yet to figure out the right song and dance to make it all work. If there’s interest, I can do a deep dive into the guts of multiprocessing in another blog.

No Read?

As an aside, what happens if we comment out the following lines?

 for item in cache:
     pass

If our theory is correct, then Python will not have to build references to the individual records in the cache, and the memory will not be copied. Note that this trick will only work in the Process version because the Pool.map version will always pickle the memory.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # With these lines omitted, the copy on read issue won't occur because
    # python will not build references as a result of reading the memory.
    '''
    for item in cache:
        pass
    '''

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = range(CACHE)
    sleep(PAUSE)

    # Run the jobs using Process and not map.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

As expected, no cache duplication:

no_read

Consideration the Second: Speed

If our cache is large, then accessing it needs to be fast. We’ve already tackled part of this problem by setting lock=False above, with the assumption that the cache will be read only. That takes care of the issues associated with subprocesses contending for read. Now we’ll look at the process of getting the data from the Array into the Python subprocess.

Time for more speculation on my part. Under the hood, Array is actually a primitive C array allocated with malloc. When the Python subprocess iterates over this array, it creates an internal Python integer, which takes time.

Wouldn’t it be handy if we could just read directly from the buffer? numpy to the rescue with the frombuffer function. This function takes our Array and reads directly off it from the blocks in memory. First lets install numpy in a virtualenv:

 $ mkvirtualenv numpy
 $ pip install numpy

Then let’s run the example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process, Array
from time import sleep
import numpy

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in numpy.frombuffer(cache, 'i'):
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

When compared to the vanilla approach, we get a marginal improvement on the subprocess runtime. As n grows, this margin should grow as well. As always, your mileage may vary:

frombuffer

It is also possible to do all of this magic with strings, however be cautioned that the Array object must be both written and read with a fixed character length. Brush up on your malloc skills and you should be just fine.

tl;dr

If you’re looking for a vanilla example, look here:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process, Array
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs using Process and not map.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

If you’re using numpy, look here:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process, Array
from time import sleep
import numpy

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in numpy.frombuffer(cache, 'i'):
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = Array('i', range(CACHE), lock=False)
    sleep(PAUSE)

    # Run the jobs.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

PyPy

Time for some bonus points. pypy is awesome and weird in all of the ways that make my mad scientist senses tingle. One of those ways is its pluggable garbage collection system.

The default collector is called minimark. I know very little about pypy, but I’ll try to give a quick overview. See Benjamin Peterson’s talk for more detail.

The underlying algorithm is called Mark and Sweep. It cleans memory in two phases. In the first phase, it will walk the reference tree and mark objects that are still alive. In the second phase it will sweep away any objects that aren’t marked. As a result no reference counting, which means copy on write should work out of the box.

Let’s test it! First, let’s install pypy and create our virtualenv:

$ sudo apt-get install pypy
$ mkvirtualenv -p /usr/bin/pypy pypy

Now let’s rerun the benchmark with no changes and see what happens:

pypy_benchmark

So, that’s not exactly what we expected. I suspect pypy doesn’t like the pickling that needs to happen to get the cache into Pool.map. Let’s tweak the code to use Process instead, except without Array:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from multiprocessing import Process
from time import sleep

# The size of the cache created when calling range(CACHE). To ensure that your
# processes is visible on your system, you will need to adjust this value.
# Start with the CACHE value below and keep adding 0s until anywhere between
# 5-10% of your memory is consumed. You may also want to reduce the PAUSE
# value below to make this tuning easier. Note that IT IS VERY EASY TO RUN
# YOUR SYSTEM OUT OF MEMORY THIS WAY, so save everything before attempting.
CACHE = 10000

# The number of subprocesses forked.
PROCESSES = 5

# The number of seconds to pause in order to display the allocated memory in
# the system monitor.
PAUSE = 3

def job(cache):
    '''
    An artificial job that sleeps for a second and then reports memory usage.
    '''
    # Read all of the data in the cache.
    for item in cache:
        pass

    # Wait for the system monitor.
    sleep(PAUSE)

def main():
    '''
    Entry point.
    '''
    # Create a global shared memory cache.
    cache = range(CACHE)
    sleep(PAUSE)

    # Run the jobs using Process and not map.
    for index in range(PROCESSES):
        process = Process(target=job, args=(cache, ))
        process.start()

if __name__ == '__main__':
    main()

pypy_process

Now that’s the kind of result that makes me smile🙂 And yes, I did check to make sure that the process actually ran :p

Results

good_news

After running the patched code in production, I’m happy to report that it worked. Huzzah for science!

As a general rule it’s hard to gauge the performance of a patch until you run it in production. In this particular case, I had good evidence to suggest the root cause of the problem and the memory was reduced to the bottleneck I predicted.

We also got a bit of runtime back, but that’s a story for a future blog. I’d also like to do a deeper dive into Python’s memory allocation and see if we can break out strace and valgrind.

Until then, if you’re stuck on a hard problem just remember, try science!

Posted in Uncategorized | 2 Comments

Explaining Technical Debt

I’ve been thinking about technical debt and how it builds in software projects. Brace yourself, this is a depressing topic so I’ve included a picture of a turtle. Hopefully that makes you feel better.

turtle

Good? Good.

It’s very easy to come up with a specific example of technical debt and then say, if we fixed that problem the project would be much easier to modify and therefore cheaper to maintain. The real trouble with technical debt is explaining how important it is to someone without a technical background. As a rule, it’s very easy to justify the next great feature in terms of sales. It’s a measurable win and makes everyone very happy very fast.

Technical debt, true to it’s name, is another beast entirely. It builds up slowly over time. Everyone knows it’s there and can see the disease spreading but it isn’t until it’s too late that you can convince people to take action. It’s too late when you need to rewrite your software to add the next essential feature. For the economically literate, this is when your marginal cost exceeds your marginal revenue.

The end result is a long and gruelling recovery in the best case and shutting down the business in the worst. There’s nothing much that can be done about the temptation to trade long term cost savings for short term revenue. But with a useful analogy we may be able to shed some light on the dragon lurking in the darkness of the treasure cave.

Technical debt is very much like debt, but with a different currency. In the world of software we trade time for features. In an ideal world, the cost of a feature is static. It doesn’t matter if you implement the feature this cycle or five years from now, it costs the same amount of time. This is probably how management thinks software development works. They’ll probably also get confused when that’s no longer the case.

Now let’s add a bit of complexity. Your boss asks for a feature that is kinda pricey for this cycle. Normally you’d tell them no, but instead you decide to borrow some time from the next release by cutting corners. You’re hailed as a technical hero for meeting the deadline. You think to yourself, things will be okay if you spend a bit of time next release to pay back the time you borrowed.

However, now you’ve set a precedent and the credit genie is out of the bottle. The next release comes along and again you’re strapped for time and so you borrow more time. This time it’s only moderately worse so again you figure you’ll spend an all nighter or two to get it done right in the next release.

Unfortunately, debt compounds over time. When the debt is low the problem is small but when the debt is high, no amount of all nighters during a release can help you. What you need to do is spend some serious time paying down the debt by fixing the small compromises you’ve made over past releases.

Up to this point, you’ve hit all of your deadlines. How do you tell management that we need to spend a whole cycle rewriting code that already works? The reality is that it’s already getting too late because if you try to add features this release, it wil get worse. Instead let’s look at what we can do before things spin out of control.

Do you necessarily have to borrow? If you put together a killer team and invest heavily on testing you’ve solved the problem right? Well, not really. Any sufficiently complex project has unexplored problems and your project team needs to learn the best ways to solve these problems. That often means borrowing refactoring time from the future to try something new today.

But having a killer team helps, right? It does, but not in the way you might think. It doesn’t reduce your overall debt as that is really part of your source code. Having a good team is like having a credit card with a cheap interest rate. You can borrow less time from future releases with clever solutions today, but you’re still borrowing. If you’ve already accumulated debt in your source code, you still need to pay that debt no matter which team you have.

Sometimes you get lucky and a good developer will find a clever solution to a core problem that clears a bunch of debt. Sometimes you get unlucky and you’ll find that you’ve accidentally built your software on a mud bank that is washing away.

Anomalies aside, the place to be is where you can service your debt. That is, you are clearing enough of it to keep the amount of time between releases to an acceptable level. If the time between releases starts to go up quickly over every release, you’ve got a problem. It’s time to stop and refactor.

Good teams refactor constantly and are always assessing whether their designs fit their requirements. The rule of thumb I’ve seen thrown around is about one developer on one refactoring task per cycle per 5-10 developers. This number will depend very much on your industry and quality requirements.

So let’s summarise the analogy. Building software is like spending time for features. Sometimes you have to borrow time from a future release to buy a feature. If you do, your team determines the interest rate you have to pay and your source code is your balance sheet. If you don’t pay back your debt, you will no longer be able to modify your software and bad things will happen.

The scariest part is that most management teams are often spending credit as though there were no interest rates. The only people that can warn them are the engineers, so it’s time to start complaining. Your weekends depend on it.

Posted in Uncategorized | Leave a comment

sqlpyzer: IPython + sqlite in html5

What is a sqlpyzer?

sqlpyzer is an experimental plugin I wrote for the IPython  Notebook. It’s much easier to describe what it is in a video, so here goes:

You can find the source and the installation instructions here:

https://github.com/RishiRamraj/sqlpyzer

The Name

Sadly all of the good SQL names were taken. I was going for sqlizer but things fell apart quickly when I tried to add py.

Installation

There are installation instructions on the github page, but I’ll go into a bit more detail here. I haven’t tested these steps, so please comment with any corrections. I’m currently running Ubuntu 12.04 32bit desktop. I’ve tested the plugin using Python 2.7 and Chromium. I’m going to assume that you have your github repository located in ~/Repo. You’ll need a specific build of IPython; one that supports jsplugins. You can find such a build at:

https://github.com/RishiRamraj/ipython

Be sure to check out the “json” branch. Once you have ipython checked out you can then run the notebook by using the ipython.py script in the root directory. This script will then launch the notebook and open a browser.

user@computer:~/Repo/ipython$ python ipython.py notebook

Next, we’ll install sqlpyzer (again, sorry for the name). Close the notebook for now, we’ll get back to it later. I recommend you use pip and virtualenv. If you’re not familiar with these tools, check this out:

http://jontourage.com/2011/02/09/virtualenv-pip-basics/

Install sqlpyzer:

user@computer:~/Repo/sqlpyzer$ pip install -e .

By using a develop egg above, you’ll be able to modify sqlpyzer without having to reinstall it. Next, we’ll have to install the sqlpyzer javascript. You’ll need to find the ipython local configuration folder. In Ubuntu:

user@computer:~$ cd .config/ipython/profile_default/static/jsplugins/

Simlink the Javascript files so that you can modify the Javascript in your repository:

user@computer:~/.config/ipython/profile_default/static/jsplugins% ln -s /home/user/Repo/sqlpyzer/js/* .

Finally, run the notebook again and open the notebook located here to test the plugin:

~/Repo/sqlpyzer/sqlpyzer.ipynb

Repo Structure

There are three things of interest in the sqlpyzer repo. The sqlpyzer folder contains all of the python responsible for getting the data out of python and marshalled into json. The js folder contains all of the javascript that creates the sqlite database and marshals the json into the database.

Python

load.py

The load.py module contains one function. It is called by the load_ext function in IPython. Its job is to tell IPython how to take a data object and display it. Everything comes together on this line:

formatter.for_type_by_name(module, name, serializer)

The serializer function is called every time you issue the display command on a sqlpyzer.Data object.

data.py

This module contains a simple validator and data type. Besides validation, the only real purpose this data type servers is to be able to associate Data objects with a specific serializer function that tells IPython which Javascript handler to use.

serialize.py

The function in this module turns the Data object into a json string. The resulting object has two properties. First is the data object. Note that the “data” key can really be called anything. Second is the handler. The handler is special as it tells IPython which Javascript function to use to interpret the json being passed in.

__init__.py

In order for IPython to be able to load a module as an extension, the module needs to have a load_ipython_extension function. The __all__ list tells python that it should load the load_ipython_extension when the module is loaded, as an attribute of the module. If you’re not familiar with __all__, take a look at:

http://docs.python.org/2/tutorial/modules.html#importing-from-a-package

Everything Else

constants.py has some constants like the Javascript callback name. test.py contains tests. To run the tests use:

user@computer:~/Repo/sqlpyzer/sqlpyzer$ python test.py

Javascript

Disclaimer: I know more about Python than I do Javascript.

console.js

Ths script takes the id of a div as an argument, and creates the database object with the UI. The UI should automatically resize its controls to fit in whatever container you put it in. The display function is the interesting part in this module. It demonstrates how to retrieve data from a sqlite query. Once the console is created, it returns an object containing the database, input and output fields.

sqlpyzer.js

Possibly the largest single piece of source in this project. If you’re looking for the Javascript that associates a the handler name with the Javascript callback, look all the way at the end of the file:

IPython.json_handlers.register_handler('sqlpyzer_data', sqlpyzer.data);

The sqlpyzer.data callback takes two parameters; the json object that we serialised in from python and the new element created under the display function. It then creates a console and uses the json data to populate the database.

When reading the create_table code, it’s important to remember that Javascript doesn’t block when executing SQL. Instead, you need to specify callbacks that are called if a query succeeds or fails. You can get away with calling them in parallel in a single transaction, if your queries are independent as with this for loop. In the background, javascript is queuing the SQL execution to happen as a set of callbacks, once the for loop is complete.

// Create all the tables.
console.database.transaction(function (tx) {
     for (table in json.data) {
         sqlpyzer.create_table(tx, table, json.data[table]);
     } 
});

If your queries depend on other queries like row insertion depending on table creation, things start getting interesting. You need to specify a callback that will insert all of the rows once the table has been created. You’ll notice that the create_table function has two functions called create_schema and insert_row for this reason. create_schema exposes a callback called next that then inserts all of the rows in parallel:

// Pull it all together to create the table.
create_schema(tx, function(tx) {
    for (var index = 0; index < table.length; index++) {
        var row = table[index];
        insert_row(tx, row);
    }
});

Closures make this sort of work a easy. They basically set the context for the callbacks used to create the schema and the individual rows. Otherwise, we would have to pass all of the table level parameters into both callbacks, which would be messy. If we were doing this sort of work sequentially with a language like python, your context containers would be your for loops:

# Table level context.
for name, table in tables
    # Create the schema.
    schema = get_schema(table)
    create_schema(schema)

    # Row level context.
    for row in table:
        # Create the row.
        insert_row(schema, row)

Once you insert a row, how do you get the id of the new row? I’m glad you asked! My initial thought was to use the last_insert_rowid function:

select last_insert_rowid() as id;

However in Javascript you cannot control the order of your callbacks and so you may end up with the wrong ID. Fortunately, the result set has a handy property called insertId. You can see it at work here:

// Generate the insert query.
var query = 'INSERT INTO '+name+'('+columns.join(', ')+') ';
query += 'VALUES ('+params.join(', ')+')';
tx.executeSql(query, values, function (tx, data){
    // The inserted row id.
    var new_id = data.insertId;

Debugging

Unfortunately IPython has a nasty tendency to eat exceptions that happen in both the serialise.data function in Python and in the sqlpyzer.data function in Javascript. In Python be sure to completely test your serializers. Unfortunately, I wrote no unit tests in Javascript. Instead, I found modifying this file:

~/Repo/ipython/IPython/frontend/html/notebook/static/js/jsonhandlers.js

to this effect helped a bit:

JSONHandlers.prototype.call_handler = function(key, json, element) {
    var handler = this.handlers[key]
    if (handler !== undefined) {
        /*
        try {
            handler(json, element);
        } catch(err) {
        };
        */
        handler(json, element);
    };
};

Conclusion

That’s about it! You know about as much as I do about making IPython talk with sqlite via html5. I’m not entirely sure what that means, but I had fun. If you end up doing anything interesting with this experiment please let me know! Again, sorry about the name…

Posted in Uncategorized | Leave a comment