python

A 29-post collection

Dealing with Big Vars in Python

I have encountered memory leak in python, and It's not fun at all. Its obscuration consumed so much of energy and time.

I have come to a policy of mine. When dealing with big vars about which we really want to minimize the memory footprint, deallocate the memory as soon as it's not being used any more in a context.

A = big var  
... use A ...
A = None  
import gc  
gc.collect()  

Mainly, do the above ! But does it really make sense ? It sure does! Considering this case:

def get_a_big_val():  
    return something big

A = get_a_big_val()  
... use A ...
A = get_a_big_val()  
... use A ...

How much memory do we need in this case ? Actually, it's more like 2 * A, since only one A is big enough we should avoid this.

import gc  
A = get_a_big_val()  
... use A ...
A = None  
gc.collect()  
A = get_a_big_val()  
... use A ...
A = None  
gc.collect()  

Better memory foot print now.

Actually, from the case I have encountered which involves memory leak, doing this will solve the problem, if not it still forestalls the chance of memory leak by a good margin.

อ่านต่อ »

A Python File Should Contain "main()" Function

I think I found a recipe for a new Python file:

def ...(): ...  
def ...(): ...

def main():  
    ...
    ...

if __name__ == '__main__':  
    main()

Instead of putting everything in main() plainly under the if statement.

Why ? I have a list:
  1. You should not be able to create and mutate global vars at will and during runtime! Under the if statement, any var declared will be global.

  2. It just harder for IDE to help us auto-suggesting the code. The IDE will not know whether a var or a function declared under the if statement will be available at the runtime or not. This is a bad thing because the auto-suggestion list will contain a bunch of choices that doesn't work.

  3. And, it's also easier for unit-testing. Instead of indirectly running a .py file, we can direct this into just using a function call.

อ่านต่อ »

Python's Multiprocess Pool Memory Leak

Update : I investigated further to the problem, and found a better solution. I will write a topic on that.

Actually, I don't know how this happen, just can't narrow it down enough to a pin point.

The situation can be described here:

# no globals

def ...

def ...

def ...

def trigger():  
    ...
    pool = Pool()
    for each in pool.imap_unordered(memory_consuming_fn, some_list):
        ...
    pool.close()
    pool.join()
    ...

# run the trigger() function many times

Note : In my case the memory_cosuming_fn involves opening many large files, this could be the case (but I personally don't think so because the running context is separated from the main process, and is essentially freed up every time without fail).

Note2: For those who aren't familiar with multiprocess, it's essentially important to make sure that the global variables are not huge because it will be forked and duplicated ! esp. under Python 2. Moreover, it's much better if you keep the globals immutable.

The point is to run trigger() many times, may be in this fashion:

for i in range(10):  
    trigger()

After each run, not all memory are freed, more or less, so, I assume that memory leakage is encountered.

I strayed trying to make small changes to the code involving putting many gc.collect() to multiple places, still no luck.

Here are some of my unfruitful experiments:

  1. Making sure that a new process is spawn for each job, pool = Pool(maxtasksperchild=1).
  2. Put gc.collect() to the beginning of the memory_consuming_fn.
  3. Try deallocate large vars by <var> = None or del <var> and then gc.collect().

And, here are some working experiments:

A. Declaring an initializer:

def pool_init():  
    import gc
    gc.collect()

pool = Pool(initializer=pool_init)  

B. Put gc.collect() before declaring pool:

import gc  
gc.collect(
อ่านต่อ »

Fast Pythonic Way to Calculate Average Distance Between Points

Finding an average distance between two sets of points (vectors) in a n-demensional euclidean space can be slow ... harnessing O(a * b * n) of time complexity, where a and b denote the numbers of points in each set.

A normal way to do this is something like:

import numpy as np  
def dist(a, b):  
    return np.linalg.norm(a - b)

def dist_cluster_avg(points_A, points_B):  
    from itertools import product
    s = sum(dist(a, b) for a, b in product(points_A, points_B))
    return s / len(points_A) / len(points_B)

Even though, the def dist(a, b) is arguable a very fast distance function, looping over each pair of points can be inexorably slow, to some extent because of the Python itself.

There are so many ways out there to improve by the means of optimizations, it is important to note that the time complexity stays the same, but it is much better implemented.

I found this, using cdist from scipy, and sum from numpy :

def dist_cluster_avg_fast(points_A, points_B):  
    from scipy.spatial.distance import cdist
    arr = cdist(points_A, points_B, metric='euclidean')
    return np.sum(arr) / (len(points_A) * len(points_B))

How much the gain ? You might ask, here it is.

points_A = np.random.rand(500, 100)  
points_B = np.random.rand(500, 100)

dist_cluster_avg(points_A, points_B) => 2.569478988647461 seconds  
dist_cluster_avg_fast(points_A, points_B) => 0.15470504760742188 seconds  

Much faster huh?

อ่านต่อ »

Bye ... PyCharm, Hello LiClipse

PyCharm is my best Python IDE.

I make the switch because of one one reason. Since I've been using Dvorak for quite sometime, PyCharm just don't support it properly, PyCharm key mapping doesn't work with alternative keyboard layouts (even though they blamed it wasn't their faults).

I had been using PyCharm with some workaround (karabiner) since the beginning. I wasn't quite satisfied, but at least I can get my job done. But not anymore, since I updated OSX to macOS Sierra, karabiner has stopped working, my only workaround has gone. Any attempt I made for other alternatives just didn't seem to be fruitful. So, I decided to abandon PyCharm.

Now, I'm on LiClipse, actually I want to go with Eclipse + PyDev. But, it recommends me to this IDE, so I will just go with that.

อ่านต่อ »