Programmer's notebook: python

Showing posts with label python. Show all posts

Wednesday, September 17, 2014

Building and packaging a python application for distribution

I find it messy to build a python application package that is easy to distribute . Though, python tools have come a long way.
"pip/wheel" helps in installing, managing and distributing individual packages. "virtualenv" provides an approachable way to create isolated environment and avoid the pollution of the main distribution. "zc.buildout" lets you reproduce / assemble and deploy a python application through configuration. Together, they provide a powerful framework for building and distributing python applications. However, It is not as simple as build once and distribute everywhere model of executable / jar package distribution. In all likelihood, you will be creating an isolated virtual environment, installing the dependencies and the application into it.

Centralized logging for distributed applications with pyzmq

Simpler distributed applications can take advantage of centralized logging. PyZMQ, a Python bindings for ØMQ provides log handlers for the python logging module and can be easily used for this purpose. Log handlers utilizes ØMQ Pub/Sub pattern and broadcasts log messages through a PUB socket. It is quite easy to construct the message collector and write messages to a central location.

         +-------------+
         |Machine1:App1+-------------------------
         +-------------+                        |
                                         +---------------+
         +-------------..................|Machine3:Logger|
         |Machine1:App2|                 +---------------+
         +-------------+                          |
                                                  |
                   +-------------+                |
                   |Machine2:App1|-----------------
                   +-------------+

Hadoop Map-Reduce with mrjob

With Hadoop, you have more flexibility in accessing files and running map-reduce jobs with java. All other languages needs to use Hadoop streaming and it feels like a second class citizen in Hadoop programming.

For those who like to write map-reduce programs in python, there are good toolkit available out there like mrjob and dumbo.
Internally, they still use Hadoop streaming to submit map-reduce jobs. These tools simplify the process of map-reduce job submission. My own experience with mrjob has been good so far. Installing and using mrjob is easy.

Installing mrjob

First ensure that you have installed a higher version of python than the default that comes with Linux (2.4.x for supporting yum). Ensure that you don't replace the existing python distribution as it breaks "yum".

Install mrjob on one of the machine in your Hadoop cluster. It is nicer to use virtualenv for creating isolated environment.

wget -O virtualenv.py http://bit.ly/virtualenv
/usr/bin/python26 virtualenv.py pythonenv
hadoopenv/bin/easy_install pip
hadoopenv/bin/pip install mrjob

Introducing ØMQ and pyzmq through examples

ØMQ is a messaging library that has the capability to revolutionize distributed software development.

Unlike full fledged messaging systems, it provides the right set of abstractions to incorporate various messaging patterns. It also provides the concept of devices which allows creation of complex network topology.

To get a quick overview, you can read the introduction to ØMQ by Nicholas Piël.

ØMQ sockets are a light abstraction on top of native sockets.
This allows it to remove certain constraints and add new ones that makes writing messaging infrastructure a breeze.

ØMQ sockets adhere to predefined messaging patterns and has to be defined during ØMQ socket creation time.
ØMQ sockets can connect to many ØMQ sockets unlike the native sockets.
There is constraint on type of ØMQ sockets that can connect to each other.

ØMQ has bindings for many languages including python (pyzmq) and that makes it very interesting.

It has been fun learning the basics and I hope soon to create some real world examples to whet my knowledge of ØMQ. Till then, I hope that the mini tutorial on ØMQ and pyzmq will serve as good introduction to it's capabilities.

Check out : http://readthedocs.org/docs/learning-0mq-with-pyzmq/en/latest/index.html

It is quite easy to get started. Use virtualenv and pip.

pip install pyzmq-static
pip install tornado

Checkout the code from https://github.com/ashishrv/pyzmqnotes
Follow some of the annotated examples to see the awesomeness of ØMQ.

Do post your feedback on the mini tutorial here as comments.

Thursday, December 1, 2011

My experiences with Fabric based deployment automation

Many good tools are available for configuration management and application deployment.
Puppet, Chef have attained cult status among the dev-ops team. There are good tools available in Python too. Salt may soon become a viable alternative and looks definitely promising to me. Push-Pull is commonly used to explain various types of tools in the eco-system.

Fabric is an excellent tool that allows you to weave operation locally and remotely on cluster of machines, allowing you to deploy applications, start/stop services and perform other operations on a cluster of machines. There are few good tutorials available to help familiarize with Fabric. If you haven't read it already, you should.

I have used Fabric to automate deployment of Hadoop / Hive application, Nagios deployment on cluster of machines on EC2, private cloud based on Cloudstack and commodity machines.

The code grew from nifty little commands / functions like setting up the "Fully qualified domain hostname" (FQDN), creating users and groups on Linux, installing yum packages to a complete system of commands that installs and brings up Kerberos enabled secure hadoop cluster using Cloudera hadoop packages.

Code soon became unwieldily.

There are a few practices that helps contain the level of complexity that grows when using Fabric enthusiastically.

Installing funkload on Mac

Funkload is a useful tool for understanding the characteristics of the application server under stress and load conditions.

Installation of Funkload is very straightforward using virtualenv and macports on Mac.
If you aren't using it already, you should think of checking it out.

Create a isolated environment for installing Funkload.

virtualenv --no-site-packages loadtest
source loadtest/bin/activate
pip install yolk

Learning Twisted (part 8) - Anatomy of deferreds in Twisted

There are numerous posts and document on conceptual explanation of one of the central concepts in Twisted framework - deferred.

The book on Twisted network programming provides an analogy of deferreds as buzzers which are handed to a visitor by a restaurant owner. This buzzer notifies the visitor that the table is ready and he could set aside what ever he has been doing and can come over to occupy the table meant for him.

Others identify deferreds as a place holder or a promise that is yet to be fulfilled. We could attach other actions that should follow when the promise is fulfilled or breached. These actions are like callback chains that would be triggered when a deferred fires.

Deferreds allows you to create followup action for something that will take some time to get fulfilled. This in turn relieves twisted to attend to other tasks and come back to execute follow-up actions when the condition is completed.

I will keep myself to code commentary and current behavior of deferreds.

Tracing call flows in Python

Python decorators comes handy when you want to intercept a piece of call flow and profiling technique seems just too verbose.
I use this quite often to analyze a python program to understand it better.

Consider the following piece of contrived python code to illustrate this approach of tracing python call flows.

def f():
    f1('some value')


def f1(result): 
    print result
    f2("f1 result")
    

def f2(result): 
    print result
    f3("f2 result")
    fe("f2 result")
    return "f2 result"

def f3(result): 
    print result
    return "f3 result"

def fe(result): 
    print result

f()

Output:

some value
f1 result
f2 result
f2 result

Buildbot - Issue with svn poller

SVN poller may miss a check-in based on poll interval.

The current behavior of the poller is

The poller polls the version control system and stores last change (version number). Subsequent changes are notified as log entries. These log entries are marked with the Time Stamp when the changes are noticed. These log entries are used to create change objects that is then passed to scheduler to trigger builds.Scheduler sees these change objects with the same timestamp and picks the latest change object to trigger the build.

The issue with this model is that if there are multiple changes within a single polling interval, this poller will result in triggering build only for the last one.

Setting up buildbot - customizing configuration file

The crux of BuildBot involves a master and multiple build slaves that can be distributed across many computers.

Each Builder is configured with a list of BuildSlaves that it will use for its builds. Within a single BuildSlave, each Builder creates its own SlaveBuilder instance.
Once a SlaveBuilder is available, the Builder pulls one or more BuildRequests off its incoming queue.These requests are merged into a single Build instance, which includes the SourceStamp that describes what exact version of the source code should be used for the build. The Build is then randomly assigned to a free SlaveBuilder and the build begins.

All this is configured via a single file called master.cfg which is a dictionary of various keys that is used to configure the buildbot when it starts up.
Open up the sample "master.cfg" that comes with the buildbot distribution, drop it in the master directory that you have created and start hacking it.

I have listed few important configuration that should get you started.
Below is instance of dictionary that is populated in the configuration file

c = BuildmasterConfig = {}

Learning Twisted (part 7) : Understanding protocol class implementation

In my last post, I had focused on protocol factory class, various methods it needs to provide and also the code flow within which these methods gets called or invoked.
Here we will look into the structure of protocol class , various methods it needs to provide and context in which they are called.

There are two ways to lookup and learn this:

Look at the interface definition: IProtocol(Interface) in interfaces.py
Like I did in my previous posting , supply a protocol class with no methods and look at the traceback to understand the code flow

So usual imports for writing a custom protocol:

from twisted.web import proxy
from twisted.internet import reactor
from twisted.internet import protocol
from twisted.python import log
import sys
log.startLogging(sys.stdout)

It is much better to derive from protocol.Protocol to build custom protocol. It does a few things for you.

Any intricate logic should be built using the connect, disconnect, data received event handlers and methods to write data onto connection

makeConnection method sets the transport attribute and also calls connectionMade method . You can use this to start communicating once the connection setup has been established.
dataReceived method is called when there is data to be read off the connection. connectionLost is called when the transport connection is lost for some reason. To write data on the connection, you use the transport method self.transport.write. This results in adding the data to buffer which will be sent across the connection. To make twisted send the data or buffer immediately, you can call self.transport.doWrite

Using buildbot for continuos integration development

Continuos integration in it's simplicity embodies certain agile tenets like frequent integration of code and automated verification of the integrated code to provide continuous feedback to the team on development and reducing heart burns during large integrations. It also avoids silent creeping in of broken builds into the code repository. At the heart of this process is a tool that can be integrated with the workflow of code check-in to trigger automated testing of frequently checked in development artifacts.

This helps in providing the developer an immediate feedback and assurance that things are moving in a positive direction.

Buildbot is a "continuos integration" tool.

BuildBot can automate the compile/test cycle required by most software projects to validate code changes.

I had a chance to set it up some time back. What follows is a snippet of that experience on getting it up and running quickly.

Image for: Wednesday, September 22, 2010

Python and binary data - Part 3

Normal file operations that we use are line oriented

FILE = open(filename,"w")
FILE.writelines(linelist)
FILE .close()
FILE = open(filename,"r")
for line in FILE.readlines(): print line
FILE .close()

We can also use byte oriented I/O operations on these files.

FILE = open(filename,"r")
FILE.read(numBytes)  # This reads up to numBytes of Bytes from the file.

But if the file does not contain textual data, the contents may not be meaningful.

It is much better to open the file in binary mode

FILE = open(filename,"rb")
FILE.read(numBytes)

Python and binary data - Part 2

In the previous post, I discussed about numbers (floating, signed and whole) representation on a computer. In "C", the bits used for representation are limited. This means that there is inherent limitation on what can be represented. It also means that there is danger of overflow when you can't hold all the bit values after an operation to represent a number that exceeds the bit limits.

How about Python? How does it represent these numbers?

Even if Python underlying implementation is in "C" or "C++" types, Python integers are not like typical "C" integers.

Python integer have arbitrary precision.

It creates a higher level of abstraction for number representation.
Python represents integers by allocating the memory that is required to hold the number value. Initially, the size is set to 32 or 64 bits and increased when required. They can pretty much store a very large value or an astronomical figure. Arbitrary precision operations on these integers can be very slow.

Algorithms in Python - Smallest free number

I have recently started reading the book "pearls of functional algorithm design" by Richard Bird. The book details various problems and functional algorithmic solutions.
I thought it would be a good exercise to solve them in Python and attempt to solve them using a functional programming language in future.

Task: find the smallest natural number not in finite set of numbers - X

The first problem illustrates the programming task in which objects are named by numbers and X represents the set of objects currently in use. The task is to find the object with the smallest name not in use. I.e. it is not in X.

Of course, there are multiple ways to solve this problem and I am not even trying to look at what may be an elegant solution.

Python and binary data - Part 1

All data is represented by ones and zeroes. How ever, the stream of binary data (ones and zeroes) can represent anything. Practically any thing can be represented with multiple ones and zeroes strung together along with the means for interpretation. The most common interpretation is textual data or Ascii data.

If the representation format is not known then we simply refer to it as binary data.

This interpretation process is called decoding and reverse transformation to binary data is called encoding.
If the binary data is not an ascii representation, you can't manipulate it in a textual editor.
Python has a specific module called 'binascii' for transformation of binary data to ascii representation and back and forth.

Learning Twisted (part 6) : Understanding protocol factory

Since most of us will be reusing the transport implementations that are already provided by Twisted. Our focus will be to create protocols and protocol factory that ties up a transport with protocol instance.

from twisted.web import proxy
from twisted.internet import reactor
from twisted.python import log
import sys
log.startLogging(sys.stdout)

class myProtocolFactory():
    protocol = proxy.Proxy

reactor.connectTCP('localhost', 80, myProtocolFactory())
reactor.run()

This barebones throws a whole lot of trace backs which helps understand the code flow a little bit easily. You can keep supplying the functions and rerun to see all the required methods of protocol factory and how the code flows or is structured.

Learning Twisted (part 5) : Low-Level Networking

So far we have explored the basic pieces of twisted core - reactors , event handlers, scheduled & looping calls. It also provides support for cooperative multitasking or loop parallelism.
In this post, I will explore the framework that implements low level protocol and networking support in twisted. This one for me was really twisted. No wonder, it is named as such!

In retrospection, generality of twisted as a networking framework is greatly improved by the underlying complexity of it's implementation. How ever, tackling this complexity does pay off in the awesome power this tool provides in building networking applications.

I am a picture guy. I can't hold all these dense information without some sort of picture in place. So I guess, I will put the picture first and then delve in the details.

Sorry folks, no code in this post.

+----------------+       connectSSL       +-----------+
|PosixReactorBase|.......listentTCP ......|IReactorTCP|
+----------------+       connectTCP       +-----------+
                             |
                             |
    +----------+             |
    |IConnector|________connector obj <.......You pass a factory
    +----------+        obj -> connect        This is a protocol
     |   |                                    factory.
     |   |                                 ,-'
     |  stopDisconnecting               ,-'
     |  getDestination                ,'
     |   ..............> factory.dostart()
  connect..............> _makeTransport()
                               |
        +-------------+    connection obj
        |ISystemHandle|________|   |__calls factory.buildProtocol(addr)
        |             |                               |
        |ITCPTransport|.......>+----------+           |
        +-------------+        |ITransport|    +-------------+
                               +----------+    |IProtocol obj|
                                    |          +-------------+
                       .............:................
                       |                            |
                     Client..___                __Server
                                `'`---..----''''
                               These are initialized differently

Learning Twisted (part 4) : Cooperative multitasking

EventLoop dispatches the events to be handled by eventHandlers. If an eventHandler plays fraud and do not return immediately, event or reactor loop can not service other events. This means that all our event handlers need to return quickly for this model to work.
i.e functions that have substantial waiting period and are non-blocking. Typically, network or file I/O can be are non-blocking.

Twisted provides facility to interleave execution of various functions, where the period of waiting can be used to do something useful for other functions that are waiting to be serviced.

+--------------------------++--------------------------+
|task1 | wait period | comp||task2 | wait period | comp|
+--------------------------++--------------------------+
+--------------------------+
|task1 | wait period | comp|
+--------------------------+
       +--------------------------+
       |task2 | wait period | comp|
       +--------------------------+

Learning Twisted (part 3) : Scheduling with reactor

In previous installments, we found that twisted does provide methods to schedule a call at some time later, after the reactor loop starts running. It also provides method for registering a call that will be called repeatedly after a delay.

The facility is provided in twisted.internet.task.

from twisted.internet import reactor
from twisted.python import log
import sys, time
from twisted.internet.task import LoopingCall, deferLater

log.startLogging(sys.stdout)

Programmer's notebook

Wednesday, September 17, 2014

Building and packaging a python application for distribution

Wednesday, August 14, 2013

Centralized logging for distributed applications with pyzmq

Wednesday, May 16, 2012

Hadoop Map-Reduce with mrjob

Tuesday, March 13, 2012

Introducing ØMQ and pyzmq through examples

Thursday, December 1, 2011

My experiences with Fabric based deployment automation

Installing funkload on Mac

Tuesday, November 2, 2010

Learning Twisted (part 8) - Anatomy of deferreds in Twisted

Friday, October 29, 2010

Tracing call flows in Python

Wednesday, October 20, 2010

Buildbot - Issue with svn poller

Wednesday, October 13, 2010

Setting up buildbot - customizing configuration file

Wednesday, October 6, 2010

Learning Twisted (part 7) : Understanding protocol class implementation

Monday, September 27, 2010

Using buildbot for continuos integration development

Wednesday, September 22, 2010

Python and binary data - Part 3

Python and binary data - Part 2

Tuesday, September 21, 2010

Algorithms in Python - Smallest free number

Python and binary data - Part 1

Friday, September 17, 2010

Learning Twisted (part 6) : Understanding protocol factory

Friday, August 27, 2010

Learning Twisted (part 5) : Low-Level Networking

Tuesday, August 24, 2010

Learning Twisted (part 4) : Cooperative multitasking

Learning Twisted (part 3) : Scheduling with reactor