By Francesc Alted, Continuum Analytics and Luis Osa, Telefonica I+D
Francesc Alted and Yves Hilpisch from Continuum Analytics, Inc. recently visited the Telefónica site in Madrid to talk about the merits of Python in Big Data analytics. You can find the slides of the talk here.
5 March 2014: The advent of distributed computing platforms that are comparatively cheap and reliable has ushered in the rise of so-called “Big Data” applications. These platforms were based on Google’s MapReduce paradigm, first published in 2004. The Hadoop project implemented the same ideas in 2005, and has become the core component of the Big Data ecosystem. Since Hadoop was written in Java, Java and other JVM-based languages have become the first choice for writing Big Data applications.
However, the key insight of Google’s paper was a way of conceiving distributed computations without the fuss of managing locking and coordination by the application programmer. This paradigm is not bound to any particular programming language; indeed, Google’s own MapReduce implementation as described in the original paper is written in C++.
Python is a language with a long history of use in the scientific and engineering communities, including the original version of Google’s webcrawler. There is a full stack of libraries that allow to analyse data (pandas), perform numerical computations (Numpy), use well-known scientific formulas (SciPy), and visualise the results of all of the above (matplotlib).
Wouldn’t it be great if analysts could rely on these proven tools, while scaling them out to even larger datasets?
Continuum Analytics was founded in 2011 to accomplish precisely that goal by generalizing existing tools, specifically Numpy, to work in a distributed setup. This improved Numpy, called Blaze, would automatically enable all other existing Python libraries that use Numpy for efficient array computations to perform the same operations on datasets that are not local to them. This includes the aforementioned pandas and SciPy, but may also provide friendly interfaces such as IPython’s notebook or QGIS.
The main argument against using Python is usually its low performance in heavy computational scenarios. While this is true of pure-Python applications, Python, being an excellent ‘glue’ language, has typically tackled this limitation through linking with library extensions written in C or Fortran.
In recent years, several projects appeared that allow to compile Python code into native code, making it much faster. As for one, Cython allows transforming Python code to its C equivalent, compiling the result automatically, allowing speed that is very close to C code. Cython also comes with support for parallel libraries (OpenMP), which allows making high-performance multi-threaded apps using Python syntax.
Another approach to higher performance has been JIT compilation, which allows to do exactly the same as the JVM does: compile those parts of the bytecode under execution that are detected as hotspots. PyPy is an example of this approach, providing an alternative interpreter that takes your Python code as-is and executes it with speedups of 6.3x in average. However, PyPy has the drawback that it has yet not full support for Numpy arrays, which is a show stopper for any numerical application. Instead, numexpr can be used as a JIT compiler for extremely fast evaluation of potentially complex numerical expressions.
Continuum Analytics has also developed its own alternative in this area: Numba uses LLVM to compile hotspots on the fly to the machine’s assembly language. This could reach speedups of up to 400x, on tight numerical code. The main advantage of Numba is that it can be used through a simple decorator in your code, without a compilation cycle as Cython has, speeding up the development of applications.
Last, but not least, Continuum Analytics is also leading the Anaconda distribution. This is meant to allow the deployment of the rich Python ecosystem (Anaconda includes +125 packages and the list is quickly growing) in a way that is easy and portable across Windows, Linux and Mac OSX. Also, with Anaconda Server, corporations can manage third-party dependencies on their own terms, streamlining the release cycle of code. This is, undoubtedly, one of the strongest projects behind pushing a much wider Python adoption for individuals, as well as in academic and corporate environments.
Using Blaze to distribute computations, Numba to accelerate them on each node, and Anaconda to manage deployment, Continuum seeks to build an alternative Big Data stack to that based on Hadoop. At Telefónica, we believe this interesting development may mark the beginning of a new breed of Big Data applications, using Python’s flexibility and mature libraries to gain insights and deliver results even faster to the market.
Continuum Analytics, Inc. <http://www.continuum.io>
Francesc Alted (Spain) <email@example.com>
Dr. Yves Hilpisch (Germany) <firstname.lastname@example.org>