We needed a classifier to assign job data into buckets for trend analysis. We had 400,000 records to assign and we knew the difference between the categories could be subtle (even for humans with domain expertise). We thought a naive bayes classifier seemed appropriate for the task. While there are some Python open source naive bayes classifiers available (I’m most comfortable programming in Python), e.g., NLTK and Orange, I decided to use the algorithm in Toby Segaran’s Programming Collective Intelligence (O’Reilly, 2007) so I could tweak the code and play with different feature arrangements. Toby does a great job of tying the code to the principles behind the naive bayes algorithm and I thought that would help with modding and tuning the classifier for our purposes.
Our data was stored on a multi-node Greenplum Massively Parallel Processing (MPP) database cluster, running a Postgres engine. Having an MPP database has been critical for analyzing the large, 1.4 billion record, data set we work with - we can generally run investigative queries against the full data set in minutes.
We wanted the classifier fast enough to iterate through the data many times so we could spend enough time training and tuning the algorithm to optimize classifier accuracy. Starting with a training set of 1,800 categorized jobs (phew...) and a random sample of 1,850 jobs, we set to work trying and reviewing different sets of feature combinations.
A combination of a large set of words per records and an algorithm written to explain how Naive Bayes works, not for maximum efficiency, made for slow going. With training and classifying the sample data running for more than six hours, we needed to do something to speed up the process to handle all 400,000 records we wanted to classify.
Note: We ran into a Python related problem early on that I think worth sharing. Due to the large numbers of words in a job description, the probabilities used by the naive bayes algorithm get exceedingly small, so small that Python turned them into zero, making for suspiciously strange results. Luckily, I complained about this problem to a friend with a doctorate in math who suggested taking the log of the probabilities, since the logs of very small numbers are not so small. Worked like a charm.
I contacted Daisy Zhe Wang, an EECS doctoral student at UC Berkeley and a consultant at Bayes Informatics, because of her focus on scaling in-database natural language and machine learning algorithms for advice.
Daisy, together with Milenko Petrovic, founder of Bayes Informatics, developed a set-oriented approach to implementing the naive bayes algorithm that treats the data derived from the training set (features (words) and counts) as a single entity, and converting the naive bayes algorithm to Python User Defined Functions (UDFs) for Postgres that, since Greenplum is a distributed database platform based on Postgres, let us parallelize the classifier process.
The result: the training set was processed and the sample data set classified in six seconds, we were able to classify the entire 400,000 record data set in under six minutes - more than a four orders of magnitude records processed per minute (26,000-fold) improvement. A process that would have run for days, in its initial implementation, now ran in minutes! The performance boost let us try out different feature options and thresholds to optimize the classifier. On the latest run, a random sample showed the classifier working with 92% accuracy.
My simple understanding of their algorithm is that training set results are treated like a model and stored as a single row/column in the database and parsed into a permanent Python data structure once, while each job description is parsed into another temporary data structure and the words in the temporary data structure are compared with words in the model by the Python UDFs. The result, one database read for each job description and one write once probabilities are compared and the classification assignment made - quite a contrast from reading and writing each word in the training set and the unassigned job.
What follows is an explanation of the approach Milenko and Daisy took to speed up the algorithm - strategies that can help you scale up your machine learning algorithms to handle large, unstructured and other fringe data sets you may need to classify.
Note: The set-oreiented approach could have been implemented using Greenplum’s map/reduce, or, in a Hadoop cluster, using a map/reduce implrementation of a naive bayes classifier.
Example of a set-oriented vs. iterative-style
To illustrate the difference between the two styles, we describe three ways of computing and using a Naive Bayes model using a database. Naive Bayes is a simple computation that counts occurrences of various "features" from the input data and constructs a simple classification model based on those counts. The computed model is then used to classify new items. For the purpose of the example, understanding of Naive Bayes is not necessary, and any other algorithm could be used in it's place for all three patterns.
(1) Iterative approach uses database as a "lookup table", similar to how a main memory data structure would be used:
Computing the model (from training set data):
Classifying new items:
The main problem here is the overhead of each lookup which may involve network communication, parsing of the database queries, query setup and cleanup. This is a significant bottleneck for large numbers of features and data items.
(2) A significantly better approach is to process all data items as a stream.
Computing the model:
Classifying new items:
The main thing to notice is single query to the database that fetches all the data items to be processed and stream style processing of features (ie. Data items are not first fetched to memory, but stream from the database and processed as they come in)
As a result, query overhead is avoided and the computation scales to datasets much larger then the available main memory.
(3) The second approach still suffers from lack of parallel processing and network communication overhead both of which may become more noticeable with large datasets.
The network overhead can be avoided by moving the model computation to the database as an aggregate, and classification as UDF. If the computation is organized as stream processing, as in (2), the procedure is straightforward to do. It involves "wrapping UDF declarations" around the application code, and specifying a partitioning key for data items in the database. Parallel databases are able to process UDFs over different partitions in parallel. If the application is not stream processing oriented, the computation will need to be restructured as in (2). The application side becomes:
Computing the model:
Classifying new items:
In this approach, the application declaratively specifies the set of items over which the functions should be applied, allowing classification to be parallelized on a parallel database, and retrieves only the result of computation, which can be many times smaller then the data used to calculate it.
For more information about in-database NLP and machine learning, extending SQL with procedural languages, UDFs and aggregates for external libraries, map/reduce pattern in SQL, please feel free to contact contact Roger or Milenko.
The upcoming Strata Conference covers many similar topics related to big data, machine learning, analytics and visualization.