Picarus

Visit https://github.com/bwhite/picarus/ for the source.

About

Picarus is a web-scale machine learning library with a focus on computer vision applications. This is still _highly_ experimental and we are working out in the open, if you are interested in working with us or trying it out send us an email.

Who

Picarus is developed by Dapper Vision, Inc. (Brandyn White and Andrew Miller). We are PhD students at UMD and UCF respectively and are interested in Computer Vision, Web-Scale Machine Learning, HCI, and Social Networks.

Use Cases

Current

  • Image classification: Determine if an image is one class or another (e.g., indoor/outdoor, person/not person). Includes training.
  • Image clustering: Group images based on visual similarity.
  • Face detection: Find all faces.
  • Keyframe: Find all keyframes in a set of videos (e.g., shot transitions, fast motion).
  • Image Search: Build an image search index (initial code available)
  • Real-time REST api (initial code available)

Private (we have the code but are working on releasing it for various reasons)

  • Video classification: Determine if a video is one class or another (e.g., dancing, skateboarding)
  • Object detection: Find the location of a specific object (e.g., car, person). Faces are currently supported.
  • Segmentation/Pixel-level classification: Classify individual image pixels as belonging to a specific class.

Requirements

Our projects

  • hadoopy (doc): Cython based Hadoop library for Python. Efficient, simple, and powerful.
  • imfeat (doc): Image features (take image, produce feature vector) and support functions.
  • distpy (doc): Distance metrics.
  • classipy: Classifiers using a simple standardized interface (supports scikit-learn).
  • impoint: Image feature point detection and description.
  • vidfeat: Video features (take video, produce feature vector) and support functions.
  • keyframe: Video keyframe algorithms (take video, identify frame changes).

Third party

Useful Tools (Optional)

Our projects (ordered by relevance)

  • hadoopy_flow: Hadoopy monkey patch library to perform automatic job-level parallelism.
  • vision_data: Library of computer vision dataset interfaces with standardized output formats.
  • image_server: Server that displays all images in the current directory as a website (very convenient on headless servers).
  • static_server: Server that allows static file access to the current directory.
  • pycassa_server: Pycassa viewer.
  • vision_results: Library HTML and Javascript tools to display computer vision results.
  • hadoop_log: Tool to scrape Hadoop jobtracker logs and provide stderr output (simplifies debugging).
  • pyram: Tiny parameter optimization library (useful when tuning up algorithms).
  • mturk_vision: Mechanical turk scripts.

Hadoop Vision Jobs

picarus.vision.run_image_feature(hdfs_input, hdfs_output, feature, image_length=None, image_height=None, image_width=None, **kw)
picarus.vision.run_video_keyframe(hdfs_input, hdfs_output, min_interval, resolution, ffmpeg, **kw)
picarus.vision.run_predict_windows(hdfs_input, hdfs_classifier_input, feature, hdfs_output, image_height, image_width, **kw)

Video Feature Computation

_images/fig_video_pipeline.png

Hadoop Cluster Jobs

picarus.cluster.run_whiten(hdfs_input, hdfs_output, image_hashes=None, **kw)
picarus.cluster.run_sample(hdfs_input, hdfs_output, num_clusters, **kw)
picarus.cluster.run_kmeans(hdfs_input, hdfs_prev_clusters, hdfs_image_data, hdfs_output, num_clusters, num_iters, num_samples, metric='l2sqr', local_json_output=None, image_hashes=None, **kw)
picarus.cluster.run_hac(hdfs_input, **kw)
picarus.cluster.run_local_kmeans(hdfs_input, hdfs_output, num_clusters, *args, **kw)

Hadoop Classification Jobs

picarus.classify.run_classifier_labels(hdfs_input_pos, hdfs_input_neg, hdfs_output, classifier_name, classifier_extra, local_labels, classifier, **kw)

TODO Finish docstring Args:

hdfs_output: Path to hdfs temporary output or None if execution should be performed locally using hadoopy.launch_local.
picarus.classify.run_train_classifier(hdfs_input, hdfs_output, local_labels, **kw)
picarus.classify.run_predict_classifier(hdfs_input, hdfs_classifier_input, hdfs_output, classes=None, image_hashes=None, **kw)
picarus.classify.run_join_predictions(hdfs_predictions_input, hdfs_input, hdfs_output, local_image_output, **kw)
picarus.classify.run_thresh_predictions(hdfs_predictions_input, hdfs_input, hdfs_output, class_name, class_thresh, output_class, in_memory=False, **kw)

Hadoop IO

picarus.io.load_local(local_input, hdfs_output, output_format='kv', max_record_size=None, max_kv_per_file=None, **kw)

Read data, de-duplicate, and put on HDFS in the specified format

Args:

local_input: Local directory path hdfs_output: HDFS output path output_format: One of ‘kv’ or ‘record’. If ‘kv’ then output sequence

files of the form (sha1_hash, binary_file_data). If ‘record’ then output sequence files of the form (sha1_hash, metadata) where metadata has keys sha1: Sha1 hash extension: File extension without a period (blah.avi -> avi,

blah.foo.avi -> avi, blah -> ‘’)

full_path: Local file path hdfs_path: HDFS path of the file (if any), the data should be the

binary contents of the file stored at this location on HDFS.

data: Binary file contents

where only one of data or hdfs_path has to exist.

max_record_size: If using ‘record’ and the filesize (in bytes) is larger
than this, then store the contents of the file in a directory called ‘_blobs’ inside output path with the name as the sha1 hash prefixed to the original file name (example, hdfs_output/blobs/sha1hash_origname). If None then there is no limit to the record size (default is None).
max_kv_per_file: If not None then only put this number of kv pairs in each
sequence file (default None).
picarus.io.dump_local(hdfs_input, local_output, extension='', **kw)

Read data from hdfs and store the contents as hash.ext

Args:

hdfs_input: HDFS input path in either ‘kv’ or ‘record’ format local_output: Local directory output path extension: Use this file extension if none available (kv format or

record with missing extension) (default ‘’)
picarus.io.run_record_to_kv(hdfs_input, hdfs_output, **kw)
picarus.io.run_kv_to_record(hdfs_input, hdfs_output, extension, base_path, **kw)

Project Versions

Table Of Contents

This Page