Picarus
Visit https://github.com/bwhite/picarus/ for the source.
About
Picarus is a web-scale machine learning library with a focus on computer vision applications. This is still _highly_ experimental and we are working out in the open, if you are interested in working with us or trying it out send us an email.
Use Cases
Current
- Image classification: Determine if an image is one class or another (e.g., indoor/outdoor, person/not person). Includes training.
- Image clustering: Group images based on visual similarity.
- Face detection: Find all faces.
- Keyframe: Find all keyframes in a set of videos (e.g., shot transitions, fast motion).
- Image Search: Build an image search index (initial code available)
- Real-time REST api (initial code available)
Private (we have the code but are working on releasing it for various reasons)
- Video classification: Determine if a video is one class or another (e.g., dancing, skateboarding)
- Object detection: Find the location of a specific object (e.g., car, person). Faces are currently supported.
- Segmentation/Pixel-level classification: Classify individual image pixels as belonging to a specific class.
Requirements
Our projects
- hadoopy (doc): Cython based Hadoop library for Python. Efficient, simple, and powerful.
- imfeat (doc): Image features (take image, produce feature vector) and support functions.
- distpy (doc): Distance metrics.
- classipy: Classifiers using a simple standardized interface (supports scikit-learn).
- impoint: Image feature point detection and description.
- vidfeat: Video features (take video, produce feature vector) and support functions.
- keyframe: Video keyframe algorithms (take video, identify frame changes).
Third party
Hadoop Vision Jobs
-
picarus.vision.run_image_feature(hdfs_input, hdfs_output, feature, image_length=None, image_height=None, image_width=None, **kw)
-
picarus.vision.run_video_keyframe(hdfs_input, hdfs_output, min_interval, resolution, ffmpeg, **kw)
-
picarus.vision.run_predict_windows(hdfs_input, hdfs_classifier_input, feature, hdfs_output, image_height, image_width, **kw)
Video Feature Computation
Hadoop Cluster Jobs
-
picarus.cluster.run_whiten(hdfs_input, hdfs_output, image_hashes=None, **kw)
-
picarus.cluster.run_sample(hdfs_input, hdfs_output, num_clusters, **kw)
-
picarus.cluster.run_kmeans(hdfs_input, hdfs_prev_clusters, hdfs_image_data, hdfs_output, num_clusters, num_iters, num_samples, metric='l2sqr', local_json_output=None, image_hashes=None, **kw)
-
picarus.cluster.run_hac(hdfs_input, **kw)
-
picarus.cluster.run_local_kmeans(hdfs_input, hdfs_output, num_clusters, *args, **kw)
Hadoop Classification Jobs
-
picarus.classify.run_classifier_labels(hdfs_input_pos, hdfs_input_neg, hdfs_output, classifier_name, classifier_extra, local_labels, classifier, **kw)
TODO Finish docstring
Args:
hdfs_output: Path to hdfs temporary output or None if execution should be performed locally using hadoopy.launch_local.
-
picarus.classify.run_train_classifier(hdfs_input, hdfs_output, local_labels, **kw)
-
picarus.classify.run_predict_classifier(hdfs_input, hdfs_classifier_input, hdfs_output, classes=None, image_hashes=None, **kw)
-
picarus.classify.run_join_predictions(hdfs_predictions_input, hdfs_input, hdfs_output, local_image_output, **kw)
-
picarus.classify.run_thresh_predictions(hdfs_predictions_input, hdfs_input, hdfs_output, class_name, class_thresh, output_class, in_memory=False, **kw)
Hadoop IO
-
picarus.io.load_local(local_input, hdfs_output, output_format='kv', max_record_size=None, max_kv_per_file=None, **kw)
Read data, de-duplicate, and put on HDFS in the specified format
- Args:
local_input: Local directory path
hdfs_output: HDFS output path
output_format: One of ‘kv’ or ‘record’. If ‘kv’ then output sequence
files of the form (sha1_hash, binary_file_data). If ‘record’
then output sequence files of the form (sha1_hash, metadata)
where metadata has keys
sha1: Sha1 hash
extension: File extension without a period (blah.avi -> avi,
blah.foo.avi -> avi, blah -> ‘’)
full_path: Local file path
hdfs_path: HDFS path of the file (if any), the data should be the
binary contents of the file stored at this location on HDFS.
data: Binary file contents
where only one of data or hdfs_path has to exist.
- max_record_size: If using ‘record’ and the filesize (in bytes) is larger
- than this, then store the contents of the file in a directory called
‘_blobs’ inside output path with the name as the sha1 hash prefixed
to the original file name (example, hdfs_output/blobs/sha1hash_origname).
If None then there is no limit to the record size (default is None).
- max_kv_per_file: If not None then only put this number of kv pairs in each
- sequence file (default None).
-
picarus.io.dump_local(hdfs_input, local_output, extension='', **kw)
Read data from hdfs and store the contents as hash.ext
- Args:
hdfs_input: HDFS input path in either ‘kv’ or ‘record’ format
local_output: Local directory output path
extension: Use this file extension if none available (kv format or
record with missing extension) (default ‘’)
-
picarus.io.run_record_to_kv(hdfs_input, hdfs_output, **kw)
-
picarus.io.run_kv_to_record(hdfs_input, hdfs_output, extension, base_path, **kw)