Welcome to the BID Data Project! Here you will find resources for the fastest Big Data tools on the Web. See our Benchmarks on github. BIDMach running on a single GPU-equipped host holds the records for many common machine learning problems, on single nodes or clusters.

Try It! BIDMach is an interactive environment designed to make it extremely easy to build and use machine learning models. BIDMach runs on Linux, Windows 7&8, and Mac OS X, and we have a pre-loaded Amazon EC2 instance. See the instructions in the Download Section.

Develop with it. BIDMach includes core classes that take care of managing data sources, optimization and distributing data over CPUs or GPUs. It’s very easy to write your own models by generalizing from the models already included in the Toolkit.

Explore. Our Publications Section includes published reports on the project, and the topics of forthcoming papers.

Discuss We have a Google group for BIDMach discussions here

Contribute. BIDMach includes many popular machine learning algorithms. But there is much more work to do. In progress we have Random Forests, extremely fast Gibbs samplers for Bayesian graphical models, distributed Deep Learning networks, and graph algorithms. Ask us for an unpublished report on these topics. Please use Github’s issues page for bug reports or suggestions:

Lightning Overview

The BID Data Suite is a collection of hardware, software and design patterns that enable fast, large-scale data mining at very low cost.

Architecture of the Toolkit

Architecture of the Toolkit

The elements of the suite are:

  • Hardware. The data engine that balances storage, CPU and GPU acceleration for typical data mining workloads.
  • Software.
    • BIDMat, an interactive matrix library that integrates CPU and GPU acceleration and novel computational kernels.
    • BIDMach, a machine learning system that includes very efficient model optimizers and mixing strategies.
  • Scaling Up.
    • Butterfly Mixing, a communication strategy that hides the latency of frequent model updates needed by fast optimizers for clusters.
    • Sparse AllReduce, an efficient MapReduce like primitive for scalable communication of power-law data.

In the benchmark section, we present several benchmark problems to show how the above elements combine to yield multiple orders-of-magnitude improvements for each problem.

4 thoughts on “Overview

  1. Lizhen


    The evaluation on the Benchmarks looks amazing! If I want to implement some deep learning models, where should I start? What is the quickest way of doing it? Is it easy to write code for (unit) tests on this platform? Is it easy to integrate integer linear programming software packages such as Gurobi (It has only Java but no Scala interface)?


  2. John Canny

    Hi Lizhen,
    Thanks for the comments! You make some excellent suggestions. To your comments:
    1. We are actively working on integration of CAFFE, a deep learning framework also from Berkeley. You will see the wrapper code if you pull a current version of BIDMach. It should be ready within the next month.

    2. There are several unit test frameworks that you can use with scala and we have a few basic unit test already in the distribution. We dont really have the resources to write comprehensive unit tests right now. One short-cut we use is to test each learning algorithm on GPU and CPU matrices. Since those use entirely different code at the implementation level (where bugs are most likely) any bugs lead to different results which are fairly easy to isolate.

    3. We havent looked at specific LP packages, but if Gurobi is in Java, then you can use it from scala instantly. Just put the jar in the lib direrectory and add it to the path when you start bidmach (i.e. modify the bidmach start-up script in the root directory) and import the classes you want to use. You dont need any extra interface to use a java class in Scala. We use several java packages (e.g. apache.commons.math) that way.

  3. Marek

    Hi All,
    I recently came across your project and it looks amazing!
    I managed to run it using my Nvidia card(Maxwell architecture), however,
    I needed to recompile GPU parts against CUDA 6.0 and add sm_50 architecture
    to Makefiles.
    I experienced some problems with kmeans algorithm and would be happy to
    ask a few questions-do you have any usergroup or bugtracking system where
    I could post them?

Leave a Reply