Introducing SAMOA, An Open Source Platform For Mining Big Data Streams.

Introducing SAMOA, an open source platform for mining big data streams.

https://github.com/yahoo/samoa

Machine learning and data mining are well established techniques in the world of IT and especially among web companies and startups. Spam detection, personalization and recommendations are just a few of the applications made possible by mining the huge quantity of data available nowadays. However, “big data” is not only about Volume, but also about Velocity (and Variety, 3V of big data).

The usual pipeline for modeling data (what “data scientists” do) involves taking a sample from production data, cleaning and preprocessing it to make it usable, training a model for the task at hand and finally deploying it to production. The final output of this process is a pipeline that needs to run periodically (and be maintained) in order to keep the model up to date. Hadoop and its ecosystem (e.g., Mahout) have proven to be an extremely successful platform to support this process at web scale.

However, no solution is perfect and big data is “data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”. The current challenge is to move towards analyzing data as soon as it arrives into the system, nearly in real-time.

For example, models for mail spam detection get outdated with time and need to be retrained with new data. New data (i.e., spam reports) comes in continuously and the model starts being outdated the moment it is deployed: all the new data is sitting without creating any value until the next model update. On the contrary, incorporating new data as soon as it arrives is what the “Velocity” in big data is about. In this case, Hadoop is not the ideal tool to cope with streams of fast changing data.

Distributed stream processing engines are emerging as the platform of choice to handle this use case. Examples of these platforms are Storm, S4, and recently Samza. These platforms join the scalability of distributed processing with the fast response of stream processing. Yahoo has already adopted Storm as a key technology for low-latency big data processing.

Alas, currently there is no common solution for mining big data streams, that is, for doing machine learning on streams on a distributed environment.

Enter SAMOA

SAMOA (Scalable Advanced Massive Online Analysis) is a framework for mining big data streams. As most of the big data ecosystem, it is written in Java. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

SAMOA is both a platform and a library. As a platform, it allows the algorithm developer to abstract from the underlying execution engine, and therefore reuse their code to run on different engines. It also allows to easily write plug-in modules to port SAMOA to different execution engines.

As a library, SAMOA contains state-of-the-art implementations of algorithms for distributed machine learning on streams. The first alpha release allows classification and clustering.

For classification, we implemented a Vertical Hoeffding Tree (VHT), a distributed streaming version of decision trees tailored for sparse data (e.g., text). For clustering, we included a distributed algorithm based on CluStream. The library also includes meta-algorithms such as bagging.

HOW DOES IT WORK?

An algorithm in SAMOA is represented by a series of nodes communicating via messages along streams that connect pairs of nodes (a graph). Borrowing the terminology from Storm, this is called a Topology. Each node in the Topology is a Processor that sends messages to a Stream. The user code that implements the algorithm resides inside a Processor. Figure 3 shows an example of a Processor joining two stream from two source Processors. Here is a code snippet to build such a topology in SAMOA.

TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); Processor join = new JoinProcessor(); builder.addProcessor(join).connectInputShuffle(streamOne).connectInputKey(streamTwo);

SWEET! HOW DO I GET STARTED?

1. Download SAMOA

git clone git@github.com:yahoo/samoa.git cd samoa mvn -Pstorm package

2. Download the Forest CoverType dataset.

wget "http://downloads.sourceforge.net/project/moa-datastream/Datasets/Classification/covtypeNorm.arff.zip" unzip covtypeNorm.arff.zip

Forest CoverType contains the forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. It contains 581,012 instances and 54 attributes, and it has been used in several papers on data stream classification.

3. Download a simple logging library.

wget "http://repo1.maven.org/maven2/org/slf4j/slf4j-simple/1.7.2/slf4j-simple-1.7.2.jar"

4. Run an Example. Classifying the CoverType dataset with the VerticalHoeffdingTree in local mode.

java -cp slf4j-simple-1.7.2.jar:target/SAMOA-Storm-0.0.1.jar com.yahoo.labs.samoa.DoTask "PrequentialEvaluation -l classifiers.trees.VerticalHoeffdingTree -s (ArffFileStream -f covtypeNorm.arff) -f 100000"

The output will be a sequence of the evaluation metrics for accuracy, taken every 100,000 instances.

To run the example on Storm, please refer to the instructions on the wiki.

I WANT TO KNOW MORE!

For more information about SAMOA, see the README and the wiki on github, or post a question on the mailing list.

SAMOA is licensed under an Apache Software License v2.0. You are welcome to contribute to the project! SAMOA accepts contributions under an Apache style contributor license agreement.

Good luck! We hope you find SAMOA useful. We will continue developing the framework by adding new algorithms and platforms.

Gianmarco De Francisci Morales (gdfm@yahoo-inc.com) and Albert Bifet (abifet@yahoo.com) @ Yahoo Labs Barcelona

More Posts from Laossj and Others

8 years ago

the clicking sound of the rack is oddly satisfying.

https://instagram.com/p/BQIDI5eh5_m/

7 years ago
Photo-editing App FaceApp Now Includes Black, Asian Indian And Caucasian Filters
Photo-editing App FaceApp Now Includes Black, Asian Indian And Caucasian Filters
Photo-editing App FaceApp Now Includes Black, Asian Indian And Caucasian Filters
Photo-editing App FaceApp Now Includes Black, Asian Indian And Caucasian Filters
Photo-editing App FaceApp Now Includes Black, Asian Indian And Caucasian Filters

Photo-editing app FaceApp now includes Black, Asian Indian and Caucasian filters

On Wednesday morning, the photo-editing app FaceApp released new photo filters that change the ethnic appearance of your face.

The app first became popular earlier in 2017 due to its ability to transform people into elderly versions of themselves and different genders.

These new options, however, will likely cause some outrage: The filters are Asian, Black, Caucasian and Indian.

Selfie apps like Snapchat have taken criticism for filters that apply “digital blackface.” In 2016, Snapchat released a Bob Marley filter that darkened the skin and gave users dreadlocks. Snapchat said another one of its 2016 filters was “inspired by anime,” but many people called it “yellowface,” as it seemingly turned the user into an Asian stereotype.

FaceApp’s newest filters, however, don’t pretend they’re anything but racial. Read more (8/9/17 12 PM)

follow @the-future-now

7 years ago
Scrying Pen
Scrying Pen
Scrying Pen

Scrying Pen

Webtoy by Andy Matuschak uses neural network-trained SketchRNN dataset to visualize in realtime potential sketch marks whilst you are drawing particular objects:

This pen’s ink stretches backwards into the past and forwards into possible futures. The two sides make a strange loop: the future ink influences how you draw, which in turn becomes the new “past” ink influencing further future ink.

Put another way: this is a realtime implementation of SketchRNN which predicts future strokes while you draw.

Currently works best in Chrome, you can try it out for yourself here

7 years ago
Intel Core With Radeon RX Vega M Graphics Launched: HP, Dell, And Intel NUC Http://ift.tt/2CQpCuH

Intel Core with Radeon RX Vega M Graphics Launched: HP, Dell, and Intel NUC http://ift.tt/2CQpCuH

8 years ago
Nueral Network Applies Sesame Street To Trumps  - Chris Rodley
Nueral Network Applies Sesame Street To Trumps  - Chris Rodley
Nueral Network Applies Sesame Street To Trumps  - Chris Rodley
Nueral Network Applies Sesame Street To Trumps  - Chris Rodley

Nueral Network applies Sesame Street to Trumps  - Chris Rodley

Artist Chris Rodley  - “I used deep learning to turn the Trump family into Sesame St characters. Was not prepared for the nameless horror”

7 years ago
A.I. Researchers Leave Elon Musk Lab to Begin Robotics Start-Up
"Pieter Abbeel, a Berkeley professor, is part of the team that has started Embodied Intelligence to make it possible for robots to learn on their own.
7 years ago

Holographic Cortana Appliance

7 years ago

How does cashless society work?

If there is one post to this tumblr I want to see reblogged like crazy, it’s this one.

image

So how would a cashless society work? This is, IMHO, one of the most important questions to ask when discussing Star Trek in general. 

Roddenberry had a vision which continues to motivate and inspire people today, because it envisioned humanity so far beyond its time. It allowed the show to craft an ideal. Something that may never be completely achieved, but that should be strived for continually. Providing not a roadmap, but a light to follow. 

Social issues are incredibly important, and are not to be understated when discussing this specific topic- they are the fundamental ideals within the Star Trek universe. But close behind this is the concept of economic enlightenment. In fact, I would argue they are fundamentally one in the same. In order for us to find love among all of us, without any hate or envy or fear, we need to find means of providing for everyone, so that everyone can be given the same opportunities to choose how they live their lives. 

One aspect of this Roddenberry version of a fair and enlightened global society would be one with no cash. 

But How Does That Work?

How, can anything work without cash? Or to clarify, money? I don’t believe it could right now, but in the future, if certain issues were solved, we could be well on our way. Here are three aspects of our society that will need to be addressed or achieved before we would even be close:

image

1. There needs to be a movement to Post-Scarcity

People need to have easy access to homes, health, and the basic comforts that money currently is required to attain. 

There’s a lot of talk about a “post-scarcity” economy. With 3d-printers, efficient production, and global access to information we are already moving towards this. But one big hurdle in this issue is energy. Until we find a means of providing nearly limitless energy to the entire planet, a post-scarcity society will be very hard to maintain. (Cold fusion is an exciting potential leap forward)

image

2. Automation for the dangerous jobs and Universal Basic Income

We need robots to do the repetitive and dangerous jobs people shouldn’t do or just don’t want to do. The more these jobs are taken up by robots, the more there will be a need for a Universal Basic Income. The general concept is this: companies that produce goods while removing jobs from the market will need to pay tax on the robots that were once paying jobs. The money will then be given to the citizens as a dividend. This will eventually be the foundation for providing a universal live-able distribution of resources to everyone. 

image

3. Debt needs to be reversed 

The final issue is debt will need to be removed from society. This is arguably the hardest to understand and I imagine even harder to implement. Our current understanding of economics runs on debt. Person A gives money to Person B so that person B can make more money and give back that money (plus interest) to person A. The problem is this seems to allow the money to be consolidated into large pools. Currently the top 8 richest people in the world hold more wealth than the bottom half. 

We need a way to believe in a society that can work in reverse. A society where we pay it forward, rather than pay it back. This is where I haven’t fully understood the ideas being proposed. But one thing is certain, those in the top 1% will need to provide for those in need for this to ever work. There needs to be a rational, if not spiritual enlightenment among the richest in the world that we need to all have a place in society. A place unburdened by overwhelming fiscal obligation. When people don’t owe money, they can make the choices that benefit themselves and the rest of us at the same time. Rather than the choice that just makes them money.

I honestly think this is the biggest hurdle out of all of them. Because while the other issues can be solved with technological and political progress, this one truly requires a global enlightenment. Yes things like bitcoin and ethereum might help, but this is a bigger problem than just banking access and credit.

*Takes Deep Breath*

So that’s one take on Roddenberry’s vision of a cashless society. It’s something I truly hope comes to fruition.  A world where people are secure in knowing they have access to healthcare, a home, and the ability to pursue their passion. A world where all its people are freed to be their best self. Where creativity and science and kindness have priority. Orchestras could play in the park for free. Artists could make paintings of anything for anyone they wish. Scientists can spend their time inventing what they believe will help the planet the most. And we can finally get to the business of exploring the stars. 

image
7 years ago
Motion Capture- You Never Know When I May Need To Do One Of The Rock’s Baywatch Stunts/ Better Safe

Motion capture- you never know when I may need to do one of The Rock’s Baywatch stunts/ better safe than sorry.

7 years ago
Tech billionaires convinced we live in the Matrix are secretly funding scientists to help break us out of it
Some of the world’s richest and most powerful people are convinced that we are living in a computer simulation. And now they’re trying to do something about it. At least two of Silicon Valley’s tech billionaires are pouring money into efforts to break humans out of the simulation that they believe that it is living in, according to a new report.
  • poffuomo
    poffuomo liked this · 6 years ago
  • sidiatig-blog
    sidiatig-blog reblogged this · 7 years ago
  • laossj
    laossj liked this · 8 years ago
  • laossj
    laossj reblogged this · 8 years ago
  • mentaliongmai
    mentaliongmai liked this · 8 years ago
  • sparkle816-blog
    sparkle816-blog reblogged this · 8 years ago
  • berrid
    berrid liked this · 9 years ago
  • mooheb-elm
    mooheb-elm reblogged this · 10 years ago
  • antonyjagdish
    antonyjagdish liked this · 10 years ago
  • chewyhsu
    chewyhsu liked this · 10 years ago
  • thungashiva
    thungashiva liked this · 11 years ago
  • machinelearningmadness-blog
    machinelearningmadness-blog reblogged this · 11 years ago
  • imikushin
    imikushin liked this · 11 years ago
  • katsyoshi
    katsyoshi liked this · 11 years ago
  • bennett
    bennett liked this · 11 years ago
  • godisdad
    godisdad reblogged this · 11 years ago
  • arjenpdevries
    arjenpdevries liked this · 11 years ago
  • prashanthgedde
    prashanthgedde liked this · 11 years ago
  • shital-blr
    shital-blr liked this · 11 years ago
  • zlckr
    zlckr liked this · 11 years ago
  • jalbertbowdenii
    jalbertbowdenii reblogged this · 11 years ago
  • jalbertbowdenii
    jalbertbowdenii liked this · 11 years ago
  • apprengineer
    apprengineer reblogged this · 11 years ago
  • velocius
    velocius reblogged this · 11 years ago
  • toogoodtoworry
    toogoodtoworry reblogged this · 11 years ago
  • hedgefundinvest-blog
    hedgefundinvest-blog liked this · 11 years ago
  • ashish1dev
    ashish1dev liked this · 11 years ago
  • yahoodevelopers
    yahoodevelopers reblogged this · 11 years ago
  • jshinchoi
    jshinchoi reblogged this · 11 years ago
  • traims
    traims liked this · 11 years ago
  • revans2
    revans2 reblogged this · 11 years ago
  • eightfatswine
    eightfatswine liked this · 11 years ago
  • ovlaere
    ovlaere liked this · 11 years ago
  • dethinking-blog
    dethinking-blog reblogged this · 11 years ago
  • dethinking-blog
    dethinking-blog liked this · 11 years ago
  • bluechoochoo
    bluechoochoo reblogged this · 11 years ago
  • bluechoochoo
    bluechoochoo liked this · 11 years ago
  • gdfm-blog
    gdfm-blog liked this · 11 years ago
  • gdfm-blog
    gdfm-blog reblogged this · 11 years ago
laossj - 无标题
无标题

295 posts

Explore Tumblr Blog
Search Through Tumblr Tags