Entries Tagged 'Data Services' ↓

New Web Service Analyzes Thousands of Data Sets For Any Community

Esrilogo.jpgGeographic data giant ESRI today released to the public its new web service called Community Analyst, which offers fast search and mapping of up to five simultaneous data sets among thousands of options. Community Analyst has been in beta testing for several months, can be used for for two weeks for free and then begins at just under one thousand dollars per user per year.

ESRI has been widely criticized by a new generation of geo-geeks as old, unwieldy and unduly influential (it's much used, if not much loved, by people in government), but the company appears determined to offer continually improved web-based means of analyzing large amounts of data with regard to spaces and places. The demos of Community Analyst look quite nice. All kinds of reports can be generated about one or multiple places, including sets of adresses uploaded from a spreadsheet.

Sponsor

A user could search by saying, for example: show me blocks in my city where median income is between $10,000 and $30,000, where there are at least 5 children living on the block, where access to medical care is unusually low, neighborhood nutrition is poor and public transportation substandard. That's where you want to put the medical clinic, right?

As a growing number of web services emerge and compete against each other to offer the most and best data, with the easiest interface and the best forms of added value to people located around the world - those users should benefit greatly from more, better and faster options.

This public launch received almost no discussion at all among geolocation geeks online today, but Community Analyst will presumably be widely discussed at next week's giant ESRI User Conference in San Diego.

Esricommunityanalyst.jpg

Discuss


Ravel Open Sources GoldenOrb, Big Data Graph Processing for Everyone

goldenorb150.jpgTraditional databases, even when they're called "relational databases," tend not handle relationships very well, and the traditional way of processing data - particularly large-scale datasets - can actually mean that some of the relationships between objects are lost or obscured.

Several years ago, Google began encountering these sorts of problems with relational data, particularly as this graph data didn't really fit into its Map Reduce system for big data processing. So Google developed a product called Pregel, which solved the relational data problem and allowed it to be processed on a massive scale.

While Pregel remains an in-house technology for Google, the data startup Ravel is releasing its Pregel-like, large-scale graph processing technology today.

Sponsor

GoldenOrb, which Ravel is also open sourcing (GitHub link), will solve some of the same types of problems as Pregel, but can be applied to many other areas beyond network analysis and social graph analysis, such as epidemiology and mathematics.

But most importantly, says Zach Richardson, the lead architect of the GoldenOrb project and the CTO of Ravel, it makes the programming that developers have to do far simpler. Rather than worrying about how they can get it to run about thousands of machines, "they can just focus on the algorithm for solving their particular problem." According to Richardson, this means that large scale data problems are now solvable even by startups.

Richardson says that Ravel opted to open source the technology so that others could work on writing algorithms and solve various problems that, in turn, Ravel hopes to be able to learn from as well. The company has no immediate plans to offer commercial support around GoldenOrb.

Discuss


Factual and SimpleGeo Team Up to Offer Developers a Better Geo-Data Toolkit

factual_simplegeo.jpgDevelopers working on building location-aware applications need two at least two things: a robust set of tools and quality geo-data.

A new strategic partnership announced today between two of the leading geo-data startups, Factual and SimpleGeo, will help with just that, as SimpleGeo will now be incorporating Factual's global places data into its Places API.

Sponsor

Factual's dataset contains more than 30 million places and local points of interest. The quantity of data is impressive, to be sure, but the key here is really the quality of that data. Factual has worked to "clean up" its location data and to standardize some of the naming conventions for the data fields. This is particularly important when it comes to handling data from multiple countries, for example, but questions of aggregating and normalizing data is a challenge that big data startups all face.

By making this partnership, SimpleGeo's CEO Jay Adelson says his company will be able to "focus on our core strengths" - its location-aware toolset that includes "Storage", its hosted database for location information and "Context," a tool that lets developers query for data relevant to a given location. The partnership will mean that SimpleGeo can continue to focus on the tools, while Factual will maintain the quality of the places data.

The timing of the announcement comes as interest in location-based technologies is rapidly expanding, and it's likely that both startups will benefit from a partnership that clearly delineates their areas of expertise. But developers too will benefit, says Factual CEO Gil Elbaz, as they'll have more support, better tools and better data.

It also means that geo-data now has, in Elbaz's words, a "sophisticated stack" upon which developers can build.

Discuss


Crazy Mashup: All Things Really Do Come Back to Philosophy, on Wikipedia

Mashup developer Jeffrey Winter was thinking about Wikipedia one day and specifically about a rumor that if you followed the first link on any Wikipedia entry that you'd eventually land on the page for Philosophy. So, nerd that he must surely be, he built a web interface to trace this phenomenon and visualize it. The end result is very cool.

Called "All Roads Lead to 'Philosophy'," Winter's mashup tests what he believed to be a reasonable theory and it seems to test well. The fact is that Wikipedia is more regularly structured than one might think and as one commenter on Winter's post said, most Wikipedia articles begin by saying that the subject of the page is a subset of a larger concept. As you click through those larger and larger concepts, you will eventually hit the ultimate abstraction: philosophy! It's pretty cool, give it a try.

Sponsor

Xefer.jpg

Thanks are due to the always enjoyable blog Flowing Data for finding this one.

So on some level this is a statement about life and the world (and Philosophy), surely, but on another level it's a statement about Wikipedia.

Two and a half years ago I wrote that Wikipedia's future could be as a development platform. The site contains a gargantuan amount of human created and tended but largely machine readable and structured data. That's a potential gold mine in terms of a potential pay-off in innovation. Wikipedia can offer developers opportunities to glean analysis, supplemental content and structured data from its years-old store of collaboratively generated information.

At least one prominent startup since then, however, has stopped using Wikipedia content as a part of its service because of the site's tendency to explain things either too generally or too technically and the penalty that search engines impose on duplicate content around the web.

But if what was becoming a web of pages is becoming a web of applications, perhaps duplicate content isn't so bad anymore. Perhaps content can become a commodity and platforms like Wikipedia can serve up what they do best (create content) and then its users can do with it what they do best (everything else).

The potential applications go well beyond fun head-scratchers like the Philosophy mashup above, but a project like this does demonstrate just how structured the wild-west of Wikipedia, and of the human experience, really is.

Discuss


How Twitter + iOS 5 Will Change Mobile Apps

A deep integration of Twitter and iOS 5 was among the many things announced by Apple today but it's not just that you'll be able to post to Twitter from inside official Apple apps like photos and maps. Any 3rd party iOS developer will be able to leverage a number of Twitter Application Programming Interfaces (APIs) to make their apps better and more social. After email, SMS and iOS messaging, Twitter will now become a key social layer over the top of many of the apps on iOS devices.

The features that app developers will have access to closely resemble what other platforms make possible with Facebook integration, and Twitter's being the one to land this deal is a pretty big deal for the world's 2nd place social network. Twitter Developer Relations leader Jason Costa wrote this afternoon on the Twitter developers email list that the points of integration will "create huge opportunities for both Twitter and iOS developers." Here's what that might look like.

Sponsor

Costa, who just joined Twitter six weeks ago to try and give that company's relationship with developers a big refresh, announced to the community this afternoon that there will be an event on Wednesday at Twitter headquarters in San Francisco to talk about the new union of platforms.

My summary, in a sentence: iOS apps will look like, feel like, read from and publish to Twitter like never before. And they'll do that in many cases instead of using Facebook.

Costa summarizes thusly.

  • "There is single sign-on, which allows you to retrieve a user's identity, avatar, and other profile data." That sounds like Facebook Connect, but I'm going to guess that Twitter will not prohibit developers from caching that data for time-shifted, aggregate, offline or other interesting types of analysis. Letting users skip having to create an account with every new app they download and instead click to log-in with their Twitter accounts is going to make many users very happy and encourage every iOS owner to get a Twitter account if they don't have one already. App developers will get more and better populated user accounts, faster.

  • "There's also a frictionless core signing service, allowing you to make and sign any call to the Twitter API." To be honest, I'm not really sure what this means. Perhaps it means that parts of the Twitter API that require user authentication will be accessible via the same single sign-on feature discussed above.

  • "There is follow graph synchronization, which enables you to bootstrap a user's social graph for your app." In other words, apps will be able to offer users to find their Twitter friends who are also using a new app they've installed, and connect with them there too. That's the kind of solution to the user-level "cold start problem" that Facebook Connect has been so helpful with for web apps.

  • "Furthermore, there is the tweet sheet feature, giving your app distribution and reach across Twitter." Again, like Facebook Connect, this is a feature that appears to make it easy for apps to publish user activity and promotional messages out into the Twitter streams of a user's friends. Facebook has a complicated algorithm that determines how often an app is allowed to publish messages out into the Newsfeed of a user's friends, based on how much interactions messages from that app have received in the past. That's a spam control mechanism that I'm going to guess Twitter will not replicate, at least at first.

  • "Loren Brichter [creator of beautiful iPhone Twitter app Tweetie, which was acquired and turned into the official Twitter iPhone app] will also be talking about ABUIKit, a UI framework specifically for Mac, which we'll be open-sourcing." Those are Costa's words. Longtime social media leader Anil Dash has this to say about ABUIKit, "I know 3rd party client devs are still mad at Twitter, but every 'sign in with Twitter' app dev on iOS will be super excited about ABUIKit."

Twitter vs. Facebook

The funny strategic big-picture of all this is that there's probably no chance that Facebook and Google will team up to counter this move with Facebook enabled Android phones, due to the intense rivalry between those two companies.
In other words, this looks a lot like Facebook Connect, but powered by Twitter: Fast account creation, quick friend discovery and social distribution of content. In some ways, it could be better for developers and for users. In other ways, like the number of users right now or the risk of spam, not so much.

It's pretty interesting that after much rhetoric from Facebook about making everything, including mobile devices, social - it was Twitter that managed to add the social layer to the world's most widely-admired phone. The funny strategic big-picture of all this is that there's probably no chance that Facebook and Google will team up to counter this move with Facebook enabled Android phones, due to the intense rivalry between those two companies.

I assume that Apple's experience with music social network Ping, which is inside iTunes, may have been both a clear indication that a social layer is not something Apple is very good at building in-house and a good introduction to working with Twitter. Ping included some Twitter integration but nothing close to this. It's a shame Google hasn't come to such a realization yet, but if it does and it choses to work with Twitter too, that could really rearrange the balance of power between Twitter and Facebook.

Discuss


Google Acquires Postrank: A Fork in the Road for the Future of Social Media

PostRanklogo150exit.jpgOne of my favorite startups in the world, Postrank, has been acquired by Google. Here at ReadWriteWeb we use Postrank every day and if Google shuts it down I am going to be sick. New account creation has already been shut off and a shell of the technology is most likely to become a part of Google Analytics.

Here's what Postrank does: you plug in any RSS feed to the system and it scores each post in that feed by the relative number of comments, inbound links, mentions on Twitter, saves on Delicious and other social media metrics. Then you can subscribe to a filtered feed of just the 10% most-discussed items in any feed. It's magic, it's gold and it's all too often unappreciated. Unfortunately, the company hardly focuses on that aspect of its business anymore. his deal could go one of two ways, very good or very bad, not just for Postrank but for its users and users of the entire social Web.

Sponsor

That core value proposition of Postrank, filtering various blogs for hot posts, has been moved to the background in favor of social media analytics of a publisher's own content. That, presumably, is what Google is interested in and will become a part of Google Analytics. Google Analytics is going to become a far more important product in the future than it is today; and it's already pretty important today.

Angels and Devils in Social Media Monitoring

Postrank can be used to do two things. (A) To help you listen to a larger number of voices than you might otherwise be able to. (B) To track what you've been saying that gets repeated and discussed most often. The company was much more focused on selling B than A when it was acquired by Google.

One of those things is an incredible tool for deep and meaningful growth. The other is useful, but when deemed the only use-case worth paying for, it becomes a sick mockery of the "social" in social media.

If you believe that social media has the potential to unearth ideas, knowledge, discourse and collaboration that will help solve some of the world's great problems - then a tool that will illuminate the contours of any new voice and shine a light on its finest work, is likely of interest to you.

If that focus gets turned around into a tool for already loud voices to narcissistically optimize their own choice of words with no higher goal in mind than further amplification of themselves for profit - that's like a beautiful fairy being enslaved by the devil.

Everybody's got to pay the bills though and not very many people believe in fairies anymore.

We use Postrank here at ReadWriteWeb to find hot topics of conversation in the haystack of hundreds of niche specialist blogs on topics like geolocation, big data and education. We use Postrank to determine which of the blogs on those topics get the most traction in a given week, to determine what a newly discovered blog's audience is most responsive to, what a blog's greatest hits have been, or with blog search feeds run through Postrank what blog mentions of a keyword have seen the most social media traction.

I once helped a recruiter search for her client company's name in Google Blogsearch. Then, we put the results through Postrank to see who wrote about the company and which items got the most traction. Those were potential recruiting targets.

I have more than twenty art and design blogs run through Postrank and then through an RSS-to-IM service, when one of their articles gets particularly hot. Those are great to read and fun to share on Twitter.

I once built a blog search aggregator for Sun Microsystems' annual conference using the now Yahoo-acquired Dapper to clean up blog search feed output and Postrank to populate a "hottest posts" widget next to the "newest posts" widgets for each of 15 topics. Sun liked that project so much they flew me down to the event and let me meet musical hero and surprise event guest Neil Young. Postrank helped me meet Neil Young - whose music I've listened to during some very trying times.

I built a mobile Web app for designers a few months ago and used Postrank-filtered blog feeds to populate a topical widget of all the hottest articles in the field, along with Dribble screenshots and Tweets from top designers.

I'm going to have a conversation with someone about a topic I don't know a lot about this afternoon, and I'm going to study up quickly by reading the most-discussed blog posts from the most-engaged-with blogs covering that topic.

I once had a dream, I don't remember if I was awake or asleep, of building OPML files of the top blogs in every country on earth, running them through Postrank to filter for the hottest topics being discussed, and giving those files to the Obama administration as a tool for international diplomacy. That was just a dream, but it wouldn't have been hard to create.

There area few of the incredible possibilities that a service like this makes possible, a service that focuses on listening to other people. Not just listening to what other people are saying about you.

If social media is reduced from a world where anyone can speak and anyone can be heard to a world where we only listen to what people are saying about us or our companies, with each voice ranked for influence score and ignored if it doesn't score high enough, then I think some of humanity's lowest instincts will have triumphed over one of our most potent opportunities to use technology to better human relations.

Please don't kill the part of Postrank, Google, that is focused on tuning our attention to the incredible Web around us.

Discuss


Is Klout’s New +K Feature Extra Kreepy?

There are hundreds of millions of people on Facebook and Twitter and a lot of people want to know who is more or less influential than others online regarding various topics. Klout is one of the best-known startups in this market (PeerIndex is another) and this week Klout has added a new feature: personal endorsements of people on topics they're influential on.

Called +K, the feature is an easy way to say, with a click, that a person has influenced you on something. I just endorsed RedBull's Andrew Nystrom on the topic of Social Media on Klout, because I value his opinion on the topic. Is this a good idea? Not everyone thinks it is.

Sponsor

Users of the new feature are allowed to allocate up to 5 +K points each day. The higher your personal Klout score is, the more impact your endorsements will have on the scores of others.

Klout scores are the result of an extensive algorithm measuring your influence on other people online. (E.g. how likely are your Tweets to be ReTweeted?) What are these scores good for? They are interesting to see when evaluating social media users you don't know yet - and a growing list of companies and organizations now offer special deals to social media users with high Klout scores. I, for example, have been invited to go watch the movie Kung-Fu Panda for free and to attend a talk by the World Wildlife Fund's chief scientist, Dr. Eric Dinerstein. Dr. Dinerstein, it seems, has no time for losers with too little influence on the Twitter. (Though I can't find that he has any Klout score himself at all!)

Klout's been criticized, too, for quantifying human social interactions and turning a new world of egalitarian social media and democratized self-publishing into a hierarchal meritocracy. To that criticism, Klout's advocates say that measuring of such hierarchy is inevitable because of the demand for it. (Plus, what have you got against free Kung-Fu Panda??)

Adding +K to the Mix

Now there's +K personal endorsements. Is it a recipe for pandering and groveling? Or a smart way to support the people who have impacted your life and work in important ways? It might be both.

plusK.jpg

Above: Big influencer Ken Waggoner believes that Shana Ray is particularly influential on the topic of Wine. Co-incidentally, Ray has also endorsed Waggoner on the topic of Wine. No data is available concerning how much expertise either thinks the other possesses on the topic of Wine Expert Identification.

Boston social media strategist Nathaniel Boyle said tonight on Twitter that he thinks the feature "will...reinforce those on top. [B]ut fair is fair." Marketer John Refford says he's "not in favor; seems wrong on many levels."

Almost all the people who posted comments on the Klout blog post about the feature, though, were very, very positive.

I don't want to like Klout, but I don't want to ignore it either.
What do you think? Myself, I use a browser plug-in that shows me every Twitter user's Klout score next to their username around the web - but I'm still a little skeptical of it. I don't want to like Klout, but I don't want to ignore it either. Sometimes I find it useful. I've canceled social engagements with some friends when their Klout scores dropped. It had to be done, I've got a career to think of. Just kidding.

The new +K feature doesn't seem nearly as anti-social as the fundamental quantification of people's relative worth that Klout is based on. This is a company whose "Dashboard" page is really just a checklist of ways to promote Klout itself around the web. Lots of things about Klout are icky.

However, if someone I know has endorsed someone I know less well on a particular topic - that's going to mean something to me. I actually like that better than putting a number of their expertise. I feel like I can trust it more. Sure, there's personal gain to be made in endorsing someone and you can never fully know the web of interests that might motivate an endorsement - but it feels more natural and comprehensible than an algorithm measuring a person.

Maybe that will seem like an antiquated perspective soon. These personal endorsements certainly aren't gaining anywhere near the traction that automated measuring of people done without requiring them to actively participate has for Klout. There aren't very many +K endorsements on the profiles I'm finding around the site so far.

Klout's numbers may be the way of the future more than explicit human interactions like this new +K feature. I'd love to hear your opinions about this matter, dear readers. Assuming, of course, that your Klout score is high enough. Who wants to hear the opinions of someone with a low Klout score?

Discuss


Look Out, Future: Ubuntu CTO Matt Zimmerman Joins Locker Project & Singly

SinglyLogo-1.jpgAfter seven years as the Chief Technology Officer of the world's leading Linux distro Ubuntu, Matt Zimmerman announced today that he's leaving that position to join a technology project we said was "aimed directly at the future of the web" when we wrote first about it earlier this year: open source personal data locker platform The Locker Project and its corporate counterpart, Singly.

Singly was co-founded by Jeremie Miller, creator of XMPP, the open source foundation of most of the instant messaging in the world. Adding Zimmerman to the team is huge news.

Sponsor

Here's how we described Singly's open source work in February:

From Zimmerman's Blog Post on The Locker Project

"Today, we are creating vastly greater amounts of personal data, and it's stored in many more places. We leave our trail on the Internet in the form of activity streams, messages and content, spread across different web sites, each with their own inscrutable terms of service and (if we're lucky) their own API. These disconnected silos prevent us from using all of this information effectively.

"Meanwhile, we want--and need--to connect with each other in more ways than ever before. We need applications which can connect us, through our personal data, to the services we need.

"Singly is building the technology to make this possible. It will be designed with the deepest respect for the relationship that we have with our personal data, and with a vision for truly personal computing."


Called The Locker Project, the open source service will capture what's called exhaust data from users' activities around the web and offline via sensors, put it firmly in their own possession and then allow them to run local apps that are built to leverage their data.

Here's how The Locker Project will work. Users will be able to download the data capture and storage code and run it on their own server, or sign up for hosted service - like WordPress.org and WordPress.com. Then the service will pull in and archive all kinds of data that the user has permission to access and store into the user's personal Locker: Tweets, photos, videos, click-stream, check-ins, data from real-world sensors like heart monitors, health records and financial records like transaction histories.

Where data extraction is made easy already by APIs or feeds, Lockers will pull it that way. Where the data is appealing and the Locker community is motivated to do so, data connectors will be built.

Searching those data archives has been a technical challenge for many other startups, but the Locker team says it is trivial for them - because they only have to build search to scale across your personal data and the data you've been given permission to access by members of your network.

Search and sharing across a user's network will be powered by Miller's eagerly-anticipated open source P2P project called Telehash, described as "a new wire protocol for exchanging JSON in a real-time and fully decentralized manner, enabling applications to connect directly and participate as servers on the edge of the network."

So delighted to see @mdzimm joining @singlyinc http://bit.ly/krMN7a Adds another great team member to a great project!less than a minute ago via Seesmic Desktop Favorite Retweet Reply

Is This Just a Dream?

All of this is happening in a larger context that includes:

  • A widespread understanding of the deep disruption and opportunities being presented by strategic analysis of data is emerging across global markets. "Analyzing large data sets--so called big data--will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus, as long as the right policies and enablers are in place," wrote giant consulting firm McKinsey Global Institute in a major new report this month on the topic.
  • People are coming to realize that their personal digital data may be more complex than they thought when Mark Zuckerberg did a bait-and-switch with it more than a year ago. Now the Wall St. Journal is writing fear-mongering article after article called What They Know About You and multiple arms of the US Federal Government are taking action concerning data transmission, privacy and innovation.
  • A number of startups focusing on individual ownership over data are emerging - personal data as a platform for software development, outside of the silos like Facebook or Microsoft, is an increasingly common aspiration. See Kaliya Hamlin's organization the Personal Data Ecosystem Collaborative Consortium for more examples.

A lot of people are watching The Locker Project and hoping it can succeed in creating a big new space for each of us individuals and for our free will in the data-centric future. There are other stakeholders who would have all this data used for nothing but the profit and power of the already powerful. The team assembling at Singly may be small, and the whole project may be too geeky for all but the geekiest among us, but it's shaping up to be a remarkable effort.

Discuss


Big Data Giant Joins InfoChimps to Save the World’s Structured Information

KurtInfochimpspic.jpgSometimes highly accomplished people just have to join crazy little startups. It's always exciting to see what happens when they do. Data scientist Kurt Bollacker is one of those people; he's decided to join Austin-based bulk data marketplace startup Infochimps, one of the most interesting little companies we regularly write about here.

Bollacker's history is intense. He helped build one of the first search engines online for academic research papers, the first prototype for the Internet Archive's Wayback Machine where he was the Technical Director, he was a biomedical research engineer at the Duke University Medical Center, did research on long term digital archiving as the Digital Research Director at the Long Now Foundation and was the Chief Scientist at Metaweb, the massively ambitious semantic web project that Google acquired in the Summer of 2010. Those are some of the weightiest data projects in the Internet's young history; now he's joined InfoChimps. "The project that is Infochimps is in it for the long haul," Bollacker told ReadWriteWeb. "We're going to make something of lasting value. That's something I can buy into."

Sponsor

InfoChimps is a small startup that provides infrastructure for people to buy and sell large sets of data. We first wrote in-depth about the company when it made a controversial move of putting 1 billion data points from months of the Twitter firehose up for sale. Twitter's legal department quickly took the edge off of what the marketplace was able to offer its customers, but its splash was made and the web suddenly knew about InfoChimps.

InfoChimps offers a wide variety of types of data, however. Among its most popular sets, the company says, is a complete downloadable set of Major League Baseball data concerning every trade, drafting, free agency and other player transaction since 1873. You can also download the raw survey data used for the Zogby International book What Arabs Think, for $999.00.

Revealing the hidden laws and processes underlying societies constitutes the most pressing scientific grand challenge of our century. That may or may not be overstated, but the point is: data is essential in order for us to develop the full extent of self-awareness that science can offer.
Who cares about raw data? Data scientists do, of course, but there's ample reason for the rest of us to as well. Our big picture interest was well articulated by Dr Dirk Helbing of the Swiss Federal Institute of Technology, who is leading an effort to build what's being called the Living Earth Simulator (LES), a giant simulation of as many of the earth's natural and social problems as can be simulated at once.
"Many problems we have today - including social and economic instabilities, wars, disease spreading - are related to human behavior, but there is apparently a serious lack of understanding regarding how society and the economy work. Revealing the hidden laws and processes underlying societies constitutes the most pressing scientific grand challenge of our century."

Revealing the hidden laws and processes underlying societies constitutes the most pressing scientific grand challenge of our century. That may or may not be overstated, but the point is: data is essential in order for us to develop the full extent of self-awareness that science can offer.

Metaweb

Metaweb, where Bollacker was Chief Scientist, was a company best-known for its product Freebase, which it describes as An entity graph of people, places and things, built by a community that loves open data. Founded by Danny Hillis, a computer scientist whose name is usually said in hushed tones, Metaweb raised nearly $60 million to build its giant structured semantic graph.

Metaweb was acquired this Summer for an undisclosed sum and parts of the Freebase technology have turned into Google Refine, "a power tool for working with messy data."

"At large scale there are classes of applications you can build that you can't do with 50 items in a data set, but with 50 million or 50 billion items," Bollacker explains. "Statistics, searches to find patterns, etc.

"I have no illusions that in 20 years, Google will still be paying to keep Freebase online as a service. I have an interest in making sure these bulk data sets stay alive. I think Infochimps has part of a model that could help that happen.

"One of the things I've learned is data that is loved tends to survive. I think the Freebase data is underloved. I think we can build extracts out of Freebase. They publish regular dumps. We're going to grab sections of those dumps, make them better indexed, better labeled and better described."

Bollacker received a Ph.D. in Computer Engineering from The University of Texas at Austin and it was in his trips back to Austin that he met Infochimps CTO Flip Kromer, a Cornel educated Mechanical Engineer, University of Texas physics education specialist and super-geek.

"The knowledge and experience is a huge known quantity," Kromer says of Bollacker's joining the company. "I got into this to build out the open data part of it. The best way to build the open data commons for the world is to do it within the context of a mixed open and commercial thing that makes everybody smarter. We're building out the commercial part, that's what we have to focus on. With Kurt on board, I have no fears that we're ever going to lose our soul. We won't lose sight over the central mission of making everybody smarter."

Discuss


On Facebook, Angry People Are More Popular (Plus Other Fascinating Statistical Correlations)

Facebook's data team proved once again today that when you analyze a large set of anonymous user data from the world's biggest social network, you can learn some very interesting things about the state of humanity.

In a blog post today titled What's on your mind?, the company disclosed the results of its text analysis of 1 million anonymized messages. Among the findings: Young people swear more than older people, and older people talk about other people more than just themselves. Popular people are more likely to talk about other people, TV, movies, swear and use religious words. Less popular people are more likely to talk about work, sleeping, eating and thinking. These are but a few of the many observations made by the in-house data team. The biggest question about the data remains unanswered, though: what could a world of independent researchers discover in this data?

Sponsor

FBwordgraphs-1.jpg

Above: Facebook found that the words on top of the left chart appeared more in profiles from older people, on the right, from more popular people. The company's blog post contains 5 more graphs concerning other word correlations.

For Facebook to make bulk, anonymized data available to independent researchers has long been a hope of mine and I've argued about how important an opportunity this is all the way up to Mark Zuckerberg himself.

My favorite example of how data like this can be important is from history. When US census data and bank home loan data were both made available for computer analysis and cross referencing for the first time, independent researchers unearthed a pattern of discrimination against African American families seeking to buy homes in big sections of major US cities. This practice was called Real Estate Red Lining and it was exposed thanks to aggregate data analysis. I am of the belief that social injustices of comparable significance, as well as opportunities for significant economic development, could be discovered in the patterns hidden across millions of Facebook status updates, friend connections, Likes and more.

It's great that Facebook is investing some of its resources into analyzing this data itself, but great opportunity is lost if the company fails to allow outside researchers to do analyze this data as well.
It's great that Facebook is investing some of its resources into analyzing this data itself, but great opportunity is lost if the company fails to allow outside researchers to do analyze this data as well.

Oliver Chiang, at Forbes, agreed with my argument in an article this month: "But really, what Facebook should do...is open up its data for research. Because they don't, we get highly sanitized findings (like these top trends, or the finding that being active on Facebook leads to increased happiness), and even, reportedly, a black market for Facebook data. The company collects the thoughts, images and content of more than half a billion users -- that data could be used for good."

Slate.com's Michael Agger wrote last month in an article discussing the opportunities latent in Facebook's data. "It would be helpful for transportation planners to know the places where people complain the most about traffic. Educators could see the data and sentiment analysis around how a community feels about its local schools."

Bernardo Huberman, a social technology researcher at HP Labs who was able to gain access to bulk Facebook data years ago, before the site was as large, controversial and armed with lawyers as it is today, is both understanding and hopeful.

"This data is amazingly important from a commercial point of view," Huberman told me in a telephone interview last week.

"But [Zuckerberg], he's not a researcher, he's just a businessman. I have a feeling that Twitter's situation is roughly the same; all this research stuff and so on is gravy. [In recent years] I've had very little traction in terms of getting access to their data. They are busy with other things, with keeping their business viable.

"They have a different view of it. Perhaps in a few years, Zuckerburg will relax and say 'I want to be the kind of public figure that wants to release data'....but right now I don't think that will motivate these people."

I hope that's not correct. I hope that every time the Facebook Data Team performs another batch of analysis on anonymized, bulk Facebook data and gives us an opportunity to look into our own souls - the potential that lies untapped in that data will be taken all the more seriously. That potential will never be realized if analysis of it is limited to the eyes, minds, interests, skills and perspectives of the company's own researchers.

Discuss