SOURCES OF BIG DATA
- Social network profiles—Tapping user profiles from Facebook, LinkedIn, Yahoo, Google, and specific-interest social or travel sites, to cull individuals’ profiles and demographic information, and extend that to capture their hopefully-like-minded networks
- Social influencers—Editor, analyst and subject-matter expert blog comments, user forums, Twitter & Facebook “likes,” Yelp-style catalog and review sites, and other review-centric sites like Apple’s App Store, Amazon, ZDNet, etc.
- Activity-generated data—Computer and mobile device log files, aka “The Internet of Things.” This category includes web site tracking information, application logs, and sensor data – such as check-ins and other location tracking – among other machine-generated
- Software as a Service (SaaS) and cloud applications—Systems like Salesforce.com, Netsuite, SuccessFactors, etc. all represent data that’s already in the Cloud but is difficult to move and merge with internal data. (Distributed data integration technology, in-memory caching technology and API integration work may be appropriate here.)
- Public—Microsoft Azure MarketPlace/DataMarket, The World Bank, SEC/Edgar, Wikipedia, IMDb, etc. – data that is publicly available on the Web which may enhance the types of analysis able to be performed. (Use the same types of parsing, usage, search and categorization techniques as for the three previously mentioned sources.)
- Hadoop MapReduce application results—The next generation technology architectures for handling and parallel parsing of data from logs, Web posts, etc., promise to create a new generations of pre- and post-processed data. We foresee a ton of new products that will address application use cases for any kinds of Big Data – just look at the partner lists of Cloudera and Hortonworks. In fact, we won’t be surprised if layers of MapReduce applications blending everything mentioned above (consolidating, “reducing” and aggregating Big Data in a layered or hierarchical approach) are very likely to become their own “Big Data”.
- Data warehouse appliances—Teradata, IBM Netezza, EMC Greenplum, etc. are collecting from operational systems the internal, transactional data that is already prepared for analysis. These will likely become an integration target that will assist in enhancing the parsed and reduced results from your Big Data installation.
- Columnar/NoSQL data sources—MongoDB, Cassandra, InfoBright, etc. – examples of a new type of map reduce repository and data aggregator. These are specialty applications that fill gaps in Hadoop-based environments, for example Cassandra’s use in collecting large volumes of real-time, distributed data.
- Network and in-stream monitoring technologies—Packet evaluation and distributed query processing-like applications as well as email parsers are also likely areas that will explode with new startup technologies.
- Legacy documents—Archives of statements, insurance forms, medical record and customer correspondence are still an untapped resource. (Many archives are full of old PDF documents and print streams files that contain original and only systems of record between organizations and their customers. Parsing this semi-structured legacy content can be challenging without specialty tools like Xenos.