“Scientia potentia est”. “Knowledge is power”.
For centuries this has been an axiom across all cultures, yet never before has it so truly manifested itself as it has in what we are calling big data today. Through the effective aggregation and management of information, advanced analytics, and predictive modeling have driven unprecedented strides in science, healthcare, technology, business, and government. Big data steadily charges forward carrying the banner of innovation, but like any momentous undertaking, it too has its problems. It seems like for every article you read heralding big data as the lodestar of progress, a countering article is written warning of its wantonness. The term is now ubiquitous in both discussion and writing almost ad nauseam. To be perfectly (and maybe pleasingly) forward, this will not be another conceptual or theoretical piece on big data and its potential for good or bad; rather, consider this simply to be a litany of the major faults and concerns voiced about the topic since its birth. Consider this a conspectus. Consider this, an airing of the grievances.
Data and Analytical Inaccuracy
I used to imagine big data as being the fuel that drove an organization down the track to new and exciting insights. Just put it in the tank and go. I now believe, however, that big data would be better depicted as the thousands of different gas stations an organization fills up at to get down that same track - different sources, different octanes, different grades, different qualities, some not always compatible with others, some not right for the vehicle itself. There are so many kinds information, modes of inputting, and ways to digest and analyze that, in combination, can lead you to sputter or drive astray. The risk of data and analytical inaccuracy lies within each of the multiple steps involved in information ingestion and interpretation.
People are human and data entry errors are inevitable. Mistyping, inverting numbers, selecting incorrect menu items in drop-downs, populating fields with unassociated data, labeling improperly, and failing to completely fill out forms can all lead to data veracity, which is the presence of uncertain or imprecise information. Beyond inputting, when collecting data from many sources, terminology standardization can be rather difficult even when the collected information relates to the same things with the same attributes. Common data elements may be called by different names and different data elements may be classified the same. Consider electronic health records (“EHRs”) for example, where multiple terms may be used to describe a physician’s role in caring for a patient: attending physician, primary care provider, consultant, surgeon, specialist, etc. Let’s say even if all data is input correctly and is perfectly standardized across all sources, software defects driven by faulty program source code or design can negatively affect data quality. Furthermore, if the various information sources are not using interoperable systems, records can be fragmented and data could be displaced. Keep in mind that for every data set added, the likelihood of the aforementioned risks increases. And practically speaking, even if there’s full confidence in the integrity of the information gathered, analysis can fall flat on its face if the algorithms and formulas used are broken or unaligned with the endgame. When outlined like this, the pursuit of big data really seems to be more like a mine field.
Apart from the struggle to collect sound data, controlling a wealth of information can be quite a headache as well. I once heard a large organization refer to their data management framework (or lack thereof) as spaghetti soup, where they were hopelessly trying to sift through and make meaning of their own poorly structured goulash of accumulated information. The situation I refer to was attributed to the most common issues in handling data, these being:
- Employing siloed big data strategies that are blind to the interdependencies across business functions
- Neglecting to invest in technology that allows for large-scale, highly adaptive, fault-tolerant storage and processing and substantiated analytic decisioning
- Lacking the data scientists, architects, and supporting tactical players needed for mobilization
- Failing to earn leadership’s endorsement and ultimately the enterprise’s embrace
The Mosaic Effect and the Gravity of Breaches
The “mosaic effect”, a term coined at the turn of the century and later spotlighted in the White House’s Open Data Policy, defines the challenge big data presents to privacy. Best articulated by Adam Mazmanian of FCW, the mosaic effect occurs when “disparate threads [of data] can be pieced together in a way that yields information that is supposed to be private”. As more and more organizations and agencies ride on the big data bandwagon, gross amounts of information continue to be stashed (often without reason). As more data is discoverable and machine readable, it is likelier that individual datasets, discreet when in isolation, can be combined with other available and public information to identify or expose data subjects, even debilitating the best anonymization and data masking efforts. Moreover, another reality is that as more information is collected, the worse breaches are going to be. Stockpiling data makes you a larger target for hacktivity, and when incidents occur, correlatingly more data subjects are laid open to harm, data subjects will have more to lose, legal ramifications will generally be heavier, and will ultimately leave organizations scraping more tarnish off of their reputations.
Unethical Data Interpretation
Prejudice has long sullied our history, but unlike the lurid Bull Connor streets of Birmingham in the 1960’s, there is now serious concern over the subtilization and elaboration of discrimination in this age. The information in reach today allows for the most exorbitant and covert profiling imaginable. The protected classes of American federal law are now more than ever vulnerable to underhanded assault at a lancer’s distance. When discussing the topic, I always think of Andrew Niccol’s film Gattaca - a movie set in the “not so distant future” that explores the unavoidable inequities of eugenics. The following excerpt summarizes it best:
“My father was right. It didn't matter how much I lied on my resume. My real resume was in my cells. Why should anybody invest all that money to train me when there were a thousand other applicants with a far cleaner profile? Of course, it's illegal to discriminate, 'genoism' it's called. But no one takes the law seriously. If you refuse to disclose, they can always take a sample from a door handle or a handshake, even the saliva on your application form. If in doubt, a legal drug test can just as easily become an illegal peek at your future in the company.”
- Vincent Freeman (Ethan Hawke), Gattaca
The concept of discrimination based on genetic predisposition was once thought of as only the makings of science fiction, but in time became a reality, perpetrated by employers and insurers and in turn contended by the Genetic Information Nondiscrimination Act (GINA) of 2008. Like these past transgressions, the injustices of the future will come in unfathomable forms and will be greased by advancement and the accessibility of data.
Beyond outright discrimination, what is more confounding is the (often accidental) unethical interpretation of big data, particularly as it relates to marketing. Where do the benefits of segmentation and personalization end and disparate impact begin? If a certain race, gender, or age is statistically more or less profitable or costly than another, will the demographic segments you select not to cater to be adversely affected to the point of substantiating discrimination? Or is that just good business? As it stands now, the law is quite unclear, and navigating through the gray may prove to be a long and trying journey.
Big data has undeniably armed organizations with the ability to make better decisions faster, reduce costs, and develop new products and services; however, it also seems to have fathered degrees of trepidation. The amassing of information is inevitable both micro- and macrocosmically, but in light of the associated potential for blunder and societal detriment, parameters, safeguards, and counterbalances need to be affixed fundamentally. Furthermore, beyond the fray, we need to be farsighted enough to prepare ourselves for the tomorrow that is on the door step.
About the Author
A manager with Schellman, Zach Schmitt has a concentration in IT security and privacy in the Washington D.C. attestation and compliance practice. He is a member of the International Association of Privacy Professionals (IAPP) and endeavors to share his observations and feelings on the certain evolution of data security and privacy. Zach is a graduate of Virginia Tech and has BAs in Accounting & Information Systems and Marketing Management.