The Master-Data Scientist

Tamer Khraisha
14 min readAug 4, 2018

By Tamer Khraisha, Ph.D. in Network Science

In this blog post, I will try to give an overview of what data science is and discuss its use in industry and academia. The main reason for writing this blog is that I noticed that many discussions about data science are revolving around the skills that a data scientist need to have without actually understanding why do we need data science and where data science is actually being employed. Additionally, in this post I would like to point out to the fact that what is usually listed as skills to be possessed by a data scientist are rarely found in one person. In fact, I named this blog the master data scientist based on the famous quote by the economist John Maynard Keynes (I am an economist by background) describing the skills needed to become a great economist :

The master-economist must possess a rare combination of gifts …. He must be mathematician, historian, statesman, philosopher — in some degree. He must understand symbols and speak in words. He must contemplate the particular, in terms of the general, and touch abstract and concrete in the same flight of thought. He must study the present in the light of the past for the purposes of the future. No part of man’s nature or his institutions must be entirely outside his regard ….

When one looks around for articles and blog posts about data science it would seem obvious that people are describing the data scientist as a person who possesses deep knowledge in statistics, mathematics, software engineering, econometrics, machine learning, data structures, cloud computing, scientific method, programming in several languages, and deep knowledge/background in diverse areas of social science . Although there are some people who can be quite good at many of these skills, in general these are specializations that are very hard to master all together with deep expertise (the division of labor is actually accredited for being one of the main drivers of growth in the capitalist system). I would sustain that we need to focus on the demand side for data science in order to understand what is data science and why and where do we need it. In the next section I will start by trying to identify what could be a reasonable definition or characterization of data science.

What is Data Science ?

Data Scientist : Person who is better at statistics than any software engineer and better at software engineering than any statistician.

- Josh Wills

A general and accepted definition of data science does not exist. People in different contexts provide different definitions which are often tailored on specific areas of interest. Without entering into the details of such a discussion, however, a minimal definition of data science may involve four main features:

i) The presence of big and /or complex data, where big is related to volume (more observations), velocity (high frequency of new observations), and variety (new variables about the same phenomena or new data formats like images or videos), and complex concerns non-conventional data formats such as key-value, document, text, columnar, and graph formats.

ii) The need for advanced techniques for storing, manipulating, and accessing data.

iii) The use of a scientific method to ask a research question or identify a problem and formulate hypothesis about it.

iv) The use of advanced statistical and algorithmic methods to answer the research question for purposes like prediction, classification, or understanding in general.

Historically, the term data science appeared in several places and contexts. In 1966, Peter Naur proposed to introduce the term Datalogy into computer science to denote “the science of the nature and use of data”. Later on, people from the field of statistics used the term data science as a possible replacement for the term statistics. In 1997, C.F. Jeff Wu gave the first modern, non-computer science, presentation titled “Statistics = Data Science?” and in his conclusion he proposed that statistics be renamed data science and statisticians data scientists. In the following years, statisticians realized that data science can extend beyond the field of statistics to incorporate advances in data computing techniques. For this reason, data science, although not independent from statistics, needs to be differentiated from statistics (this paper offers a good review of the differences between staatistics and data science).

Data science is different from statistics in aspects like scalability. Running a linear regression on one data set can be done by a statistician, but when it comes to large scales where the statistician wants to run thousands of linear regressions or train a linear regression model using parallel-distributed programs like MapReduce , then a new set of skills is needed. Storing and processing of data requires also a new set of skills that statisticians usually don’t have (often called data pipelines). Even data visualization, an essential part of data science, has become more challenging for statisticians. For example, nowadays there is demand for communicating complex data-driven concepts on the web and one of the most popular tools for this purpose is the D3.js library. However, building a D3 visualization requires deep (very good) understanding of HTML, CSS, SVG, DOM, JavaScript, geometry, color spaces, and data structures.

Besides statistics, data scientists are supposed to possess computational skills, especially when the data is large and complex while resources such as time and memory are limited. Computational methods like optimization, agent-based simulations, sampling, bootstrapping, Markov Chain Monte Carlo (MMMC), and evolutionary computation are very efficient tools that data scientists can use with large and complex data.

Finally and perhaps most importantly, data scientists must possess a good sense of how to put a human face on the data analysis process. As authors in this paper argued, data science is an iterative processes that involves “understanding a problem domain, deciding which data to acquire and how to process it, exploring and visualizing the data, selecting appropriate statistical models and computational methods, and communicating the results of the analyses. These skills are not usually taught in the traditional statistics or computer science classroom but instead, are gained through experience and collaboration with others.” This implies that data science extends beyond big data analysis.These days data is becoming available to almost everybody. The more data is available to everybody at the same time, the more the advantage shifts to those with superior means of interpretation and communication.

To conclude this section, I would like to emphasize that the use of data science techniques to analyze big and complex data demand more than ever particular carefulness in data interpretations since people have tendency to uncritically relate big and new with truth. “Big data + new data techniques” is not equal to more objective and valid results! . For example, a recent study which caused media uproar claimed that deep neural networks can infer sexual orientation from facial images. Although the study was meant to show that machines are simply better than humans in certain tasks, the results of such study have been criticized since humans can be way more complex and unpredictable to simply infer their sexual orientation using deep learning.

Data Science : Academia vs. Industry

Scientists and professionals in both academia and industry make use of the concept and tools of data science, however data science can serve different purposes in academia as compared to industry. In academia, the general goal is to understand the real world! In industry, while not directly declared by firms, the goal is rather to generate economic gain. Data science in academia is mostly intended for assisting in answering old or new scientific questions for which big and complex data is available and advanced data techniques are needed. To illustrate the point, I like to provide the following example : in 1689, the great philosopher John Locke published his famous A Letter Concerning Toleration where he asked the question ‘How do we tolerate each other ?’ In his book, Locke didn’t use any data but proposed a central question that is still relevant until our days. Before big data, “how” questions were mostly addressed with qualitative analysis. With the advent of big data, data scientists in academia are able to study the old questions by actually measuring the phenomenon being studied. For example, in a paper published in 2010 in Plos One, the authors used a large dataset of Wikipedia articles which contained information about 3.2 M articles to study how people tolerate each other in an online context. The reason I made this example is to show that data science in academia is not supposed to be a replacement of the scientific tradition or propose a new goal for academia , rather it is an advanced set of tools and methods that are needed to answer the relevant questions (social, economic, political) for which big and complex data is available. The final outcome of academia is assumed to be policy recommendations for improving the well-being of humanity (hopefully!).

In industry, the goal of data science is to seize the opportunity of the increasing availability of data about customers, firms themselves, and external sources of data in order to achieve better performance (profits, growth, technological innovation, social responsibility,..). With the increasing use of technology and smart devices, humans are increasingly storing their pieces of knowledge, including thoughts, things they like, and emotions in a digitized form. For companies, this information is what is needed to identify (and perhaps influence) ‘consumer preferences’ which is a key to make profits in industry. The outcome of data science in industry is usually in the form of data product. Data products are useful things made from data that customers can make use from and solve certain problems. The data scientist here assumes the role of turning all available data into stories that are easy to understand for people who are not experts in data. Thousands of the services which we use are actually data products. Google search, Google Maps, YouTube video recommendation, Facebook friendship recommendation, and the broken yet helpful text completion on mobile keyboards are all examples of data products.

This however doesn’t mean that academic-side and industry-side data science can’t serve each other. During my PhD my main research focus has been technological innovation. Academics has done a substantial amount of research on how to measures technological innovation, how to classify innovations, and how to measure the technological distance between innovations. In most cases, patent data is used as the most comprehensive source of information about the technological landscape. Patent data is massive and complicated and requires the use of data science tools like natural language processing, dimensionality reduction, distance measures, and clustering algorithms. Several industry applications came out of such research lines. For example, a team of researchers created a company called PatentVector which provides users (mostly innovating firms) with an interface to discover, analyze, compare patents. On the other hand, some companies in industry provide data science and computing services to academia. For example, at my current job at Alphacruncher, we provide academic researchers with advanced cloud-based platform for conducting scientific research.

Mastering Data Science

Many people ask the question of what to do in order to become a data scientist ? It is not clear whether there is a recipe for becoming a data scientist since it is more likely that the skills and knowledge of a good data scientist are context/situation specific. I would recommend instead to focus on understanding the demand side and the nature of data science in order to become a data scientist. To this end, I would provide the following advises:

1- Understand why and when do you need data science and data scientists

Whether you are in industry or in academia, a decision maker or a job seeker, before deciding to invest in data science it would be useful to evaluate why and when do you really need such an investment. It would be a mistake to say “there is much data out there, so let’s extract meaning out of it as much as possible” or to like that fact that data science is called “the Sexiest Job of the 21st Century”. Is your problem really that challenging ? It might be that all you need is exploratory data analysis rather than data science. In other cases, the data at hand might not be very complex (like business information) such that business intelligence specialists might be able to do the data analysis job perfectly. Data science is not business intelligence, since business intelligence is mostly about domain experience and does not necessarily rely on data-driven analysis and the related computational and statistical methods.

One should not also get seduced by big data. In certain cases the effective number of data points needed is quite small. For example, many real world phenomena exhibit a property known as scaling (or long tail), which means that few things are very frequent, but most things are quite rare. For instance, the frequency of words like ‘the’ and ‘where’ is very common, but words like ‘ bibliopole’ is very rare. One consequence of scaling is that to learn about the phenomenon at hand one would need a small sample of the dataset, even if the data is quite large.

Additionally, one should not get excited about learning techniques like deep learning or Bayesian networks because they sound fancy; what for ? If one works in speech or image recognition then these techniques are highly valuable, but it might be that for other problems there is no need at all for these complex techniques. A good data scientist should recognize when to use one technique rather than another. In some cases, it might be that a simple linear regression performs much better that a deep neural network.

In corporate contexts, data science can serve multiple purposes according to the growth stage of the company and the data being analysed. A startup might not have many clients at the beginning, therefore most of the attention is likely to be directed toward infrastructure design, data engineering, and marketing. You don’t need a data scientist yet. However, once the startup grows and creates a wide clientele base, then the management team might decide to hire data scientists to build a recommender system for their products.

2- Start with a question or a defined problem (if there is any)

Data scientists need to start by asking a research question or identify a business problem to solve using complex and/or big data. In academia, identifying the research question is mostly a matter of taste. One can have all the technical skills in the world, but what might be more important is to be able to ask an interesting question and make the right hypothesis. There is no recipe to identify what an interesting question is, and therefore data scientists must develop a sense of intuition of what an interesting question is. The research question may also emerge gradually as the data scientist is iteratively looping through the data science process.

I do not pretend to start with precise questions. I do not think you can start with anything precise. You have to achieve such precision as you can, as you go along

- Bertrand Russel

While in academia research questions tend to fall within certain scientific paradigms and build on existing literature, in industry most research questions are developed as response to business problems, new opportunities, or pressure from competition. The research question part is one of the main features that distinguishes a data scientist from statisticians or programmers who may lack the necessary skills needed to ask questions about the data and formulate hypothesis about it.

3- Choose the right scientific approach to the data at hand and be aware of its shortcomings

A challenge that might arise with data science concerns the scientific method used in the analysis. In academia, one can always try to answer one of the big questions like what drives economic growth ? Are financial markets efficient ? What is the best educational system?. On the other hand, one might decide not to rely on theoretical questions and adopt a pure data-driven approach where no well-defined questions are asked at the beginning. The data-driven method is an emerging trend both both in academia and industry. Tony Hey classified the scientific methods developed so far into four categories : empirical observation and experimentation, analytical or theoretical approaches, computational science or simulation, and finally the data-intensive approach which was called the ‘Fourth Paradigm’. It is believed that data science is gaining its importance within the fourth paradigm of data-intensive research. This however doesn’t mean that data scientists may not decide to combine data-driven methods with computational or empirical methods. Although the data-driven approach can provide a powerful tool in the big data era, some concerns might arise as researchers (especially in academia) might fall in the trap of focusing on pattern recognition without relating to the relevant research questions. This point was made clear in this editorial article in Research Policy:

Pessimists allege that Big Data may bring an end to social science research. One fear is that scholars will focus on pattern recognition rather than developing theory or engaging in hypothesis driven empirical research. As it becomes easier to manipulate large numbers of records it is seductive to keep collecting more and more observations, matching ever more and more diverse sources — the potential is unlimited. Resources may be diverted to never-ending data projects rather than focusing on questions that are answerable with currently available data. Moreover, with a sufficiently large sample it is simply easier to find associations and make dubious claims. Another worry is that rather than focusing on interesting questions researchers will limit their inquiry to questions they are able to examine rather than consider the more socially relevant questions, becoming like the proverbial drunk who seeks their car keys under the lamp post because it is easiest to look there.

For this reason, before investing in any data science project, an essential task would be to identify what scientific approach to adopt for the problem at hand and be aware of it’s limitations.

4- Structure your data science project and organize the data science team

Once a research question or a business problem is identified and the data is structured, one need to consider the logistics needed to conduct the data analysis part. First, data scientists are not Queen bees, they have to collaborate and communicate with other members like data engineers, cloud experts, domain experts, software developers, and decision makers. Second, a data scientist must have a good intuition of how much time they spend on tasks like data collection , data cleaning, data normalization, setting up the servers for computing, conducting feature engineering, feature selection, and so on. A survey by Crowdflower revealed that data scientists spend most of their time (~ 60%) cleaning and organizing the data.

In machine learning projects, data scientists begin by structuring their Machine learning strategy (in some cases people use the term Machine learning Pipline). In one of his lectures , the famous computer scientist Andrew Ng offered an interesting example to illustrate how to structure a machine learning strategy. The example concerns the use of a neural network to build a computer vision system for detecting cats in pictures. Andrew showed that if resulting neural network is not precise enough, then a machine learning strategy might consist in :

  • Get more data: Collect more pictures of cats.
  • Collect a more diverse training set. For example, pictures of cats in unusual positions; cats with unusual coloration; pictures shot with a variety of camera settings; ….
  • Train the algorithm longer, by running more gradient descent iterations.
  • Try a bigger neural network, with more layers/hidden units/parameters.
  • Try a smaller neural network.
  • Try adding regularization (such as L2 regularization).
  • Change the neural network architecture (activation function, number of hidden units, etc.)

Because data science is a broad and technical field, having a strategy about the data science project is a must in order to avoid wasting time and resources.

Conclusion

It is true that there is pressure both in academia and industry to invest in data science to produce more empirical research and not stay behind in competition. However, to fully take advantage of data science it would be more useful to focus on the actual uses and demand-side for data science to understand when it is needed, what are its limitations, and how to better structure the data science strategy. Academia can benefit substantially from data science if data scientists assist in answering the relevant and challenging questions like ‘how much do we tolerate each other ? how to estimate the role of media in political elections? how to map the technological landscape of innovation ? how to measure possible shocks in the financial system ? how to measure collective behavior ?, and many other. New promising fields like Computational Social Science and network science has emerged from the intersection of social science and data science. Industry can also employ data science for socially-useful purposes like educational platforms which transform big data into user interfaces, privacy-preserving data systems which help protect client privacy, recommender systems which save clients the time to search big lists of items, and image recognition which might one day become products that will help blind people walk around and recognize objects. Of course, we might run the risk that all those ‘new’ sciences turn out to be buzz words if applied uncritically and without the necessary reflection/acknowledgement of previous ‘old-fashioned’ scientific research.

--

--

No responses yet