As Thomas Davenport and D.J. Patil wrote in an article titled "Data Scientist: The Sexiest Job of the 21st Century" in 2012, data analysis is getting a lot of attention nowadays. It’s taking on an increasingly larger role at all sizes of companies, including small startups. The practice of data analysis has gradually developed over time, gaining huge benefits from evolution in computing. Let's take a short journey together through the history of data analysis.

Data Analysis and Statistics

Data analysis is rooted in statistics, which has a pretty long history. It is said that the beginning of statistics was marked in ancient Egypt as it took a periodic census for building pyramids. Throughout history, statistics has played an important role for governments all across the world, for the creation of censuses, which were used for various governmental planning activities (including, of course, taxation). With the data collected, we can move on to the next step, which is the analysis of that data. Data analysis is a process that begins with retrieving data from various sources and then analyzing it with the goal of discovering beneficial information. For example, the analysis of population growth by district can help governments determine the number of hospitals that would be needed in a given area.

Data Analysis and Computing

Advances in Collection Mechanisms

The invention of computers and the subsequent advances in computing technology dramatically enhanced what we can do with data analysis. Before computers, the 1880 Census in the US took over 7 years to process the collected data and to arrive at a final report. In order to shorten the time it takes for creating the Census, in 1890, Herman Hollerith invented the "Tabulating Machine". This machine was capable of systematically processing data recorded on punch cards. Thanks to the Tabulating Machine, the 1890 census finished in only 18 months and on a much smaller budget.

Relational Databases

After the von Neumann architecture was invented, the data had been regarded and processed as data to be processed for data analysis. The turning point was the appearance of RDB (relational database) in the 1980s which allowed users to write Sequel (SQL) to retrieve data from a database. For users, the advantage of RDB and SQL is to be able to analyze their data on demand. It made the process to get data easy and helped to spread database use. As you see, the combination of easier/cheaper data collection with cheaper/faster data storage/retrieval technology has pushed the boundaries of what we can do with data.

Data warehouse and Business Intelligence

From around the late 1980s, the amount of data collected continued to increase significantly, thanks to the ever decreasing costs for hard disk drives. That’s when William H. Inmon proposed a "data warehouse", which is a system optimized for reporting and data analysis. The difference from usual relational databases is that data warehouses are usually optimized for response time to queries. Many times data is stored with a timestamp and operations such as DELETEs and UPDATEs are used much less frequently. For example, if a business wanted to compare sales trends for each month, all sales transactions can be stored with timestamps within a data warehouse, and queried based on this timestamp. Also the term "BI (Business Intelligence)" was proposed by Howard Dresner at Gartner in 1989. BI supports better business decision making through searching, collecting and analyzing accumulated data in business. The birth of the concept was only natural, given the quality of technologies like databases and data warehouses available to support it. Especially big companies embraced BI by analyzing customer data systematically when making business decisions.

Worry-free replication from source to Redshift & Snowflake

Data mining

Data mining, which appeared around the 1990s, is the computational process to discover patterns in large datasets. By analyzing data in a different way from usual methods, unexpected but beneficial results could be expected. The development of data mining was made possible thanks to database and data warehouse technologies, which enable companies to store more data and still analyze it in a reasonable manner. A general business trend emerged, where companies started to “predict” customers' potential needs based on analysis of historical purchasing patterns.

The next big change was the internet. For the demand of searching a particular website on the web, Larry Page and Sergey Brin developed the Google search engine which processes and analyzes big data in distributed computers. Surprisingly the Google engine responds with the result which you mostly likely wanted to see in just a few seconds. The key points of this system are that it was "automated", "scalable" and "high performance". A white paper on MapReduce in 2004 greatly inspired engineers, pulling in an influx of talent to the challenge of handling big data. In the late 2000s, many open source software projects like Apache Hadoop and Apache Cassandra were created to take on this challenge.

Big Data Analysis on the Cloud

In the early 2010s, Amazon Redshift, which is a cloud-based data warehouse, and Google BigQuery, which processes a query in thousands of Google servers, were released. Both came with a remarkable fall in cost and lowered the hurdle to process big data. Nowadays, every company is able to get an infrastructure for big data analysis within a reasonable budget. Even startups, which traditionally did not have a budget to conduct such analysis, are now able to repeat PDCA cycles rapidly by using big data tools such as Amazon Redshift.

Conclusion

As we’ve seen, data analysis and computer technology have been developing and affecting each other, ever since the advent of computing. As the collected data size gets larger, new methods of data analysis have been introduced in each stage, out of necessity. As data collection and computing gets even cheaper, we should continue to see breakthroughs in the area of big data.

How Integrate.io Helps

Integrate.io provides continuous, near real-time replication between RDS, MySQL and PostgreSQL databases to Amazon Redshift. The Integrate.io Sync tool is an intuitive, powerful, cost-effective way to integrate your transactional databases and analytical data stores in a single interface with no manual scripting.

You can start a 14-day Free Trial and begin syncing your data within minutes. For questions about Integrate.io and how we can help accelerate your use-case and journey on Amazon Redshift.

References

[1] Data Scientist: The Sexiest Job of the 21st Century https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

[2] Censuses of Egypt: http://en.wikipedia.org/wiki/Censuses_of_Egypt

[3] Tabulating machine: http://en.wikipedia.org/wiki/Tabulating_machine

[4] Data warehouse: http://en.wikipedia.org/wiki/Data_warehouse

[5] Data mining: http://en.wikipedia.org/wiki/Data_mining

[6] BI: http://en.wikipedia.org/wiki/Business_intelligence