Thursday, March 3, 2016

Blog 3:Structured vs Unstructured Data

Data Overview
Structured Data is data that is represented in a database, xml, csv, etc. It is easy for machines to process and allows for computations to be run on the data to enrichen it or make it more meaningful. When data is both structured and formatted, it can be easily loaded into databases or data warehouses for queries and processing.(1)
Unstructured Data, on the other hand, is this blog for example. Usually data that is stored in human readable format that is easy for us to understand but very difficult or impossible at times for a computer to understand. Can be analyzed by computers via parsers and such, but it is much easier for a human to put data into a structured format than it is for computers to take unstructured data and turn it into something structured.











(2)






Data Types
Next, I will discuss three different types of data that are frequently seen in business. Communication data for the most part is unstructured but the metadata can be structured. Transactional data is mostly structured and finally, log data can be structured as well.  The graphic below highlights that the three big sources of big data are transactions, emails(communications), and log data. These three data types will be discussed in greater detail below.



















(3)

Communication(email) data is largely unstructured. It can be emails, text messages, phone calls, or video calls. Even chatting with your bro Jim in the hall about the game last night is communication. The actual communication itself is unstructured and difficult to process, but records of communication can be aggregated in a structured way. How frequently and for how long people communicate in business can very well be loaded into a structured format. In most large organizations, employees sign waivers allowing the organization to track communication data. It can be loaded into a data warehouse and trends in communications can be analyzed for investigation or business trends.


Transactional data by nature is the most structured of the three data types being discussed. Transactions are almost always tracked in a database and hold customer, supplier, product, and sale data. All of this data can be easily loaded from a transactional database into a data warehouse for processing and analyzing metadata and trends. When in a transactional database, data is more than likely in 3rd or 4th normal form. The goal in a data warehouse is fast processing of large data sets and normalization often slows this process so it is better to flatten data when loading the data into a data warehouse.
Log data takes on many forms and big businesses generate tremendous amounts of logs every day. Logs can be collected from operating systems, applications, servers, databases, networking devices, and many other sources. Although this data is uniform for many operating systems, it is often unstructured. Many organizations such as Splunk make log aggregators that parse logs into structured formats. From there, log data can be analyzed on many different levels.


Data Warehouse Limitations



















(4)





Next I will discuss the difficulties or limitations of data warehouses when discussing different types of data. As you can see in the above image, data takes many different forms and is collected from many different locations in an organization. One limitation of a data warehouse is more an issue in the actual data that makes the data warehousing process very difficult: non-uniform data. When you have data coming from multiple sources, the likelihood that all data is uniform is unlikely and can cause performance issues in a data warehouse in the ETL process. Another limitation is the sheer amount of data. Especially with large organizations, they can accumulate terabytes of data every day and need a way to archive data in order to make sure that their data warehouses are performing adequately. The quality and amount of data and can be limiting to the effectiveness of data warehouses and their ability to run analysis of data in a reasonable amount of time.

Where Data Warehouses are Headed
In my opinion, data warehouses will be leveraged to make macro predictions based on much more micro data. As our ability to process more quickly and efficiently with physically smaller devices advances, we will be able to aggregate much larger data sets and run analysis on relationships between data in ways we thought were unimaginable years ago. Rather than simulate the economic outcome of events, we will be able to store micro data on such a large scale that we will be able to accurately model macro levels economic change. With this, our ability to accurately predict how changes in GDP and small production changes will have a local, state, and country impact. Increasing the ability to compute with more efficiency and store data in smaller physical formats allows us to analyze data on a very large scale.


(4)    http://hadooptutorial.info/wp-content/uploads/2014/12/Data-ware-house-environment.png

***This blog was submitted for grade and not to be taken as a professional recommendation***

No comments:

Post a Comment