Friday, January 11, 2019

Differentiate between Structured, Semi structured and Unstructured data


Structured Data: A dataset which has can fit into pre-defined row and column format. 

E.g. data stored in relational databases, like Oracle, MYSQL etc.
Example: An employee dataset in structured format

emp_id
emp_name
emp_age
emp_sal
emp_address_id
1
Michael Conway
45
250000
ADD1
2
Sameer P
33
150000
ADD2
3
Arya
18
85000
ADD3

Structured data can be linked to other structured dataset using relational keys. E.g emp_address_id might relate another address table

emp_address_id
addr_1
addr_2
city
state
zip
ADD1
9232 abc rd
Apt A222
Marlborogh
MA
02345
ADD2
40098 main st

Boston
MA
06789

Semi structured: Semi structure data cannot be represented in simple row and column format, but they do have a format which makes it easy to process. 

Example of such data structure is XML/JSON.

Example: The above Employee dataset in semi-structured JSON format

{
   {
"emp_id": 1,
"emp_name": "Michael Conway",
"emp_age": "45",
"emp_sal":"25000",
"emp_address":
{
"emp_address_id": 123
}
}
}

Please note that the structure is not as simple as structured dataset, still it can be analyzed. Using certain library (Java code to convert JSON-CSV TODO:create a tutorial and link it) tools (Chrome extension JSON-CSV converter TODO:create a tutorial and link it), the unstructured dataset could be converted into structured dataset.

Unstructured dataset: Any dataset which does not fall into the above categories. Almost 80% of the current dataset around world is unstructured. Unstructured data include raw text, multimedia content, audio/video, images and even your favorite novel/book etc.  While it might seem an internal structure within the data (like a book has indices and chapters), it is does not closely fit into the concept of pre-defined data format.

Example: the above employee data could be part of unstructured text data
The xyz Inc is a new startup company and its growing fast. We had a chance to interview its founder and CEO Michael Conway (Age: 45) at his residence 9293 abc rd, Apt A222, Marlborough, MA, 0234. Michael is busy in strategizing the business plan for the company.

Analyzing and processing unstructured data can be tedious due to potential incomplete, erroneous and non-useful data present. Given the fact that most of the data is unstructured, certain big data tools and ecosystem (like, Hadoop, spark) provides standard way to process it.

No comments:

Post a Comment