Big Data Coder: Differentiate between Structured, Semi structured and Unstructured data

Structured Data: A dataset which has can fit into pre-defined row and column format.

E.g. data stored in relational databases, like Oracle, MYSQL etc.

Example: An employee dataset in structured format

emp_id	emp_name	emp_age	emp_sal	emp_address_id
1	Michael Conway	45	250000	ADD1
2	Sameer P	33	150000	ADD2
3	Arya	18	85000	ADD3

Structured data can be linked to other structured dataset using relational keys. E.g emp_address_id might relate another address table

emp_address_id	addr_1	addr_2	city	state	zip
ADD1	9232 abc rd	Apt A222	Marlborogh	MA	02345
ADD2	40098 main st		Boston	MA	06789

Semi structured: Semi structure data cannot be represented in simple row and column format, but they do have a format which makes it easy to process.

Example of such data structure is XML/JSON.

Example: The above Employee dataset in semi-structured JSON format

{

"emp_id": 1,

"emp_name": "Michael Conway",

"emp_age": "45",

"emp_sal":"25000",

"emp_address":

{

"emp_address_id": 123

}

Please note that the structure is not as simple as structured dataset, still it can be analyzed. Using certain library (Java code to convert JSON-CSV TODO:create a tutorial and link it) tools (Chrome extension JSON-CSV converter TODO:create a tutorial and link it), the unstructured dataset could be converted into structured dataset.

Unstructured dataset: Any dataset which does not fall into the above categories. Almost 80% of the current dataset around world is unstructured. Unstructured data include raw text, multimedia content, audio/video, images and even your favorite novel/book etc. While it might seem an internal structure within the data (like a book has indices and chapters), it is does not closely fit into the concept of pre-defined data format.

Example: the above employee data could be part of unstructured text data

The xyz Inc is a new startup company and its growing fast. We had a chance to interview its founder and CEO Michael Conway (Age: 45) at his residence 9293 abc rd, Apt A222, Marlborough, MA, 0234. Michael is busy in strategizing the business plan for the company.

Analyzing and processing unstructured data can be tedious due to potential incomplete, erroneous and non-useful data present. Given the fact that most of the data is unstructured, certain big data tools and ecosystem (like, Hadoop, spark) provides standard way to process it.

Big Data Coder

Friday, January 11, 2019

Differentiate between Structured, Semi structured and Unstructured data

No comments:

Post a Comment