Structured Data: A dataset which has can fit into
pre-defined row and column format.
E.g. data stored in relational databases,
like Oracle, MYSQL etc.
Example: An employee dataset in structured format
emp_id
|
emp_name
|
emp_age
|
emp_sal
|
emp_address_id
|
1
|
Michael Conway
|
45
|
250000
|
ADD1
|
2
|
Sameer P
|
33
|
150000
|
ADD2
|
3
|
Arya
|
18
|
85000
|
ADD3
|
Structured data can be linked to other structured dataset
using relational keys. E.g emp_address_id might relate another address table
emp_address_id
|
addr_1
|
addr_2
|
city
|
state
|
zip
|
ADD1
|
9232 abc rd
|
Apt A222
|
Marlborogh
|
MA
|
02345
|
ADD2
|
40098 main st
|
Boston
|
MA
|
06789
|
Semi structured: Semi structure data cannot be
represented in simple row and column format, but they do have a format which
makes it easy to process.
Example of such data structure is XML/JSON.
Example: The above Employee dataset in
semi-structured JSON format
{
{
"emp_id": 1,
"emp_name": "Michael Conway",
"emp_age": "45",
"emp_sal":"25000",
"emp_address":
{
"emp_address_id": 123
}
}
}
Please note that the structure is not as simple as
structured dataset, still it can be analyzed. Using certain library (Java code
to convert JSON-CSV TODO:create a tutorial and link it) tools (Chrome extension
JSON-CSV converter TODO:create a tutorial and link it), the unstructured
dataset could be converted into structured dataset.
Unstructured dataset: Any dataset which does not fall into
the above categories. Almost 80% of the current dataset around world is
unstructured. Unstructured data include raw text, multimedia content,
audio/video, images and even your favorite novel/book etc. While it might seem an internal structure
within the data (like a book has indices and chapters), it is does not closely fit
into the concept of pre-defined data format.
Example: the above employee data could be part of
unstructured text data
The xyz Inc is a new startup company and its growing fast.
We had a chance to interview its founder and CEO Michael Conway (Age: 45) at
his residence 9293 abc rd, Apt A222, Marlborough, MA, 0234. Michael is busy in
strategizing the business plan for the company.
Analyzing and processing unstructured data can be tedious
due to potential incomplete, erroneous and non-useful data present. Given the
fact that most of the data is unstructured, certain big data tools and
ecosystem (like, Hadoop, spark) provides standard way to process it.
No comments:
Post a Comment