Friday 2 February 2024

Data Preparation - Cleansing, integrating, and transforming data


 Data Retrieval Phase and Modeling

  • Data from retrieval phase is often "diamond in the rough."
  • Sanitization and preparation are crucial for better performance and less time spent on output correction.
  • Data transformation is necessary for the model to fit specific data formats.
  • Early correction of data errors is recommended.
  • Corrective actions may be necessary in realistic settings.
  • Below figure shows common actions during data cleansing, integration, and transformation.

1. Data Cleaning 

Data cleansing is a sub process of the data science process that focuses on removing errors in your data so your data becomes a true and consistent representation of the processes it originates from.

1.1. Data Entry Errors Overview

  • Data collection and entry are error-prone processes requiring human intervention.
  • Human errors can include typos or loss of concentration.
  • Machine data collection also faces errors due to human sloppiness or machine or hardware failure.
  • Examples include transmission errors and bugs in the extract, transform, and load phase (ETL).
  • Hand-checking every value is recommended for small data sets.
  • Data errors can be detected by tabulating data with counts.
  • Frequency tables can be created for variables with only two values.

1.2.Outliers in Data Analysis 

  • Outliers are observations that seem distant from others or follow a different logic or generative process.
  • Finding outliers is easy using plots or tables with minimum and maximum values.
  • An example is provided where a normal distribution (Gaussian distribution) is expected, showing high values in the bottom graph.
  • Outliers can significantly influence data modeling, so it's crucial to investigate them first.

1.3. Dealing with Missing Values in Data Science

  • Missing values aren't always wrong but need separate handling.
  • They may indicate data collection errors or ETL process errors.
  • Common techniques used by data scientists are listed in table 2.4.

2. Transforming Data for Data Modeling 

  • Data cleansing and integration are crucial for data modeling.
  • Data transformation involves transforming data into a suitable form.
  • Linear relationships between input and output variables can be simplified by transforming the log of independent variables.
  • Combining two variables into a new variable can also be used.

Reducing Variables in Models

  • Overloading variables can hinder model handling.
  • Techniques like Euclidean distance perform best with 10 variables.
  • Reducing the number of variables can add new information to the model.

Turning Variables into Dummies in Data Science

  • Variables can be transformed into dummy variables, which can only take two values: true(1) or false(0).
  • Dummy variables indicate the absence of a categorical effect explaining an observation.
  • Separate columns for classes stored in one variable are created, with 1 indicating present classes and 0 otherwise.
  • Example: Turn Weekdays into Monday through Sunday columns to show if the observation was on a Monday.
  • This technique is popular in modeling and is not exclusive to economists.
  • The next step is to transform and integrate data into usable input for the modeling phase.

3. Data Combination from Different Sources

  • Data sources include databases, Excel files, text documents, etc.
  • Data science process is the focus, not presenting scenarios for every type of data.
  • Other data sources like key-value stores and document stores will be discussed in later sections.

Different Ways of Combining Data

  1. Joining: enriches an observation from one table with information from another.
  2. Appending or stacking: adds observations from one table to another.
  3.  Combining data allows creation of new physical or virtual tables.
  4. Views consume less disk space



Post a Comment

Note: only a member of this blog may post a comment.


Follow US

Join 12,000+ People Following





Java Tutorial


Digital Logic design Tutorial




ANU Materials