Friday, 2 February 2024

Data Preparation - Cleansing, integrating, and transforming data

 Data Retrieval Phase and Modeling

  • Data from retrieval phase is often "diamond in the rough."
  • Sanitization and preparation are crucial for better performance and less time spent on output correction.
  • Data transformation is necessary for the model to fit specific data formats.
  • Early correction of data errors is recommended.
  • Corrective actions may be necessary in realistic settings.
  • Below figure shows common actions during data cleansing, integration, and transformation.

1. Data Cleaning 

Data cleansing is a sub process of the data science process that focuses on removing errors in your data so your data becomes a true and consistent representation of the processes it originates from.

1.1. Data Entry Errors Overview

  • Data collection and entry are error-prone processes requiring human intervention.
  • Human errors can include typos or loss of concentration.
  • Machine data collection also faces errors due to human sloppiness or machine or hardware failure.
  • Examples include transmission errors and bugs in the extract, transform, and load phase (ETL).
  • Hand-checking every value is recommended for small data sets.
  • Data errors can be detected by tabulating data with counts.
  • Frequency tables can be created for variables with only two values.

1.2.Outliers in Data Analysis 

  • Outliers are observations that seem distant from others or follow a different logic or generative process.
  • Finding outliers is easy using plots or tables with minimum and maximum values.
  • An example is provided where a normal distribution (Gaussian distribution) is expected, showing high values in the bottom graph.
  • Outliers can significantly influence data modeling, so it's crucial to investigate them first.

1.3. Dealing with Missing Values in Data Science

  • Missing values aren't always wrong but need separate handling.
  • They may indicate data collection errors or ETL process errors.
  • Common techniques used by data scientists are listed in table 2.4.

2. Transforming Data for Data Modeling 

  • Data cleansing and integration are crucial for data modeling.
  • Data transformation involves transforming data into a suitable form.
  • Linear relationships between input and output variables can be simplified by transforming the log of independent variables.
  • Combining two variables into a new variable can also be used.

Reducing Variables in Models

  • Overloading variables can hinder model handling.
  • Techniques like Euclidean distance perform best with 10 variables.
  • Reducing the number of variables can add new information to the model.

Turning Variables into Dummies in Data Science

  • Variables can be transformed into dummy variables, which can only take two values: true(1) or false(0).
  • Dummy variables indicate the absence of a categorical effect explaining an observation.
  • Separate columns for classes stored in one variable are created, with 1 indicating present classes and 0 otherwise.
  • Example: Turn Weekdays into Monday through Sunday columns to show if the observation was on a Monday.
  • This technique is popular in modeling and is not exclusive to economists.
  • The next step is to transform and integrate data into usable input for the modeling phase.

3. Data Combination from Different Sources

  • Data sources include databases, Excel files, text documents, etc.
  • Data science process is the focus, not presenting scenarios for every type of data.
  • Other data sources like key-value stores and document stores will be discussed in later sections.

Different Ways of Combining Data

  1. Joining: enriches an observation from one table with information from another.
  2. Appending or stacking: adds observations from one table to another.
  3.  Combining data allows creation of new physical or virtual tables.
  4. Views consume less disk space

Retrieving Data

 Data Science Steps: Retrieving Required Data

  • Designing data collection process may be necessary.
  • Companies often collect and store data.
  • Unneeded data can be purchased from third parties.
  • Don't hesitate to seek data outside your organization.
  • More organizations are making high-quality data freely available for public and commercial use.

Data can be stored in many forms, ranging from simple text files to tables in a database. The objective now is acquiring all the data you need. This may be difficult, and even if you succeed, data is often like a diamond in the rough: it needs polishing to be of any use to you.

 1. Start with data stored within the company

 Assessing Data Relevance and Quality

  • Assess the quality and relevance of available data within the company.
  • Companies often have a data maintenance program, reducing cleaning work.
  • Data can be stored in official repositories like databases, data marts, data warehouses, and data lakes.
  • Databases are for data storage, data warehouses for data analysis, and data marts serve specific business units.
  • Data lakes contain raw data, while data warehouses and data marts are preprocessed.
  • Data may still exist in Excel files on a domain expert's desktop.

Data Management Challenges in Companies

  • Data scattered as companies grow.
  • Knowledge dispersion due to position changes and departures.
  • Documentation and metadata not always prioritized.
  • Need for Sherlock Holmes-like skills to find lost data.

Data Access Challenges

  • Organizations often have policies ensuring data access only for necessary information.
  • These policies create physical and digital barriers, known as "Chinese walls."
  • These "walls" are mandatory and well-regulated for customer data in most countries.
  • Accessing data can be time-consuming and influenced by company politics.


2. Don’t be afraid to shop around

Data Sharing and its Importance

  • Companies like Nielsen and GFK specialize in collecting valuable information.
  • Twitter, LinkedIn, and Facebook provide data for enriching their services and ecosystem.
  • Governments and organizations share their data for free, covering a broad range of topics.
  • This data is useful for enriching proprietary data and training data science skills at home.
  • Table 2.1 shows a small selection from the growing number of open-data providers.





3. Do data quality checks now to prevent problems later


Data Science Project Overview
  • Data correction and cleansing are crucial, often up to 80% of project time.
  • Data retrieval is the first phase of data inspection in the data science process.
  • Errors in data retrieval can be easily identified, but carelessness can lead to long-term data issues.
  • Data investigation occurs during import, data preparation, and exploratory phases.
  • Data retrieval checks if the data is equal to the source document and if the data types match.
  • Data preparation involves a more detailed check, aiming to eliminate typos and data entry errors.
  • The exploratory phase focuses on learning from the data, examining statistical properties like distributions, correlations, and outliers.
  • Iteration over these phases is common, as outliers can indicate data entry errors.

Setting the research goal

A project starts by understanding the what, the why, and the how of your project. What does the company expect you to do? And why does management place such a value on your research? Is it part of a bigger strategic picture or a “lone wolf” project originating from an opportunity someone detected? Answering these three questions (what, why, how) is the goal of the first phase, so that everybody knows what to do and can agree on the best course of action.

The output should be a clear research aim, a strong grasp of the context, well-defined deliverables, and a plan of action with a time frame. The appropriate location for this information is then in a project charter. Naturally, the duration and formality might vary throughout projects and businesses. This component of the project will frequently be led by more senior staff since during this early stage, commercial acumen and people skills are more crucial than exceptional technical ability.

Spend time understanding the goals and context of your research

Research Goal Importance

  • Outlines the purpose of the assignment clearly.
  • Essential for understanding business goals and context
  • Continue asking questions and examples until understanding business expectations.
  • Identify project's fit in the larger picture.
  • Understand how research will change the business.
  • Understand how results will be used.
  • Avoid misunderstanding business goals and context.
  • Many data scientists fail due to lack of understanding.

 Create a project charter

 A project charter requires teamwork, and your input covers at least the following:

  • A clear research goal
  • The project mission and context
  • How you’re going to perform your analysis
  • What resources you expect to use
  • Proof that it’s an achievable project, or proof of concepts
  • Deliverables and a measure of success
  • A timeline

Latest Notifications

More

Results

More

Timetables

More

Latest Schlorships

More

Materials

More

Previous Question Papers

More

All syllabus Posts

More

AI Fundamentals Tutorial

More

Data Science and R Tutorial

More
Top