Friday 2 February 2024

Retrieving Data


 Data Science Steps: Retrieving Required Data

  • Designing data collection process may be necessary.
  • Companies often collect and store data.
  • Unneeded data can be purchased from third parties.
  • Don't hesitate to seek data outside your organization.
  • More organizations are making high-quality data freely available for public and commercial use.

Data can be stored in many forms, ranging from simple text files to tables in a database. The objective now is acquiring all the data you need. This may be difficult, and even if you succeed, data is often like a diamond in the rough: it needs polishing to be of any use to you.

 1. Start with data stored within the company

 Assessing Data Relevance and Quality

  • Assess the quality and relevance of available data within the company.
  • Companies often have a data maintenance program, reducing cleaning work.
  • Data can be stored in official repositories like databases, data marts, data warehouses, and data lakes.
  • Databases are for data storage, data warehouses for data analysis, and data marts serve specific business units.
  • Data lakes contain raw data, while data warehouses and data marts are preprocessed.
  • Data may still exist in Excel files on a domain expert's desktop.

Data Management Challenges in Companies

  • Data scattered as companies grow.
  • Knowledge dispersion due to position changes and departures.
  • Documentation and metadata not always prioritized.
  • Need for Sherlock Holmes-like skills to find lost data.

Data Access Challenges

  • Organizations often have policies ensuring data access only for necessary information.
  • These policies create physical and digital barriers, known as "Chinese walls."
  • These "walls" are mandatory and well-regulated for customer data in most countries.
  • Accessing data can be time-consuming and influenced by company politics.

2. Don’t be afraid to shop around

Data Sharing and its Importance

  • Companies like Nielsen and GFK specialize in collecting valuable information.
  • Twitter, LinkedIn, and Facebook provide data for enriching their services and ecosystem.
  • Governments and organizations share their data for free, covering a broad range of topics.
  • This data is useful for enriching proprietary data and training data science skills at home.
  • Table 2.1 shows a small selection from the growing number of open-data providers.

3. Do data quality checks now to prevent problems later

Data Science Project Overview
  • Data correction and cleansing are crucial, often up to 80% of project time.
  • Data retrieval is the first phase of data inspection in the data science process.
  • Errors in data retrieval can be easily identified, but carelessness can lead to long-term data issues.
  • Data investigation occurs during import, data preparation, and exploratory phases.
  • Data retrieval checks if the data is equal to the source document and if the data types match.
  • Data preparation involves a more detailed check, aiming to eliminate typos and data entry errors.
  • The exploratory phase focuses on learning from the data, examining statistical properties like distributions, correlations, and outliers.
  • Iteration over these phases is common, as outliers can indicate data entry errors.


Post a Comment

Note: only a member of this blog may post a comment.


Follow US

Join 12,000+ People Following





Java Tutorial


Digital Logic design Tutorial




ANU Materials