Monday, 29 January 2024

Data Science Process.

Data science is mostly applied in the context of an organization. When the business asks you to perform a data science project, you’ll first prepare a project charter. This charter contains information such as what
you’re going to research, how the company benefits from that, what data and resources you need, a timetable, and deliverables.
 


 

1. Setting the research goal: This initial step involves defining the specific problem or question you want to answer using data. It's crucial to have a clear and well-defined goal to guide the rest of the process.

2. Retrieving data: Once you know what you're looking for, you need to gather the relevant data. This can involve accessing existing data sources, designing and conducting surveys or experiments, or scraping data from the web.

3. Data preparation: Raw data is rarely ready for analysis, so this step involves cleaning, organizing, and formatting the data to make it suitable for modeling. This might include tasks like:

  • Data cleaning: Fixing errors, inconsistencies, and missing values.
  • Data integration: Combining data from multiple sources.
  • Data transformation: Converting data into a format compatible with your chosen analysis tools.
  • Feature engineering: Creating new features from existing data to improve the performance of your models.

4. Data exploration: This is where you start to get a feel for the data by analyzing its properties and identifying patterns, trends, and relationships. Exploratory data analysis (EDA) can involve techniques like:

  • Descriptive statistics: Summarizing the data using measures like mean, median, and standard deviation.
  • Data visualization: Creating charts and graphs to represent the data visually.
  • Correlation analysis: Identifying relationships between different variables.

5. Data modeling: This step involves using the prepared data to build a model that can answer your research question or make predictions. There are many different types of data models, such as:

  • Regression models: Used to predict a continuous outcome variable based on one or more predictor variables.
  • Classification models: Used to predict a categorical outcome variable.
  • Clustering algorithms: Used to group similar data points together.

6. Presentation and automation: Finally, you need to communicate your findings to others and, if applicable, deploy your model into production. This might involve:

  • Creating reports and presentations: Summarizing your results and insights in a clear and concise way.
  • Developing dashboards and visualizations: Making your results more accessible and interactive.
  • Deploying the model: Integrating your model into a production environment to make predictions on new data.
 
 Reference:
 
 DavyCielen, Arno.D.B.Maysman, Mohamed Ali, “Introducing Data Science” ManningPublications, 2016

Facets of data

 In data science and big data you’ll come across many different types of data, and each of them tends to require different tools and techniques. The main categories of data are these:

  1. Structured data
  2. Unstructured data
  3. Natural language data
  4. Machine-generated data
  5. Graph-based data
  6. Audio, video, and images data
  7. Streaming data

Let’s explore all these interesting data types.

 1.  Structured data

  • Data that is stored in a defined field inside a record and is dependent on a data model is referred to as structured data.
  • Because of this, storing structured data in tables inside databases or Excel files is frequently simple.
  • Database management and querying are best done with SQL, or Structured Query Language.
  • Additionally, you can encounter complex data that is difficult to store in a conventional relational database.
  • One example is hierarchical data, like a family tree.
  • The world isn’t made up of structured data, though; it’s imposed upon it by humans and machines. More often, data comes unstructured


2. Unstructured data 

  • Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or varying.
  • One example of unstructured data is your regular email (figure 1.2).
  • Although email contains structured elements such as the sender, title, and body text, it’s a challenge to find the number of people who have written an email complaint about a specific employee because so many ways exist to refer to a person, for example.
  • The thousands of different languages and dialects out there further complicate this.



3. Natural language data

  • Natural language is a special type of unstructured data; it’s challenging to process because it requires knowledge of specific data science techniques and linguistics.
  • The natural language processing community has had success in entity recognition, topic recognition, summarization, text completion, and sentiment analysis, but models trained in one domain don’t generalize well to other domains.
  • The concept of meaning itself is questionable here.


4. Machine-generated data

  • Machine-generated data is information that’s automatically created by a computer, process, application, or other machine without human intervention.
  • Machine-generated data is becoming a major data resource and will continue to do so.
  • The analysis of machine data relies on highly scalable tools, due to its high volume and speed. 
  • Examples of machine data are web server logs, call detail records, network event logs, and telemetry

5. Graph-based data

Graph-based data represents entities and their relationships as nodes and edges in a graph. This makes it a powerful tool for modeling complex relationships between entities, such as social networks, financial transactions, and knowledge graphs.

For example, in a social network, people are represented as nodes and their friendships are represented as edges. This allows us to analyze things like the spread of information, the formation of communities, and the influence of individuals.

6. Audio, video, and images data

Audio, video, and images are collectively known as multimedia data. This type of data is characterized by its rich and complex nature, and it can be challenging to store, process, and analyze. However, it also has the potential to provide valuable insights that other types of data cannot.

Here are some examples of how multimedia data is used:

  • Computer vision: Analyzing images and videos to understand the content, such as identifying objects, people, and actions.
  • Speech recognition: Converting spoken language into text.
  • Natural language processing: Understanding the meaning of text and speech.
  • Medical imaging: Analyzing medical images to diagnose diseases.
  • Entertainment: Creating movies, games, and other forms of entertainment.

7. Streaming data

Streaming data is data that is generated in real-time and continuously over time. This type of data is becoming increasingly common, due to the growth of the Internet of Things (IoT) and other sensors that generate data constantly.

Here are some examples of how streaming data is used:

  • Fraud detection: Analyzing financial transactions in real-time to identify fraudulent activity.
  • Traffic monitoring: Monitoring traffic flows in real-time to optimize traffic management.
  • Social media analysis: Analyzing social media posts in real-time to understand public opinion and trends.
  • Industrial automation: Monitoring and controlling industrial processes in real-time.
  • Scientific research: Collecting and analyzing data from scientific experiments in real-time.

 Reference:

DavyCielen, Arno.D.B.Maysman, Mohamed Ali, “Introducing Data Science” ManningPublications, 2016


Benefits and uses of data science and big data

Data science and big data are used almost everywhere in both commercial and noncommercial settings.

Commercial companies in almost every industry use data science and big data to gain insights into their customers, processes, staff, completion, and products.

Many companies use data science to offer customers a better user experience, as well as to cross-sell, up-sell, and personalize their offerings.

A good example of this is Google AdSense, which collects data from internet users so relevant commercial messages can be matched to the person browsing the internet.

Data Science:

Benefits:

  • Improved decision-making: Data-driven insights can help businesses make better decisions across all levels, from strategic planning to marketing campaigns.
  • Increased efficiency and productivity: Automation and optimization based on data analysis can streamline processes and free up resources for higher-value tasks.
  • Enhanced customer understanding: Analyzing customer data allows businesses to personalize experiences, tailor marketing efforts, and predict customer behavior.
  • Innovation and new product development: Data insights can reveal previously unknown trends and patterns, leading to the development of new products and services.
  • Risk management and fraud detection: Identifying patterns in data can help detect fraudulent activity and prevent financial losses.

Uses:

  • Predictive maintenance: Analyzing sensor data from equipment can predict and prevent failures, reducing downtime and maintenance costs.
  • Personalized medicine: Analyzing medical data can help tailor treatment plans for individual patients and improve healthcare outcomes.
  • Fraud detection: Identifying patterns in financial transactions can help detect and prevent financial fraud.
  • Sentiment analysis: Analyzing social media data and customer reviews can provide insights into public perception and brand sentiment.
  • Targeted advertising: Data analysis can help personalize advertising campaigns and increase their effectiveness.

Big Data:

Benefits:

  • Scalability: Big data systems can handle massive amounts of data, making them well-suited for applications with large datasets.
  • Velocity: Big data technologies can process data in real-time, enabling faster insights and quicker decision-making.
  • Variety: Big data systems can handle diverse data formats, from structured databases to unstructured social media posts.
  • Veracity: Big data tools can help filter and clean noisy data, improving the accuracy of insights.

Uses:

  • Real-time traffic management: Analyzing traffic data in real-time can help optimize traffic flow and reduce congestion.
  • Cybersecurity: Analyzing network data can help detect and prevent cyberattacks.
  • Weather forecasting: Analyzing weather data from various sources can improve the accuracy of weather forecasts.
  • Smart cities: Big data can be used to manage and optimize city infrastructure, such as energy grids and transportation systems.
  • Scientific research: Analyzing large datasets can lead to new discoveries in scientific fields like genomics and astronomy.

 
 Reference:

1. DavyCielen, Arno.D.B.Maysman, Mohamed Ali, “Introducing Data Science” ManningPublications, 2016
 

Latest Notifications

More

Results

More

Timetables

More

Latest Schlorships

More

Materials

More

Previous Question Papers

More

All syllabus Posts

More

AI Fundamentals Tutorial

More

Data Science and R Tutorial

More
Top