- Presenting findings to stakeholders after successful data analysis and model development.
- Automating models to meet the demand for repeatable predictions and insights
- Implementing model scoring or creating applications for automatic updates of reports, Excel spreadsheets, or PowerPoint presentations.
- Emphasizing the importance of soft skills in the final stage of data science.
- Recommendation: Find dedicated books and information on the subject to enhance your skills.
Home » All
posts
Friday, 2 February 2024
Modeling - Build the models
Model Building Process
- Clean data and understanding of content are crucial.
- Goals include better predictions, object classification, and system understanding.
- Focused phase compared to exploratory analysis.
- Outcomes determined by desired outcomes.
- Below Figure illustrates model building components.
Building a model is an iterative process. The way you build your model depends on whether you go with classic statistics or the somewhat more recent machine learning school, and the type of technique you want to use. Either way, most models consist of the following main steps:
1. Model and variable selection
- Selecting variables and modeling technique based on exploratory analysis findings.
- Judgment required to choose the right model for a problem.
- Consideration of model performance and project requirements.
- Factors to consider: model's suitability for production environment, maintenance challenges, and model's ease of explanation.
- Action required once the model is developed.
2. Model execution
Once you’ve chosen a model you’ll need to implement it in code. Here are the two example
Example1:
In the above code we provided how a linear regression model will be executed.
Example2:
3. Model diagnostics and model comparison
- Multiple models are built and chosen based on multiple criteria.
- Holdout sample is used to evaluate the model after building.
- The model should work on unseen data.
- Only a fraction of the data is used for model estimation.
- The model is then unleashed on unseen data and error measures calculated.
- Multiple error measures are available, with the mean square error.
Exploration - Exploratory data analysis
Exploratory Data Analysis Overview
- Deep dive into data using graphical techniques.
- Uses open mind and eyes for understanding data interactions.
- Aims to discover anomalies not previously identified.
- Requires step back and fixation to ensure accuracy.
Visualization Techniques in Data Analysis
- Uses range from simple line graphs or histograms to complex diagrams like Sankey and network graphs.
- Composes composite graphs for deeper data insight.
- Animates or makes interactive graphs for ease and enjoyment.
Interactive Data Exploration Techniques
- Combining plots for deeper insights.
- Overlaying several plots for better understanding.
- Using Pareto diagrams or 80-20 diagrams.
- Brushing and linking for automatic transfer of changes from one graph to another.
- High correlation between answers indicated by average score per country.
- Selection of points on subplots corresponds to similar points on other graphs.
- Histogram: Categorizes variables into discrete categories, summarizing occurrences in each category.
- Boxplot: Provides distribution within categories, showing maximum, minimum, median, and other characterizing measures.
- Techniques include visualization, tabulation, clustering, and other modeling techniques.
- Building simple models can also be part of exploratory analysis.
- After data exploration, move on to building models.
Key objectives of EDA:
- Gaining familiarity with the data: This involves understanding the structure of the dataset, the data types of each variable, and any missing values present.
- Identifying patterns and trends: EDA helps uncover relationships between variables, outliers, and potential errors within the data.
- Formulating hypotheses: Based on the observations and insights gained, you can start forming hypotheses that you can later test through modeling or analysis.
- Guiding further analysis: EDA lays the groundwork for choosing the appropriate techniques for modeling, feature engineering, and data cleaning.
Common steps involved in EDA:
- Data import and cleaning: This involves loading the data into your chosen environment and addressing any missing values, inconsistencies, or formatting issues.
- Univariate analysis: This step examines each variable individually, using summary statistics like mean, median, and standard deviation for numerical variables and frequency distributions for categorical variables. Visualizations like histograms, boxplots, and bar charts are helpful in understanding the distribution of each variable.
- Bivariate analysis: This step explores the relationships between two variables. Scatter plots, heatmaps, and correlation matrices are commonly used to visualize these relationships.
- Multivariate analysis: This step involves exploring the relationships between multiple variables simultaneously. Techniques like principal component analysis (PCA) and dimensionality reduction can be used for this purpose.
Benefits of EDA:
- Improved data understanding: A thorough EDA provides a deep understanding of the data, its strengths, and weaknesses, allowing you to make informed decisions about further analysis.
- Enhanced data quality: By identifying and addressing data quality issues early on, you can ensure the reliability and accuracy of your results.
- More effective modeling: Understanding the data's characteristics helps you choose the most appropriate modeling techniques and avoid common pitfalls.
- Clearer communication: EDA findings can be effectively communicated to stakeholders through data visualizations and reports, fostering better collaboration and project understanding.
Subscribe to:
Posts
(Atom)




