Sunday, 11 February 2024

Getting Data in and out of R

1. Reading and Writing Data

There are a few principal functions reading data into R.

  • read.table, read.csv, for reading tabular data
  • readLines, for reading lines of a text file
  • source, for reading in R code files (inverse of dump)
  • dget, for reading in R code files (inverse of dput)
  • load, for reading in saved workspaces
  • unserialize, for reading single R objects in binary form

There are of course, many R packages that have been developed to read in all kinds of other datasets, and you may need to resort to one of these packages if you are working in a specific area.

There are analogous functions for writing data to files

  • write.table, for writing tabular data to text files (i.e. CSV) or connections

  • writeLines, for writing character data line-by-line to a file or connection

  • dump, for dumping a textual representation of multiple R objects

  • dput, for outputting a textual representation of an R object

  • save, for saving an arbitrary number of R objects in binary format (possibly compressed) to a file.

  • serialize, for converting an R object into a binary format for outputting to a connection (or file).

2 .Reading Data Files with read.table()

The read.table() function is one of the most commonly used functions for reading data. The help file for read.table() is worth reading in its entirety if only because the function gets used a lot (run ?read.table in R). I know, I know, everyone always says to read the help file, but this one is actually worth reading.

The read.table() function has a few important arguments:

  • file, the name of a file, or a connection
  • header, logical indicating if the file has a header line
  • sep, a string indicating how the columns are separated
  • colClasses, a character vector indicating the class of each column in the dataset
  • nrows, the number of rows in the dataset. By default read.table() reads an entire file.
  • comment.char, a character string indicating the comment character. This defalts to "#". If there are no commented lines in your file, it’s worth setting this to be the empty string "".
  • skip, the number of lines to skip from the beginning
  • stringsAsFactors, should character variables be coded as factors? This defaults to TRUE because back in the old days, if you had data that were stored as strings, it was because those strings represented levels of a categorical variable. Now we have lots of data that is text data and they don’t always represent categorical variables. So you may want to set this to be FALSE in those cases. If you always want this to be FALSE, you can set a global option via options(stringsAsFactors = FALSE). I’ve never seen so much heat generated on discussion forums about an R function argument than the stringsAsFactors argument. Seriously.

For small to moderately sized datasets, you can usually call read.table without specifying any other arguments

> data <- read.table("foo.txt")

In this case, R will automatically

  • skip lines that begin with a #
  • figure out how many rows there are (and how much memory needs to be allocated)
  • figure what type of variable is in each column of the table.

Telling R all these things directly makes R run faster and more efficiently. The read.csv() function is identical to read.table except that some of the defaults are set differently (like the sep argument).

3. Reading in Larger Datasets with read.table

With much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking.

  • Read the help page for read.table, which contains many hints

  • Make a rough calculation of the memory required to store your dataset (see the next section for an example of how to do this). If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.

  • Set comment.char = "" if there are no commented lines in your file.

  • Use the colClasses argument. Specifying this option instead of using the default can make ’read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are “numeric”, for example, then you can just set colClasses = "numeric". A quick an dirty way to figure out the classes of each column is the following:

> initial <- read.table("datatable.txt", nrows = 100)
> classes <- sapply(initial, class)
> tabAll <- read.table("datatable.txt", colClasses = classes)
  • Set nrows. This doesn’t make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool wc to calculate the number of lines in a file.

In general, when using R with larger datasets, it’s also useful to know a few things about your system.

  • How much memory is available on your system?
  • What other applications are in use? Can you close any of them?
  • Are there other users logged into the same system?
  • What operating system ar you using? Some operating systems can limit the amount of memory a single process can access

4. Calculating Memory Requirements for R Objects

Because R stores all of its objects physical memory, it is important to be cognizant of how much memory is being used up by all of the data objects residing in your workspace. One situation where it’s particularly important to understand memory requirements is when you are reading in a new dataset into R. Fortunately, it’s easy to make a back of the envelope calculation of how much memory will be required by a new dataset.

For example, suppose I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame? Well, on most modern computers double precision floating point numbers are stored using 64 bits of memory, or 8 bytes. Given that information, you can do the following calculation

1,500,000 × 120 × 8 bytes/numeric | = 1,440,000,000 bytes |
| = 1,440,000,000 / 220 bytes/MB
| = 1,373.29 MB
| = 1.34 GB

So the dataset would require about 1.34 GB of RAM. Most computers these days have at least that much RAM. However, you need to be aware of

  • what other programs might be running on your computer, using up RAM
  • what other R objects might already be taking up RAM in your workspace
Reading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in the worst case. So make sure to do a rough calculation of memory requirements before reading in a large dataset.

Saturday, 10 February 2024

ANU UG/Degree 4th Sem Exam Fee Notification April 2023

 ANU UG/Degree 4th Sem Exam Fee Notification April 2023 is released. Acharaya Nagarjuna University Degree 4th Semester exams fee notifications 2024. The last date for payment of exams fee without fine is 14.03.2023




Download official notification from below



Acharya Nagarjuna University UG 3rd Semester Results 2023 Out Now! Check Regular & Supply Exam Marks Here!

ANU UG 3rd Semester Regular & Supply Exam November 2023 Results are now available, the candidates who are looking for results can check their results from here.

Calling all Nagarjuna University UG students! The wait is finally over! The results for the 3rd Semester Regular & Supply Examinations held in November 2023 have been officially declared.

 


 

Here's what you'll learn:

  • How to access your results: We'll provide you with the official links and guide you through the process of checking your marks.
  • Important dates and deadlines: Get all the crucial information about revaluation, mark verification, and other important procedures.
  • Tips for analyzing your results: Understand your performance and strategize for future semesters.
  • Additional resources: Find helpful links to the university website, exam cell contact details, and other relevant information.

Don't wait any longer! Head over to the official website of Acharya Nagarjuna University to check your results now:

Official Results Link: UG 3rd Semester Results 2023 Out Now! Check Here

Pro Tip: Remember to have your hall ticket number handy while accessing your results.

Here are some additional tips for analyzing your results:

  • Compare your marks with the subject average: This will give you an idea of how you performed compared to your peers.
  • Identify your strengths and weaknesses: Focus on the subjects where you excelled and the ones that require improvement.
  • Seek guidance from your professors: Don't hesitate to consult your professors if you have any questions or concerns about your performance.

Remember: These results are an opportunity to learn and grow. Use them to understand your strengths and weaknesses, set goals for the future, and make the most of your academic journey at Acharya Nagarjuna University!

We wish you all the best!

P.S. Share this blog post with your fellow ANU UG students! Let everyone know the results are out!

 

Friday, 2 February 2024

ANU UG/Degree 2nd & 4th Sem RV Results July/Aug 2023 @ Available Now

 Acharya Nagarjuna University (ANU) has released the revaluation results for the UG/Degree 2nd and 4th Sem examinations held in July/August 2023. Students who were not satisfied with their initial results can now check their revaluation results online.

Key Points

  • ANU UG/Degree 2nd and 4th Sem Revaluation Results July/Aug 2023 are available online
  • Students can check their revaluation results on the university's website or through the official mobile app
  • The revaluation results are available in PDF format
  • Students can also download their revaluation mark sheets online
 
 


 

How to Check Revaluation Results

Students can follow these steps to check their revaluation results online:

  1. Go to the link given below
  2. Click on the 'Results' tab
  3. Select 'UG/Degree' from the drop-down menu
  4. Select 'Revaluation' from the drop-down menu
  5. Select '2nd Sem' or '4th Sem' from the drop-down menu
  6. Enter your roll number and date of birth
  7. Click on 'Submit'

The revaluation results will be displayed on the screen. Students can also download their revaluation mark sheets online.

 Acharya Nagarjuna University ANU UG/Degree 2nd  Sem Revaluation Results July/Aug 2023

 Acharya Nagarjuna University ANU UG/Degree 4th Sem Revaluation Results July/Aug 2023

 

data Presentation and Automation - Presenting findings and building applications on top of them

  •  Presenting findings to stakeholders after successful data analysis and model development.
  • Automating models to meet the demand for repeatable predictions and insights
  • Implementing model scoring or creating applications for automatic updates of reports, Excel spreadsheets, or PowerPoint presentations.
  • Emphasizing the importance of soft skills in the final stage of data science.
  • Recommendation: Find dedicated books and information on the subject to enhance your skills.


Modeling - Build the models

 Model Building Process

  1. Clean data and understanding of content are crucial.
  2. Goals include better predictions, object classification, and system understanding.
  3. Focused phase compared to exploratory analysis.
  4. Outcomes determined by desired outcomes.
  5. Below Figure illustrates model building components.


 Building a model is an iterative process. The way you build your model depends on whether you go with classic statistics or the somewhat more recent machine learning school, and the type of technique you want to use. Either way, most models consist of the following main steps:

 1. Model and variable selection

  • Selecting variables and modeling technique based on exploratory analysis findings.
  • Judgment required to choose the right model for a problem.
  • Consideration of model performance and project requirements.
  • Factors to consider: model's suitability for production environment, maintenance challenges, and model's ease of explanation.
  • Action required once the model is developed.

2. Model execution

Once you’ve chosen a model you’ll need to implement it in code. Here are the two example 

Example1:


In the above code we provided how a linear regression model will be executed.

Example2:


 3. Model diagnostics and model comparison

  • Multiple models are built and chosen based on multiple criteria.
  • Holdout sample is used to evaluate the model after building.
  • The model should work on unseen data.
  • Only a fraction of the data is used for model estimation.
  • The model is then unleashed on unseen data and error measures calculated.
  • Multiple error measures are available, with the mean square error.

Exploration - Exploratory data analysis

 Exploratory Data Analysis Overview

  • Deep dive into data using graphical techniques.
  • Uses open mind and eyes for understanding data interactions.
  • Aims to discover anomalies not previously identified.
  • Requires step back and fixation to ensure accuracy.


 Visualization Techniques in Data Analysis

  • Uses range from simple line graphs or histograms to complex diagrams like Sankey and network graphs.
  • Composes composite graphs for deeper data insight.
  • Animates or makes interactive graphs for ease and enjoyment.

 Interactive Data Exploration Techniques

  • Combining plots for deeper insights.
  • Overlaying several plots for better understanding.
  • Using Pareto diagrams or 80-20 diagrams.
  • Brushing and linking for automatic transfer of changes from one graph to another.
  • High correlation between answers indicated by average score per country.
  • Selection of points on subplots corresponds to similar points on other graphs.
  • Histogram: Categorizes variables into discrete categories, summarizing occurrences in each category.
  • Boxplot: Provides distribution within categories, showing maximum, minimum, median, and other characterizing measures.
  • Techniques include visualization, tabulation, clustering, and other modeling techniques.
  • Building simple models can also be part of exploratory analysis.
  • After data exploration, move on to building models.

Key objectives of EDA:

  • Gaining familiarity with the data: This involves understanding the structure of the dataset, the data types of each variable, and any missing values present.
  • Identifying patterns and trends: EDA helps uncover relationships between variables, outliers, and potential errors within the data.
  • Formulating hypotheses: Based on the observations and insights gained, you can start forming hypotheses that you can later test through modeling or analysis.
  • Guiding further analysis: EDA lays the groundwork for choosing the appropriate techniques for modeling, feature engineering, and data cleaning.

Common steps involved in EDA:

  1. Data import and cleaning: This involves loading the data into your chosen environment and addressing any missing values, inconsistencies, or formatting issues.
  2. Univariate analysis: This step examines each variable individually, using summary statistics like mean, median, and standard deviation for numerical variables and frequency distributions for categorical variables. Visualizations like histograms, boxplots, and bar charts are helpful in understanding the distribution of each variable.
  3. Bivariate analysis: This step explores the relationships between two variables. Scatter plots, heatmaps, and correlation matrices are commonly used to visualize these relationships.
  4. Multivariate analysis: This step involves exploring the relationships between multiple variables simultaneously. Techniques like principal component analysis (PCA) and dimensionality reduction can be used for this purpose.

Benefits of EDA:

  • Improved data understanding: A thorough EDA provides a deep understanding of the data, its strengths, and weaknesses, allowing you to make informed decisions about further analysis.
  • Enhanced data quality: By identifying and addressing data quality issues early on, you can ensure the reliability and accuracy of your results.
  • More effective modeling: Understanding the data's characteristics helps you choose the most appropriate modeling techniques and avoid common pitfalls.
  • Clearer communication: EDA findings can be effectively communicated to stakeholders through data visualizations and reports, fostering better collaboration and project understanding.

Data Preparation - Cleansing, integrating, and transforming data

 Data Retrieval Phase and Modeling

  • Data from retrieval phase is often "diamond in the rough."
  • Sanitization and preparation are crucial for better performance and less time spent on output correction.
  • Data transformation is necessary for the model to fit specific data formats.
  • Early correction of data errors is recommended.
  • Corrective actions may be necessary in realistic settings.
  • Below figure shows common actions during data cleansing, integration, and transformation.

1. Data Cleaning 

Data cleansing is a sub process of the data science process that focuses on removing errors in your data so your data becomes a true and consistent representation of the processes it originates from.

1.1. Data Entry Errors Overview

  • Data collection and entry are error-prone processes requiring human intervention.
  • Human errors can include typos or loss of concentration.
  • Machine data collection also faces errors due to human sloppiness or machine or hardware failure.
  • Examples include transmission errors and bugs in the extract, transform, and load phase (ETL).
  • Hand-checking every value is recommended for small data sets.
  • Data errors can be detected by tabulating data with counts.
  • Frequency tables can be created for variables with only two values.

1.2.Outliers in Data Analysis 

  • Outliers are observations that seem distant from others or follow a different logic or generative process.
  • Finding outliers is easy using plots or tables with minimum and maximum values.
  • An example is provided where a normal distribution (Gaussian distribution) is expected, showing high values in the bottom graph.
  • Outliers can significantly influence data modeling, so it's crucial to investigate them first.

1.3. Dealing with Missing Values in Data Science

  • Missing values aren't always wrong but need separate handling.
  • They may indicate data collection errors or ETL process errors.
  • Common techniques used by data scientists are listed in table 2.4.

2. Transforming Data for Data Modeling 

  • Data cleansing and integration are crucial for data modeling.
  • Data transformation involves transforming data into a suitable form.
  • Linear relationships between input and output variables can be simplified by transforming the log of independent variables.
  • Combining two variables into a new variable can also be used.

Reducing Variables in Models

  • Overloading variables can hinder model handling.
  • Techniques like Euclidean distance perform best with 10 variables.
  • Reducing the number of variables can add new information to the model.

Turning Variables into Dummies in Data Science

  • Variables can be transformed into dummy variables, which can only take two values: true(1) or false(0).
  • Dummy variables indicate the absence of a categorical effect explaining an observation.
  • Separate columns for classes stored in one variable are created, with 1 indicating present classes and 0 otherwise.
  • Example: Turn Weekdays into Monday through Sunday columns to show if the observation was on a Monday.
  • This technique is popular in modeling and is not exclusive to economists.
  • The next step is to transform and integrate data into usable input for the modeling phase.

3. Data Combination from Different Sources

  • Data sources include databases, Excel files, text documents, etc.
  • Data science process is the focus, not presenting scenarios for every type of data.
  • Other data sources like key-value stores and document stores will be discussed in later sections.

Different Ways of Combining Data

  1. Joining: enriches an observation from one table with information from another.
  2. Appending or stacking: adds observations from one table to another.
  3.  Combining data allows creation of new physical or virtual tables.
  4. Views consume less disk space

Retrieving Data

 Data Science Steps: Retrieving Required Data

  • Designing data collection process may be necessary.
  • Companies often collect and store data.
  • Unneeded data can be purchased from third parties.
  • Don't hesitate to seek data outside your organization.
  • More organizations are making high-quality data freely available for public and commercial use.

Data can be stored in many forms, ranging from simple text files to tables in a database. The objective now is acquiring all the data you need. This may be difficult, and even if you succeed, data is often like a diamond in the rough: it needs polishing to be of any use to you.

 1. Start with data stored within the company

 Assessing Data Relevance and Quality

  • Assess the quality and relevance of available data within the company.
  • Companies often have a data maintenance program, reducing cleaning work.
  • Data can be stored in official repositories like databases, data marts, data warehouses, and data lakes.
  • Databases are for data storage, data warehouses for data analysis, and data marts serve specific business units.
  • Data lakes contain raw data, while data warehouses and data marts are preprocessed.
  • Data may still exist in Excel files on a domain expert's desktop.

Data Management Challenges in Companies

  • Data scattered as companies grow.
  • Knowledge dispersion due to position changes and departures.
  • Documentation and metadata not always prioritized.
  • Need for Sherlock Holmes-like skills to find lost data.

Data Access Challenges

  • Organizations often have policies ensuring data access only for necessary information.
  • These policies create physical and digital barriers, known as "Chinese walls."
  • These "walls" are mandatory and well-regulated for customer data in most countries.
  • Accessing data can be time-consuming and influenced by company politics.


2. Don’t be afraid to shop around

Data Sharing and its Importance

  • Companies like Nielsen and GFK specialize in collecting valuable information.
  • Twitter, LinkedIn, and Facebook provide data for enriching their services and ecosystem.
  • Governments and organizations share their data for free, covering a broad range of topics.
  • This data is useful for enriching proprietary data and training data science skills at home.
  • Table 2.1 shows a small selection from the growing number of open-data providers.





3. Do data quality checks now to prevent problems later


Data Science Project Overview
  • Data correction and cleansing are crucial, often up to 80% of project time.
  • Data retrieval is the first phase of data inspection in the data science process.
  • Errors in data retrieval can be easily identified, but carelessness can lead to long-term data issues.
  • Data investigation occurs during import, data preparation, and exploratory phases.
  • Data retrieval checks if the data is equal to the source document and if the data types match.
  • Data preparation involves a more detailed check, aiming to eliminate typos and data entry errors.
  • The exploratory phase focuses on learning from the data, examining statistical properties like distributions, correlations, and outliers.
  • Iteration over these phases is common, as outliers can indicate data entry errors.

Setting the research goal

A project starts by understanding the what, the why, and the how of your project. What does the company expect you to do? And why does management place such a value on your research? Is it part of a bigger strategic picture or a “lone wolf” project originating from an opportunity someone detected? Answering these three questions (what, why, how) is the goal of the first phase, so that everybody knows what to do and can agree on the best course of action.

The output should be a clear research aim, a strong grasp of the context, well-defined deliverables, and a plan of action with a time frame. The appropriate location for this information is then in a project charter. Naturally, the duration and formality might vary throughout projects and businesses. This component of the project will frequently be led by more senior staff since during this early stage, commercial acumen and people skills are more crucial than exceptional technical ability.

Spend time understanding the goals and context of your research

Research Goal Importance

  • Outlines the purpose of the assignment clearly.
  • Essential for understanding business goals and context
  • Continue asking questions and examples until understanding business expectations.
  • Identify project's fit in the larger picture.
  • Understand how research will change the business.
  • Understand how results will be used.
  • Avoid misunderstanding business goals and context.
  • Many data scientists fail due to lack of understanding.

 Create a project charter

 A project charter requires teamwork, and your input covers at least the following:

  • A clear research goal
  • The project mission and context
  • How you’re going to perform your analysis
  • What resources you expect to use
  • Proof that it’s an achievable project, or proof of concepts
  • Deliverables and a measure of success
  • A timeline

Overview of the Data Science Process

Following a structured approach to data science helps you to maximize your chances of success in a data science project at the lowest cost. It also makes it possible to take up a project as a team, with each team member focusing on what they do best. Take care, however: this approach may not be suitable for every type of project or be the only way to do good data science. 

The typical data science process consists of six steps through which you’ll iterate, as shown in figure


Figure 2.1 summarizes the data science process and shows the main steps and actions you’ll take during a project. The following list is a short introduction

1. The first step of this process is setting a research goal. The main purpose here is making sure all the stakeholders understand the what, how, and why of the project. In every serious project this will result in a project charter.

2. The second phase is data retrieval. You want to have data available for analysis, so this step includes finding suitable data and getting access to the data from the data owner.  The result is data in its raw form, which probably needs polishing and transformation before it becomes usable.

3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw form into data that’s directly usable in your models. To achieve this, you’ll detect and correct different  kinds of errors in the data, combine data from different data sources, and transform it. If you have successfully completed this step, you can progress to data visualization and modeling.

4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data. You’ll look for patterns, correlations, and deviations based on visual and descriptive techniques. The insights you gain from this phase will enable you to start modeling.

5 Finally, we get to the  model building . It is now that you attempt to gain the insights or make the predictions stated in your project charter. 

Now is the time to bring out the heavy guns, but remember research has taught us that often (but not always) a combination of simple models tends to outperform one complicated model. If you’ve done this phase right, you’re almost done.


6. The last step of the data science model is presenting your results and automating the analysis, if needed. One goal of a project is to change a process and/or make better decisions. You may still need to convince the business that your findings will indeed change the business process as expected. 

This is where you can shine in your influencer role. The importance of this step is more apparent in projects on a strategic and tactical level. Certain projects require you to perform the business process over and over again, so automating the project will save time.

R Nuts and Bolts Part-II


4.8 Explicit Coercion

Objects can be explicitly coerced from one class to another using the as.* functions, if available.

> x <- 0:6
> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
> as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"

Sometimes, R can’t figure out how to coerce an object and this can result in NAs being produced.

> x <- c("a", "b", "c")
> as.numeric(x)
Warning: NAs introduced by coercion
[1] NA NA NA
> as.logical(x)
[1] NA NA NA
> as.complex(x)
Warning: NAs introduced by coercion
[1] NA NA NA

When nonsensical coercion takes place, you will usually get a warning from R.

4.9 Matrices

Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (number of rows, number of columns)

> m <- matrix(nrow = 2, ncol = 3) 
> m
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA
> dim(m)
[1] 2 3
> attributes(m)
$dim
[1] 2 3

Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns.

> m <- matrix(1:6, nrow = 2, ncol = 3) 
> m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Matrices can also be created directly from vectors by adding a dimension attribute.

> m <- 1:10 
> m
 [1]  1  2  3  4  5  6  7  8  9 10
> dim(m) <- c(2, 5)
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

Matrices can be created by column-binding or row-binding with the cbind() and rbind() functions.

> x <- 1:3
> y <- 10:12
> cbind(x, y)
     x  y
[1,] 1 10
[2,] 2 11
[3,] 3 12
> rbind(x, y) 
  [,1] [,2] [,3]
x    1    2    3
y   10   11   12

4.10 Lists

Lists are a special type of vector that can contain elements of different classes. Lists are a very important data type in R and you should get to know them well. Lists, in combination with the various “apply” functions discussed later, make for a powerful combination.

Lists can be explicitly created using the list() function, which takes an arbitrary number of arguments.

> x <- list(1, "a", TRUE, 1 + 4i) 
> x
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i

We can also create an empty list of a prespecified length with the vector() function

> x <- vector("list", length = 5)
> x
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

4.11 Factors

Factors are used to represent categorical data and can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label. Factors are important in statistical modeling and are treated specially by modelling functions like lm() and glm().

Using factors with labels is better than using integers because factors are self-describing. Having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.

Factor objects can be created with the factor() function.

> x <- factor(c("yes", "yes", "no", "yes", "no")) 
> x
[1] yes yes no  yes no 
Levels: no yes
> table(x) 
x
 no yes 
  2   3 
> ## See the underlying representation of factor
> unclass(x)  
[1] 2 2 1 2 1
attr(,"levels")
[1] "no"  "yes"

Often factors will be automatically created for you when you read a dataset in using a function like read.table(). Those functions often default to creating factors when they encounter data that look like characters or strings.

The order of the levels of a factor can be set using the levels argument to factor(). This can be important in linear modelling because the first level is used as the baseline level.

> x <- factor(c("yes", "yes", "no", "yes", "no"))
> x  ## Levels are put in alphabetical order
[1] yes yes no  yes no 
Levels: no yes
> x <- factor(c("yes", "yes", "no", "yes", "no"),
+             levels = c("yes", "no"))
> x
[1] yes yes no  yes no 
Levels: yes no

4.12 Missing Values

Missing values are denoted by NA or NaN for q undefined mathematical operations.

  • is.na() is used to test objects if they are NA

  • is.nan() is used to test for NaN

  • NA values have a class also, so there are integer NA, character NA, etc.

  • A NaN value is also NA but the converse is not true

> ## Create a vector with NAs in it
> x <- c(1, 2, NA, 10, 3)  
> ## Return a logical vector indicating which elements are NA
> is.na(x)    
[1] FALSE FALSE  TRUE FALSE FALSE
> ## Return a logical vector indicating which elements are NaN
> is.nan(x)   
[1] FALSE FALSE FALSE FALSE FALSE
> ## Now create a vector with both NA and NaN values
> x <- c(1, 2, NaN, NA, 4)
> is.na(x)
[1] FALSE FALSE  TRUE  TRUE FALSE
> is.nan(x)
[1] FALSE FALSE  TRUE FALSE FALSE

4.13 Data Frames

Data frames are used to store tabular data in R. They are an important type of object in R and are used in a variety of statistical modeling applications. Hadley Wickham’s package dplyr has an optimized set of functions designed to work efficiently with data frames.

Data frames are represented as a special type of list where every element of the list has to have the same length. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows.

Unlike matrices, data frames can store different classes of objects in each column. Matrices must have every element be the same class (e.g. all integers or all numeric).

In addition to column names, indicating the names of the variables or predictors, data frames have a special attribute called row.names which indicate information about each row of the data frame.

Data frames are usually created by reading in a dataset using the read.table() or read.csv(). However, data frames can also be created explicitly with the data.frame() function or they can be coerced from other types of objects like lists.

Data frames can be converted to a matrix by calling data.matrix(). While it might seem that the as.matrix() function should be used to coerce a data frame to a matrix, almost always, what you want is the result of data.matrix().

> x <- data.frame(foo = 1:4, bar = c(T, T, F, F)) 
> x
  foo   bar
1   1  TRUE
2   2  TRUE
3   3 FALSE
4   4 FALSE
> nrow(x)
[1] 4
> ncol(x)
[1] 2

4.14 Names

R objects can have names, which is very useful for writing readable code and self-describing objects. Here is an example of assigning names to an integer vector.

> x <- 1:3
> names(x)
NULL
> names(x) <- c("New York", "Seattle", "Los Angeles") 
> x
   New York     Seattle Los Angeles 
          1           2           3 
> names(x)
[1] "New York"    "Seattle"     "Los Angeles"

Lists can also have names, which is often very useful.

> x <- list("Los Angeles" = 1, Boston = 2, London = 3) 
> x
$`Los Angeles`
[1] 1

$Boston
[1] 2

$London
[1] 3
> names(x)
[1] "Los Angeles" "Boston"      "London"     

Matrices can have both column and row names.

> m <- matrix(1:4, nrow = 2, ncol = 2)
> dimnames(m) <- list(c("a", "b"), c("c", "d")) 
> m
  c d
a 1 3
b 2 4

Column names and row names can be set separately using the colnames() and rownames() functions.

> colnames(m) <- c("h", "f")
> rownames(m) <- c("x", "z")
> m
  h f
x 1 3
z 2 4

Note that for data frames, there is a separate function for setting the row names, the row.names() function. Also, data frames do not have column names, they just have names (like lists). So to set the column names of a data frame just use the names() function. Yes, I know its confusing. Here’s a quick summary:

Object Set column names Set row names
data frame names() row.names()
matrix colnames() rownames()


Latest Notifications

More

Results

More

Timetables

More

Latest Schlorships

More

Materials

More

Previous Question Papers

More

All syllabus Posts

More

AI Fundamentals Tutorial

More

Data Science and R Tutorial

More
Top