|
|
3 Credits | 200 Level | 38 Contact hours
R for Data Science - Hadley Wickham & Garrett Grolemund
Exploratory Data Analysis with R - Roger D. Peng
The data scientist is faced with a dataset of very different origin, format, organisation, coding, etc. The correct acquisition, organisation, elimination of possible erroneous data (outliers), missing data imputation, data transformation, selection of the most relevant characteristics of a high dimensionality set (feature selection), elimination of redundant data, etc. is one of the most costly stages of a data analysis problem. This stage is crucial for the correct treatment of the problem and the reliability and solidity of the results obtained in later stages of analysis (selection of models, classifiers, grouping, estimation, hypothesis contrasts, etc.).
Term-specific information provided onsite.
Topic 1. Introduction to data processing
1. Why analyse data?
2. Overview of a data processing problem
Topic 2. Getting data
1. Data repositories
2. Data file formats
3. Merging data from different sources
4. Access to databases
Topic 3. Data visualization
1. Explanatory and exploratory graphics
2. Graphic systems in R
Topic 4. Preparation of the data
1. Structure of a data set for analysis: basic data operations
2. Data manipulation
3. Data handling
Topic 5. Exploratory data analysis
1. Exploring a new dataset.
2. Characterisation of variables
3. Displaying relationships between variables
4. Anomalies in numerical variables: Outliers
5. Detection methods
6. Abnormalities in numerical variables: missing and absent data
Topic 6. Working with text data
1. Bases of text data analysis.
2. Basic functions for handling characters in R
3. Regular expressions
4. Character encoding
Upon successful completion of this course, students will be able to:
● To be able to upload any data file of any kind (text, tabular formats, etc.).
● To clear and complete the contents of the dataset after loading (correction of typographical errors, imputation of lost data, etc.).
● To choose the right type of data depending on its nature (integer, real, factor, text, etc.).
● To filter samples, select features, and create new ones from tabbed formats such as the data frame.
● To carry out a characterisation of the data depending on their typology.
● To create basic data visualisations to draw preliminary conclusions from the data.
● To find out what problems you have when you have highly unbalanced data sets.
Class Attendance 10%
Assignments 25%
Midterm exam 30%
Final Exam 35%
|
|