Data processing

3 Credits | 200 Level | 38 Contact hours

REQUIRED TEXTBOOKS AND COURSE MATERIALS

R for Data Science - Hadley Wickham & Garrett Grolemund
Exploratory Data Analysis with R - Roger D. Peng

DESCRIPTION

The data scientist is faced with a dataset of very different origin, format, organisation, coding, etc. The correct acquisition, organisation, elimination of possible erroneous data (outliers), missing data imputation, data transformation, selection of the most relevant characteristics of a high dimensionality set (feature selection), elimination of redundant data, etc. is one of the most costly stages of a data analysis problem. This stage is crucial for the correct treatment of the problem and the reliability and solidity of the results obtained in later stages of analysis (selection of models, classifiers, grouping, estimation, hypothesis contrasts, etc.).

OUTLINE

Term-specific information provided onsite.

Topic 1. Introduction to data processing
1. Why analyse data?
2. Overview of a data processing problem

Topic 2. Getting data
1. Data repositories
2. Data file formats
3. Merging data from different sources
4. Access to databases

Topic 3. Data visualization
1. Explanatory and exploratory graphics
2. Graphic systems in R

Topic 4. Preparation of the data
1. Structure of a data set for analysis: basic data operations
2. Data manipulation
3. Data handling

Topic 5. Exploratory data analysis
1. Exploring a new dataset.
2. Characterisation of variables
3. Displaying relationships between variables
4. Anomalies in numerical variables: Outliers
5. Detection methods
6. Abnormalities in numerical variables: missing and absent data

Topic 6. Working with text data
1. Bases of text data analysis.
2. Basic functions for handling characters in R
3. Regular expressions
4. Character encoding

STUDENT LEARNING/COURSE OUTCOMES

Upon successful completion of this course, students will be able to:

● To be able to upload any data file of any kind (text, tabular formats, etc.).
● To clear and complete the contents of the dataset after loading (correction of typographical errors, imputation of lost data, etc.).
● To choose the right type of data depending on its nature (integer, real, factor, text, etc.).
● To filter samples, select features, and create new ones from tabbed formats such as the data frame.
● To carry out a characterisation of the data depending on their typology.
● To create basic data visualisations to draw preliminary conclusions from the data.
● To find out what problems you have when you have highly unbalanced data sets.

ASSESSMENT/GRADES

Class Attendance 10%
Assignments 25%
Midterm exam 30%
Final Exam 35%

• Attendance

• Assessments

• Sexual Harassment Policy

• Students With Disabilities

• Academic Honesty Policy

• University Ombudsman

• Statement On Audio And Video Recording

• Syllabus Change Policy

Data processing