## Introduction to Data Science

This course is listed in USOS as: Introduction to Data Science WFAIS.IF-N018.0 (Wprowadzenie do Analityki Danych), 60 hours 6 ECTS.

### When

Will start in winter semester 2017/2018

### Lecturers

### Target audience

Students of 1^{st} and 2^{nd} level studies (preferably 1^{st})

### Prerequisites

- A course in statistics and/or probability calculus
- A programming course (preferably Python)

### Course outcomes

We expect students after finishing this course to have an understanding of the overall data science process which consists of:

- Acquiring the data
- Cleaning and validating the data
- Doing exploratory data analysis
- Modeling the data (linear and logistic regression)
- Visualizing the data
- Statistical inference: drawing conclusions/answering questions/testing hypothesis
- Presenting findings

Students will also gain an overall view of the Machine Learning “family tree”: Supervised, Unsupervised and Reinforcement Learning. Specifically, students after finishing the course should be able to:

- Acquire structured and some to some extent the unstructured data from various sources such as text files, pdf files, databases (SQL and NoSQL) and WWW.
- Clean the acquired data if necessary which includes handling missing data.
- Manipulate the collected data using operations that include slicing, indexing, multi-indexing, grouping, merging and aggregating
- Perform exploratory data analysis on data. That includes calculating various statistical descriptors and estimating the errors on those descriptors.
- Perform a dimensional reduction using PCA analysis.
- Visualize the data using appropriate plots.
- Fit a linear model to the data and check its validity and robustness.
- Use the logistic regression for classification
- Perform clusterisation using k-means algorithm as an example of unsupervised learning.
- Estimate the errors of the calculated descriptors or parameters using resampling methods such as jackknife and/or bootstrap.
- State a hypothesis concerning the data and prove or disprove it with a specified significance level. Evaluate an A/B test.
- Implement a k-nearest neighbor’s classifier.
- Implement a Bayesian classifier.
- Deal with a large amount of data that does not fit into a single computer memory using PySpark.
- Present and justify their findings.

Students will also acquire the working knowledge of the tools described in next section.

### Tools

We will use Python language and its packages. We will base our course on Jupyter notebooks together with Python packages SciPy, NumPy, Pandas, scikit-learn, matplotlib, plotly and seaborn. Additionally, PySpark will also be used for large data set analysis.

### Assessment

The students will be required during the period of the course to carry out at least two data science projects and present their conclusion for assessment.