BD4151 Foundations of Data Science Syllabus:

BD4151 Foundations of Data Science Syllabus – Anna University PG Syllabus Regulation 2021

COURSE OBJECTIVES:

 To apply fundamental algorithms to process data.
 Learn to apply hypotheses and data into actionable predictions.
 Document and transfer the results and effectively communicate the findings using visualization techniques.
 To learn statistical methods and machine learning algorithms required for Data Science.
 To develop the fundamental knowledge and understand concepts to become a data science professional.

UNIT I INTRODUCTION TO DATA SCIENCE

Data science process – roles, stages in data science project – working with data from files – working with relational databases – exploring data – managing data – cleaning and sampling for modeling and validation – introduction to NoSQL.

UNIT II MODELING METHODS

Choosing and evaluating models – mapping problems to machine learning, evaluating clustering models, validating models – cluster analysis – K-means algorithm, Naïve Bayes – Memorization Methods – Linear and logistic regression – unsupervised methods.

UNIT III INTRODUCTION TO R

Reading and getting data into R – ordered and unordered factors – arrays and matrices – lists and data frames – reading data from files – probability distributions – statistical models in R – manipulating objects – data distribution.

UNIT IV MAP REDUCE

Introduction – distributed file system – algorithms using map reduce, Matrix-Vector Multiplication by Map Reduce – Hadoop – Understanding the Map Reduce architecture – Writing Hadoop MapReduce Programs – Loading data into HDFS – Executing the Map phase – Shuffling and sorting – Reducing phase execution.

UNIT V DATA VISUALIZATION

Documentation and deployment – producing effective presentations – Introduction to graphical analysis – plot() function – displaying multivariate data – matrix plots – multiple plots in one window – exporting graph using graphics parameters – Case studies.

TOTAL : 45 PERIODS

COURSE OUTCOMES:

CO1: Obtain, clean/process and transform data.
CO2: Analyze and interpret data using an ethically responsible approach.
CO3: Use appropriate models of analysis, assess the quality of input, derive insight from results, and investigate potential issues.
CO4: Apply computing theory, languages and algorithms, as well as mathematical and statistical models, and the principles of optimization to appropriately formulate and use data analyses.
CO5: Formulate and use appropriate models of data analysis to solve business-related challenges.

REFERENCES:

1. Nina Zumel, John Mount, “Practical Data Science with R”, Manning Publications, 2014.
2. Mark Gardener, “Beginning R – The Statistical Programming Language”, John Wiley & Sons, Inc., 2012.
3. W. N. Venables, D. M. Smith and the R Core Team, “An Introduction to R”, 2013.
4. Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, Abhijit Dasgupta, “Practical Data Science Cookbook”, Packt Publishing Ltd., 2014.
5. Nathan Yau, “Visualize This: The FlowingData Guide to Design, Visualization, and Statistics”, Wiley, 2011.