BD4151 Foundations of Data Science Syllabus:
BD4151 Foundations of Data Science Syllabus – Anna University PG Syllabus Regulation 2021
COURSE OBJECTIVES:
To apply fundamental algorithms to process data.
Learn to apply hypotheses and data into actionable predictions.
Document and transfer the results and effectively communicate the findings using visualization techniques.
To learn statistical methods and machine learning algorithms required for Data Science.
To develop the fundamental knowledge and understand concepts to become a data science professional.
UNIT I INTRODUCTION TO DATA SCIENCE
Data science process – roles, stages in data science project – working with data from files – working with relational databases – exploring data – managing data – cleaning and sampling for modeling and validation – introduction to NoSQL.
UNIT II MODELING METHODS
Choosing and evaluating models – mapping problems to machine learning, evaluating clustering models, validating models – cluster analysis – K-means algorithm, Naïve Bayes – Memorization Methods – Linear and logistic regression – unsupervised methods.
UNIT III INTRODUCTION TO R
Reading and getting data into R – ordered and unordered factors – arrays and matrices – lists and data frames – reading data from files – probability distributions – statistical models in R – manipulating objects – data distribution.
UNIT IV MAP REDUCE
Introduction – distributed file system – algorithms using map reduce, Matrix-Vector Multiplication by Map Reduce – Hadoop – Understanding the Map Reduce architecture – Writing Hadoop MapReduce Programs – Loading data into HDFS – Executing the Map phase – Shuffling and sorting – Reducing phase execution.
UNIT V DATA VISUALIZATION
Documentation and deployment – producing effective presentations – Introduction to graphical analysis – plot() function – displaying multivariate data – matrix plots – multiple plots in one window – exporting graph using graphics parameters – Case studies.
TOTAL : 45 PERIODS
COURSE OUTCOMES:
CO1: Obtain, clean/process and transform data.
CO2: Analyze and interpret data using an ethically responsible approach.
CO3: Use appropriate models of analysis, assess the quality of input, derive insight from results, and investigate potential issues.
CO4: Apply computing theory, languages and algorithms, as well as mathematical and statistical models, and the principles of optimization to appropriately formulate and use data analyses.
CO5: Formulate and use appropriate models of data analysis to solve business-related challenges.
REFERENCES:
1. Nina Zumel, John Mount, “Practical Data Science with R”, Manning Publications, 2014.
2. Mark Gardener, “Beginning R – The Statistical Programming Language”, John Wiley & Sons, Inc., 2012.
3. W. N. Venables, D. M. Smith and the R Core Team, “An Introduction to R”, 2013.
4. Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, Abhijit Dasgupta, “Practical Data Science Cookbook”, Packt Publishing Ltd., 2014.
5. Nathan Yau, “Visualize This: The FlowingData Guide to Design, Visualization, and Statistics”, Wiley, 2011.