BD4251 Big Data Mining and Analytics Syllabus:

BD4251 Big Data Mining and Analytics Syllabus – Anna University PG Syllabus Regulation 2021

COURSE OBJECTIVES:

 To understand the computational approaches to Modeling, Feature Extraction
 To understand the need and application of Map Reduce
 To understand the various search algorithms applicable to Big Data
 To analyze and interpret streaming data
 To learn how to handle large data sets in main memory and learn the various clustering techniques applicable to Big Data

UNIT I DATA MINING AND LARGE SCALE FILES

Introduction to Statistical modeling – Machine Learning – Computational approaches to modeling – Summarization – Feature Extraction – Statistical Limits on Data Mining – Distributed File Systems – Map-reduce – Algorithms using Map Reduce – Efficiency of Cluster Computing Techniques.

UNIT II SIMILAR ITEMS

Nearest Neighbor Search – Shingling of Documents – Similarity preserving summaries – Locality sensitive hashing for documents – Distance Measures – Theory of Locality Sensitive Functions – LSH Families – Methods for High Degree of Similarities.

UNIT III MINING DATA STREAMS

Stream Data Model – Sampling Data in the Stream – Filtering Streams – Counting Distance Elements in a Stream – Estimating Moments – Counting Ones in Window – Decaying Windows.

UNIT IV LINK ANALYSIS AND FREQUENT ITEMSETS

Page Rank –Efficient Computation – Topic Sensitive Page Rank – Link Spam – Market Basket Model – A-priori algorithm – Handling Larger Datasets in Main Memory – Limited Pass Algorithm – Counting Frequent Item sets.

UNIT V CLUSTERING

Introduction to Clustering Techniques – Hierarchical Clustering –Algorithms – K-Means – CURE – Clustering in Non -– Euclidean Spaces – Streams and Parallelism – Case Study: Advertising on the Web – Recommendation Systems.

TOTAL: 45 PERIODS

COURSE OUTCOMES:

Upon completion of this course, the students will be able to
CO1: Design algorithms by employing Map Reduce technique for solving Big Data problems.
CO2: Design algorithms for Big Data by deciding on the apt Features set .
CO3: Design algorithms for handling petabytes of datasets
CO4: Design algorithms and propose solutions for Big Data by optimizing main memory consumption
CO5: Design solutions for problems in Big Data by suggesting appropriate clustering techniques.

REFERENCES:

1. Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman, “Mining of Massive Datasets”, Cambridge University Press, 3rd Edition, 2020.
2. Jiawei Han, MichelineKamber, Jian Pei, “Data Mining Concepts and Techniques”, Morgan Kaufman Publications, Third Edition, 2012.
3. Ian H.Witten, Eibe Frank “Data Mining – Practical Machine Learning Tools and Techniques”, Morgan Kaufman Publications, Third Edition, 2011.
4. David Hand, HeikkiMannila and Padhraic Smyth, “Principles of Data Mining”, MIT PRESS, 2001

WEB REFERENCES:

1. https://swayam.gov.in/nd2_arp19_ap60/preview
2. https://nptel.ac.in/content/storage2/nptel_data3/html/mhrd/ict/text/106104189/lec1.pdf

ONLINE RESOURCES:

1. https://examupdates.in/big-data-analytics/
2. https://www.tutorialspoint.com/big_data_analytics/index.htm
3. https://www.tutorialspoint.com/data_mining/index.htm