INFM 203-10
Big Data Analytics and Management
Fall 2021 Syllabus

Glen Mules PhD
E-mail
Office Hours:  Virtually — by appointment (use e-mail to schedule) — Zoom optional. Drop-in office hours might also be held as needed — Eastern Timezone. Telephone 914.235.7916 — Eastern timezone.


Syllabus Links
Textbooks
CLOs
Program Learning Outcomes (PLOs)
Prerequisites
Resources
Canvas Login and Tutorials
iSchool eBookstore
 

Canvas Information: This course will be available on Canvas Oct 11, 2021, at 6 am PT. 

This 2-unit course runs from Oct 11, 2021, to Dec 6, 2021.

You will be enrolled in the Canvas site automatically.

Course Description

Covers important big data technologies, trends, infrastructure, and management issues that enable users to make informed and strategic decisions with the presence of large-scale data sets

Course Requirements

This two-unit course is an offering of SJSU’s School of Information (iSchool) in the College of Professional and Global Education. All iSchool courses are completely online. Home computing requirements are posted online for prospective students https://ischool.sjsu.edu/home-computing-environment. Students must meet those minimum requirements to participate in the activities for this course, in addition to the following requirements.

Computer Hardware Requirements

To effectively participate fully in the practical Labs for this course, the following hardware / network requirements apply:

  • Laptop / Desktop computer running an up-to-date version of Windows 10 or macOS — with Administrator privileges
  • Minimum of 8 GB RAM (Random Access Memory) — but 16 GB or more is definitely preferred
  • 100 GB available disk storage
  • Strongly recommended:  1 TB or 2 TB external USB 3.0 disk drive for short-term and long-term storage of Virtual Machine (VM) images and all other course materials
  • Strong Internet connection — preferably Cable/Fiber with 20+ mbps connectivity — needed to download the rather large files for use in the Labs (e.g., 5+ GB virtual machine needed for Labs, installation software, etc.) — these downloads are mainly done in Weeks 01 & 02.

Course Calendar 

This schedule and related dates/readings/assignments is tentative and subject to change with fair notice. Any changes will be announced in due time in class and on the course’s website in the Canvas Learning Management System. The students are obliged to consult the most updated and detailed version of the reading material and syllabus, which will be posted on the course’s website.

Week# -- Starting Date

CLOs

Topics

1

Mon 11-Oct-21

1

  1. What is Big Data? What is Data Analytics?
  2. Roles: Business Analyst, Data Engineer, Data Scientist

Project Initialization

2

Mon 18-Oct-21

1, 2

  1. Hadoop & HDFS
  2. MapReduce and Distributed Computing

3

Mon 25-Oct-21

2

  1. Spark
  2. The Hadoop Ecosystem

4

Mon 01-Nov-21

2

  1. Data Lakes / Data Fabric / Cloud
  2. Relational Databases & the NoSQL movement

5

Mon 08-Nov-21

1-4

  1. Data Movement

Multiple Choice Quiz (MCQ)

6

Mon 15-Nov-21

3

  1. Programming for Big Data
  2. Using Data Notebooks for Data Science
  3. Data Visualization

7

Mon 22-Nov-21

4

  1. Management, Governance, and Data Security

8

Mon 29-Nov-21

1-4

Course Review
Project Presentations / Submissions

9.

Mon 06-Dec-21

 

Last Day of Instruction

Final Exam (MCQ + Short Answer) (on Canvas)
Final Practical Exam (sent to Instructor by email)

Assignments

All assignments are due by Sunday Midnight PT at the end of the week in which scheduled as noted in the table below. Practical Labs work is evidenced as completed by submitting an MS Word file (YourLastName,YourFirstName-Weeknn-Report.docx) to the discussion section and the appropriate week-folder for this course on Canvas. Assignments are subject to change with fair notice.

Assignment Due Dates:

Most assignments are due by Midnight Pacific Time on Sunday Night at the end of the relevant week unless noted differently below.

Week# -- Starting Date

Practical-Labs & Semester-Project Assignments

1

Mon 11-Oct-21

Lab #1: Download and install Virtual Machine (VM) for Hadoop Labs.

(Note:  EACH WEEK, the lab week generally requires the submission of a Report written and submitted using MS Word to show what was accomplished by the Lab work and with other learning during that week)

Project: Semester Mini- Project Initialization

2

Mon 18-Oct-21

Lab #2:  Explore HDFS and run a simple Hadoop job

Project: Choose Dataset that you will use

3

Mon 25-Oct-21

Project Milestone 1: Initial Thoughts (discuss direction of project)

In addition. you must comment on the work of at least two projects by the end of the week (Sunday)

Lab #3:  Additional exploratory work with Hadoop VM

4

Mon 01-Nov-21

Lab #4:  TBD

5

Mon 08-Nov-21

Lab #5:  Data Movement

Project Milestone 2:  One-page report on current progress of the project

In addition. you must comment on the work of at least two projects by the end of the week (Sunday)

Multiple Choice Mid-Semester Quiz/Exam (MCQ + Short Answers)

6

Mon 15-Nov-21

Lab #6: Using Jupyter Notebooks for running Python, etc. and for Data Visualization

7

Mon 22-Nov-21

Research Paper on Management, Governance, Data Security, and especially the Ethics of Working with Data -- due date is by end of this week (Fri 29-Nov-21).

 

This week is Thanksgiving Week in 2021.

8

Mon 29-Nov-21

Project Presentations / Submissions: Submit the final version of your Mini-Project -- due date is Tuesday 30-Nov-21. 

In addition. you must comment on the work of at least two projects by the end of that week (Sunday, 05-Dec-21)

Mon 06-Dec-21

Last Day of Instruction

Final Written Exam – Online on Canvas (both MCQ + Short Answer) — during Exam Week, Wed 08-Dec-21 to Tue 14-Dec-21
 
+ Final Practical Exam submission by email to Instructor by Wednesday 08-Dec-21

Labs / Hands-on Practical (40% of overall grade, supports CLO 2)

Individual, hands-on practices will be given throughout the semester to help students review and reinforce what they have learned in class. Students will learn how to analyze and visualize big data with practical tools. These Labs (or Hands-on Practices) are an important part of the course. The Labs generally require the submission of a Lab Report (MS Word File, .docx) to the Canvas website for the course.

Semester Mini-Project (30% of overall grade, supports CLOs 1-4)

Students will work in teams or alone on a Mini-Project that consists of three phases (more details TBA on the Canvas site). The main requirement of the project is that it uses at least one tool covered in the class.

  • Milestone I - Initial Thoughts: Students will submit a short paragraph discussing the potential topics and directions of the semester project. Students will also briefly present the motivation of the study and the approach that might be taken. 

Emphasis in the Project is on understanding the chosen data-set, data gathering, data munging / preparation, and the steps leading towards data analysis (this work is generally 60-80% of any data project)..  Heavy analysis and computer programming are beyond the scope of the project.

  • Milestone II - Mini Report: Students will submit a one-page report outlining the current progress of the project. The report will include what has been done, what the current status and results are, and what needs to be accomplished.
  • Final Report & Demo: Students will submit a detailed, 10-page report for the project. The report should at least include the following sections: motivation, problem statement, methodology, analysis results, discussion, and any conclusions. Students will also prepare a short “demo” to present and discuss their work.

Research Paper (5% of overall grade, supports CLO 4). The Research Paper will deal with topics such as Ethics related to Big Data Acquisition and Usage.

Mid-Semester Quiz (Multiple Choice Questions [MCQ] + Short-Answer Questions)
+ Final Exam Online (Multiple Choice Questions [MCQ] + Short-Answer Questions) & Final Practical Exam
(25% of overall grade, supports CLOs 1-4)

Grading Information

Grading will be based on a total accumulation of possible 100 percent, distributed as follows:

Deliverables

Percent of overall grading (Total = 100%)

Hands-on Practices (“Labs”)
— evidenced by 7+ Weekly Reports

40%

Semester Mini-Project

Milestone I: 2%
Milestone II: 3%
Final Report & Project Demo: 25%

Grading is based on the submissions plus constructive comments on at least two other-student Projects

Research Paper

5%

Mid-Semester Quiz/Exam & Final Exam

25%

These deliverables will be graded using larger point values but the totals for each type of deliverable will be scaled to the relative percentages shown in the iSchool grading scale shown below.

Supplementary reading (see also the Textbooks below)

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity. Retrieved from https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation

Various Apache Software Foundation (ASF) websites:

Overview: http://hadoop.apache.org and http://hadoop.apache.org/docs/current

MapReduce Tutorial: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

WordCount 2: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v2.0

Spark: http://spark.apache.org

Flume: http://flume.apache.org

Sqoop: http://sqoop.apache.org

Google’s Paper on Big Table: http://research.google.com/archive/bigtable.html

Google’s Paper on MapReduce: http://research.google.com/archive/mapreduce.html

FTC Report on Big Data: Tool for Inclusion and Exclusion: https://goo.gl/YgPCWv

Articles on specific topics, e.g.:

Spark: https://databricks.com/spark

Other electronic articles as referenced during the course itself.

Course Workload Expectations

Success in this course is based on the expectation that students will spend, for each unit of credit, a minimum of forty-five hours over the length of the course (normally 3 hours per unit per week with 1 of the hours used for lecture) for instruction or preparation/studying or course related activities including but not limited to internships, labs, clinical practica. Other course structures will have equivalent workload expectations as described in the syllabus.

Instructional time may include but is not limited to:
Working on posted modules or lessons prepared by the instructor; discussion forum interactions with the instructor and/or other students; making presentations and getting feedback from the instructor; attending office hours or other synchronous sessions with the instructor.

Student time outside of class:
In any seven-day period, a student is expected to be academically engaged through submitting an academic assignment; taking an exam or an interactive tutorial, or computer-assisted instruction; building websites, blogs, databases, social media presentations; attending a study group;contributing to an academic online discussion; writing papers; reading articles; conducting research; engaging in small group work.

Course Prerequisites

Graduate Standing or Instructor Consent

Course Learning Outcomes

Upon successful completion of the course, students will be able to:

  1. Describe and explain how the main technologies and trends in big data work, specifically data visualization, large-scale database management, map-reduce paradigm, and big data mining.
  2. Demonstrate proficiency in using current big data technologies to solve big data analytical problems.
  3. Interpret and communicate big data analysis and visualization results appropriately, effectively and accurately.
  4. Discuss, articulate and compare various big data management issues (e.g., big data privacy).

SLOs and PLOs

This course supports Informatics SLO 3: Demonstrate proficiency in using current big data and electronic records technologies to solve analytical problems; including developing policies, standards, and practices in particular specialized contexts and interpreting and communicating analysis and visualization results appropriately and accurately.

SLO 3 supports the following Informatics Program Learning Outcomes (PLOs):

  • PLO 2 Evaluate, manage, and develop electronic records programs and applications in a specific organizational setting.
  • PLO 3 Demonstrate strong understanding of security and ethics issues related to informatics, user interface, and inter-professional application of informatics in specific fields by designing and implementing appropriate information assurance and ethics and privacy solutions.
  • PLO 6 Conduct informatics analysis and visualization applied to different real-world fields, such as health science and sports.

Textbooks

Required Textbooks:

  • Akhtar, S. (2018). Big data architect's handbook: A guide to building proficiency in tools and systems used by leading big data experts. Packt Publishing. Available through Publisherarrow gif indicating link outside sjsu domain
  • Döbler, M., & Großmann, T. (2019). Data visualization with Python. Packt Publishing. Available through Publisherarrow gif indicating link outside sjsu domain
  • Godsey, B. (2017). Think like a data scientist: Tackle the data science process step-by-step. Manning. Available through Amazon: 1633430278arrow gif indicating link outside sjsu domain
  • McCreary, D., & Kelly, A. (2013). Making sense of NoSQL: A guide for managers and the rest of us. Manning Publications. Available through Amazon: 1617291072arrow gif indicating link outside sjsu domain

Recommended Textbooks:

  • Kaldero, N. (2018). Data science for executives: Leveraging machine intelligence to drive business ROI. Lioncrest Publishing. Available through Amazon: 1544511256arrow gif indicating link outside sjsu domain
  • Spivey, B., & Echeverria, J. (2015). Hadoop security: Protecting your big data platform. O'Reilly Media. Available through Amazon: 1491900989arrow gif indicating link outside sjsu domain

Grading Scale

The standard SJSU School of Information Grading Scale is utilized for all iSchool courses:

97 to 100 A
94 to 96 A minus
91 to 93 B plus
88 to 90 B
85 to 87 B minus
82 to 84 C plus
79 to 81 C
76 to 78 C minus
73 to 75 D plus
70 to 72 D
67 to 69 D minus
Below 67 F

 

In order to provide consistent guidelines for assessment for graduate level work in the School, these terms are applied to letter grades:

  • C represents Adequate work; a grade of "C" counts for credit for the course;
  • B represents Good work; a grade of "B" clearly meets the standards for graduate level work or undergraduate (for BS-ISDA);
    For core courses in the MLIS program (not MARA, Informatics, BS-ISDA) — INFO 200, INFO 202, INFO 204 — the iSchool requires that students earn a B in the course. If the grade is less than B (B- or lower) after the first attempt you will be placed on administrative probation. You must repeat the class if you wish to stay in the program. If - on the second attempt - you do not pass the class with a grade of B or better (not B- but B) you will be disqualified.
  • A represents Exceptional work; a grade of "A" will be assigned for outstanding work only.

Graduate Students are advised that it is their responsibility to maintain a 3.0 Grade Point Average (GPA). Undergraduates must maintain a 2.0 Grade Point Average (GPA).

University Policies

Per University Policy S16-9, university-wide policy information relevant to all courses, such as academic integrity, accommodations, etc. will be available on Office of Graduate and Undergraduate Programs' Syllabus Information web page at: https://www.sjsu.edu/curriculum/courses/syllabus-info.php. Make sure to visit this page, review and be familiar with these university policies and resources.

In order to request an accommodation in a class please contact the Accessible Education Center and register via the MyAEC portal.

icon showing link leads to the PDF file viewer known as Acrobat Reader Download Adobe Acrobat Reader to access PDF files.

More accessibility resources.