INFO 246-03
INFO 246-15
Information Technology Tools and Applications – Advanced Topic: Web/Text/Data Mining for LIS
Fall 2020 Syllabus
Dr. Geoffrey Z. Liu
E-mail
Other contact information: telephone: (408) 924-2467
Office Location: Clark Hall 418L, SJSU Campus
Office Hours: Email, Zoom Chat, and in-person by appointment
Syllabus Links Textbooks CLOs Competencies Prerequisites |
Additional Links Online Resource Text/Data Mining Tools |
Resources Canvas Login and Tutorials iSchool eBookstore |
Canvas Information: Courses will be available beginning on August 19th at 6 am PT, unless you are taking an intensive or a one-unit or two-unit class that starts on a different day. In that case, the class will open on the first day that the class meets.
You will be enrolled in the Canvas site automatically.
Course Description
This course is an introduction to web, text, and data mining from the perspective of library and information services. Students will learn basic concepts, approaches, and practical techniques of web/text/data mining by conducting group topical research and completing one individual mining project (consisting of ten stages/exercises) with Rapid Miner (a free data mining software with extensions for web/text processing). No programming (coding) skill is required, but some understanding of basic computational processes is desirable.
Course Requirements
Assignments
Students' performance in this class will be evaluated on the basis of the following assignments,
- Self-introduction (1%) -- CLO #1, #2
- Group online discussion (two sessions, 2% each) -- CLO #1, #2
- Quizzes embedded in assigned textbook chapters (20%) -- CLO #1, #2, #3
- Individual mining project (10 stages, 75% in total) -- CLO #4, #5
- RapidMiner installation/configuration | Excel data import/exporting
- Reading/writing text file | Correlation analysis
- Web crawling and text extraction | Preprocessing of texts
- Document clustering | Constructing random set for model training
- Building | testing a data model (NN/Bayesian classifier)
At the start of the semester, students will be randomly assigned into groups (of optimal size five) to conduct two sessions of online discussion on given topics in their group forums. The group discussions are mainly to share findings and to comment on related issues. The first session is on the topic "Web/text/data mining in the library and information science fields", and the second session on the topic "Ethical/social issues related to big data and data mining". In each session, every member of a group will create/moderate one thread in the group's forum, by making a lead post. The lead post may be a critical review of a scholarly article, analysis of a related project, or reflection on personal experience/observation. Reference and article PDF should be included when applicable. Others will participate by responding.
Detailed instruction on each stage of the individual mining project will be provided in the Canvas class site, along with other course materials.
Students are encouraged (but not required) to record presentations of the individual mining project in Zoom. The presentation recordings may be used later for INFO289 e-portfolio as competency evidence or for a job interview as a demo of skills.
All written work should be professionally prepared following the APA editorial style and established conventions of academic writing, free of grammatical errors and spelling mistakes. Tutorial, assistance, and resources for improving academic writing skills are available at the Writing Resources Center.
It is the students' responsibility to submit and maintain the electronic version of their works until the final grade is issued.
Course Calendar
(Tentative. A final/extensive version will be provided in Canvas)
Session |
Topic |
Tasks & Dues |
0 8/22 |
Orientation
|
Zoom Meeting (mandatory) |
1 8/24 |
Introduction
|
Ex-1 DUE on 8/28 (Installing RapidMiner) |
* Sat. 8/29 |
Lab Session 1: Initiation on Rapid Minor (Zoom Meeting) (mandatory for whole class) |
Zoom Meeting (mandatory) Self Intro End |
* 9/7 |
Labor Day (campus closed) |
. |
3 9/8 |
Survey of Software Tools
|
Ex-2 DUE (First process: import/export Excel data) |
4 9/14 |
Statistical Data Analysis (for mining)
|
Ex-3 DUE (Read/write text files) GD-1 Start (Web/Text/Data mining in LIS) |
5 9/21 |
Web Mining (Content and Structure)
|
Ex-4 DUE (Correlation analysis) |
* Sat. 9/26- |
Lab Session 2: Web crawling & document extraction (Zoom Meeting) (Hourly timeslots for individual tutoring) |
. |
6 9/28 |
(Reserved) |
. |
7 10/5 |
NLP and Text Preprocessing
|
Ex-5 DUE (Crawling & extracting web docs) GD-1 End |
* Sat. 10/10- |
Lab Session 3: Text Preprocessing (Zoom Meeting) (Hourly timeslots for individual meetings) |
. |
8 10/12 |
(Reserved) |
. |
9 10/19 |
Statistical Text Mining
|
Ex-6 DUE (Preprocessing texts) |
* Sat. 10/24- |
Lab Session 4: Document clustering (Zoom Meeting) (Hourly timeslots for individual meetings) |
GD-2 Start (Ethical/social issues) |
10 10/26 |
(Reserved) |
. |
11 11/2 |
Data Transformation & Modeling
|
Ex-7 DUE (Document clustering) |
* Sat. 11/7- |
Lab Session 5: NN-based data modeling (Zoom Meeting) (Hourly timeslots for individual meetings) |
Ex-8 DUE on 11/9 (Constructing random training sets) |
12 11/9 |
(Reserved) |
. |
* 11/11 |
Veteran's Day (campus closed) |
. |
13 11/16 |
Transaction Log Analysis (Web Analytics)
|
GD-2 End |
14 11/23 |
Competitive Intelligence
|
Ex-9 DUE (Building data model -- NN classifier) |
* 11/26-27 |
Thanksgiving (campus closed) |
. |
15 12/7 |
Conclusion
|
Ex-10 DUE (Testing data model -- NN classifier) |
Notes:
* Ex-#: Stages of an individual mining project (exercises); GD-#: Group discussion online.
** Lab sessions 2-5 are optional individual meetings for 1-to-1 tutoring. Students in need of assistance will sign up for hourly timeslots on the specified day(s), using the Canvas appointment scheduling tool.
*** Quizzes embedded in assigned chapters for reading are not listed.
Grading
Students' work will be evaluated according to the following specific criteria:
- Basic content as required (90%)
- Originality and creativity (10%)
Group online discussion on assigned topics will be graded based on meeting the minimum expectation of postings as tracked by the Canvas system, with necessary adjustment for quality of contribution. Individual exercises (as stages of a data/text mining project) will be letter-graded.
The SJSU iSchool's Standard Grading Scale will be used to convert letter grades to percentage scores. Per-assignment scores are added up proportionately to yield the total of earned points, which in turn is converted back into a letter grade using the same scale. No additional work is offered for extra credit or for making up for a missed assignment.
Late submission will not be accepted unless appropriate documentation of legitimate cause for the delay is provided, either as part of a priori arrangement or timely afterward. Request for deadline extension will be handled the same way as of RP (incomplete), in accordance to the university/school policy.
Software Requirement
- Microsoft Excel (version 2009 or later, included in Microsoft Office)
- Screen capturer (such as the snipping tool of Windows)
- Rapid Miner Studio (free academic licensing for both instructor and students)
Course Workload Expectations
Success in this course is based on the expectation that students will spend, for each unit of credit, a minimum of forty-five hours over the length of the course (normally 3 hours per unit per week with 1 of the hours used for lecture) for instruction or preparation/studying or course related activities including but not limited to internships, labs, clinical practica. Other course structures will have equivalent workload expectations as described in the syllabus.
Instructional time may include but is not limited to:
Working on posted modules or lessons prepared by the instructor; discussion forum interactions with the instructor and/or other students; making presentations and getting feedback from the instructor; attending office hours or other synchronous sessions with the instructor.
Student time outside of class:
In any seven-day period, a student is expected to be academically engaged through submitting an academic assignment; taking an exam or an interactive tutorial, or computer-assisted instruction; building websites, blogs, databases, social media presentations; attending a study group;contributing to an academic online discussion; writing papers; reading articles; conducting research; engaging in small group work.
Course Prerequisites
INFO 246 has no prequisite requirements.
Course Learning Outcomes
Upon successful completion of the course, students will be able to:
- Describe key concepts and terminologies in the field of text, data, and Web mining.
- Describe major approaches and techniques of text, data, and Web mining.
- Discuss the roles of text, data, and Web mining in intelligence and knowledge discovery.
- Use a software tool to accomplish a reasonably sophisticated text, data, or Web mining task.
- Integrate, summarize, and report the findings of mining research.
Core Competencies (Program Learning Outcomes)
INFO 246 supports the following core competencies:
- E Design, query, and evaluate information retrieval systems.
- G Demonstrate understanding of basic principles and standards involved in organizing information such as classification and controlled vocabulary systems, cataloging systems, metadata schemas or other systems for making information accessible to a particular clientele.
- H Demonstrate proficiency in identifying, using, and evaluating current and emerging information and communication technologies.
- J Describe the fundamental concepts of information-seeking behaviors and how they should be considered when connecting individuals or groups with accurate, relevant and appropriate information.
Textbooks
Required Textbooks:
- North, M. (2020). Data mining for the masses: With implementations in RapidMiner and R (4th ed.). MyEducator. Available through Online From Publisher
Recommended Textbooks:
- Han, J., & Kamber, M. (2005). Data mining: Concepts and techniques (2nd ed.). Morgan Kaufmann. Available through Amazon: 1558609016
Grading Scale
The standard SJSU School of Information Grading Scale is utilized for all iSchool courses:
97 to 100 | A |
94 to 96 | A minus |
91 to 93 | B plus |
88 to 90 | B |
85 to 87 | B minus |
82 to 84 | C plus |
79 to 81 | C |
76 to 78 | C minus |
73 to 75 | D plus |
70 to 72 | D |
67 to 69 | D minus |
Below 67 | F |
In order to provide consistent guidelines for assessment for graduate level work in the School, these terms are applied to letter grades:
- C represents Adequate work; a grade of "C" counts for credit for the course;
- B represents Good work; a grade of "B" clearly meets the standards for graduate level work or undergraduate (for BS-ISDA);
For core courses in the MLIS program (not MARA, Informatics, BS-ISDA) — INFO 200, INFO 202, INFO 204 — the iSchool requires that students earn a B in the course. If the grade is less than B (B- or lower) after the first attempt you will be placed on administrative probation. You must repeat the class if you wish to stay in the program. If - on the second attempt - you do not pass the class with a grade of B or better (not B- but B) you will be disqualified. - A represents Exceptional work; a grade of "A" will be assigned for outstanding work only.
Graduate Students are advised that it is their responsibility to maintain a 3.0 Grade Point Average (GPA). Undergraduates must maintain a 2.0 Grade Point Average (GPA).
University Policies
Per University Policy S16-9, university-wide policy information relevant to all courses, such as academic integrity, accommodations, etc. will be available on Office of Graduate and Undergraduate Programs' Syllabus Information web page at: https://www.sjsu.edu/curriculum/courses/syllabus-info.php. Make sure to visit this page, review and be familiar with these university policies and resources.
In order to request an accommodation in a class please contact the Accessible Education Center and register via the MyAEC portal.
Download Adobe Acrobat Reader to access PDF files.
More accessibility resources.