Machine Learning and Data Analysis (Strijov's practice)
Материал из MachineLearning.
- Short URL to this page: http://goo.gl/7yWAAX
Introduction
’’Machine Learning and Data Analysis’’ is a practical course that focuses on methods for scientific research. The course teaches students how to conduct research projects in the field of machine learning and data analysis. The abstract goal is to learn to convey ideas in precise, clear and elegant way; specific goal is to write a research paper, accepted by other researchers from the field of Machine Learning and Data Analysis; make a report. Expected result is a research paper, submitted to a peer-reviewed journal from the list, composed by the Higher Attestation Commission.
The course introduces students to technologies used in scientific research and teaches them to present the results of their studies in the correct format, as used by other researchers from the field of machine learning and data analysis. By the end of this term, each student is expected to write a research paper and submit it to a peer-reviewed journal from the list, composed by the Higher Attestation Commission. During the course the students learn the basics of scientific writing and designing computational experiments, using associated tools such as markdown system LaTeX, bibliographic system BibTeX, and computing environment MATLAB.
The work on a project includes exploring the literature, writing mathematical problem statement and algorithm description, investigating the its properties, and running computational experiments. Each student selects a personal problem from the list of suggested research topics. The student analyzes recent publications on the selected topic, formulates the problem and presents it to the group. Then the student performs mathematical description and analysis of suggested methods, followed by an intermediate report. The last step is to run computational experiments to illustrate the method's properties using real or synthetic data. Each paper undergoes a revision process with the student's peers acting as reviewers. The works are syncronized via SourceForge.org, at the project ’’MLAlgorithms’’.
Course format. Each project is aided by an assistant and an expert. A student is willing to learn to formally state research problems, find adequate references, generate novel and significant ideas for problem solving.
An assistant helps the student with technical issues, consults the student on topics of machine learning, promptly reacts to arising problems, performs evaluations and grading. Each assistant is supposed to possess sufficient publishing experience. Ideally, the advisor is writing paper on the adjacent topic. It is recommended to organize weekly reviewing process in such way that a student would input the corrections himself.
An expert guarantees novelty and importance of the paper, suggests the problems, provides data.
Course-related materials
- Brief description of the course: goals, structure and grading policy CourseShort.pdf
- Slides in PDF with course overview (goals, syllabus, summary of 2009-2014 results) CourseSlides.pdf
- Basic schedule with the list of tasks to complete
- Report on the course results, Fall 2013 Report2013Fall.pdf
- Report presentation templates in pdf, tex
-  Lists of recommended journals on Machine Learning and Data Analysis: 
- High impact factor High_IF_ScientificJournals.pdf
- Low impact factor Low_IF_ScientificJournals.pdf
 
-  On reviewing/resubmitting/correcting the paper:
- Examples of feedback from reviewers: Review1.pdf, Review2.pdf, Review3.pdf
- Sample responses Response1.pdf, Response2.pdf
- Correction sample CorrectedPaper.pdf
 
Past terms
| Link to the course page | Description | 
|---|---|
| Group 274, summer 2015 (In Russian) | My first publication in Higher Attestation Commission journal. The course involves experts and personal assistants. | 
| Group YАД, summer 2015 (In Russian) | My first publication in Higher Attestation Commission journal. The course involves experts and personal assistants. | 
| Group 174, summer 2015 (In Russian) | Research planning. | 
| Group 174, winter 2014 (In Russian) | Conducting commercially-oriented research, developing applications. The problems are chosen from industrial and academical sources. | 
| Group 974, winter 2014 (In Russian) | Lectures on emerging machine learning issues. Assays and practice in Mathematica. | 
| Group 174, summer 2014 (In Russian) | My first publication in Higher Attestation Commission journal. The course involves experts and personal assistants. | 
| Group 074, summer 2014 (In Russian) | Writing assays: brief problem statements and analysis | 
| Group 974, summer 2014 (In Russian) | The "Software engineering" course, professor L. Karpov | 
Requirements
Basic
- The students must have previously passed the analysis, discrete mathematics, probability theory, statistical inference, and optimization algorithms courses.
Advanced
- The students are encouraged to get acquainted with materials of the lecture course on machine learning by K. Vorontsov.
Approximate syllabus
- Find and describe the data. Compose a reference list, and store it in bib-file. Write an annotation to the paper.
- Visualize the data. Make a literature review.
- Write an introduction to the paper. The introduction should include existing methods review and a description of the proposed approach.
- Write a problem statement. Make stress on the novelty of suggested approach. Come up with a solution draft.
- Design computational experiment, obtain initial results.
- Describe the suggested approach in detail.
- Complete computational experiments.
- Describe the results of computational experiments. This includes error analysis and comparison to other methods.
- Correct the paper according to reviewers comments.
- Correct theoretical content.
- Correct the paper's structure.
- Submit the manuscript of the paper to a journal.
- Make a report
Consulting and grading
- The project is divided into separate tasks, each followed by a list of requirements that determine the quality criteria for grading.
- Each task must be completed during the week and submitted the day preceding the lecture.
- Preferably, each task is improved and resubmitted several times before the deadline.
Each completed task (marked with a corresponding letter) yields 1 point, and the suffix +/- adds/subtracts 0.25 points.
Homeworks
Note for assistants. The tasks listed below provide quality citeria for homework grading.
Homework1: synchronization tools
- Acquire the technical computing environment (MATLAB or Octave) .
- Install the typesetting system TeX (MikTeX for Windows, TeX Live for Linux and Mac OS).
- Install a text editor, for example TeXnic Center or WinEdt for Windows, and TeXworks for Linux.
- Install the bibliographic reference manager JabRef.
- Create account at [1] repository and e-mail the login to the group's coordinator. Read [introductory materials] on version control systems.
- Install a subversion client (TortoiseSVN for Windows, RabbitVCS for Linux).
- Following the guidelines, check out the MLAlgorithms repository.
- Create account at MachineLearning.ru and e-mail the login to the group's coordinator.
Run the installed tools, and get acquainted with interfaces.
Homework1: LaTeX
- If necessary, read LaTeX and BibTeX articles.
- Download the article template, ZIP and compile it.
Homework1: MATLAB
- Read [introductory materials] to MATLAB.
- Read documenting conventions Matlab Programming Style Guidelines.
Homework2: test programming problem
This should take between two and six hours. The purpose of this homework is for the students to practice using MATLAB/Octave before the projects are launched.
- Select a problem from the list (rus), place its number into the table in the Group's page.
- In the directory MLAlgorithms/Group274/Example2015Code create a subditectory Surname2015Problem0 (your surname, year, "Problem", problem number).
- Upload your code and figures into your directory.
- Provide a brief annotation for the problem that includes description of solution and results. Make a short report (approximately 45 seconds) on the basis of this descriptions.
- Note that all figures should be formatted according to recommendations from JMLDA/Fig (rus).
- The code should be documented according to MATLAB Programming Style Guidelines.
- Tip: first formulate your solution in formulas then rewrite it as code.
Examples:
- [2]. The code is documented, results are visualized, solution is described. This example lacks results report and figures description. Figure formatting: compare fig1 and fig2.
- [3]. The code is documented, results are visualized, solution is described. Figure formatting is OK. This example lacks results report and figures description.
Examples of basic machine learning problems in MatLab: [4].
Homework AIL: Annotation, Introduction, Literature
- Select a problem and insert project title into the table on the page ( Group 174, summer 2014).
- Collect a reference base and store the bibliographic information using BibTeX. Pay attention to filling the name, volume and pages fields.
-  Write annotation section (about 600 characters), insert the text into the article template. Follow the plan:
- provide a general overview of the study,
- define the more specific focus of the study,
- list particular features of the problem,
- highlight the author’s contribution and novelty of the methods; and
- mention some practical problems that illustrate the study.
 
- Think of several keywords.
- Create a project directory. Create subdirectories (doc, code,...), following guidelines from “Working with repository” section, and upload the template.
- Write synopses to the sources in the reference list and compose a literature review.
-  Write the introduction section. The introduction section should contain 
- the central message of the paper in one or two phrases,
- a review of the existing literature concerning state of the art methods, limitations, and problems (two to four paragraphs),
- a description of the proposed method (approximately two paragraphs).
 
- Insert a link to your directory and PDF-file of the paper into the table. Examples: [5], pdf.
Homework SBR: Problem Statement, Basic algorithm, Report
-  Discuss the problem statement with your assistant. Look through example problem statements of papers written by senior students during the course and papers from peer-reviewed journals. Write your problem statement (from 0.5 to 1 page). Download  this archive (rus) for recommended notation.
- a data set description or list of permissible operations on samples,
- statistical or other assumptions on the origins of the data set,
- description and (optionally) rationale of the error function, loss function, or other quality function, which is used to measure the quality of the solution,
- a description of the constraints (if present),
- additional quality functions, and
-  optimization problem in statement. 
 
-  Design a computational experiment that provides a baseline quality. The baseline algorithm for primary experiments is chosen with the assistant or fixed by the expert in the problem description.
- Prepare a synthetic dataset or chose easily accessible real data with an uncomplicated structure.
- Run the baseline algorithm on this data and evaluate the error function.
- Illustrate the results using straightforward figures and graphs. Remember to use formatting recommendations.
 
-  Write a concise report, including
- a brief description of experimental goals;
- a brief description of the way synthetic data was generated or the origins of real data (see, for instance, Bishop C.P. Pattern recognition and machine learning, 2006. Pp. 677-683); and
- a brief description of the figures and results.
 
- Insert the problem statement and the initial report into the respective sections of the paper. Upload the data and code into directories Example2014Code/data and Example2014Code/ code.
Homework
- Prepare data for the real-life problem, as described in the problem description. Upload data to the repository (if the data exceeds 5 Ìb or there are too many files, refer to the assistant for advice).
- State computational experiment goals, write a plan for the experiments, and describe the data.
- If needed, introduce additional quality criteria and constraints.
- List compared methods and explain the error analysis procedure.
- Make a list of figures and tables for visualization purposes and describe them.
- Run computational experiments (in some simplified settings it may be possible to use synthetic data), draw figures, and generate data for tables.
-  Prepare a presentation of the project (up to 3 minutes). Rehearse your presentation and time it. Follow the plan:
- project plan and general contents,
- specific features of the project,
- what has been done,
- obtained results, and
- further improvements.
 
-  Presentation templates in pdf and tex format.
- Title, 1 slide.
- Project goals, the problem, approaches, 1 slide.
- Illustration of the problem, 1-2 slides (arbitrarily).
- Primary references, 1 slide.
- Problem statement, 1-2 slides.
- Theoretical part, 1-10 slides (not expected yet).
- Aims of experiments, 1 slide.
- Experimental design, 1-5 slides.
- Experimental results, 1-3 slides.
- Conclusion, main results.
 
- The talk should take about 3 min.
Homework T: Theory
- Write a theoretical description of your proposed solution.
- Compare this text with the code, and then correct the code and the text.
- Contact assistant as early as possible to discuss the results.
Homework D: Document
Prepare a draft of the paper containing all the following main sections.
- Title and annotation. Correct the first draft of the annotation according to the changes made/results obtained.
- Keywords: most important terms. To identify good keywords try using them for web search. The topic of search results must be close the project’s topic.
- Introduction and problem statement.
- Theoretical aspects of the proposed solution.
- Other possible sections as appropriate.
-  Methodological aspects of the proposed solution and computational experiments. These sections can be merged.
- Comparison to other methods.
- Results analysis.
 
-  Conclusion. Provide a brief review of the results. 
- Preferably, give a link to the corresponding directory in MLalgorithms with the main project files, so that other researchers can
 
reproduce the reported results.
-  Reference list:
- papers that are focused on a narrow area; they may contain a method description or application,
- at least five papers published in the last 5 years, and
- three to five fundamental reviews.
- Tip: use \nocite{*} command to obtain a complete reference list.
 
Homework E: Error analysis
- Complete the description of computational experiments, and implement one or more of the suggested error analysis methods.
- Prepare a final presentation of the project. The presentation should have a similar structure to the paper.
- See recommendations (rus) on reporting results.

