Machine Learning and Data Analysis (Strijov practice)/Group 074, Fall 2013

Материал из MachineLearning.

Перейти к: навигация, поиск

Main article: Machine Learning and Data Analysis (Strijov practice, in Russian)


The completed projects are located at http://mvr.jmlda.org

Problems

Author Problem name Link [BMF]LSICUDTPRWS Total Grade
Bunakov Vasiliy Fraud Signature Recognition Using SVM Method [1] [BM+F]L+SI+CU-DTPRWS 14.5
Vdovina Evgeniya Visualization of Results of Keyword Groups Mapping [2] [BF]L-S+I+C0DT-0R-0S 9.75
Voronov Sergey Google Steet View Text Detection and Recognition [3] [BM+F]LS-I+CU+DTP+R-W+S-- 14.25
Grinchuk Oleg Macroeconomic Conditions Forecasting [4] [BMF]L-SI-C-0DTPRWS 12.25
Dubovik Anna Classification and Exploring of Source Code of Python Projects. [5] [M]L0I-->>>000
Zhelavskaya Irina Automatic Filters Generator for Gmail [6] [BM+]LS->>>>>00I
Zhuykov Vladimir Fraud Signature Recognition [7] [B]L--0I-->>>>>
Ivanov Sergey Personalize Expedia Hotel Searches [8] [B]+L-SI+>>
Ivanov Aleksandr Detecting Unsolicited SMS Messages [9] [BM+]LSIC->>U>DTPR
Kasatkin Sergey Determination of the Type of Human Activity Based on the Data from the Accelerometer [10] [B]L-S-I-->>>000
Katrutsa Aleksandr Search Engine Results Ranking [11] [BM+F]L+SI+CUDTPR+W+S 15.25
Kolchanov Andrey The Financial Bubbles Detection in The Stock Data [12] [B]0S-I->>>
Kostin Aleksandr Classify Handwritten Digits [13] [B]L+S-IS-
Kotenko Lengold Ekaterina Satellite Imagery Processing for NDVI Estimation [14] [BMF-]L-S-IC-UD--000W--S-- 8.5
Kudryashova Aleksandra Satellite Imagery Processing for NDVI Estimation [15] [BMF-]L-S-IC-UD--000W--S-- 8.5
Levdik Pavel Electricity Prices Forecasting [16] [BM+]L-SIC--U-D->PR-W> 9.75
Matrosov Mikhail Short-term Forecasting of Musical Compositions [17] [BF]L-SIC--UDT>>W+S 9.5
Mityashov Andrey Unstructured Social Data Processing in Classification Problem [18] [M+F]L+SI--C-UDT--P00S- 10
Neklyudov Kirill Face Recognition [19] [BM+F]LS-I+CU-DTPR-WS- 13.5
Perekrestenko Dmitriy Human Activity Recognition Using Deep Learning [20] [BM+F]L-SI-CU-DTPRW+S 13.75
Prilepskiy Roman Text Detection on Google Street View Images. [21] [B]L+00>>>000
Pushnyakov Aleksey Color Image Segmentation [22] [BM+F]L+S+I+C+UDT+P+R+W+S 16.25
Ryskina Mariya Topic Modeling Using PLSA algorithm [23] [BM+F]L-S+I+CUDT+PR+W+S 15.25
Stenin Sergey Detection of Topically Similar Abstracts of Scientific Conference [24] [B]L+S+I+CUD
Urzhumtsev Oleg Similar Conferences Abstract Search [25] [BM+F]L-S-IC>D>>R--WS 10.25
Feyzkhanov Rustem Email Filter Generation [26] [BM+F-]LS-IC--U->(D-T)>>PR
Shuyskiy Nikolay Melody Recognition using Spectral Analysis [27] [B]0S-0>>>>>
Yashkov Daniil Face Detection Using Viola-Jones [28] [M+F]L-S-IC->>>UDTP

Sсhedule

Date Result To discuss Code
September 18 Select a problem, an advisor. machinelearning.ru record. -
25 Collect literature, write comments. Bibliography list, mini-report. Literature
October 2 Problem statement (synthetic data). Write mathematical statement in TeX-format. ~1 page of text (problem statement) Statement
9 Create report file. Make project description. Describe architecture and main system interfaces (synthetic data). Description, IDEF0. Idef
16 Detail interfaces, write a code (first version). Code (synthetic data). Code
23 Write Unit tests with a launch module. Unit tests. Unit-test
30 Collect real data. Finish IDEF0-schema. Write loading data modules. Data, second IDEF0-schema, modules. Data
November 6 Write and launch system tests. Write a review on a project. Tests, review. Tests
13 Optimize the code. Profiler report before and after. Profiler
20 Make visualization report. Finished technical report. Report
27 Develop web interface. Code on a site. Web
December 4 Make user interface and examples. Report. Show

Work and consultations

  1. Finish each work in a week.
  2. Each work is desirable to be submitted several times before deadline.
  3. Deadline of the last version: Tuesday, 6:00am.
  4. Elapsed week time will be added to the report.
  • Each work stage + 1 point (А--, А-, А, А+, А++),
  • Undone work stage - 0.

Home tasks

Literature

  1. . Complete section 1.1.2 "Motivation" of SysDocs;
  2. . Complete section 1.1.3 "Literature";
  3. . Prepare 40-second oral report on a problem.

Statement

Compose problem statement (using LaTeX). Here[29] is a "template" of problem statement:

And here are some examples from the class presentation, it's strictly recommended to review all of them before starting:

[30] [31] [32] [33] [34] [35] [36] [37] [38] [39]

Also you can review several articles from JMLDA journal archive [40].

Idef

  1. Correct problem statement in case if necessary.
  2. Write down the abstract according to plans and (section 1.1.1 Systemdocs)
  3. Design two layer IDEF0 diagram (sections 1.2.2, 1.2.3 Systemdocs), preferably separating learning stage from final utilization stage.
  4. Describe general data formats and structures(section 1.4 Systemdocs)
  5. Describe modules interfaces (section 2 Systemdocs)

Some useful links that can help:

  1. MATLAB Programming Style Guidelines[41]
  2. IDEF0[42]
  3. Function heading style example[43]
  4. System of notations[44](файл Strijov2013Notation.pdf)

Code

  1. Create launchable source code
  2. But to complete this task you also need to rewrite in more detailed view all modules interfaces (section 2 Systemdocs) and function headings.

Unit-test

  1. Create final version of code for project basement: launchable code should evaluate project results in "one click".
  2. Write unit tests for each module, according to the manual.

Data

  1. Finish IDEF0: detail block of user data processing, make second level of schema. The second level is devoting to the user data adequacy checking, in particular:
    1. The presence of viruses in the uploaded data (do not execute commands from the data, e.g. mpeg),
    2. uploaded data type,
    3. uploaded data size,
    4. allowability of the expected time complexity of the algorithm (not more than 15 sec)
    5. allowability of the memory complexity (not more than 200 Mb),
    6. the adequacy of the input data structure (algorithm should signalize in the case of inadequate data).
  2. Gather real data in the folder 'data' to demonstrate the algorithm performing (and possibly for testing if the data are not too big). If the data are big write to the 'data' files with internet links on the real data. As a variant, the link can be located in the data loading module. Make the data description in systemdocs.
  3. Prepare modules of loading and checking the user data. The module must download one user file.

For your attention:

  1. The main stages of system testing and error analysis.
    1. Check data adequacy,
    2. Check models adequacy (overfitting, complexity, stability, accuracy, etc).
    3. Check adequacy of the obtained results. Error analysis (e.g. residual analysis).
    4. Check adequacy of the system (time complexity, optimization algorithms convergency, stability of the algorithm on the similar data).
  2. Methods of algorithm complexity calculation.
    1. Method 1, theoretical.
      1. Estimate time complexity, e.g. O(n ln n).
      2. Estimate a constant in O().
      3. Estimate time required for the user file processing.
    2. Method 2, technical.
      1. Measure algorithm time on the samples of a different size.
      2. Plot a figure sample size / elapsed time.
      3. Estimate a regression function of the sample size on the elapsed time.
      4. Estimate time required for the user file processing.

Tests

  1. Write a review using a plan provided below and place it into a file named like YourSurname2013ReviewSurname
  2. Prepare 1-minute speech
  3. Create system tests: test data sets, module (script) for launching. Put the reference to this module in section 5.2 of SystemDocs file.

Review plan:

  1. Shortly - what is the main topic, what do you think the most important it this project, aim of the project comparing with similar projects, how can you apply the results of the project (is it actual? important ?)
  2. Project strengths (what positively surprise you?) and weaknesses (what should be considered in a more detailed way)
  3. Project details: clarity of project description in SystemDocs, ProblemStatement; code readability, interfaces usability, tests coverage.
  4. Conclusion

Profiler

Using built-in Matlab profiler, optimize bottle necks in your code. Report about the achievements in section 5.3 of systemdocs file (using profiler reports and comments on the achievements)

Bottle necks are the code fragments, which are unexpectedly turned out to be time-expensive during the experiment. You should show that source code was improved by replacing loops with matrix operations and show that code is efficient enough. If necessary put most significant strings from profiler reports (usually first 10-15 lines), Either copy-pasting lines from html-report generated by profiler or using profiler's exporting utilities (several examples are provided in Matlab manual).

It's recommended to parallel the execution of your algorithm (where it is possible). One of the easiest way to parallel your program is to utilize structure parfor, that is just a "parallel for". Look documentation ("doc parfor") to find examples.

Example:

>> matlabpool(3)

>> tic; parfor i=1:3, c(:,i) = eig(rand(1000)); end; toc Elapsed time is 3.712837 seconds.

>> tic; for i=1:3, c(:,i) = eig(rand(1000)); end; toc Elapsed time is 5.807167 seconds.

Report

Using the results of system tests and the computational experiment, aimed to provide error rate analysis, create plots and tables with some clarifications, and put it into section 5.2 of system docs. Please identify different parts of this report with help of paragraphs named adequately.

  1. Required parts of the mentioned computational experiment:
  2. Visualization of the procedure of model selection and structural parameters optimization
  3. Visualization of the resulting model or algorithm, visualization of the applied method of optimization, dependence of the lost function or quality criterion on the level of inserted noise or on other factors.
  4. Visualization of obtained error rate in "web" section. (also plot or table)

Web

The folder "web" should contain next mandatory files:

  1. File "config.json" (name and extension should be the same). Fill this file using example placed in folder "Group074/Kuznetsov2013SSAForecasting/web/"
  2. File "main.m" with one argument variable and one resulting variable: html = main(filname), where filename is a text string containing file name, and html is text string containing visual "web" report in html format.
  3. File "test.csv" (you can use another extension), This file should contain test object (text, time series, image, sound, video, etc.) for forecasting.
  4. Other files, that are required for function "main" (in particular file with parameters and structural parameters of forecasting model/algorithm)

For testing purposes it is strongly recommended to launch function writeHTML. It calls function "main('test.csv')" and save results into "out.html". This file should contain either "web" report about results of forecasting or error massage about some trouble with forecasting (types of errors were considered in data loading section).

Личные инструменты