7022DATSCI Big Data Analysis

7022DATSCI—Mini-projects
Master of Sensors Data and Management
Big Data Analysis

Instructions:

  • You should work on the mini-projects in groups of up to 5 students.
  • Use electronic communication for organising your group work. Support wil be provided online via
  • Together with your group, prepare a Powerpoint presentation of your project with 10 minutes recorded audio. All group members will receive the same mark for the presentation.
  • In addition, everyone should hand in a one-page summary of the project.
  • In the week after the deadline (week commencing Monday, 27th April 2020) each group should meet with me via videoconference and explain the code that they have produced for the project (“code walkthrough”). Each student will receive an individual mark for the code demonstration.

Project: Big Data Analysis

The aim of the Big Data Analysis project is to apply a machine learning method in a practical setting. In each of the following projects you are asked to...

  1. Work on a practical machine learning project.
  2. Present your work in a presentation.

You will work on your projects in groups of 3-5 students. The following list contains suggestions for project topics. Additional topics might become available and you can also suggest alternative topics:

  • “3, 6, 8, 9?”—recognising hand-written digits with principal component analysis

Apply principal component analysis for recognising handwritten digits as explained in (Lu, 2017) (but without the pre-processing using Histograms of Oriented Gradients (HOG)) to the MNIST data set. http://yann.lecun.com/exdb/mnist/

  • Googling food webs—the PageRank of extinction

Implement the variant of the PageRank algorithm described in (Allesina and Pascual, 2009) and reproduce the study for some of the food webs from this article. Note that some of the food webs are available in R by installing the cheddar library.

  • MCMC for code cracking

A highly original application of Markov chain Monte Carlo (MCMC) was presented by (Diaconis, 2009) and extended by (Chen and Rosenthal, 2012). Implement and test the approach by reproducing the example described in (Diaconis, 2009).

References

Allesina, S., Pascual, M., 09 2009. Googling food webs: Can an eigenvector measure species’ importance for coextinctions? PLOS Computational Biology 5 (9), 1–6.

URL https://doi.org/10.1371/journal.pcbi.1000494

Chen, J., Rosenthal, J., 2012. Decrypting classical cipher text using Markov chain Monte Carlo. Statistics and Computing 22, 397–413.

URL https://doi.org/10.1007/s11222-011-9232-5

Diaconis, P., 2009. The Markov Chain Monte Carlo Revolution. Bulletin of the American Mathematical Society 46 (2), 179–205.

Lu, W., 2017. Handwritten digits recognition using PCA of histogram of oriented gradient. In: 2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM). pp. 1–5.

What you should hand in

  1. Each group: A Powerpoint presentation with 10 minutes recorded audio (25%).
  2. Every student: A one-page summary of your mini-project (25%)
  3. Every student: A text file containing your commented source code (50%).

Important! All group members will receive the same mark for the Powerpoint presentation, one-page summary and code demonstration will be marked individually.

Presentation/One-page summary

Partial mark

Introduction

Brief description of your application

Motivation: Which challenge are you going to address?

5%

Implementation

What are the challenges of implementing the algorithm?

Explain how you implemented the method.

15%

Results

What have you found out about your data set?

Show how your machine learning method addresses the challenge described in the Introduction.

10%

Discussion

Brief summary of the analysis of the data

Critically reflect how well the challenge described in the Introduction was solved by your machine learning approach.

10%

Formal marks

Visual presentation

Delivery of the talk

Time keeping

10%

Total

50%

Source code (submitted to Canvas and demonstration)

Partial mark

Completeness of the implementation

20%

Demonstration

10%

Clarity of the code

10%

Quality of Comments

10%

Total

50%