Ict707 Big Data Assignment Answers Assessment Answer

Answer:

Introduction:

Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of the week, season, hour of the day etc. can affect the rental behaviors (Fanaee-T, 2014). The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bike share System, Washington D.C., USA which is publicly available in System Data | Capital Bikeshare. We aggregated the data on two hourly and daily basis and then extracted and added the corresponding weather and seasonal information extracted from https://www.freemeteo.com./.

Reqirments

The requirements for the program to run or to be in effect are basic and can be found on any simple computer. However some of the essential requirements include;

  • Hardware which can be either a desktop or a laptop.
  • Software required is Apache Spark, PySpark, Matplotlib, Pylab and numpy python libraries.
  • Python program supports any platform such as Ubuntu, windows 10 or Mac Os.

Methodology

Regression models define the relationship between a dependent variable and one or more independent variables (Kutner, 2004). They are concerned with target variables that can take any variable. The underlying principle is to find a model that maps input features to predicted target variables. A few examples of where regression models are used include;

  • Predicting stock returns and other economic variables
  • Predicting customer lifetime value in a retail, mobile, or other business based on user behavior and spending patterns.
  • Predicting loss defaults and many others.

In our program, we are predicting bike rental count based on environmental and seasonal settings using PySpark, Python (Srinivasa, 2015) by combining past rental patterns with historical rental data to forecast rental demand.

Source code

#This program helps in prediction of future bike rental demands

import matplotlib


from pylab import hist

from pyspark.mllib.regression import LabeledPoint

import numpy as np

from pyspark.mllib.regression import LinearRegressionWithSGD, RidgeRegressionWithSGD

from pyspark.mllib.tree import DecisionTree, GradientBoostedTrees

from pyspark import SparkContext

spark = SparkContext('local', 'Assignment Project', '/usr/spark-hadoop')

raw_data = spark.textFile('/home/livinggoods/Projects/Others/regression_modelling/data/hour-noheader.csv')

print raw_data.count()

data_count = raw_data.count()

records = raw_data.map(lambda x: x.split(','))

first = records.first()

print first

print data_count

 

records.cache()

def get_mapping(rdd, idx):

    return rdd.map(lambda fields: fields[idx]).distinct().zipWithIndex().collectAsMap()

 

print "Mapping of first categorical feature column: %s" % get_mapping(records, 2)

 

mappings = [get_mapping(records, i) for i in range(2,10)]

cat_len = sum(map(len, mappings))

num_len = len(records.first()[11:15])

total_len = num_len + cat_len

 

print "Feature vector length for categorical features: %d" % cat_len

print "Feature vector length for numerical features: %d" % num_len

print "Total feature vector length: %d" % total_len

 

def extract_features(record):

    cat_vec = np.zeros(cat_len)

    i = 0

    step = 0

    for field in record[2:9]:

            m = mappings[i]

            idx = m[field]

            cat_vec[idx + step] = 1

            i = i + 1

            step = step + len(m)

    num_vec = np.array([float(field) for field in record[10:14]])

    return np.concatenate((cat_vec, num_vec))

 

def extract_label(record):

    return float(record[-1])

 

data = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))

first_point = data.first()

print "Raw data: " + str(first[2:])

 

print "Label: " + str(first_point.label)

 

print "Linear Model feature vector:n" + str(first_point.features)

   for field in record[2:9]:

           m = mappings[i]

           idx = m[field]

           cat_vec[idx + step] = 1

           i = i + 1

           step = step + len(m)

   num_vec = np.array([float(field) for field in record[10:14]])

   return np.concatenate((cat_vec, num_vec))

def extract_label(record):

   return float(record[-1])

def evaluate(train, test, iterations, step, regParam, regType, intercept):

   model = LinearRegressionWithSGD.train(train, iterations, step,regParam=regParam, regType=regType, intercept=intercept)

   tp = test.map(lambda p: (p.label, model.predict(p.features)))

   rmsle = np.sqrt(tp.map(lambda (t, p): squared_log_error(t, p)).mean())

   return rmsle

def extract_features_dt(record):

   reurn np.array(map(float, record[2:14]))

def squared_error(actual, pred):

   return (pred - actual)**2

def abs_error(actual, pred):

   return np.abs(pred - actual)

def squared_log_error(pred, actual):

   return (np.log(pred + 1) - np.log(actual + 1))**2

print "Mapping of first categorical feature column: %s" % get_mapping(records, 2)

mappings = [get_mapping(records, i) for i in range(2,10)]

cat_len = sum(map(len, mappings))

num_len = len(records.first()[11:15])

total_len = num_len + cat_len

print "Feature vector length for categorical features: %d" % cat_len

print "Feature vector length for numerical features: %d" % num_len

print "Total feature vector length: %d" % total_len

data = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))

first_point = data.first()

print "Raw data: " + str(first[2:])

print "Label: " + str(first_point.label)

print "Linear Model feature vector:n" + str(first_point.features)

print "Linear Model feature vector length: " + str(len(first_point.features))

data_dt = records.map(lambda r: LabeledPoint(extract_label(r), extract_features_dt(r)))

first_point_dt = data_dt.first()

print "Decision Tree feature vector: " + str(first_point_dt.features)

print "Decision Tree feature vector length: " + str(len(first_point_dt.features))

linear_model = LinearRegressionWithSGD.train(data, iterations=10, step=0.1, intercept=False)

true_vs_predicted = data.map(lambda p: (p.label, linear_model.predict(p.features)))

print "Linear Model predictions: " + str(true_vs_predicted.take(5))

dt_model = DecisionTree.trainRegressor(data_dt,{})

preds = dt_model.predict(data_dt.map(lambda p: p.features))

actual = data.map(lambda p: p.label)

true_vs_predicted_dt = actual.zip(preds)

print "Decision Tree predictions: " + str(true_vs_predicted_dt.take(5))

print "Decision Tree depth: " + str(dt_model.depth())

print "Decision Tree number of nodes: " + str(dt_model.numNodes())

mse = true_vs_predicted.map(lambda (t, p): squared_error(t,p)).mean()

mae = true_vs_predicted.map(lambda (t, p): abs_error(t, p)).mean()

rmsle = np.sqrt(true_vs_predicted.map(lambda (t, p): squared_log_error(t, p)).mean())

print "Linear Model - Mean Squared Error: %2.4f" % mse

print "Linear Model - Mean Absolute Error: %2.4f" % mae

print "Linear Model - Root Mean Squared Log Error: %2.4f" % rmsle

mse_dt = true_vs_predicted_dt.map(lambda (t, p): squared_error(t,p)).mean()

mae_dt = true_vs_predicted_dt.map(lambda (t, p): abs_error(t, p)).mean()

rmsle_dt = np.sqrt(true_vs_predicted_dt.map(lambda (t, p): squared_log_error(t, p)).mean())

print "Decision Tree - Mean Squared Error: %2.4f" % mse_dt

print "Decision Tree - Mean Absolute Error: %2.4f" % mae_dt

print "Decision Tree - Root Mean Squared Log Error: %2.4f" % rmsle_d

targets = records.map(lambda r: float(r[-1])).collect(

hist(targets, bins=40, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16, 10)

log_targets = records.map(lambda r: np.log(float(r[-1]))).collect()

hist(log_targets, bins=40, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16, 10)

sqrt_targets = records.map(lambda r: np.sqrt(float(r[-1]))).collect()

hist(sqrt_targets, bins=40, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16, 10)

data_log = data.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))

model_log = LinearRegressionWithSGD.train(data_log, iterations=10, step=0.1)

true_vs_predicted_log = data_log.map(lambda p: (np.exp(p.label), np.exp(model_log.predict(p.features))))

mse_log = true_vs_predicted_log.map(lambda (t, p): squared_error(t, p)).mean()

mae_log = true_vs_predicted_log.map(lambda (t, p): abs_error(t, p)).mean()

rmsle_log = np.sqrt(true_vs_predicted_log.map(lambda (t, p): squared_log_error(t, p)).mean())

print "Mean Squared Error: %2.4f" % mse_log

print "Mean Absolue Error: %2.4f" % mae_log

print "Root Mean Squared Log Error: %2.4f" % rmsle_log

print "Non log-transformed predictions:n" + str(true_vs_predicted.take(3))

print "Log-transformed predictions:n" + str(true_vs_predicted_log.take(3))

data_dt_log = data_dt.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))

dt_model_log = DecisionTree.trainRegressor(data_dt_log,{})

preds_log = dt_model_log.predict(data_dt_log.map(lambda p: p.features))

actual_log = data_dt_log.map(lambda p: p.label)

true_vs_predicted_dt_log = actual_log.zip(preds_log).map(lambda (t, p): (np.exp(t), np.exp(p)))

mse_log_dt = true_vs_predicted_dt_log.map(lambda (t, p): squared_error(t, p)).mean()

mae_log_dt = true_vs_predicted_dt_log.map(lambda (t, p): abs_error(t, p)).mean()

rmsle_log_dt = np.sqrt(true_vs_predicted_dt_log.map(lambda (t, p): squared_log_error(t, p)).mean())

print "Mean Squared Error: %2.4f" % mse_log_dt

print "Mean Absolue Error: %2.4f" % mae_log_dt

print "Root Mean Squared Log Error: %2.4f" % rmsle_log_dt

print "Non log-transformed predictions:n" + str(true_vs_predicted_dt.take(3))

print "Log-transformed predictions:n" + str(true_vs_predicted_dt_log.take(3))

data_with_idx = data.zipWithIndex().map(lambda (k, v): (v, k))

test = data_with_idx.sample(False, 0.2, 42)

train = data_with_idx.subtractByKey(test)

train_data = train.map(lambda (idx, p): p)

test_data = test.map(lambda (idx, p) : p)

train_size = train_data.count()

test_size = test_data.count()     

print "Training data size: %d" % train_size

print "Test data size: %d" % test_size

print "Total data size: %d " % data_count

print "Train + Test size : %d" % (train_size + test_size)

data_with_idx_dt = data_dt.zipWithIndex().map(lambda (k, v): (v, k))

test_dt = data_with_idx_dt.sample(False, 0.2, 42)

train_dt = data_with_idx_dt.subtractByKey(test_dt)

train_data_dt = train_dt.map(lambda (idx, p): p)

test_data_dt = test_dt.map(lambda (idx, p) : p)

params = [1, 5, 10, 20, 50, 100]

metrics = [evaluate(train_data, test_data, param, 0.01, 0.0, 'l2', False) for param in params]

print params

print metrics

matplotlib.pyplot.plot(params, metrics)

fig = matplotlib.pyplot.gcf()

matplotlib.pyplot.xscale('log')

The output is as below:

Conclusion

Therefore we can conclude that the program has been successful for what it aims to perform. This program can be used in a bike rental organization to predict future rental demands and also in event and anomaly detections.

References

Fanaee-T, H. a. G. J., 2014. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, (2-3)( 2), pp. pp.113-127.

Kutner, M. N. C. a. N. J., 2004. Applied linear regression models. McGraw-Hill/Irwin: s.n.

Srinivasa, K. a. M. A., 2015. Getting Started with Spark. In: C. Springer, ed. Guide to High Performance Distributed Computing . s.l.:s.n., pp. (pp. 73-99).

Other sources from the internet include:

  • System Data | Capital Bikeshare
  • https://www.freemeteo.com.

Download Sample Now

Earn back money you have spent on downloaded sample



Upload Document Document Unser Evaluion Get Money Into Your Wallet



Cite This work.

To export a reference to this article please select a referencing stye below.

Assignment Hippo (2022) . Retrive from https://assignmenthippo.com/sample-assignment/ict707-big-data-assignment-answers-assessment-answer

"." Assignment Hippo ,2022, https://assignmenthippo.com/sample-assignment/ict707-big-data-assignment-answers-assessment-answer

Assignment Hippo (2022) . Available from: https://assignmenthippo.com/sample-assignment/ict707-big-data-assignment-answers-assessment-answer

[Accessed 24/05/2022].

Assignment Hippo . ''(Assignment Hippo,2022) https://assignmenthippo.com/sample-assignment/ict707-big-data-assignment-answers-assessment-answer accessed 24/05/2022.



Buy Ict707 Big Data Assignment Answers Assessment Answers Online

Talk to our expert to get the help with Ict707 Big Data Assignment Answers Assessment Answers from Assignment Hippo Experts to complete your assessment on time and boost your grades now

The main aim/motive of the finance assignment help services is to get connect with a greater number of students, and effectively help, and support them in getting completing their assignments the students also get find this a wonderful opportunity where they could effectively learn more about their topics, as the experts also have the best team members with them in which all the members effectively support each other to get complete their diploma assignment help Australia. They complete the assessments of the students in an appropriate manner and deliver them back to the students before the due date of the assignment so that the students could timely submit this, and can score higher marks. The experts of the assignment help services at www.assignmenthippo.com are so much skilled, capable, talented, and experienced in their field and use our best and free Citation Generator and cite your writing assignments, so, for this, they can effectively write the best economics assignment help services.

Get Online Support for Ict707 Big Data Assignment Answers Assessment Answer Assignment Help Online

Want to order fresh copy of the Sample Ict707 Big Data Assignment Answers Assessment Answers? online or do you need the old solutions for Sample Ict707 Big Data Assignment Answers Assessment Answer, contact our customer support or talk to us to get the answers of it.

Assignment Help Australia
Want latest solution of this assignment

Want to order fresh copy of the Ict707 Big Data Assignment Answers Assessment Answers? online or do you need the old solutions for Sample Ict707 Big Data Assignment Answers Assessment Answer, contact our customer support or talk to us to get the answers of it.


Submit Your Assignment Here

AssignmentHippo Features

On Time Delivery

Our motto is deliver assignment on Time. Our Expert writers deliver quality assignments to the students.

Plagiarism Free Work

Get reliable and unique assignments by using our 100% plagiarism-free.

24 X 7 Live Help

Get connected 24*7 with our Live Chat support executives to receive instant solutions for your assignment.

Services For All Subjects

Get Help with all the subjects like: Programming, Accounting, Finance, Engineering, Law and Marketing.

Best Price Guarantee

Get premium service at a pocket-friendly rate at AssignmentHippo

FREE RESOURCES

  • Assignment Writing Guide
  • Essay Writing Guide
  • Dissertation Writing Guide
  • Research Paper Writing Guide

FREE SAMPLE FILE

  • Accounts
  • Computer Science
  • Economics
  • Engineering

Client Review

I was struggling so hard to complete my marketing assignment on brand development when I decided to finally reach to the experts of this portal. They certainly deliver perfect consistency and the desired format. The content prepared by the experts of this platform was simply amazing. I definitely owe my grades to them.

Tap to Chat
Get instant assignment help