-
Notifications
You must be signed in to change notification settings - Fork 1
/
25.6.17.txt
75 lines (49 loc) · 1.26 KB
/
25.6.17.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
descriptive
numerical
mean and median
categorical
mode and frequency table
mean can be represented in histogram
median scatter plot
mode showin in bar chart
freq table in pie chart
Missing Values
% of missing rows is very less then drop them
if its numerical replace the missing value with mean
if its categorical replace with mode : if there is too much mv then this may cause a bias so create new catefory
replacing with random number : should be valid numbers because regression is affected by outliers
Predict the values
STEPS
Reading data from csv, tsv, sas, txt
Schema
Cleaning
Descriptive - outliers
MV
READING
read.csv(filename)
data.table pppackage use fread
CLEANING
gsub replace
regexpr like inddex of
substr
DESCRIPTIVE
sumary(data)
gives 5 point summary for numerical
fatcors : ordered or nominal
DATASTRUVTURES
vectors
matrix
hist()
plot()
barplot()
R^2=1-DIFFERENCE IN BESTFIT/DIFFERENCE IN MEAN
variance in modal = difference in mean - ifference in bestfit : this is OLS
generally normal distribution doesnt hold true
critical assumptions
no correlation btw depenedent
no autocorrelation
variance of model = variance of mean from actual
variance of error = variance of best fit from actual
MODEL = SST
(SST-SSE)/SST
NULL HYPOTHESIS