-
Notifications
You must be signed in to change notification settings - Fork 0
/
Prosper Loans EDA.rmd
495 lines (378 loc) · 22 KB
/
Prosper Loans EDA.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
Exploring Prosper Loan Data by Abdullah Muhammed
========================================================
## Abstract
[Prosper](https://www.prosper.com/) is a peer-to-peer lending platform that aims to connect people who need money with those who have money to invest. In this Exploratory Data Analysis, I explore a Prosper dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load all of the packages that you end up using in your analysis in this code
# chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
library(ggplot2)
library(knitr)
library(ggthemes)
library(gridExtra)
library(dplyr)
library(tidyr)
library(scales)
theme_set(theme_economist())
```
### Exploring and Cleaning Variables
In this analysis we want to answer the following questions:
Univariate analysis
- What are the characteristics of the loan (loan amount, term, etc.)?
- Who uses Prosper and where?
- Why people use Prosper?
# Load the Data
```{r echo=FALSE, Load_the_Data}
prosper <- read.csv('prosperLoanData.csv')
summary(prosper)
str(prosper)
```
# Univariate Plots Section
# Why people use Prosper?
```{r}
labels <- c("Debt Consolidation", "Home Improvement", "Business", "Personal Loan", "Student Use", "Auto", "Baby & Adoption", "Boat", "Cosmetic Procedure", "Engagement Ring", "Green Loans", "Household Expenses", "Large Purchases", "Medical/Dental", "Motorcycle", "RV", "Taxes", "Vacation", "Wedding", "Other", "Not Applicable")
prosper$ListingCategory <- factor(prosper$ListingCategory..numeric.,
levels = c(1:6, 8:20, 7, 0),
labels = labels)
#Number of Loans by Category
ggplot(prosper, aes(x=ListingCategory)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90))+
ggtitle('Number of Loans by Category') +
xlab('Category') +
ylab('Number of Loans')
```
Majority loans are debt consolidations, with small but significant amounts of home improvement and business loans.It is really strange that people took loans to reimburse loans/debts and lenders also issued loans to these people.
### How much do people borrow?
```{r echo=FALSE, Univariate_Plots}
# Number of Loans by Loan Amount
qplot(x = LoanOriginalAmount, data = prosper,
binwidth = 1000, colour = I("#424242"), fill = I("#0077C0")) +
scale_x_continuous(breaks = seq(0, 35000, 5000))
```
Most loans lies below $15,000, which indicates that most of the borrowers using Prosper are looking for small(er) loans and spikes in the count on round values like $10,000, $15,000, and even $20,000 and $25,000.
```{r}
summary(prosper$LoanOriginalAmount)
```
The minimum loan is 1000, with the median of 6500 and mean of 8337.
### How much do borrowers earn(Income Range)?
```{r}
qplot(IncomeRange, data = prosper) +
theme(axis.text.x = element_text(angle = -90))+
ggtitle('How much do borrowers earn?')
```
Most of borrowers Income ranges from $25,000 - $74,999.
Question: Why does there is borrowers with "Zero"" income?
Maybe exploring more information about those borrowers would make it clearer that why they can get the loan.
```{r}
# Histogram Showing the distribution of Borrower Rate
ggplot(aes(x=BorrowerRate),data = prosper)+
geom_histogram(binwidth = 0.01, fill='#5760AB')+
scale_x_continuous(limits = c(0.0, 0.4), breaks = seq(0.0, 0.4, 0.1))+
ggtitle('Distribution of Borrower Rate')
```
```{r}
summary(prosper$BorrowerRate)
```
The Borrower Rate seems quite uniformly distributed with peak around 0.16. Most people borrow between 10%-20% while mean being at 19.28%.There's spike occurs, around 0.3.
```{r}
# Formatting dates
prosper$ListingCreationDate <- prosper$ListingCreationDate %>%
as.Date(format = '%Y-%m-%d')
prosper$LoanOriginationDate <- prosper$LoanOriginationDate %>%
as.Date(format = '%Y-%m-%d')
range(prosper$ListingCreationDate)
```
```{r}
# Histogram Showing the Growth of Number of Loans
ggplot(aes(x = LoanOriginationDate), data = prosper) +
geom_histogram(binwidth = 30, color = 'black', fill = '#099DD9') +
ggtitle('Number of Loans by Date') +
xlab('Date') +
ylab('Number of Loans')
```
There was a dip in loans offered in 2009 corresponding to their [Prosper SEC relaunch](https://en.wikipedia.org/wiki/Prosper_Marketplace#2009_post_SEC_relaunch), but there's been a growth of loan offerings ever since. The dip in 2014 reflects the fact that the data was collected up until 03/11/2014
```{r}
prosper$ProsperRating <-
factor(prosper$ProsperRating..Alpha,
levels = c('AA', 'A', 'B', 'C', 'D', 'E', 'HR', 'NA'))
# Histogram of Amount Loaned by Propser Rating
ggplot(prosper, aes(ProsperRating, LoanOriginalAmount)) +
geom_bar(stat='identity', aes(fill = ProsperRating)) +
ggtitle('Amount Loaned by Prosper Rating') +
xlab('Prosper Rating') +
ylab('Amount Loaned')
```
The plot looked approximately normally distrubuted, except for "NA" Not Applicable.'C' is the most frequent rating in our data and the highest (AA) and the lowest (HR) rating are less common comparing with other ratings in between.Null values because Prosper Rating only applicable for loans originated after July 2009.
```{r}
# Boxplots of loan amounts by Prosper Rating
ggplot(subset(prosper, !is.na(ProsperRating)),
aes(ProsperRating, LoanOriginalAmount)) +
geom_boxplot(aes(color = ProsperRating))+
ggtitle('LoanOriginalAmount for ProsperRating')
```
These boxplots shows with higher prosper rating you get higher loan amounts, and less than 'C' you chance decrease.
What most
```{r}
#Bar chart show distribution among occupation
ggplot(aes(x = reorder(Occupation, Occupation,
function(x) -length(x)),
fill = I("#0077C0")),
data = prosper) +
geom_bar() +
xlab("Occupation") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle('occupation of borrowers')
```
'Other' and 'Professional' are ambiguous options for occupations and they are away from the rest of the occupations include analysts, accountants, computer programmers, teachers and executives.
```{r}
#Bar chart show distribution among States
ggplot(aes(x = reorder(BorrowerState, BorrowerState,
function(x) length(x)),
fill = I("#0077C0")),data = prosper) +
geom_bar() +
xlab("Borrower State") +
ggtitle('Distribution of Borrower\'s among States') +
theme(axis.text.x = element_text(angle = 90))
```
California is by far the biggest, and that was expected because it is the state where Prosper was founded followed by other popular states include Florida, New York, Texas and Illinois.
```{r}
ggplot(aes(x = Term), data = prosper) +
geom_bar( binwidth =2)+
ggtitle('Number of Loans by Term') +
xlab('Term') +
ylab('Number of Loans')
```
```{r}
summary(prosper$Term)
```
Most loan-terms are 36-month, with some 60-month, and a negligible amount of 12-month terms.Interestingly there are no records with 24 and 48 months tenure.
# Univariate Analysis
### What is the structure of your dataset?
There are 113937 loans in this dataset and 81 dataset (I only use 11 among them)
Other observations:
- Most loans are taken for debt consolidation. Individuals in all income categories take loans for debt consolidation and the amount varies largely
- Individuals earning between $25k and $50k constitute the largest group of borrowers. Those with over $100k income borrow more
- Most of loans term was 36 months and there was not even single loan in 24 and 48 months tenure.
### What is/are the main feature(s) of interest in your dataset?
The main features of the dataset is the Prosper rating.
### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?
Loans' Category, loan amount, borrower rate (or interest rate),borrower income, borrower employment status.
### Did you create any new variables from existing variables in the dataset?
I adjust "ProsperRating" based on ProsperRating..Alpha.
Formatting dates "ListingCreationDate" and "LoanOriginationDate" instead of using them as factor.
Convert ListingCategory from numeric to factor variable using the keys given in the Google Spreadsheet
### Of the features you investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the form \
of the data? If so, why did you do this?
Most of the distributions are positively skewed and there's also peopole who makes 0 dolars monthly salary.
Interestingly there are no Term with 24 and 48 months tenure.
The biggest reason people take loan is to pay them back. Second and Third largest category are not even categories in themselves i.e. Not Available and Other.
I formatted two variable (i.e. ProsperRating and ListingCategory) names to be more concise and meaningful. Created factor levels for ListingCategory variables for better understanding and data manipulation.
# Bivariate Plots Section
```{r echo=FALSE, Bivariate_Plots}
# Loan amount vs number of investors
ggplot(prosper, aes(LoanOriginalAmount, Investors)) +
scale_x_continuous(breaks = seq(0, 35000, 5000)) +
geom_point(alpha = 1/50,
position = 'jitter')+
ggtitle('Loan amount vs number of investors')
```
The number of investors per loan seem to be heavily positive skewed.Spikes on Loan amount 5000,1000,2000 and 2500 indicates most borrowers want fixed and relatively small loans.
```{r}
# create box plot of Home Owner and Loan Amount
ggplot(aes(y = LoanOriginalAmount, x = IsBorrowerHomeowner), data = prosper) +
geom_boxplot() +
scale_y_continuous(
limits = c(0, quantile(prosper$LoanOriginalAmount, 0.99, na.rm = TRUE)))+
ggtitle('Home Owner and Loan Amount')
```
Borrower who home owner had receive bigger loan than other one who not own the home.
```{r echo=FALSE}
rating <- c("HR", "E", "D", "C", "B", "A", "AA")
ggplot(aes(x = ProsperRating, y = CreditScoreRangeUpper,
fill = I("#F5F5F5"), color = I("#506E86")),
data = prosper) +
geom_boxplot(lwd = 0.75, outlier.color = I("#FBB448"), outlier.shape = 1) +
scale_x_discrete(limits = rating) +
coord_cartesian(ylim = c(600, 950))+
ggtitle('Credit Scorer and Prosper Rating')
```
High Credit Score Upper is assign to excellent rating (AA), and surprise is Poor prosper rating "HR" have high credit score that rating E.
```{r}
ggplot(aes(x = ProsperRating, y = BorrowerAPR * 100,
fill = I("#F5F5F5"), color = I("#506E86")),
data = subset(prosper, !is.na(BorrowerAPR))) +
geom_boxplot(lwd = 0.75, outlier.colour = I("#FBB448"), outlier.shape = 1) +
scale_x_discrete(limits = rating)+
ggtitle('Borrower APR and Prosper Rating') +
ylab("Borrower APR")
```
The boxplots above show the relationship between borrower's Prosper rating and their assigned Annual Percentage Rate (APR).The variation in APRs also decreases as the loans get less riskier as displayed by the decreasing size of the boxes in the boxplots when going from 'E' to 'AA'.
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. How did the feature(s) of interest vary with other features in \
the dataset?
For the borrower, BorrowerRate depend on ProsperRating, IsBorrowerHomeowner and Loan Amount,
### Did you observe any interesting relationships between the other features \
(not the main feature(s) of interest)?
I also observe when the borrower who are home owner receive bigger loan than other one who not own the home.
### What was the strongest relationship you found?
I found the strongest relationshipbetween ProsperRating and BorrowerAPR.
# Multivariate Plots Section
```{r echo=FALSE, Multivariate_Plots}
ggplot(aes(x = DelinquenciesLast7Years, y = AmountDelinquent),
data = filter(prosper, AmountDelinquent > 0 &
EmploymentStatus != "Other" )) +
geom_point() +
xlim(0, quantile(prosper$DelinquenciesLast7Years, 0.99, na.rm = T)) +
ylim(0, quantile(prosper$AmountDelinquent, 0.99, na.rm = T)) +
facet_wrap(~EmploymentStatus)+
ggtitle('Relationship between Delinquencies and Employment Status')
```
I wanted to see the relationship between the amount borrowers were delinquent and the number of delinquencies they've had over the last 7 years then separated that by employment status to see if people that weren't full employed had higher delinquencies or owed more money.Overplotting and general dispersion of data doesn't really reveal the trend
```{r}
# draw the histogram to get a better understanding of DIT(Debt to Income) Ratio
ggplot(aes(x=DebtToIncomeRatio), data=subset(prosper,DebtToIncomeRatio<=1)) +
geom_density(aes(color = IncomeRange,
fill = IncomeRange),
alpha = 0.5) +
xlim(0, quantile(prosper$DebtToIncomeRatio, 0.95, na.rm = T)) +
ggtitle('Borrowers APR to Income Range') +
facet_wrap(~IncomeRange, ncol=2,
scales = "free")
```
```{r}
quantile(prosper$DebtToIncomeRatio, 0.99, na.rm=T)
```
After limiting the ratio to less or equal to 1, we include more than 99% of the data but get a much finer graphic. The distribution has a bell shape and is left-skewed.Income ranges from $25,000 - $74,999 are normaly distributed.
```{r}
# A scatter plot of borrower's APR and the debt to income ratio of the borrower
ggplot(aes(x = DebtToIncomeRatio, y = BorrowerAPR,
color =as.integer(ProsperRating), group = ProsperRating),
data = subset(prosper, !is.na(BorrowerAPR) & ProsperRating %in% rating)) +
geom_jitter() +
xlim(0, quantile(prosper$DebtToIncomeRatio, 0.995, na.rm = T)) +
scale_color_gradient2(high = "#FA4659", midpoint = 4, mid = "#FBB448",
low = "#2EB872",
breaks = seq(1, 7, 1),
labels = rev(rating),
limits = c(1, 7),
name = 'Category of Risk') +
ggtitle('Borrowers APR to Debt To Income Ratio ') +
theme(legend.position = "right",
legend.direction = "vertical",
legend.key.size = unit(1, "cm"))
```
A scatter plot of borrower's APR and the debt to income ratio of the borrower, with the colors describing the risk category given to the particular loan.It displays the progression from a safe green to a risky red.Removing outliers that spread the graph by limited the x-axis by removing 0.05% of the points furthest away from the median.When APR increases as the rating gets riskier.Most people tend to have debt-to-income ratios below 1, regardless of risk category
```{r}
# create term of loan by factor the Term
prosper$TermInMonth <- as.factor(prosper$Term)
ggplot(data = subset(prosper,ProsperRating %in% rating),
aes(x = LenderYield,
y = Investors,
colour =as.integer(ProsperRating), group = ProsperRating
)) +
geom_jitter(alpha = 0.5,
position = 'jitter') +
facet_wrap(~TermInMonth, ncol = 3) +
scale_color_gradient2(name = 'Prosper Rating',
breaks = seq(1, 7, 1),
labels = rating,
limits = c(1, 7)) +
xlab("Lender yield") +
ylab("Number of Investor") +
ggtitle("Lender yield by Term and Categories")+
theme(legend.position = "right",
legend.direction = "vertical",
legend.key.size = unit(1, "cm"))
```
This plot shows the relationship between a lender yield on the loan and the number of investors.And I made individual graphs to show that relationsip based on the duration of the loan and finally coloured it based on risk rating.Most data are in 36 and left-skewed.Increase in prosper rating show increasing in how much lender yield.
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. Were there features that strengthened each other in terms of \
looking at your feature(s) of interest?
Monthly income had positive and late payments has negative correlation with Rating.
Increase in prosper rating show increasing in how much lender yield, however most loans in 36 months.
### Were there any interesting or surprising interactions between features?
High amount of loans should generate high risk, since it's risky if someone doesn't pay the the loans. But here the rating doesn't event consider the loan amount. Even the opposite, low loan amount could generate higher risk.
### OPTIONAL: Did you create any models with your dataset? Discuss the \
strengths and limitations of your model.
> No,I did not.
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, Plot_One}
ggplot(aes(x=ListingCategory, fill = TermInMonth ), data = prosper) +
geom_bar() +
scale_y_continuous(trans = log10_trans()) +
guides(fill = guide_legend(title = "Term In Month"))+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90))+
ggtitle('Number of Loans by Category') +
xlab('Category') +
ylab('Number of Loans')
```
I have chosen this plot because of it's combination of detail and simplicity. It makes for an easy way to evaluate why people borrow money and period of pay this loan.Scale transformed in log scale so we can clearly see distibuation of number of loans in Category of taking loans.Except for "Other" and "Not Applicable" majority loans are debt consolidations then home Improvments then business.people took loans to reimburse loans/debts thay took prevoiusly.
Three years term is most popular choice for almost all categories of loans.
### Description One
### Plot Two
```{r echo=FALSE, Plot_Two}
#BorrowerAPR : The Borrower's Annual Percentage Rate (APR) for the loan
ggplot(aes(x = ProsperRating, y = BorrowerAPR * 100,
fill = I("#F5F5F5"), color = I("#506E86")),
data = subset(prosper, !is.na(BorrowerAPR))) +
geom_boxplot(lwd = 0.75, outlier.colour = I("#FBB448"), outlier.shape = 1) +
scale_x_discrete(limits = rating) +
ggtitle('Borrower APR and ProsperRating') +
ylab("Borrower APR")
```
### Description Two
The boxplots above show the relationship between borrower's Prosper rating and their assigned Annual Percentage Rate (APR).It's very clear that as we go down the ladder of risk - from a 'High Risk' to an 'AA' rating - the APR for the borrower reduces drastically.The variation in APRs also decreases as the loans get less riskier as displayed by the decreasing size of the boxes in the boxplots when going from 'E' to 'AA'. There is also a reduction in the number of outliers.
This indeed proves that the lenders like to charge less from borrowers with better prosper score.
### Plot Three
```{r echo=FALSE, Plot_Three}
ggplot(aes(x = DebtToIncomeRatio, y = BorrowerAPR,
color =as.integer(ProsperRating), group = ProsperRating),
data = subset(prosper, !is.na(BorrowerAPR) & ProsperRating %in% rating)) +
geom_jitter() +
geom_point(size = 1.1) +
xlim(0, quantile(prosper$DebtToIncomeRatio, 0.995, na.rm = T)) +
scale_color_gradient2(high = "#FA4659", midpoint = 4, mid = "#FBB448",
low = "#2EB872",
breaks = seq(1, 7, 1),
labels = rev(rating),
limits = c(1, 7),
name = 'Category of Risk') +
theme(legend.position = "right",
legend.direction = "vertical",
legend.key.size = unit(1, "cm"))+
ggtitle('Borrower APR and Debt To Income Ratio')
```
### Description Three
This is a great plot with a lot of information.A scatter plot of borrower's APR and the debt to income ratio of the borrower, with the colors describing the risk category given to the particular loan.
It displays the progression from a safe green to a risky red.
Removing outliers that spread the graph by limited the x-axis by removing 0.05% of the points furthest away from the median.When APR increases as the rating gets riskier and most people tend to have debt-to-income ratios below 1, regardless of risk category.Also, there is this unusual horizontal line in the 'HR' category that extends past 1 and all the way to
------
# Reflection
The Prosper Loan data set contains information on 113937 loan records.We have been able to analyze that Prosper's portfolio has really changed between 2005 and 2014.I started by going through the Prosper website to understand the concept and details of some of the fields in the data set. Then I started understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots.
Following are some of the interesting features which I observed during the exploratory analysis:
* There are 3 loan terms 12, 36 and 60 months. The most popular loan term is 36 months
* Borrower who home owner had receive bigger loan than other one who not own the home.
* High Credit Score Upper is assign to excellent rating (AA), and surprise is Poor prosper rating "HR" have high credit * score that rating E.
* Most of borrowers Income ranges from $25,000 - $74,999.
* Most loans are taken for debt consolidation. Individuals in all income categories take loans for debt consolidation and the amount varies largely.
* The Borrower rate of Interest and Lender Yield is low for higher Credit Grades and high for lower Credit Grades.
* The median gradually increases from higher Credit Grade (AA) to lower Credit Grade (HR).
Although some ideas for future exploration can improve this analysis substantially.
I really think regression model can be improved further using Gradient Decend to better approximate the slope and the intercept of the line.