+This was my synthesis project done during my final year at UQAM (during my Bachelors in Economics). The goal of this project was to do synthesis of economics concepts. I decided do wprk micro economics project related to health. The dataset was provided by our professor. Here I use econometrics (statistical analysis) techniques. The report was in French, but here below I translated the conclusion-summary. In the report you can see Graphs and diagrams that can give you an idea of the whole analysis.
+Project done during my Financial Econometric class - Here I showcase the STATA code and the PDF file ( can be downloaded). The project is written in French and will eventually be translated.
+## Code Stata
+Here below is the code I programmed on STATA (The software used for statistical analysis)
+``` C++
+global root ="C:\Users\Maricarmen\Desktop\TRAVAIL SESSION 8620\travail de session\DATA\01-do-file"
+global raw = "C:\Users\Maricarmen\Desktop\TRAVAIL SESSION 8620\travail de session\DATA\02 raw file"
+global work = "C:\Users\Maricarmen\Desktop\TRAVAIL SESSION 8620\travail de session\DATA\03-work"
+use "$raw\ppeco8620MaricarmenArenas.dta" , clear
+//time series (avant de faire mes tableaux et regressions il a fallu transformer mes données en time series)
+/*gen temps2 = date(temps, "MDY")
+ //dummy variables pour chaque décénnie
+format temps2 %td
+tsset temps2
+gen d1990=0
+gen d2000=0
+gen d2010=0
+replace d1990=1 if tin(01mar1990,31dec1999)
+replace d2000=1 if tin(01jan2000,31dec2009)
+replace d2010=1 if tin(01jan2010,31mar2016)*/
+//scatter snp3_6 temps2 if d2000 , connect(2) clwidth(medthick) clcolor(black) clpattern(dot) || scatter corrmed6 temps2 if tin(01jan2000,01dec2009) , connect(2) clwidth(medthick) clcolor(black) clpattern(dot) mps2 if tin(01jan2000,01dec2009) , connect(2) clwidth(medthick) clcolor(black) clpattern(dot)
+// graphiques/figures -sP500 et corrélations à travers le temps
+ scatter SP500 temps2 if d1990, connect(2) clwidth(medthick) clcolor(black) clpattern(dot) c(l) yaxis(1)||scatter corrmed18 temps2 if d1990 , connect(2) clwidth(medthick) clcolor(black) clpattern(dot) c(l) yaxis(2) title("Décénnies 1990: corrélations vs rendements S&P500 ")
+ scatter SP500 temps2 if d2000 , connect(2) clwidth(medthick) clcolor(black) clpattern(dot) c(l) yaxis(1) ||scatter corrmed18 temps2 if d2000 , connect(2) clwidth(medthick) clcolor(black) clpattern(dot) c(l) yaxis(2)title("Décénnies 2000: corrélations vs rendements S&P500 ")
+ scatter SP500 temps2 if d2010, connect(2) clwidth(medthick) clcolor(black) clpattern(dot) c(l) yaxis(1) || scatter corrmed18 temps2 if d2010, connect(2) clwidth(medthick) clcolor(black) clpattern(dot) c(l) yaxis(2) title("Décénnies 2010: corrélations vs rendements S&P500 ")
+//tableau pour les moyennes, medianes, écarts types, et quantiles
+tabstat snp3_6 snp6_6 snp9_6 snp12_12 snp3_12 snp6_12 snp9_12 snp12_12 snp3_18 snp6_18 snp9_18 snp12_18 corrmed6 corrmed12 corrmed18 corpbondyield GDP_growth unemp_rate, stats(mean p1 p5 p10 p25 p50 p75 p90 p95 p99 sd) columns(statistics)
+// regressions simples
+//12 regressions
+reg snp3_6 corrmed6 GDP_growth unemp_rate corpbondyield, vce(robust)
+outreg2 using Tp.xls, replace ctitle(Regression)
+reg snp6_6 corrmed6 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+reg snp9_6 corrmed6 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+reg snp12_6 corrmed6 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+reg snp3_12 corrmed12 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+reg snp6_12 corrmed12 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+reg snp9_12 corrmed12 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+reg snp12_12 corrmed12 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+reg snp3_18 corrmed18 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+reg snp6_18 corrmed18 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+reg snp9_18 corrmed18 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+reg snp12_18 corrmed18 corpbondyield GDP_growth unemp_rate , vce(robust)
+outreg2 using Tp.xls, append ctitle(Regression)
+// 36 regressions
+// regressions par décénnie dummie variables 1990-
+reg snp3_6 corrmed6 unemp_rate corpbondyield GDP_growth if d1990, vce(robust)
+outreg2 using ppMaricarmen.xls, replace ctitle(Regression)
+reg snp6_6 corrmed6 corpbondyield GDP_growth unemp_rate if d1990, vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp9_6 corrmed6 corpbondyield GDP_growth unemp_rate if d1990 , vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp12_6 corrmed6 corpbondyield GDP_growth unemp_rate if d1990, vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp3_12 corrmed12 corpbondyield GDP_growth unemp_rate if d1990, vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp6_12 corrmed12 corpbondyield GDP_growth unemp_rate if d1990, vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp9_12 corrmed12 corpbondyield GDP_growth unemp_rate if d1990, vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp12_12 corrmed12 corpbondyield GDP_growth unemp_rate if d1990, vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp3_18 corrmed18 corpbondyield GDP_growth unemp_rate if d1990, vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp6_18 corrmed18 corpbondyield GDP_growth unemp_rate if d1990 , vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp9_18 corrmed18 corpbondyield GDP_growth unemp_rate if d1990, vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp12_18 corrmed18 corpbondyield GDP_growth unemp_rate if d1990, vce(robust)
+outreg2 using ppMaricarmen.xls, append ctitle(Regression)
+reg snp3_6 corrmed6 unemp_rate corpbondyield if d2000, vce(robust)
+outreg2 using ppMaricarmen2.xls, replace ctitle(Regression)
+reg snp6_6 corrmed6 corpbondyield GDP_growth unemp_rate if d2000, vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp9_6 corrmed6 corpbondyield GDP_growth unemp_rate if d2000 , vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp12_6 corrmed6 corpbondyield GDP_growth unemp_rate if d2000, vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp3_12 corrmed12 corpbondyield GDP_growth unemp_rate if d2000, vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp6_12 corrmed12 corpbondyield GDP_growth unemp_rate if d2000, vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp9_12 corrmed12 corpbondyield GDP_growth unemp_rate if d2000, vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp12_12 corrmed12 corpbondyield GDP_growth unemp_rate if d2000, vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp3_18 corrmed18 corpbondyield GDP_growth unemp_rate if d2000, vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp6_18 corrmed18 corpbondyield GDP_growth unemp_rate if d2000 , vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp9_18 corrmed18 corpbondyield GDP_growth unemp_rate if d2000, vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp12_18 corrmed18 corpbondyield GDP_growth unemp_rate if d2000, vce(robust)
+outreg2 using ppMaricarmen2.xls, append ctitle(Regression)
+reg snp3_6 corrmed6 unemp_rate corpbondyield GDP_growth if d2010, vce(robust)
+outreg2 using ppMaricarmen3.xls, replace ctitle(Regression)
+reg snp6_6 corrmed6 corpbondyield unemp_rate GDP_growth if d2010, vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+reg snp9_6 corrmed6 corpbondyield unemp_rate GDP_growth if d2010 , vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+reg snp12_6 corrmed6 corpbondyield unemp_rate GDP_growth if d2010, vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+reg snp3_12 corrmed12 corpbondyield unemp_rate GDP_growth if d2010, vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+reg snp6_12 corrmed12 corpbondyield unemp_rate GDP_growth if d2010, vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+reg snp9_12 corrmed12 corpbondyield unemp_rate GDP_growth if d2010, vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+reg snp12_12 corrmed12 corpbondyield unemp_rate GDP_growth if d2010, vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+reg snp3_18 corrmed18 corpbondyield unemp_rate GDP_growth if d2010, vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+reg snp6_18 corrmed18 corpbondyield unemp_rate GDP_growth if d2010 , vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+reg snp9_18 corrmed18 corpbondyield unemp_rate GDP_growth if d2010, vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+reg snp12_18 corrmed18 corpbondyield unemp_rate GDP_growth if d2010, vce(robust)
+outreg2 using ppMaricarmen3.xls, append ctitle(Regression)
+//corrélations pour GMM
+pwcorr snp12_6 snp6_6 snp3_6 snp9_6 corrmed6
+pwcorr x1 x2 x3 x4 x5 x6 corrmed6
+pwcorr x1 x2 x3 x4 x5 x6 snp3_6 snp3_6 snp6_6 snp9_6 snp12_6
+pwcorr snp3_12 snp6_12 snp9_12 snp12_12 corrmed12
+pwcorr x7 x8 x9 x10 x11 x12 corrmed12
+pwcorr x7 x8 x9 x10 x11 x12 snp3_12 snp3_12 snp6_12 snp9_6 snp12_12
+pwcorr snp3_18 snp6_18 snp9_18 snp12_18 corrmed18
+pwcorr x13 x14 x15 x16 x17 x18 corrmed18
+pwcorr x13 x14 x15 x16 x17 x18 snp3_18 snp6_18 snp9_18 snp12_18
+// 12 regressions
+sca b0 = 1
+gmm (snp3_6 - {xb:corrmed6 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x1 x2 x3 x4 x5 x6) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp6_6 - {xb:corrmed6 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x1 x2 x3 x4 x5 x6) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp9_6 - {xb:corrmed6 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x1 x2 x3 x4 x5 x6) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp12_6 - {xb:corrmed6 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x1 x2 x3 x4 x5 x6) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp3_12 - {xb:corrmed12 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x7 x8 x9 x10 x11 x12) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp6_12 - {xb:corrmed12 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x7 x8 x9 x10 x11 x12) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp9_12 - {xb:corrmed12 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x7 x8 x9 x10 x11 x12) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp12_12 - {xb:corrmed12 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x7 x8 x9 x10 x11 x12) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp3_18 - {xb:corrmed18 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x13 x14 x15 x16 x17 x18) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp6_18 - {xb:corrmed18 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x13 x14 x15 x16 x17 x18) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp9_18 - {xb:corrmed18 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x13 x14 x15 x16 x17 x18) twostep vce(unadjusted)
+sca b0 = 1
+gmm (snp12_6 - {xb:corrmed18 unemp_rate corpbondyield GDP_growth} - {b0} ), instruments (x13 x14 x15 x16 x17 x18) twostep vce(unadjusted)
+return list
+ereturn list
+mat resultat = r(table)
+sca b1 = resultat[1,1]
+dis b1
\ No newline at end of file
+When completing my degree at UQAM university (2013-2016) I took a directed readings class on behavioural economics applied on retirement saving policies. This was an independant class, meaning that I took the initiative to reach a professor and I was the only student working on that project. The report is in French, but here below I leave the translated introduction.
+layout: post
+title: "SQL Exercises"
+date: 2023-07-20 13:32:20 +0300
+description: Sample SQL exercises dones during my one of my Business Intelligence class. These are complex queries! # Add post description (optional)
+img: sql.png # Add image post (optional)
+fig-caption: Sample SQL exercises dones during my one of my Business Intelligence class. These are complex queries!
+tags: [sql]
+# SQL exercises
+Below are some exercises I had to work on for my 'Business Intelligence Techniques' class. Every exercise took some time to figure out, but I decided it would be a good sample work to showcase how much I can accomplish. During this class I also learned about Related database concepts and SQL concepts. The class was in french therefore the questions are (for now) in french, I will work on their translations in the days to come (starting from now 20th of July 2023).
+``` sql
+ * TECH 60701 -- Technologies de l'intelligence d'affaires
+ * HEC Montréal
+ use AdventureWorks2019
+ go
+ Question #1 :
+ AdventureWorks would like to implement former General Electric Chairman and CEO Jack Welch's "The vitality model",
+ Jack Welch, which has been described as a "20-70-10" system. The "most important 20%" of employees are the most productive, and 70% (the "indispensable 70
+ indispensable") work well. The remaining 10% are non-producers and must be let go.
+ Using a ranking clause and a subquery, you need to write a query to identify the "top 20%" of sales people
+ (to congratulate and encourage them!) as well as the bottom 10% (to kick them out )!!!
+ So we don't want the salespeople belonging to the remaining 70% to appear in the report.
+ (By salespeople, we're referring to sales clerks, regardless of job title.)
+ Since AdventureWorks sells mainly bicycles, spring (March to May, incl.) is crucial to its financial results.
+ Therefore, the analysis should only take into account the subtotal sales that salespeople have achieved for this
+ period, whatever the year. For each result, the following should be displayed
+ - Salesperson ID
+ - Seller's National ID Number
+ - Seller's first name
+ - Seller's surname
+ - The seller's subtotal sales for the fourth quarter (formatted in dollars, i.e. $xxx.xx)
+ - The seller's percentage rank (formatted as a percentage with two points of precision)
+ - The decile to which the seller belongs
+ - Personalized message for: 1st decile 'Excellent performance!'; 2nd decile 'Keep up the good work!
+ 10th decile 'Are you looking for a job elsewhere?
+ *,
+ Case
+ when Decile =1 then 'Excellente performance !'
+ when Decile =2 then 'Continuez, vous allez bien !'
+ when Decile =10 then 'Cherchez vous un emploi ailleurs !'
+ else 'Whatever'
+ end as 'Status'
+ from (
+ select
+ soh.SalesPersonID,
+ e.NationalIDNumber,
+ p.FirstName,
+ p.LastName,
+ FORMAT(SUM(soh.SubTotal), 'c', 'en-us') as 'Somme sous-total Ventes',
+ /*Le rang en pourcentage du vendeur (formaté en pourcentage avec deux points de précision)*/
+ FORMAT(ROUND(percent_rank() over(order by sum(soh.SubTotal) desc), 2), 'p') as 'Le rang en %',
+ NTILE(10) over(order by SUM(soh.SubTotal) desc) as 'Decile'
+ from
+ Sales.SalesOrderHeader soh
+ inner join Sales.SalesPerson sp1 on soh.SalesPersonID = sp1.BusinessEntityID
+ inner join Person.Person p on sp1.BusinessEntityID = p.BusinessEntityID
+ inner join HumanResources.Employee e on p.BusinessEntityID= e.BusinessEntityID
+ where Month(OrderDate) in (3,4,5)
+ group by soh.SalesPersonID, e.NationalIDNumber, p.FirstName, p.LastName
+ ) as Table2
+ where Decile in (1,2,10)
+ Question #2 :
+ AdventureWorks would like to explore its customers' purchases of accessories (non-manufactured products). We are particularly interested in accessories that were ordered
+ were ordered by stores located in Canada at the same time as they made bicycle purchases (products manufactured by AdventureWorks).
+ Therefore, data should be displayed only for sales made to stores (not individual customers) who purchased bicycles.
+ Using a CTE, you should display a list containing information grouped by product identifier, product name,
+ product number.
+ Your report should contain only four columns, as follows:
+ ProductID |Name |ProductNumber |OrderCount |Rang
+ 715 |Long-Sleeve Logo Jersey, L |LJ-0192-L |238 |1
+ 712 |AWC Logo Cap |CA-1098 |237 |2
+ 708 |Sport-100 Helmet, Black |HL-U509 |190 |3
+ ... |... |... |... |...
+ This indicates, for example, that of all the orders placed by stores in which manufactured products were purchased, 238
+ orders also included the purchase of product 715 (Long-Sleeve Logo Jersey, L), 237 orders included the purchase of product 712 (AWC Logo Cap),
+ etc. The rank used does not allow value jumps.
+ Sort by "OrderCount", in descending order.
+ --7385
+with CTEQ2(ProductID, Name, ProductNumber,SalesOrderID, SalesOrderDetailID) as
+pt.ProductID, pt.Name, pt.ProductNumber,
+soh.SalesOrderID, sod.SalesOrderDetailID
+ from Sales.SalesOrderHeader soh
+ inner join Sales.Customer c on c.CustomerID = soh.CustomerID
+ inner join Person.BusinessEntityAddress bea on c.StoreID = bea.BusinessEntityID
+ inner join person.Address a on a.AddressID = bea.AddressID
+ inner join Sales.SalesOrderDetail sod on soh.SalesOrderID = sod.SalesOrderID
+ inner join Production.Product pt on sod.ProductID = pt.ProductID
+ inner join Person.StateProvince sp on sp.StateProvinceID =a.StateProvinceID
+ where sp.CountryRegionCode = 'CA' AND pt.MakeFlag =1 --AND soh.SalesOrderID='55280'
+ )
+ select
+ p.ProductID,
+ p.Name,
+ p.ProductNumber,
+ COUNT( distinct(CTEQ2.SalesOrderID)) as OrderCount,
+ dense_rank()over(order by COUNT(distinct(CTEQ2.SalesOrderID)) desc) as 'Rang'
+ from CTEQ2
+ inner join Sales.SalesOrderDetail sod on sod.SalesOrderID = CTEQ2.SalesOrderID and sod.SalesOrderDetailID <> CTEQ2.SalesOrderDetailID
+ inner join Production.Product p on p.ProductID = sod.ProductID
+ where p.MakeFlag =0
+ group by p.ProductID, p.[Name], p.ProductNumber;
+ Question #3 a) :
+ You are asked to provide a query displaying the following details of active suppliers, with preferred status, from whom
+ from whom AdventureWorks has placed fewer than 30 orders. Show:
+ - Supplier ID
+ - Order date
+ - A sequence number assigned to each order placed with the supplier, starting with the most recent order
+ - The subtotal of each order (formatted in dollars, i.e. $xxx.xx)
+select poh.VendorID
+ , poh.OrderDate, row_number()over(partition by poh.VendorID order by OrderDate desc) as 'No_sequence' , FORMAT(poh.SubTotal, 'C')
+ from Purchasing.PurchaseOrderHeader poh
+ inner join Purchasing.Vendor v on v.BusinessEntityID = poh.VendorID
+ where ActiveFlag =1 AND PreferredVendorStatus=1 AND
+ poh.VendorID in (select VendorID from Purchasing.PurchaseOrderHeader group by VendorID having count(PurchaseOrderID)<=30);
+ Question #3 b) :
+ AdventureWorks would like to know which of these preferred suppliers (with whom AdventureWorks has placed fewer than 30 orders) tends to
+ tend to increase their prices. The company would like to use this information to remove their "preferred supplier" status.
+ supplier" status. The assumption here is that orders from a supplier remain stable over time and are therefore
+ always for similar products/quantities.
+ Using a CTE based on the query in Part a), build a query that will display the list of suppliers for which
+ the average amount (using the subtotal) of their three most recent orders is greater than the average amount they have requested
+ AdventureWorks to date.
+ We'd like to display :
+ - Supplier ID
+ - The average amount of all orders placed with the supplier
+ - The average amount of the three most recent orders placed with the supplier
+ - The difference between the average amount of the three most recent orders placed with the supplier and the average amount of all orders placed with the supplier.
+ all orders placed with the supplier.
+ Your report should contain only these four columns, and be filtered by the reduction in acquisition costs, so that
+ so that the largest reduction is at the top of the list. All amounts must be formatted in dollars, i.e. $xxx.xx.
+with CTEQ3b(VendorID, OrderDate, RowNum, SubTotal) as
+(select poh.VendorID
+ , poh.OrderDate,
+ row_number()over(partition by poh.VendorID order by OrderDate desc) ,
+ poh.SubTotal
+ from Purchasing.PurchaseOrderHeader poh
+ inner join Purchasing.Vendor v on v.BusinessEntityID = poh.VendorID
+ where v.ActiveFlag =1 AND v.PreferredVendorStatus=1 AND
+ poh.VendorID in (select VendorID from Purchasing.PurchaseOrderHeader group by VendorID having count(PurchaseOrderID)<=30)
+(select format(avg(Subtotal),'C') from Purchasing.PurchaseOrderHeader poh where c.VendorID=poh.VendorID) as 'total',
+format(avg(c.SubTotal),'C') as 'total 3' ,
+Format(avg(SubTotal) - (select avg(Subtotal) from Purchasing.PurchaseOrderHeader poh where c.VendorID=poh.VendorID),'C') as 'difference'
+from CTEQ3b c
+where c.RowNum<=3
+group by c.VendorID
+having avg(SubTotal) - (select avg(Subtotal) from Purchasing.PurchaseOrderHeader poh where c.VendorID=poh.VendorID) >0
+order by avg(SubTotal) - (select avg(Subtotal) from Purchasing.PurchaseOrderHeader poh where c.VendorID=poh.VendorID) desc
\ No newline at end of file
+layout: post
+title: "Twitter API V2"
+date: 2023-07-20 13:32:20 +0300
+description: Python code to retreive twitters with V2 API version. I modified this code so I could retreive exactly what I needed. I used this code was used to retreive the data I used for my Master's Thesis.
+ # Add post description (optional)
+img: twitter-api.jpg # Add image post (optional)
+tags: [twitter, coding, python]
+This application was made with Twitter API version 2. It is built with Python.
+This is a python code I modified so I could get more twitter information (taken from beyond data science website). I added an algorithm so I could retrieve hourly tweets. I also added a piece of code to retrieve 3 more files containing Twitter information (user info,place info, retweet info) in addition to the main file. I retreived everything on CSV files.
+This produces 4 csv files containing Twitter information.
+file'AcademicMain' fields : 'author id', 'created_at', 'place_id', 'referenced_id', 'Retweet', 'id', 'conversation_id' ,'lang', 'source', 'tweet', 'username_mentioned', 'username_mentioned_id', 'urls_expanded', 'urls', 'tag
+file AcademicUsers contains fields: 'author id','username', 'place_id', 'description', 'name', 'followers_count', 'following_count', 'verified', 'profile_image_url' file AcademicRetweetInfo contains fields: 'conversation_id', 'referenced_id','place_id', 'text2','username_mentioned2','username_mentioned_id2', 'url2', 'urls_expanded2', 'tag2' file AcademicPlaces contains fields: 'place_id','name_country', 'full_name_country', 'name_country', 'country_code', 'place_type'
+After obtaining the files you will need to merge the author id from 'AcademicMain' file with the 'AcademicUsers' file (with author id), then AcademicRetweetInfo contains reference id, which needs to be merged with reference id with 'AcademicMain, then 'AcademicPlaces' place id needs to be merged with place id in 'AcademicPlaces'
+``` python
+## For sending GET requests from the API
+import requests
+# For saving access tokens and for file management when creating and adding to the dataset
+import os
+# For dealing with json responses we receive from the API
+import json
+# For displaying the data after
+#import pandas as pd
+# For saving the response data in CSV format
+import csv
+# For parsing the dates received from twitter in readable formats
+import datetime
+import dateutil.parser
+import unicodedata
+#To add wait time between requests
+import time
+from pathlib import Path
+import datetime
+from datetime import datetime
+#timestamp = pd.Timestamp('2020-5-23')
+import pytz
+from datetime import date, timedelta
+import datetime
+os.environ['TOKEN'] = 'AAAAAAAAAAAAAAAAAAAAAAggUwEAAAAAcvH6Nz7S%2BPfeswddbEoiVp4%2BLtY%3DoKgXGUTKQkCiFWkaQpc8DWtR3aBJQlZl7N3lOEmcUhmU9ybxuK'
+def listdates(a, b):
+ sdate = a # start date
+ edate = b # end date
+ delta = edate - sdate # as timedelta
+ begin_list =[]
+ end_list =[]
+ for i in range(delta.days + 1):
+ day = sdate + timedelta(days=i)
+ year = day.strftime("%Y")
+ month = day.strftime("%m")
+ day = day.strftime("%d")
+ begin_time = datetime.datetime(int(year), int(month), int(day), 0)
+ #local_dt = local.localize(begin_time, is_dst=None)
+ #utc_dt = local_dt.astimezone(pytz.utc)
+ m= begin_time.isoformat("T") + ".000Z"
+ begin_list.append(m)
+ n=12
+ m=0
+ s=0
+ # Add 2 hours to datetime object
+ final_time= begin_time+ timedelta(hours=n, minutes=m, seconds=s)
+ final_t = final_time.isoformat("T") + ".000Z"
+ end_list.append(final_t)
+ return(begin_list, end_list)
+#time_change = datetime.timedelta(hours=10)
+#new_time = date_and_time + time_change
+def auth():
+ return os.getenv('TOKEN')
+def create_headers(bearer_token):
+ headers = {"Authorization": "Bearer {}".format(bearer_token)}
+ return headers
+def create_url(keyword, start_date, end_date, max_results):
+ search_url = "https://api.twitter.com/2/tweets/search/all" # Change to the endpoint you want to collect data from
+ # change params based on the endpoint you are using
+ query_params = {'query': keyword,
+ 'start_time': start_date,
+ 'end_time': end_date,
+ 'max_results': max_results,
+ 'expansions': 'author_id,in_reply_to_user_id,geo.place_id,referenced_tweets.id,attachments.media_keys',
+ 'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,public_metrics,lang,entities,reply_settings,source',
+ 'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
+ 'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
+ 'next_token': {}}
+ return (search_url, query_params)
+def connect_to_endpoint(url, headers, params, next_token = None):
+ params['next_token'] = next_token #params object received from create_url function
+ response = requests.request("GET", url, headers = headers, params = params)
+ print("Endpoint Response Code: " + str(response.status_code))
+ if response.status_code != 200:
+ raise Exception(response.status_code, response.text)
+ return response.json()
+def write_json(new_data, filenamejson):
+ # with open(filename, 'w') as f:
+ # json.dump(new_data, f, indent=4, sort_keys=True)
+ jsonfile = open(filenamejson, 'a')
+ json.dump(new_data, jsonfile, indent=4, sort_keys=True)
+def append_to_csv(json_response, fileName):
+ # A counter variable
+ counter = 0
+ # Open OR create the target CSV file
+ csvFile = open(fileName, "a", newline="", encoding='utf-8')
+ csvWriter = csv.writer(csvFile)
+ #
+ # Loop through each tweet
+ for tweet in json_response['data']:
+ # We will create a variable for each since some of the keys might not exist for some tweets
+ # So we will account for that
+ # 1. Author ID
+ author_id = str("'" + tweet['author_id'])
+ # 2. Time created
+ created_at = dateutil.parser.parse(tweet['created_at'])
+ ###'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
+ # 3. Geolocation
+ if ('geo' in tweet):
+ if('place_id' in tweet['geo']):
+ place_id = tweet['geo']['place_id']
+ else:
+ place_id = " "
+ else:
+ place_id = " "
+ if('referenced_tweets' in tweet):
+ referenced_id= str("'" + tweet['referenced_tweets'][0]['id'])
+ Retweet=str("'" + tweet['referenced_tweets'][0]['type'])
+ else:
+ referenced_id=' '
+ Retweet=' '
+ # 4. Tweet ID
+ tweet_id = str("'" + tweet['id'])
+ conversation_id = str("'" + tweet['conversation_id'])
+ # 5. Language
+ lang = tweet['lang']
+ # 6. Tweet metrics
+ # 7. source
+ source = tweet['source']
+ # 8. Tweet text
+ text = tweet['text']
+ if('entities' in tweet):
+ if('mentions' in tweet['entities']):
+ d= len(tweet['entities']['mentions'])
+ user1 =[]
+ user2=[]
+ for i in range(d):
+ user1.append(tweet['entities']['mentions'][i]['username'])
+ user2.append(tweet['entities']['mentions'][i]['id'])
+ username_mentioned = user1
+ #print(username_mentioned)
+ username_mentioned_id = user2
+ # print(username_mentioned_id)
+ else:
+ username_mentioned=' '
+ username_mentioned_id=' '
+ if('urls' in tweet['entities']):
+ d= len(tweet['entities']['urls'])
+ url1 =[]
+ url2 =[]
+ for i in range(d):
+ url1.append(tweet['entities']['urls'][i]['expanded_url'])
+ url2.append(tweet['entities']['urls'][i]['url'])
+ urls_expanded= url1
+ urls = url2
+ else:
+ urls_expanded=' '
+ urls=' '
+ if('hashtags' in tweet['entities']):
+ tag1 =[]
+ d= len(tweet['entities']['hashtags'])
+ for i in range(d):
+ tag1.append(tweet['entities']['hashtags'][i]['tag'])
+ tag=tag1
+ else:
+ tag=' '
+ else:
+ username_mentioned=' '
+ username_mentioned_id=' '
+ urls_expanded=' '
+ urls=' '
+ tag=' '
+ res = [ author_id, created_at, place_id, referenced_id, Retweet, tweet_id, conversation_id, lang, source, text, username_mentioned,
+ username_mentioned_id, urls_expanded, urls, tag]
+ csvWriter.writerow(res)
+ counter += 1
+#When done, close the CSV file
+ csvFile.close()
+# Print the number of tweets for this iteration
+ print("# of Tweets added from this response: ", counter)
+def append_to_csvUsers(json_response, fileName):
+ # A counter variable
+ counter = 0
+ # Open OR create the target CSV file
+ csvFile = open(fileName, "a", newline="", encoding='utf-8')
+ csvWriter = csv.writer(csvFile)
+ # Loop through each tweet
+ if('users' in json_response['includes']):
+ for tweet in json_response['includes']['users']:
+ #print(tweet)
+ author_id = str("'" + tweet['id'])
+ username = tweet['username']
+ if ('geo' in tweet):
+ if('place_id' in tweet['geo']):
+ place_id = tweet['geo']['place_id']
+ else:
+ place_id = " "
+ else:
+ place_id = " "
+ #print(username)
+ name=tweet['name']
+ description=tweet['description']
+ followers_count=tweet['public_metrics']['followers_count']
+ following_count=tweet['public_metrics']['following_count']
+ verified= tweet['verified']
+ if('profile_image_url' in tweet):
+ profile_image_url=tweet['profile_image_url']
+ else:
+ profile_image_url= ' '
+ res = [author_id, username, place_id, description, name, followers_count, following_count, verified, profile_image_url]
+ csvWriter.writerow(res)
+ counter += 1
+#When done, close the CSV file
+ csvFile.close()
+# Print the number of tweets for this iteration
+ print("# of Tweets added from this response: ", counter)
+def append_to_csvExtended(json_response, fileName):
+# A counter variable
+ counter = 0
+ # Open OR create the target CSV file
+ csvFile = open(fileName, "a", newline="", encoding='utf-8')
+ csvWriter = csv.writer(csvFile)
+ if('tweets' in json_response['includes']):
+ #print(json_response['includes']['tweets'])
+ for tweet in json_response['includes']["tweets"]:
+ #print(tweet)
+ conversation_id=str("'" +tweet['conversation_id'])
+ referenced_id= str("'" +tweet['id'])
+ if ('geo' in tweet):
+ if('place_id' in tweet['geo']):
+ place_id = tweet['geo']['place_id']
+ #print(place_id)
+ else:
+ place_id = " "
+ else:
+ place_id = " "
+ text2=tweet['text']
+ if('entities' in tweet):
+ if('mentions' in tweet['entities']):
+ #print((tweet['entities']['mentions']))
+ d= len(tweet['entities']['mentions'])
+ user1 =[]
+ user2=[]
+ for j in range(d):
+ user1.append(tweet['entities']['mentions'][j]['username'])
+ user2.append(tweet['entities']['mentions'][j]['id'])
+ username_mentioned2 = user1
+ username_mentioned_id2 = user2
+ else:
+ username_mentioned2=' '
+ username_mentioned_id2=' '
+ if('urls' in tweet['entities']):
+ d= len(tweet['entities']['urls'])
+ url1 =[]
+ url2 =[]
+ for j in range(d):
+ url1.append(tweet['entities']['urls'][j]['expanded_url'])
+ url2.append(tweet['entities']['urls'][j]['url'])
+ urls_expanded2= url1
+ urls2 = url2
+ #print(urls2)
+ else:
+ urls_expanded2=' '
+ urls2=' '
+ if('hashtags' in tweet['entities']):
+ tag1 =[]
+ d= len(tweet['entities']['hashtags'])
+ for j in range(d):
+ tag1.append(tweet['entities']['hashtags'][j]['tag'])
+ tag2=tag1
+ else:
+ tag2=' '
+ else:
+ username_mentioned2=' '
+ username_mentioned_id2=' '
+ urls_expanded2=' '
+ urls2=' '
+ tag2=' '
+ res = [conversation_id, referenced_id, place_id, text2, username_mentioned2, username_mentioned_id2, urls2, urls_expanded2, tag2 ]
+ csvWriter.writerow(res)
+ counter += 1
+#When done, close the CSV file
+ csvFile.close()
+# Print the number of tweets for this iteration
+ print("# of Tweets added from this response: ", counter)
+def append_to_csvPlaces(json_response, fileName):
+ # A counter variable
+ counter = 0
+ # Open OR create the target CSV file
+ csvFile = open(fileName, "a", newline="", encoding='utf-8')
+ csvWriter = csv.writer(csvFile)
+ # Loop through each tweet
+ #print(json_response['includes']['places'])
+ # n=len(json_response['includes']['places'])
+ if('places' in json_response['includes']):
+ for tweet in json_response['includes']['places']:
+ # print(tweet)
+ place_id = str("'" + tweet['id'])
+ # print(place_id)
+ name_country=tweet['name']
+ full_name_place=tweet['full_name']
+ country=(tweet['country'])
+ country_code=tweet['country_code']
+ place_type=tweet['place_type']
+ res = [place_id, name_country, full_name_place, name_country, country_code, place_type]
+ csvWriter.writerow(res)
+ counter += 1
+#When done, close the CSV file
+ csvFile.close()
+# Print the number of tweets for this iteration
+ print("# of Tweets added from this response: ", counter)
+#Inputs for tweets
+bearer_token = auth()
+headers = create_headers(bearer_token)
+keyword = 'onlyfans -promotion -promote lang:en'
+# '"new comer" "escort" "call girls" OR #callgirl lang:en'
+from datetime import timedelta, date
+ #2020-04-06
+answer = listdates(date(2021, 1, 16) , datetime.datetime.now().date() )
+start_list = answer[0]
+end_list = answer[1]
+max_results = 500
+#Total number of tweets we collected from the loop
+total_tweets = 0
+# Create file
+timestr = time.strftime("%Y%m%d-%H%M%S")
+filename1 = Path("/data") / ('AcademicMain' + timestr + ".csv")
+csvFile = open(filename1, "a", newline="", encoding='utf-8')
+csvWriter = csv.writer(csvFile)
+filename2 = Path("/data") / ('AcademicUsers' + timestr + ".csv")
+csvFile2 = open(filename2, "a", newline="", encoding='utf-8')
+csvWriter2 = csv.writer(csvFile2)
+filename3 = Path("/data") / ('AcademicRetweetInfo' + timestr + ".csv")
+csvFile3 = open(filename3, "a", newline="", encoding='utf-8')
+csvWriter3 = csv.writer(csvFile3)
+filename4 = Path("/data") / ('AcademicPlaces' + timestr + ".csv")
+csvFile4 = open(filename4, "a", newline="", encoding='utf-8')
+csvWriter4 = csv.writer(csvFile4)
+#Create headers for the data you want to save, in this example, we only want save these columns in our dataset
+csvWriter.writerow(['author id', 'created_at', 'place_id', 'referenced_id', 'Retweet', 'id', 'conversation_id' ,'lang',
+ 'source', 'tweet', 'username_mentioned', 'username_mentioned_id', 'urls_expanded', 'urls', 'tag'])
+csvWriter2.writerow(['author id','username', 'place_id', 'description', 'name', 'followers_count', 'following_count', 'verified', 'profile_image_url'])
+csvWriter3.writerow(['conversation_id', 'referenced_id','place_id', 'text2','username_mentioned2','username_mentioned_id2', 'url2', 'urls_expanded2', 'tag2'])
+csvWriter4.writerow(['place_id','name_country', 'full_name_country', 'name_country', 'country_code', 'place_type'])
+for i in range(0,len(start_list)):
+ # Inputs
+ count = 0 # Counting tweets per time period
+ max_count = 1500 # Max tweets per time period
+ flag = True
+ next_token = None
+ # Check if flag is true
+ while flag:
+ # Check if max_count reached
+ if count >= max_count:
+ break
+ print("-------------------")
+ print("Token: ", next_token)
+ url = create_url(keyword, start_list[i],end_list[i], max_results)
+ json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
+ result_count = json_response['meta']['result_count']
+ write_json(json_response, "/data/JSON" + timestr + ".json")
+ if 'next_token' in json_response['meta']:
+ # Save the token to use for next call
+ next_token = json_response['meta']['next_token']
+ print("Next Token: ", next_token)
+ if result_count is not None and result_count > 0 and next_token is not None:
+ print("Start Date: ", start_list[i])
+ append_to_csv(json_response, filename1)
+ append_to_csvUsers(json_response, filename2)
+ append_to_csvExtended(json_response, filename3)
+ append_to_csvPlaces(json_response, filename4)
+ count += result_count
+ total_tweets += result_count
+ print("Total # of Tweets added: ", total_tweets)
+ print("-------------------")
+ time.sleep(5)
+ # If no next token exists
+ else:
+ if result_count is not None and result_count > 0:
+ print("-------------------")
+ print("Start Date: ", start_list[i])
+ append_to_csv(json_response, filename1)
+ append_to_csvUsers(json_response, filename2)
+ append_to_csvExtended(json_response, filename3)
+ append_to_csvPlaces(json_response, filename4)
+ count += result_count
+ total_tweets += result_count
+ print("Total # of Tweets added: ", total_tweets)
+ print("-------------------")
+ time.sleep(5)
+ #Since this is the final request, turn flag to false to move to the next time period.
+ flag = False
+ next_token = None
+ time.sleep(5)
+print("Total number of results: ", total_tweets)
\ No newline at end of file
+layout: post
+title: Detecting Coordinated Activities through OnlyFans Tweets using Machine learning
+date: 2023-07-20 13:32:20 +0300
+description: This Thesis was conducted during my studies at HEC Montreal. Here I showcase my work in a PDF format that you can download.. # Add post description (optional)
+img: thesis.png # Add image post (optional)
+tags: [machine learning, ai, statistics]
+This Thesis was conducted during my studies at HEC Montreal. Here I showcase my work in a PDF format that you can download.
+I decided to pursue a masters degree with Thesis because I was very curious about machine learning research and social issues. I believed that the only way to satisfy my curiosity was to do a Thesis. I think I learned more technical skills doing my thesis than than taking classes and doing a short project. Here I worked with BIG data, python, Graphs, Mila servers and unsupervised learning techniques. I had the opportunity to work on a paper (you can look at the publication here paper) alongside a Phd student who specializes in Graphs and we had the honor of presenting our work at WebSci 2023 WebSci23.
+## Thesis Abstact
+In this thesis by articles, we present a research paper that we submitted for theWebSci’23 conference and is now under review. In addition to the article itself, in the thesis, we provide further detail regarding the motivation, background, literature review and research. The aim of this thesis is to provide a method that can facilitate the work of individuals combating online human trafficking. The majority of trafficking victims report being advertised online, this explains why online sex trafficking has been on the rise in the past few years. On the other hand, the use of OnlyFans as a platform for adult content has increased exponentially in the past three years, and Twitter has been its main advertising tool. Since we know that traffickers usually work within a network and control multiple victims, we suspect that there may be networks of traffickers promoting multiple OnlyFans accounts belonging to their victims. Based on these observations, we decided to conduct the first tstudy looking at organized activities on Twitter through OnlyFans advertisements. Preliminary analysis of this space shows that most tweets related to OnlyFans contains generic text, making text-based methods less reliable. Instead, focusing on what ties the authors of these tweets together, we propose a novel method for uncovering coordinated networks of users based on their behaviour. Our method, called Multi-Level Clustering (MLC), combines two levels of clustering. In the first level, we detect communities based on username Mentions and shared URLs, while the second level is done through two different approaches: i- the Partial Intersections (PI) of URLs and Mention communities ii- Joint Clustering (JT) by applying a subraph dense detection algorithm. We additionally successfully proved that our JT approach applied on synthetically generated data (with injected ground truth) shows a superior performance compared to competitive baselines. Furthermore, we apply the MLC to real-world data of tweets pertaining to OnlyFans and analyse the detected groups and show that our Partial Intersections provides good quality clusters (high entropy of OnlyFans accounts). Our paper and our thesis end with a discussion where we show carefully chosen examples of organized clusters and provide multiple interesting points that supports our method.
+Here below is my full Thesis
+layout: post
+title: "Baltimore Crime Data Analysis"
+date: 2023-07-20 13:32:20 +0300
+description: Statistical analysis of Crimes in Baltimore done on Kaggles. Here I show you a summary and a link to my kaggle account. I also included a Python/Jupyter notes PDF file. # Add post description (optional)
+img: crimeBaltimore.png # Add image post (optional)
+tags: [crime, data analysis, statistical analysis]
+This is a statistical analysis of Crimes in Baltimore done on Kaggles. Here I show you a summary and a link to my kaggle account. I also included a Python/Jupyter notes PDF file.
+In this Notebook I will analyse "Part1_Crime_Data.csv" dataset taken from Data Baltimore cityT. This dataset represents the location and characteristics of major (Part 1) crime against persons such as homicide, shooting, robbery, aggravated assault etc. within the City of Baltimore. Data is updated weekly. This is an exploratory analysis.
+The data was last updated May 17, 2023, the original csv file contains 565,726 records and 20 columns. Attributes (columns) : CCNO, CrimeDateTime, Location, Description, Inside_Outside, Weapon, Post, Gender, Age, Race, Ethnicity, District, Neighborhood, Latitude, Longitude, Geolocation, Premise, Total_incidents,
+Here below is the link to my kaggle analysis :
+## Crimes in Baltimore conclusion analysis
+Baltimore dataset contains data starting from the 1960's, however the entries don't seem consistent (only a few in a total of half a million). The Data becomes more consistent from year 2012, however data is incomplete for 2023 (since the year isn't finished). Therefore the analysis is from 2012 to 2013.
+Baltimore crime data shows that specific types of crimes are more 'popular' regardless of the year, namely Larceny, Common Assault and Burglary. While others are less 'popular' regardless of the year, namely Homicide, Rape and Arson. Larceny and Larceny from auto both show a downward trend. Aggregated assault and homicide seem to follow the same upward trend. Robbery and rape both reached a peak in 2017. Shooting increased sharply from 2012 to 2015, then from 2015 it steadily goes up.
+Frankford is the city with the highest crime level while the district with the highest level of crime is southeast. However, when we look at the heatmap, no particular city or district stands out. From the above analysis we find that Larcency, common assault and Agg. Assault are the 3 most common crimes around the most dense crime location (based on latitude and longitude).
+When it comes to the average time when crimes where pertpetuated, we see that it varies depending on the year. The only pattern noticeable is that crimes tend to happen between the afternoon (from 15h) to midnight.
+I performed a simple regression with the years as the dependent value and number of crimes per type of crime as the independant value. I then predicted the number of crimes for 2023 and compared the results with the 2023 data we had previously (by doubling the number of crimes for 2023). I concluded that the results are off, and that a deeper analysis should be done if we want to forecast the number of crimes (ex.: use of time series).
+I also checked if race, age or gender has an impact on the type of crime by performing a chi2_contigency test and concluded it does. Further analysis would need to be done to see what are exactly the differences.
+Below is a PDF version of my kaggle project
\ No newline at end of file
+layout: post
+title: Tableau - Prime Amazon Analysis
+date: 2023-07-20 13:32:20 +0300
+description: This will redirect you to my Tableau Dashboard - The dataset is taken from Kaggle # Add post description (optional)
+img: tableau.png # Add image post (optional)
+fig-caption: This will redirect you to my Tableau Dashboard - The dataset is taken from Kaggle
+tags: [Tableau, Data analysis, graphs, Dashboard]
Tableau - Prime Amazon Analysis
+## See link below
+Tableau - Prime Amazon Analysis
