GitHub - majdialali/web_crawler: In this big-sized project, I build a web crawler that is able to download websites and capture sensitive data from them. I used the BeutifulSoup library of Python. In addition, I implemented some NLP technics on the scraped data plus some statistics.

majdialali / web_crawler Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

In this big-sized project, I build a web crawler that is able to download websites and capture sensitive data from them. I used the BeutifulSoup library of Python. In addition, I implemented some NLP technics on the scraped data plus some statistics.

GPL-2.0 license

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.txt		README.txt
raport .pdf		raport .pdf
web_crawler.ipynb		web_crawler.ipynb
web_crawler.py		web_crawler.py

Repository files navigation

ACIT4420 - “Problem-solving with scripting”, master subject at OsloMet – Fall Semester 2021

Final project submission and evaluation

For their final assessment of ACIT4420, students must develop a project and submit both the code
base and a project report on December 17th 2021 before 12.00 (noon), via Inspera. The submission
must be anonymous and contain the candidate number as the only identifying information (no name
or student number should be included). The project report must be submitted as a single pdf file. The
code base should be submitted as a single zip file. All files necessary to run and test the project should
be included in that zip file.
Students must work on their projects individually. They must choose one of the project options
presented in this document. The code for the project must be written in Python. Each project option
contains a number of minimum features which must be implemented, but the students are expected
to go beyond these minimum features and extend the project with their own ideas.
The project report must be written in English and must be between 6000 and 12000 words long (not
counting the code base, which is not part of the report). The report counts for 100% of the grade in
this course. The report should contain: an introduction, an overview of available solutions, a simplified
description of the design chosen by the student, detailed explanations of the implementation, analysis
and testing of the project with examples, a final evaluation. The report will be graded according to:
general impression (10%), overview and alternatives (10%), design (10%), implementation details
(40%), testing and examples (10%), analysis (10%), format (10%).

Project Option 1 - Website Crawler
Create a Python project that is able to download websites and capture sensitive data from them. The
program should accept the following inputs from the user:
• The initial URL to start the crawling.
• The depth of the crawling (how many links to subpages and to subpages of those to
take into account).
• User-defined regular expressions for sensitive data.
Minimum features:
• Download the website from the provided URL, identify links inside the source code
and download all subpages linked there. Repeat until the crawling depth is reached.
• Identify email addresses and phone numbers and create a list of the captured values.
• Identify comments inside the source code and make a list of them, indicating the file
names and line numbers where they appear.
• Identify special data using the regular expressions provided by the user and create
lists with them.
• Create a list of the most common words used on the crawled websites.