ProText Analyzer

Note

Apologies, but I did not use the NLTK package for some tasks. Instead, I used:

TextBlob for sentiment analysis

spaCy for various text processing tasks

Syllapy for counting syllables in words

Project Structure

🗂 Directories and Files

📝 Cleaned Articles

cleaned_articles: Contains cleaned articles ready for analysis.

📂 Extracted Articles

extracted_articles: Holds raw articles extracted for the project.

📚 Master Dictionary

master_dictionary: Collection of files for sentiment analysis.
- cleaned_negative_words.txt: List of cleaned negative words.
- cleaned_positive_words.txt: List of cleaned positive words.
- negative-words.txt: Raw negative words for sentiment analysis.
- positive-words.txt: Raw positive words for sentiment analysis.

📑 Project Introduction

project_introduction: Overview and objectives of the project.

🧪 Test Assessment

test_assessment: Contains test assignments and notebooks.
- dataextraction.ipynb: Jupyter Notebook for data extraction tasks.
- testassessment.ipynb: Jupyter Notebook for additional test assessments.

💻 Code and Markdown

testassignment: Code and markdown files related to assignments.
- Code + Markdown/: Contains code snippets and explanations.
- Run All/: Script to execute all code cells in notebooks.

🚫 Stop Words

Stop Words: Directory with various stop words files for preprocessing.

📊 Text Analysis

text_analysis: Files for performing text analysis.
- textanalysis.ipynb: Jupyter Notebook for text analysis.
- sentiment_analysis.log: Log file for sentiment analysis results.
- textblob_sentiment_result.csv: CSV file with sentiment analysis results.

📈 Additional Files

additional_files: Summary results and metrics.
- analysis_results.csv: Various analysis results.
- final_text_analysis_results.xlsx: Final compiled analysis results.

Blackcoffer Test Assignment

Assignment Overview

Objective: Extract textual data from provided URLs and perform text analysis.
Data Extraction:
- Input from Input.xlsx
- Tools: Python, BeautifulSoup, Selenium, Scrapy.
Data Analysis:
- Output in CSV or Excel format.
- Variables include Positive Score, Negative Score, Polarity Score, etc.
Timeline: Duration of 6 days.
Submission: Via Google Form with required files.

Methodology

Sentimental Analysis: Clean text using stop words, create dictionaries of positive/negative words, and extract variables.
Readability Analysis: Calculate average sentence length, percentage of complex words, and Fog Index.

Objective:
The ProText-Analyzer project extracts article content from provided URLs and performs various text analysis tasks like sentiment scoring, readability measurement, and more. The results are structured in a clean and organized format, ready for review and further use.

Project Overview

The goal of ProText-Analyzer is to:

Extract Textual Data: Fetch the article content from URLs provided in the Input.xlsx file.
Perform Textual Analysis: Calculate the following metrics:
- Sentiment scores (positive, negative, polarity, subjectivity)
- Readability scores (Fog Index, Avg. Sentence Length)
- Word count, syllable count, and other word statistics

Technologies Used

Python 🐍
- Libraries:
  - TextBlob for sentiment analysis
  - spaCy for text processing tasks (tokenization, POS tagging, etc.)
  - Syllapy for syllable counting
  - BeautifulSoup for HTML parsing during data extraction
  - Requests for handling HTTP requests
Pandas for data management
Excel/CSV for input/output handling

Installation

Clone the repository to your local machine:

git clone https://github.com/rubydamodar/ProText-Analyzer.git
cd ProText-Analyzer

Install the required Python libraries:
```
pip install -r requirements.txt
```

Data Extraction Process

The ProText-Analyzer extracts the article title and body from each URL listed in the Input.xlsx file and stores the text for further analysis.

Process Overview:

Read Input File: Load the URLs and their associated IDs from Input.xlsx.
Extract Article Content:
- Fetch HTML content using requests.
- Parse the HTML using BeautifulSoup to extract the article's title and body.
- Save the extracted content into text files named after the URL_ID.

File Management:

Each article's content is saved in text files, facilitating a clean process for further analysis.
Error handling ensures proper management of file I/O and network issues.

Text Analysis Process

The extracted text undergoes several analysis steps to compute the following variables:

Sentiment Analysis:
- Implemented using TextBlob to compute Positive Score, Negative Score, Polarity Score, and Subjectivity Score.
- Text is cleaned by removing stop words and irrelevant characters.
Readability Analysis:
- Calculated using the Gunning Fog Index.
- Additional metrics: Average Sentence Length, Percentage of Complex Words, and Fog Index.
Word-Level Metrics:
- Word Count, Complex Word Count, Syllable Count per Word (via syllapy), Personal Pronouns Count (using regex), and Average Word Length.

Output Structure

The results are saved in Excel/CSV format as per the structure outlined in Output Data Structure.xlsx. The following variables are included:

Positive Score
Negative Score
Polarity Score
Subjectivity Score
Average Sentence Length
Complex Word Count
Word Count
Syllable Count
Personal Pronouns Count
Average Word Length

How to Run

Data Extraction: Run the script to extract article data from the URLs:
```
python data_extraction.py
```
Text Analysis: Run the text analysis script to process the extracted articles:
```
python text_analysis.py
```

The results will be saved in the output directory in .csv or .xlsx format.

Challenges and Solutions

Error Handling: Implemented robust error handling to manage potential network and file-related issues.
Text Processing: Utilized advanced tools like spaCy for precise text tokenization and POS tagging, and syllapy for syllable counting.
Personal Pronouns: Regex was used to accurately capture pronouns without including words like "US" mistakenly.

Contributing

We welcome contributions to enhance ProText-Analyzer! To contribute:

Fork the repository.
Create a new branch for your changes.
Submit a pull request with a detailed description of your changes.

License

This project is licensed under the MIT License.

Project Maintainer

Ruby Poddar
Email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
StopWords		StopWords
analysis result		analysis result
cleaned_articles		cleaned_articles
extracted_articles		extracted_articles
master dictionary		master dictionary
project Introduction		project Introduction
test assignment		test assignment
visualization		visualization
Input.xlsx		Input.xlsx
Objective.docx		Objective.docx
Output Data Structure.xlsx		Output Data Structure.xlsx
README.md		README.md
Text Analysis.docx		Text Analysis.docx
desktop.ini		desktop.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProText Analyzer

Note

Project Structure

Blackcoffer Test Assignment

Company Information

Assignment Overview

Methodology

Project Overview

Technologies Used

Installation

Data Extraction Process

Process Overview:

File Management:

Text Analysis Process

Output Structure

How to Run

Challenges and Solutions

Contributing

License

Project Maintainer

About

Releases

Packages

Languages

rubydamodar/ProText-Analyzer

Folders and files

Latest commit

History

Repository files navigation

ProText Analyzer

Note

Project Structure

Blackcoffer Test Assignment

Company Information

Assignment Overview

Methodology

Project Overview

Technologies Used

Installation

Data Extraction Process

Process Overview:

File Management:

Text Analysis Process

Output Structure

How to Run

Challenges and Solutions

Contributing

License

Project Maintainer

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages