Site Logo Site Flag

AI-Powered Parsing & Brand Monitoring Case Study | Flexi IT’s Scalable Solution

The client required monitoring articles on a prominent website to track mentions of their brand.

AI-Powered Parsing & Brand Monitoring Case Study | Flexi IT’s Scalable Solution

Introduction

Our goal was to implement the most effective parsing process for articles on a well-known website, preserving both textual content and associated images for subsequent analysis, while specifically ignoring live streaming content.

Challenge

The client required monitoring articles on a prominent website to track mentions of their brand. Additionally, an automated solution was necessary to identify whether these mentions occurred in a positive or negative context.

Solution

We developed a custom parser designed to extract content from articles on the target website, with the capability to analyse the content for specified keywords and determine their context.

Parsing Process

  • Link Collection: Scanning the website and selecting relevant URLs.
  • Article Crawling: Extracting text and images while deliberately skipping live streams.
  • Filtering: Keyword detection within the article text.
  • Data Storage: Efficiently saving data into a database and caching where necessary.
  • Context Detection: Utilising advanced AI/ML tools for accurate sentiment analysis.
  • Error Handling: Robust logging mechanisms with automated notifications for error tracking.

Architecture

We selected a microservices architecture for optimal scalability and modular expansion capabilities. The architecture includes:

  • URL Collector Service: Gathers links to relevant articles.
  • Article Parser: Processes the articles and extracts necessary data.
  • Context Analyzer: Determines sentiment around keyword mentions.
  • Database: Stores collected data.
  • Cache System: Provides rapid data access.
  • Message Broker: Facilitates communication between microservices.
  • Error Tracking: Centralised logging and issue reporting.

Source Code

The project's codebase is accessible here: GitLab Repository

Technology Stack

Programming Language:

  • Python

Databases:

  • MongoDB
  • Redis (enhanced with RedisInsight for improved monitoring)

Cache:

  • Redis (monitored with RedisInsight)

Message Broker:

  • Apache Kafka (monitored with KafkaUI)

Version Control and CI/CD:

  • GitLab

Error Tracking and Logging:

  • Sentry

Virtualisation:

  • Docker
  • Docker-compose

Code Quality Tools:

  • flake8, mypy, black, ruff, dlint
  • pytest

Python Libraries:

  • urllib3
  • requests
  • loguru
  • kafka-python-ng3
  • python-dotenv
  • pydantic
  • sentry-sdk[loguru]
  • beautifulsoup4
  • redis
  • pymongo
  • openai

Outcomes and Benefits

The technologies implemented have created an efficient, scalable system for parsing articles:

  • Python, combined with powerful libraries like BeautifulSoup, facilitated reliable data extraction and asynchronous processing.
  • MongoDB provided flexible, scalable data storage capabilities essential for managing large volumes of information
  • Redis improved system performance significantly by acting as a cache, reducing database load, and ensuring fast data access.
  • Apache Kafka effectively handled inter-service messaging, enhancing system reliability and scalability.
  • Sentry enabled rapid error detection and resolution, contributing to system stability.
  • Docker and docker-compose simplified system deployment, ensuring isolation of components and streamlining testing and integration processes.
  • Code Quality Tools ensured consistently high standards of code, proactively addressing potential issues during development.

The resulting system efficiently meets the client's requirements for monitoring and parsing, providing robust scalability for dynamic workload adjustments. Operational management has been substantially simplified through the integration of monitoring tools such as RedisInsight and KafkaUI.

 

 

 

 

Up