Indexing Solution for portal with large recipe data

Indexing, Crawling Solutions Portal for Large Data

About Customer & Business Need:

Customer developed Web Portal and indexed over 1.5 million recipes - this portal acts as a search engine to find all your recipes in cookbooks, magazines and blogs. It is also a platform to exchange recommendations, find new ideas & follow blogs written by cooking experts.

While the goal is to continuously index more and more recipe books, the manual process adopted for crawling & indexing data was not fast enough to add more data quickly. Also it involved a tedious task of coordination between crawler staff, indexer staff, editors and portal administrators since it also involved careful quality assurance before any data is added in the portal.

Our team got involved in understanding of the various types of book formats, data formats and current manual process of crawling & indexing. Subsequently, we suggested that an automated crawling & indexing solution can be developed and implemented thus providing a huge benefit to the business.

Our Solution & Process:

The project can be divided into two parts, first was system analysis and solution envisioning and second was solution development.

A. System Analysis

Ace Publishing has spent 4.5 weeks for doing the detailed analysis for the possible solution with the following goals in mind:

Drastically reducing data crawling/indexing process time. This will help customer to enhance the productivity of the portal's content management and will provide a cutting edge technological advantage in the industry.
Capturing wide variety of data points
Capturing description & steps included in the data of several book formats.
Establishing a strong foundation for tackling content management issues that may arise in the coming years. This structured data solutions enables to explore more possibilities and additional service channels within the boundary of core offerings.

The analysis phase had very tangible outcomes in terms of benefits mentioned below and additionally screen designs were done, workflow was finalized, technical architecture was created for the crawling/indexing solution:

Following is the stats of analysis done:

4 different layout Books/PDFs analyzed - this helped in analyzing different types of formats/layouts and how solution will react/work
Converted PDF to HTML with a file size in the range of 75 to 100 MB PDF files – this helped in analyzing the performance and size conversion that solution will handle
Converted PDF to HTML with maximum 350+ pages in one book PDF file – this helped in analyzing the performance on volume pages conversion solution can handle
Checked two pages width – this helped in determining the maximum page size/resolution

B. Crawling & indexing solution development

A new Crawling & Indexing Automation solution was developed and this was separate system than existing content portal. This new system enables customers to reduce the time to crawl/index content & data drastically and will also allow to capture a variety of data points.
The solution included below features and sections:
- Uploading of PDFs & Docs
- System crawls data, makes it available to store structuredly in database
- Allow manual entering of Metadata from the books content
- Select page sections to be crawled/indexed
- System can replicate crawling/indexing flow for subsequent pages
- Data stored can be viewed and available for search with any keyword or metadata information.
- The solution eventually allows the use of the data crawled (and stored in database) for any future integrations or for creating unique data model for sales & marketing purposes.

Case Studies

Indexing Solution for portal with large recipe data

About Customer & Business Need:

Tech Stack

Keywords

Our Solution & Process:

5 Months

Team allocation for this Project