Browsing Category Content Extraction

Extracting Media from HTML

Over the last decade, there has been a rapid development from static webpages with little interactive content to media-rich and dynamic web pages. This can be attributed to new web technologies that make embedding of multimedia content incredibly simple. For example, adding a video in HTML5 is as easy as wrapping it in <video> tags.…

Continue Reading
0 Comments

A Benchmark Comparison Of Content Extraction From HTML Pages

Introduction Content extraction is the task of separating boilerplate such as comments, navigation bars, social media links, ads, etc, from the main body of text of an article formatted as HTML. The main content typically accounts for only a small portion of a page’s source code (highlighted in red in the image below). Extraction is…

Continue Reading
0 Comments