The FAQ Extract and Import features support extracting FAQs only from the following:
- JSON, CSV, PDF file formats and
- Webpages
Comma-Seperated Value (CSV)
- The imported FAQs interpret the text in the first column as a question and that in the second column as an answer.
- The file must not have any headers.
- Any headers and the text present in the other columns are ignored.
Portable Document Format (PDF)
- The Extracted FAQs from PDF files processes the content from a PDF and converts it into question-answer pairs.
- Documents with the table of contents: Ideally a document with a table of contents is preferred. In such cases, the table of contents is extracted first and then used to parse the document and identify headings. The information present in the table of contents is used to derive the hierarchy of headings (headings, subheadings, nested sub headings, etc.). These levels are separated by a vertical line as a delimiter (heading | subheading | sub-sub heading) as part of the extraction process.
- Documents with no table of contents: In such cases, a pre-trained machine learning model is applied that identifies headings based on either font style or font size. In the case of using font size, the heading hierarchy can also be derived.
- The text is then formatted with a uniform header and paragraph blocks.
Web Pages
The Extract FAQs supports the following three different FAQ web pages:
- Plain FAQ pages with linear question-answer pairs.
- Pages with question
- hyperlinks that point to answers on the same page.
- Pages with question hyperlinks that point to answers on a different page.
Extraction of certain FAQs on the webpage fails under the following conditions:
- The question text is split between multiple HTML tags on the FAQ page.
- The tag applied to the answer is neither the child nor the sibling of the extracted question as per the HTML DOM structure.
- The question does not have a hyperlink to the answer (applies to FAQs with hyperlinks).
- When the questions hyperlink to the answer, but the question statement is not repeated above the answer (applies to FAQs with hyperlinks).
The extraction of the entire FAQ page fails if the page consists of more than one FAQ page type mentioned previously.