Advertisement

MoreThanSentiments: A text analysis package

Open AccessPublished:December 21, 2022DOI:https://doi.org/10.1016/j.simpa.2022.100456

      Highlights

      • Convert non-trivial text quantification methods into simple functions.
      • Domain-agnostic information extraction algorithms.
      • Work efficiently with large text corpora.
      • Facilitate text characterization beyond sentiment analysis, counting of words, and computing readability.

      Abstract

      Text mining on a large corpus of data has gained utility and popularity over recent years owing to advancements in information retrieval and machine learning methods. However, popular text mining software packages mainly focus on either sentiment analysis or semantic meaning extraction, requiring pretraining on a large corpus of text data. In comparison, MoreThanSentiments provides computation of newer text attribution measures, including boiler score, specificity, redundancy, and hard info, which have been proposed in accounting analytics literature. Our software package, available in Python, is flexible in terms of parameter setting and is adaptable to different applications. Through this package, we seek to simplify the process of deploying nontrivial information extraction techniques published in domain-specific text analysis research into domain-agnostic analytics applications.

      Keywords

      Code metadata
      Tabled 1
      Current code versionV 0.2.0
      Permanent link to code/repository used for this code versionhttps://github.com/SoftwareImpacts/SIMPAC-2022-293
      Permanent link to reproducible capsulehttps://codeocean.com/capsule/3686195/tree/v1
      Legal code licenseBSD 3-Clause License
      Code versioning system usedGit
      Software code languages, tools and services usedPython
      Compilation requirements, operating environments and dependenciestqdm(4.59.0), spacy(3.3.0), pandas(1.2.4), nltk(3.6.1)
      If available, link to developer documentation/manualhttps://github.com/jinhangjiang/morethansentiments/blob/main/README.md
      Support email for questions[email protected]

      1. Motivation and significance

      Natural language processing, one of the major data mining methods, has been expanded and applied in many research fields in the past decades. Researchers have already paid lots of effort into studying unstructured text data by leveraging information extraction and retrieval techniques. And using the pre-trained models, such as the sentiment analysis model provided by TextBlob [
      • Lorla S.
      TextBlob documentation release 0.16.0, TextBlob.
      ] or VADER [
      • E. Hutto C.J.
      • Gilbert VADER
      A Parsimonious Rule-based Model for, Eighth International AAAI Conference on Weblogs and Social Media.
      ] allows the users to deploy powerful models to tackle those tasks with less training time and resources. Recent studies have also proposed newer methods, such as Bidirectional Encoder Representations from Transformers (BERT) [
      • Devlin J.
      • Chang M.-W.
      • Lee K.
      • Toutanova K.
      BERT: Pre-training of deep bidirectional transformers for language understanding.
      ], for extracting the semantic meaning of the text. However, text features from existing software packages in Python and R either use pre-trained, high-performance models that result in features that cannot be directly interpreted or propose features that have a limited scope of explaining the nature of the text content. Sentiment analysis software typically provides the polarity or aggregate negative/positive sentiments expressed in text. Text analysis software such as py-readability-metrics provides metrics, including Gunning Fog, SMOG, and Flexh-Kincaid that focus on discerning the readability of the text. On the other hand, deep learning methods have been used to summarize longer text into short paragraphs [
      • Yousefi-Azar M.
      • Hamey L.
      Text summarization using unsupervised deep learning.
      ].
      MoreThanSentiments [
      • Jiang J.
      • Srinivasan K.
      ] was motivated by the fact that the users are eager to seek more nontrivial methods to quantify the text and summarize the structure of the text. Currently, the package supports the following text complexity metrics: Boilerplate [
      • Lang M.
      • Stice-Lawrence L.
      Textual analysis and international financial reporting: Large sample evidence.
      ], a measurement of informativeness; Redundancy [
      • Cazier R.A.
      • Pfeiffer R.J.
      10-K disclosure repetition and managerial reporting incentives.
      ], a measurement of usefulness; Specificity [
      • Hope O.-K.
      • Hu D.
      • Lu H.
      The benefits of specific risk-factor disclosures.
      ], a measurement of the quality of relating uniquely to a particular subject; and Relative Prevalence [
      • Blankespoor E.
      The impact of information processing costs on firm disclosure choice: Evidence from the XBRL mandate.
      ], a measurement of hard information. This domain-agnostic package can easily be implemented for text quantification tasks in various projects. Additionally, we expect the novel adoption of the features in this package can serve as an enabler for different downstream works.

      2. Software description

      In this section, we discuss the functionality of MoreThanSentiments, followed by the demonstrations of the main functions.

      2.1 Software architecture

      The package, MoreThanSentiments, is implemented in Python. Currently, it is composed of one major module that supports all of the features:
      • Read raw txt. format data into pandas dataframe
      • Clean and preprocess the text corpora
      • Calculate Boilerplate, Redundancy, Specificity, Relative Prevalence

      2.2 Software functionalities

      A boilerplate is a group of words (e.g., tetragram) that may be omitted from a statement without altering its semantic meaning in textual analysis. It is a measurement of informativeness, in other words. The boiler score [
      • Lang M.
      • Stice-Lawrence L.
      Textual analysis and international financial reporting: Large sample evidence.
      ] is determined by comparing the number of sentences that use boilerplate language to the total number of words. Thus, the higher the boiler scores, the lower the informativeness for the given corpora. To identify a boilerplate, the users need first to set the length. The default is four words, which is a tetragram. Then the whole corpora will be scanned, and the frequency of the boilerplate per document will be captured. By default, only the boilerplates appearing at least in five documents and less than 75% of the total documents will be used to calculate the boiler scores. The frequency threshold is used as a bias control. And the formula of the Boilerplate is as follows:
      Boilerplate=WsWd
      (1)


      Ws is the word count of the sentence that has a boilerplate.
      Wd is the word count of the whole document.
      The degree of Redundancy indicates how useful a corpus is. It is the proportion of really large sentences or phrases (e.g., 10 grams) that appear more than once in a given document. If a super-long statement or phrase is used repeatedly, that means the author tries to impose the duplicated information. Hence, this piece of information shall be marked as not useful. Similar to the Boilerplate, the higher the Redundancy, the less useful the given corpus is.
      Specificity is a measure of the ability to relate specifically to a certain subject. It is described as the number of specific entity names, numerical values, and times/dates scaled by the overall word count of a document. Currently, the Named Entity Recognizer from spaCy serves as the foundation for the Specificity function.
      Hard information of a given corpus is measured by Relative Prevalence. It compares the number of numerical values to the overall length of the text. It aids in assessing the amount of quantitative data in a particular text.

      3. Illustrative examples

      In this section, we illustrate three usage examples of MoreThanSentiments. For a full usage guide, please refer to the library documentation (https://pypi.org/project/MoreThanSentiments/). The dataset we used to experiment with is the BBC Business news dataset [
      • Greene D.
      • Cunningham P.
      Practical solutions to the problem of diagonal dominance in kernel document clustering.
      ].
      The codes below demonstrate how to read the raw text data (“read_txt_files”) and perform the text cleaning (“clean_data”) as needed.
      For the data cleaning function (“clean_data”), we offer the following options:
      • lower: make all the words lowercase
      • punctuations: remove all the punctuations in the corpus
      • number: remove all the digits in the corpus
      • unicode: remove all the types of unicode in the corpus
      • stop_words: remove the stopwords in the corpus
      The following codes illustrate how to calculate the boiler score. It needs to be applied to the whole corpora instead of a single document.
      For the (“Boilerplate”) function, we offer the following options:
      • input_data: this function requires tokenized documents.
      • n: number of the ngrams to use. The default is 4.
      • min_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 5 document. 30% of the number of documents is recommended. When the parameters are given as decimals (e.g., 0.3), it will be read as a percentage.
      • get_ngram: if this parameter is set to “True”, it will return a dataframe with all the ngrams and the corresponding frequency, and “min_doc” parameter will become ineffective.

      4. Impact

      Our proposed software package contains functions that can be beneficial to research conducted in multiple disciplines, including but not limited to accounting, finance, information systems, marketing, management science, information sciences, applied computer science, and applied linguistics. For example, the boiler scores of financial disclosures could indicate the extent to which firms tend to adapt common phrases from other firms or reuse statements from their previous disclosures. The trends in disclosure reporting behavior regarding boilerplate, specificity, and hardInfo could help understand the role of disclosure scripting in the firm image, performance, and market behavior. Another application that examines Specificity, Boilerplate, and Redundancy could be the impact of pre-written bot responses to customer queries. What textual characteristics of bot response are most appreciated by end users and how it contributes to problem resolution can be an interesting research question to address. A third potential application is a comparison of text characteristics of spam mail versus regular email. One could expect that boiler scores for regular emails might be much lesser than boiler scores for spam emails as often these emails tend to use common ‘key’ phrases geared towards fear-mongering and click-baiting of end users. Though measures such as Boilerplate and Redundancy have been widely used in accounting literature [
      • Lang M.
      • Stice-Lawrence L.
      Textual analysis and international financial reporting: Large sample evidence.
      ,
      • Cazier R.A.
      • Pfeiffer R.J.
      10-K disclosure repetition and managerial reporting incentives.
      ,
      • Hope O.-K.
      • Hu D.
      • Lu H.
      The benefits of specific risk-factor disclosures.
      ,
      • Blankespoor E.
      The impact of information processing costs on firm disclosure choice: Evidence from the XBRL mandate.
      ], recent studies have considered such measures in applications such as predicting crowdfunding success [

      S. Pu, K. Srinivasan, AIS Electronic Library ( AISeL ) Are Project Narrative Attributes Indicative of Pre-order Campaign Success on Crowdfunding Platforms ? – A Text-Mining Approach Are Project Narrative Attributes Indicative of Pre-order Campaign Success on Crowdfunding Platf, in: MWAIS 2022 PROCEEDINGS, 2022.

      ]. Our software code has been widely downloaded from the GitHub repository. Therefore, we decided to convert the code to a software package for ease of use and program replicability.

      5. Conclusions

      We propose a new software package called MoreThanSentiments that includes a list of text characterization features. The textual features we present via the python software package are unavailable elsewhere as reproducible code or software for general-purpose applications. These features originate from multiple studies in the accounting analytics discipline focusing on gleaning a variety of quantifiable information about the financial disclosure of firms. We make these quantitative features available for general-purpose applications by allowing the flexibility to generate the features using simple functions with user-defined parameters. Our package facilitates text characterization beyond sentiment analysis, counting of words, and computing readability.
      The future development of MoreThanSentiments will focus on expanding its capabilities in terms of the text attribution measures it can compute [
      • Davis A.K.
      • Piger J.M.
      • Sedor L.M.
      Beyond the numbers: Measuring the information content of earnings press release language.
      ,
      • Li F.
      The information content of forward-looking statements in corporate filings—A naïve Bayesian machine learning approach.
      ,
      • v. Brown S.
      • Tucker J.W.
      Large-sample evidence on firms’ year-over-year MD & a modifications.
      ], as well as making the software more user-friendly and adaptable to different applications. This will involve continuing to improve the underlying machine learning algorithms and information retrieval methods, as well as incorporating user feedback to ensure that the software meets the needs of a wide range of users. By providing a flexible and easy-to-use package for text mining on large corpora of data, MoreThanSentiments has the potential to become an essential tool for researchers and practitioners in a variety of fields.

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      References

        • Lorla S.
        TextBlob documentation release 0.16.0, TextBlob.
        2020 (https://textblob.readthedocs.io/en/dev/)
        • E. Hutto C.J.
        • Gilbert VADER
        A Parsimonious Rule-based Model for, Eighth International AAAI Conference on Weblogs and Social Media.
        2014: 18 (https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/viewPaper/8109)
        • Devlin J.
        • Chang M.-W.
        • Lee K.
        • Toutanova K.
        BERT: Pre-training of deep bidirectional transformers for language understanding.
        2018 (http://arxiv.org/abs/1810.04805)
        • Yousefi-Azar M.
        • Hamey L.
        Text summarization using unsupervised deep learning.
        Exp. Syst. Appl. 2017; 68: 93-105https://doi.org/10.1016/j.eswa.2016.10.017
        • Jiang J.
        • Srinivasan K.
        MoreThanSentiments.
        2022https://doi.org/10.5281/zenodo.6853351
        • Lang M.
        • Stice-Lawrence L.
        Textual analysis and international financial reporting: Large sample evidence.
        J. Account. Econ. 2015; 60: 110-135https://doi.org/10.1016/j.jacceco.2015.09.002
        • Cazier R.A.
        • Pfeiffer R.J.
        10-K disclosure repetition and managerial reporting incentives.
        J. Financial Rep. 2017; 2: 107-131https://doi.org/10.2308/jfir-51912
        • Hope O.-K.
        • Hu D.
        • Lu H.
        The benefits of specific risk-factor disclosures.
        Rev. Account. Stud. 2016; 21: 1005-1045https://doi.org/10.1007/s11142-016-9371-1
        • Blankespoor E.
        The impact of information processing costs on firm disclosure choice: Evidence from the XBRL mandate.
        J. Account. Res. 2019; 57: 919-967
        • Greene D.
        • Cunningham P.
        Practical solutions to the problem of diagonal dominance in kernel document clustering.
        ACM Int. Conf. Proc. Ser. 2006; 148: 377-384https://doi.org/10.1145/1143844.1143892
      1. S. Pu, K. Srinivasan, AIS Electronic Library ( AISeL ) Are Project Narrative Attributes Indicative of Pre-order Campaign Success on Crowdfunding Platforms ? – A Text-Mining Approach Are Project Narrative Attributes Indicative of Pre-order Campaign Success on Crowdfunding Platf, in: MWAIS 2022 PROCEEDINGS, 2022.

        • Davis A.K.
        • Piger J.M.
        • Sedor L.M.
        Beyond the numbers: Measuring the information content of earnings press release language.
        Contemp. Account. Res. 2012; 29: 845-868https://doi.org/10.1111/j.1911-3846.2011.01130
        • Li F.
        The information content of forward-looking statements in corporate filings—A naïve Bayesian machine learning approach.
        J. Account. Res. 2010; 48: 1049-1102https://doi.org/10.1111/j.1475-679X.2010.00382
        • v. Brown S.
        • Tucker J.W.
        Large-sample evidence on firms’ year-over-year MD & a modifications.
        J. Account. Res. 2011; 49: 309-346https://doi.org/10.1111/j.1475-679X.2010.00396