Highlights
- •Convert non-trivial text quantification methods into simple functions.
- •Domain-agnostic information extraction algorithms.
- •Work efficiently with large text corpora.
- •Facilitate text characterization beyond sentiment analysis, counting of words, and computing readability.
Abstract
Keywords
Current code version | V 0.2.0 |
Permanent link to code/repository used for this code version | https://github.com/SoftwareImpacts/SIMPAC-2022-293 |
Permanent link to reproducible capsule | https://codeocean.com/capsule/3686195/tree/v1 |
Legal code license | BSD 3-Clause License |
Code versioning system used | Git |
Software code languages, tools and services used | Python |
Compilation requirements, operating environments and dependencies | tqdm(4.59.0), spacy(3.3.0), pandas(1.2.4), nltk(3.6.1) |
If available, link to developer documentation/manual | https://github.com/jinhangjiang/morethansentiments/blob/main/README.md |
Support email for questions | [email protected] |
1. Motivation and significance
2. Software description
2.1 Software architecture
- •Read raw txt. format data into pandas dataframe
- •Clean and preprocess the text corpora
- •Calculate Boilerplate, Redundancy, Specificity, Relative Prevalence
2.2 Software functionalities
3. Illustrative examples
- •lower: make all the words lowercase
- •punctuations: remove all the punctuations in the corpus
- •number: remove all the digits in the corpus
- •unicode: remove all the types of unicode in the corpus
- •stop_words: remove the stopwords in the corpus
- •input_data: this function requires tokenized documents.
- •n: number of the ngrams to use. The default is 4.
- •min_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 5 document. 30% of the number of documents is recommended. When the parameters are given as decimals (e.g., 0.3), it will be read as a percentage.
- •get_ngram: if this parameter is set to “True”, it will return a dataframe with all the ngrams and the corresponding frequency, and “min_doc” parameter will become ineffective.
4. Impact
S. Pu, K. Srinivasan, AIS Electronic Library ( AISeL ) Are Project Narrative Attributes Indicative of Pre-order Campaign Success on Crowdfunding Platforms ? – A Text-Mining Approach Are Project Narrative Attributes Indicative of Pre-order Campaign Success on Crowdfunding Platf, in: MWAIS 2022 PROCEEDINGS, 2022.
5. Conclusions
Declaration of Competing Interest
References
- TextBlob documentation release 0.16.0, TextBlob.2020 (https://textblob.readthedocs.io/en/dev/)
- A Parsimonious Rule-based Model for, Eighth International AAAI Conference on Weblogs and Social Media.2014: 18 (https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/viewPaper/8109)
- BERT: Pre-training of deep bidirectional transformers for language understanding.2018 (http://arxiv.org/abs/1810.04805)
- Text summarization using unsupervised deep learning.Exp. Syst. Appl. 2017; 68: 93-105https://doi.org/10.1016/j.eswa.2016.10.017
- MoreThanSentiments.2022https://doi.org/10.5281/zenodo.6853351
- Textual analysis and international financial reporting: Large sample evidence.J. Account. Econ. 2015; 60: 110-135https://doi.org/10.1016/j.jacceco.2015.09.002
- 10-K disclosure repetition and managerial reporting incentives.J. Financial Rep. 2017; 2: 107-131https://doi.org/10.2308/jfir-51912
- The benefits of specific risk-factor disclosures.Rev. Account. Stud. 2016; 21: 1005-1045https://doi.org/10.1007/s11142-016-9371-1
- The impact of information processing costs on firm disclosure choice: Evidence from the XBRL mandate.J. Account. Res. 2019; 57: 919-967
- Practical solutions to the problem of diagonal dominance in kernel document clustering.ACM Int. Conf. Proc. Ser. 2006; 148: 377-384https://doi.org/10.1145/1143844.1143892
S. Pu, K. Srinivasan, AIS Electronic Library ( AISeL ) Are Project Narrative Attributes Indicative of Pre-order Campaign Success on Crowdfunding Platforms ? – A Text-Mining Approach Are Project Narrative Attributes Indicative of Pre-order Campaign Success on Crowdfunding Platf, in: MWAIS 2022 PROCEEDINGS, 2022.
- Beyond the numbers: Measuring the information content of earnings press release language.Contemp. Account. Res. 2012; 29: 845-868https://doi.org/10.1111/j.1911-3846.2011.01130
- The information content of forward-looking statements in corporate filings—A naïve Bayesian machine learning approach.J. Account. Res. 2010; 48: 1049-1102https://doi.org/10.1111/j.1475-679X.2010.00382
- Large-sample evidence on firms’ year-over-year MD & a modifications.J. Account. Res. 2011; 49: 309-346https://doi.org/10.1111/j.1475-679X.2010.00396
Article info
Publication history
Footnotes
The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals.
Identification
Copyright
User license
Creative Commons Attribution (CC BY 4.0) |
Permitted
- Read, print & download
- Redistribute or republish the final article
- Text & data mine
- Translate the article
- Reuse portions or extracts from the article in other works
- Sell or re-use for commercial purposes
Elsevier's open access license policy