Cyber-enabled Competitive Data Theft: A Framework for Modeling Long-Run Cybersecurity Consequences

  • Download the paper

Subscribe to the Center for Technology Innovation Newsletter

Allan a. friedman , allan a. friedman former brookings expert, director of cybersecurity initiatives, national telecommunications and information administration - u.s. department of commerce austen mack-crane , and austen mack-crane senior research assistant - center on social dynamics and policy ross a. hammond ross a. hammond director - center on social dynamics and policy , senior fellow - economic studies.

December 6, 2013

Cybersecurity has become a pressing policy issue, and has drawn the attention of the national security community. Yet there is an emerging consensus among experts that one of the largest policy problems faced in cyberspace may be not a question of military threats in a new domain, but the massive exfiltration of competitive information from American companies. Economic espionage has existed at least since the industrial revolution, but the scope of modern cyber-enabled competitive data theft may be unprecedented.

Much of the conversation surrounding the impact of cyber-enabled data theft has focused on how much theft is occurring today and how much this theft costs our economy today. Since data on the former (the level of theft) is extremely limited and almost certainly incomplete, efforts to estimate the latter (the present cost of theft) have suffered from both limited data and analytical approach, leading to widely varying estimates. The focus in this paper is instead on long-term consequences of cybertheft for innovative sectors of activity that are at the core of US economic success. Friedman, Mack-Crane, and Hammond conceive of the problem as one of diminished growth, rather than purloined assets. They explore the long-run implications of a world with no more (or with selectively fewer) digital secrets, examining which sectors or industries will be hurt the most or remain resilient, and which policies or technologies might be priorities for limiting economic harm in the future.

The authors begin by developing a framework to unpack the concept of “cyber-enabled competitive data theft” (CCDT), which comprises many different dynamic pathways. The type of data stolen is important: even files typically seen as mundane, such as email archives, could be of great value to an attacker. The right emails can reveal a bidding strategy for a billion-dollar deal, for example. They also consider how different protection “regimes” (investments in particular forms of cybersecurity) map onto what types of information are or are not effectively protected. They detail the types of data that any firm might use to create value that are also of interest to attackers. These classes of information can be mapped to industries and sectors based on how attackers use strategic information. They then explicitly catalogue how firms suffer direct, first-order harms from data theft. In the model, we instantiate industry-specific patterns of information use related harms from theft drawn from extensive case studies, interviews, and the published literature. They then model expected long run shifts in the distribution of production and investment in innovative activity resulting from any particular pattern of harms.

With this paper, Friedman, Mack-Crane, and Hammond present what they believe is the first economic framework and model to understand the long-run impact of competitive data theft on an economy by taking into account the actual mechanisms and pathways by which theft harms the victims. The initial results suggest five important conclusions:

  • The three dimensions along which the framework differentiates CCDT can all be important to model outcomes. In some cases, sector matters, in others the type of data stolen matters, and in others protection regime matters.
  • By seeing stolen data from a business process perspective, rather than a lost asset, the authors were able to understand the problem in a longer time frame. This not only avoids the challenges of short term analysis and gives us the context of equilibria, it is more extensible in a policy analysis.
  • These simulations demonstrate that different interventions will have different effects. Not only is there no ‘silver bullet,’ but some sectors will benefit from solutions that may offer no help to others.
  • The framework introduces a new way of thinking about cybersecurity that does not easily map onto existing theoretical structures or evidence. The modeling process revealed the need for further theoretical work to properly integrate the diversity of impacts the framework identifies into a model of growth.
  • This basic model is not only extensible, but can help us understand a range of critical cybersecurity policy problems. A particularly promising extension of the model would be to divide each sector into two groups: defenders and self-insurers. The defending firms spend some of their fixed capital in a one-time investment, but are less vulnerable to attacks. The remainder of the sector chooses to use their capital for growth, as before.

Related Content

Allan A. Friedman

September 25, 2013

Alejandra Palacios, Christian Norton

March 6, 2024

September 19, 2013

Related Books

William J. Congdon, Jeffrey R. Kling, Sendhil Mullainathan

December 14, 2011

Richard N. Haass, Martin S. Indyk

November 25, 2008

Flynt Leverett

May 18, 2005

Economic Studies Governance Studies

Center for Technology Innovation Center on Social Dynamics and Policy

Brookings Institution, Washington DC

10:00 am - 11:00 am EST

Greg Wright, Dany Bahar, Ian Seyal

June 17, 2022

Martina Hund-Mejean, Marcela Escobari

April 28, 2020

Cybercrime and Intellectual Property Theft: An Analysis of Modern Digital Forensics

  • Conference paper
  • First Online: 13 October 2022
  • Cite this conference paper

Book cover

  • Andrew K. Blaskovic 10 ,
  • John-David Rusk 10 ,
  • Victor C. Parker Jr. 10 &
  • Bryson R. Payne 10  

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 560))

Included in the following conference series:

  • Proceedings of the Future Technologies Conference

549 Accesses

1 Citations

Technology use has become ubiquitous, and with thousands of cybercrimes happening each day, there is a high demand for those who know how to identify these crimes and investigate them safely and appropriately. Since new and revolutionary ideas and technologies occur almost daily, so does the risk of intellectual property theft to acquire trade secrets and proprietary products. Theft of intellectual property has also become a growing threat to our national security. By refusing to recognize these actions as criminal, other countries shield many of the individuals responsible for these crimes. This research aims to examine how countries, companies, and agencies deal with intellectual property theft and to provide a breakdown of how intellectual property theft affects today’s industries and the damage it can cause. When intellectual property thefts do occur, it is important to understand the necessary steps to handle the situation and how professionals go about collecting evidence from these crimes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Johnson, M.B.: Software piracy: stopping it before it stops you. In: Proceedings of the 16th Annual ACM SIGUCCS Conference on User Services, pp. 295–299 (October 1988)

Google Scholar  

Hamelink, C.J.: The Ethics of Cyberspace. Sage, USA (2000)

Forester, T.: Software theft and the problem of intellectual property rights. ACM SIGCAS Comput. Soc. 20 (1), 2–11 (1990)

Article   Google Scholar  

Davis, R., Samuelson, P., Kapor, M., Reichman, J.: A new view of intellectual property and software. Commun. ACM 39 (3), 21–30 (1996)

Samuelson, P.: The NII intellectual property report. Commun. ACM 37 (12), 21–28 (1994)

Parker, D.B.: Computer crime. In: Encyclopedia of Computer Science, pp. 349–353 (2003)

Ramesh, P., Maheswari, D.: Survey of cybercrime activities and preventive measures. In: Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology, pp. 301–305 (October 2012)

The Commission on the Theft of American Intellectual Property: (rep.). The Report of the Commission on the Theft of American Intellectual Property, Seattle, WA, pp. 1–89 (2013)

U.S. Tariff Commission, China: Effects of Intellectual Property Infringement and Indigenous Policies on the U.S. Economy. Washington, DC (2010)

Notkin, D., Kirsch, G., Skulikaris, Y.: Intellectual property issues in software (panel). In: Proceedings of the 21st International Conference on Software Engineering, pp. 594–595 (May 1999)

Gantz, J.F., et al.: The dangerous world of counterfeit and pirated software. IDC White Paper (2013)

Kim, D., et al.: A birthmark-based method for intellectual software asset management. In: Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, pp. 1–6 (January 2014)

International Organization for Standardization. ISO/IEC 27037:2012. ISO (9 July 2018)

Stringer-Calvert, D.W.J.: (rep.). Digital Evidence. Inside Risks (2002)

United Nations Office on Drugs and Crime: Handling of digital evidence. Cybercrime Module 6 Key Issues: Handling of Digital Evidence (March 2019). Accessed 25 Mar 2022

Amari, K.: Techniques and tools for recovering and analyzing data from volatile memory. SANS Institute InfoSec Reading Room (2009)

Scientific Working Group on Digital Evidence: (rep.). SWGDE Best Practices for Computer Forensic Acquisitions (2018)

U.S. Department of Justice: Reporting intellectual property crime (October 2018). https://www.justice.gov/criminal-ccips/file/891011/download . Accessed 11 Apr 2020

Download references

Author information

Authors and affiliations.

University of North Georgia, Dahlonega, GA, USA

Andrew K. Blaskovic, John-David Rusk, Victor C. Parker Jr. & Bryson R. Payne

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Bryson R. Payne .

Editor information

Editors and affiliations.

Faculty of Science and Engineering, Saga University, Saga, Japan

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Cite this paper.

Blaskovic, A.K., Rusk, JD., Parker, V.C., Payne, B.R. (2023). Cybercrime and Intellectual Property Theft: An Analysis of Modern Digital Forensics. In: Arai, K. (eds) Proceedings of the Future Technologies Conference (FTC) 2022, Volume 2. FTC 2022 2022. Lecture Notes in Networks and Systems, vol 560. Springer, Cham. https://doi.org/10.1007/978-3-031-18458-1_36

Download citation

DOI : https://doi.org/10.1007/978-3-031-18458-1_36

Published : 13 October 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-18457-4

Online ISBN : 978-3-031-18458-1

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

U.S. flag

An official website of the United States government, Department of Justice.

Here's how you know

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock A locked padlock ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Identity Theft - A Research Review

Based upon "Identity Theft Literature Review" (Graeme R. Newman and Megan M. McNally, July 2005), this online publication assesses what is known about identity theft and recommends areas that need further research.

The research found that identity theft generally involves three stages: acquisition of the identity information, the thief's use of the information for personal gain to the detriment of the victim of identity theft, and discovery of the identity theft. Evidence indicates that the longer it takes to discover the theft, the greater the loss incurred and the less likely it is that prosecution will be successful. Older persons and those with less education are less likely to discover the identity theft quickly and to report it after discovery. The research also found that access to personal information about potential victims and the anonymity the Internet offers would-be thieves are major facilitators of identity theft. Major topics on identity theft reviewed in this report are the definition of identity theft, the extent and patterns of identity theft, types of identity theft, recording and reporting identity theft, law enforcement issues and response, the cost of identity theft, and issues that need more research. Regarding the latter topic, the researchers recommend more research on the best ways to prevent identity theft crimes. Specifically, research should address practices and operating environments of document-issuing agencies that allow offenders to exploit opportunities to obtain identity documents. Research should also focus on practices and operating environments of document-authenticating agencies that allow offenders access to identity data. Also, the structure and operations of the information systems involved with the operational procedures of the aforementioned agents should be researched. The report reviewed more than 160 literature sources that ranged from traditional journal articles to Web sites and presentations.

Additional Details

Related topics, similar publications.

  • General Provider Resources: Medical Records For Medicolegal Death Investigations Toolkit
  • Developing and Piloting Videogames to Increase College and University Students Awareness and Efficacy of the Bystander Role in Incidents of Sexual Violence
  • Trajectories of Alcohol and Marijuana Use Among Primary Versus Secondary Psychopathy Variants Within an Adjudicated Adolescent Male Sample

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Cyber risk and cybersecurity: a systematic review of data availability

Frank cremer.

1 University of Limerick, Limerick, Ireland

Barry Sheehan

Michael fortmann.

2 TH Köln University of Applied Sciences, Cologne, Germany

Arash N. Kia

Martin mullins, finbarr murphy, stefan materne, associated data.

Cybercrime is estimated to have cost the global economy just under USD 1 trillion in 2020, indicating an increase of more than 50% since 2018. With the average cyber insurance claim rising from USD 145,000 in 2019 to USD 359,000 in 2020, there is a growing necessity for better cyber information sources, standardised databases, mandatory reporting and public awareness. This research analyses the extant academic and industry literature on cybersecurity and cyber risk management with a particular focus on data availability. From a preliminary search resulting in 5219 cyber peer-reviewed studies, the application of the systematic methodology resulted in 79 unique datasets. We posit that the lack of available data on cyber risk poses a serious problem for stakeholders seeking to tackle this issue. In particular, we identify a lacuna in open databases that undermine collective endeavours to better manage this set of risks. The resulting data evaluation and categorisation will support cybersecurity researchers and the insurance industry in their efforts to comprehend, metricise and manage cyber risks.

Supplementary Information

The online version contains supplementary material available at 10.1057/s41288-022-00266-6.

Introduction

Globalisation, digitalisation and smart technologies have escalated the propensity and severity of cybercrime. Whilst it is an emerging field of research and industry, the importance of robust cybersecurity defence systems has been highlighted at the corporate, national and supranational levels. The impacts of inadequate cybersecurity are estimated to have cost the global economy USD 945 billion in 2020 (Maleks Smith et al. 2020 ). Cyber vulnerabilities pose significant corporate risks, including business interruption, breach of privacy and financial losses (Sheehan et al. 2019 ). Despite the increasing relevance for the international economy, the availability of data on cyber risks remains limited. The reasons for this are many. Firstly, it is an emerging and evolving risk; therefore, historical data sources are limited (Biener et al. 2015 ). It could also be due to the fact that, in general, institutions that have been hacked do not publish the incidents (Eling and Schnell 2016 ). The lack of data poses challenges for many areas, such as research, risk management and cybersecurity (Falco et al. 2019 ). The importance of this topic is demonstrated by the announcement of the European Council in April 2021 that a centre of excellence for cybersecurity will be established to pool investments in research, technology and industrial development. The goal of this centre is to increase the security of the internet and other critical network and information systems (European Council 2021 ).

This research takes a risk management perspective, focusing on cyber risk and considering the role of cybersecurity and cyber insurance in risk mitigation and risk transfer. The study reviews the existing literature and open data sources related to cybersecurity and cyber risk. This is the first systematic review of data availability in the general context of cyber risk and cybersecurity. By identifying and critically analysing the available datasets, this paper supports the research community by aggregating, summarising and categorising all available open datasets. In addition, further information on datasets is attached to provide deeper insights and support stakeholders engaged in cyber risk control and cybersecurity. Finally, this research paper highlights the need for open access to cyber-specific data, without price or permission barriers.

The identified open data can support cyber insurers in their efforts on sustainable product development. To date, traditional risk assessment methods have been untenable for insurance companies due to the absence of historical claims data (Sheehan et al. 2021 ). These high levels of uncertainty mean that cyber insurers are more inclined to overprice cyber risk cover (Kshetri 2018 ). Combining external data with insurance portfolio data therefore seems to be essential to improve the evaluation of the risk and thus lead to risk-adjusted pricing (Bessy-Roland et al. 2021 ). This argument is also supported by the fact that some re/insurers reported that they are working to improve their cyber pricing models (e.g. by creating or purchasing databases from external providers) (EIOPA 2018 ). Figure  1 provides an overview of pricing tools and factors considered in the estimation of cyber insurance based on the findings of EIOPA ( 2018 ) and the research of Romanosky et al. ( 2019 ). The term cyber risk refers to all cyber risks and their potential impact.

An external file that holds a picture, illustration, etc.
Object name is 41288_2022_266_Fig1_HTML.jpg

An overview of the current cyber insurance informational and methodological landscape, adapted from EIOPA ( 2018 ) and Romanosky et al. ( 2019 )

Besides the advantage of risk-adjusted pricing, the availability of open datasets helps companies benchmark their internal cyber posture and cybersecurity measures. The research can also help to improve risk awareness and corporate behaviour. Many companies still underestimate their cyber risk (Leong and Chen 2020 ). For policymakers, this research offers starting points for a comprehensive recording of cyber risks. Although in many countries, companies are obliged to report data breaches to the respective supervisory authority, this information is usually not accessible to the research community. Furthermore, the economic impact of these breaches is usually unclear.

As well as the cyber risk management community, this research also supports cybersecurity stakeholders. Researchers are provided with an up-to-date, peer-reviewed literature of available datasets showing where these datasets have been used. For example, this includes datasets that have been used to evaluate the effectiveness of countermeasures in simulated cyberattacks or to test intrusion detection systems. This reduces a time-consuming search for suitable datasets and ensures a comprehensive review of those available. Through the dataset descriptions, researchers and industry stakeholders can compare and select the most suitable datasets for their purposes. In addition, it is possible to combine the datasets from one source in the context of cybersecurity or cyber risk. This supports efficient and timely progress in cyber risk research and is beneficial given the dynamic nature of cyber risks.

Cyber risks are defined as “operational risks to information and technology assets that have consequences affecting the confidentiality, availability, and/or integrity of information or information systems” (Cebula et al. 2014 ). Prominent cyber risk events include data breaches and cyberattacks (Agrafiotis et al. 2018 ). The increasing exposure and potential impact of cyber risk have been highlighted in recent industry reports (e.g. Allianz 2021 ; World Economic Forum 2020 ). Cyberattacks on critical infrastructures are ranked 5th in the World Economic Forum's Global Risk Report. Ransomware, malware and distributed denial-of-service (DDoS) are examples of the evolving modes of a cyberattack. One example is the ransomware attack on the Colonial Pipeline, which shut down the 5500 mile pipeline system that delivers 2.5 million barrels of fuel per day and critical liquid fuel infrastructure from oil refineries to states along the U.S. East Coast (Brower and McCormick 2021 ). These and other cyber incidents have led the U.S. to strengthen its cybersecurity and introduce, among other things, a public body to analyse major cyber incidents and make recommendations to prevent a recurrence (Murphey 2021a ). Another example of the scope of cyberattacks is the ransomware NotPetya in 2017. The damage amounted to USD 10 billion, as the ransomware exploited a vulnerability in the windows system, allowing it to spread independently worldwide in the network (GAO 2021 ). In the same year, the ransomware WannaCry was launched by cybercriminals. The cyberattack on Windows software took user data hostage in exchange for Bitcoin cryptocurrency (Smart 2018 ). The victims included the National Health Service in Great Britain. As a result, ambulances were redirected to other hospitals because of information technology (IT) systems failing, leaving people in need of urgent assistance waiting. It has been estimated that 19,000 cancelled treatment appointments resulted from losses of GBP 92 million (Field 2018 ). Throughout the COVID-19 pandemic, ransomware attacks increased significantly, as working from home arrangements increased vulnerability (Murphey 2021b ).

Besides cyberattacks, data breaches can also cause high costs. Under the General Data Protection Regulation (GDPR), companies are obliged to protect personal data and safeguard the data protection rights of all individuals in the EU area. The GDPR allows data protection authorities in each country to impose sanctions and fines on organisations they find in breach. “For data breaches, the maximum fine can be €20 million or 4% of global turnover, whichever is higher” (GDPR.EU 2021 ). Data breaches often involve a large amount of sensitive data that has been accessed, unauthorised, by external parties, and are therefore considered important for information security due to their far-reaching impact (Goode et al. 2017 ). A data breach is defined as a “security incident in which sensitive, protected, or confidential data are copied, transmitted, viewed, stolen, or used by an unauthorized individual” (Freeha et al. 2021 ). Depending on the amount of data, the extent of the damage caused by a data breach can be significant, with the average cost being USD 392 million 1 (IBM Security 2020 ).

This research paper reviews the existing literature and open data sources related to cybersecurity and cyber risk, focusing on the datasets used to improve academic understanding and advance the current state-of-the-art in cybersecurity. Furthermore, important information about the available datasets is presented (e.g. use cases), and a plea is made for open data and the standardisation of cyber risk data for academic comparability and replication. The remainder of the paper is structured as follows. The next section describes the related work regarding cybersecurity and cyber risks. The third section outlines the review method used in this work and the process. The fourth section details the results of the identified literature. Further discussion is presented in the penultimate section and the final section concludes.

Related work

Due to the significance of cyber risks, several literature reviews have been conducted in this field. Eling ( 2020 ) reviewed the existing academic literature on the topic of cyber risk and cyber insurance from an economic perspective. A total of 217 papers with the term ‘cyber risk’ were identified and classified in different categories. As a result, open research questions are identified, showing that research on cyber risks is still in its infancy because of their dynamic and emerging nature. Furthermore, the author highlights that particular focus should be placed on the exchange of information between public and private actors. An improved information flow could help to measure the risk more accurately and thus make cyber risks more insurable and help risk managers to determine the right level of cyber risk for their company. In the context of cyber insurance data, Romanosky et al. ( 2019 ) analysed the underwriting process for cyber insurance and revealed how cyber insurers understand and assess cyber risks. For this research, they examined 235 American cyber insurance policies that were publicly available and looked at three components (coverage, application questionnaires and pricing). The authors state in their findings that many of the insurers used very simple, flat-rate pricing (based on a single calculation of expected loss), while others used more parameters such as the asset value of the company (or company revenue) or standard insurance metrics (e.g. deductible, limits), and the industry in the calculation. This is in keeping with Eling ( 2020 ), who states that an increased amount of data could help to make cyber risk more accurately measured and thus more insurable. Similar research on cyber insurance and data was conducted by Nurse et al. ( 2020 ). The authors examined cyber insurance practitioners' perceptions and the challenges they face in collecting and using data. In addition, gaps were identified during the research where further data is needed. The authors concluded that cyber insurance is still in its infancy, and there are still several unanswered questions (for example, cyber valuation, risk calculation and recovery). They also pointed out that a better understanding of data collection and use in cyber insurance would be invaluable for future research and practice. Bessy-Roland et al. ( 2021 ) come to a similar conclusion. They proposed a multivariate Hawkes framework to model and predict the frequency of cyberattacks. They used a public dataset with characteristics of data breaches affecting the U.S. industry. In the conclusion, the authors make the argument that an insurer has a better knowledge of cyber losses, but that it is based on a small dataset and therefore combination with external data sources seems essential to improve the assessment of cyber risks.

Several systematic reviews have been published in the area of cybersecurity (Kruse et al. 2017 ; Lee et al. 2020 ; Loukas et al. 2013 ; Ulven and Wangen 2021 ). In these papers, the authors concentrated on a specific area or sector in the context of cybersecurity. This paper adds to this extant literature by focusing on data availability and its importance to risk management and insurance stakeholders. With a priority on healthcare and cybersecurity, Kruse et al. ( 2017 ) conducted a systematic literature review. The authors identified 472 articles with the keywords ‘cybersecurity and healthcare’ or ‘ransomware’ in the databases Cumulative Index of Nursing and Allied Health Literature, PubMed and Proquest. Articles were eligible for this review if they satisfied three criteria: (1) they were published between 2006 and 2016, (2) the full-text version of the article was available, and (3) the publication is a peer-reviewed or scholarly journal. The authors found that technological development and federal policies (in the U.S.) are the main factors exposing the health sector to cyber risks. Loukas et al. ( 2013 ) conducted a review with a focus on cyber risks and cybersecurity in emergency management. The authors provided an overview of cyber risks in communication, sensor, information management and vehicle technologies used in emergency management and showed areas for which there is still no solution in the literature. Similarly, Ulven and Wangen ( 2021 ) reviewed the literature on cybersecurity risks in higher education institutions. For the literature review, the authors used the keywords ‘cyber’, ‘information threats’ or ‘vulnerability’ in connection with the terms ‘higher education, ‘university’ or ‘academia’. A similar literature review with a focus on Internet of Things (IoT) cybersecurity was conducted by Lee et al. ( 2020 ). The review revealed that qualitative approaches focus on high-level frameworks, and quantitative approaches to cybersecurity risk management focus on risk assessment and quantification of cyberattacks and impacts. In addition, the findings presented a four-step IoT cyber risk management framework that identifies, quantifies and prioritises cyber risks.

Datasets are an essential part of cybersecurity research, underlined by the following works. Ilhan Firat et al. ( 2021 ) examined various cybersecurity datasets in detail. The study was motivated by the fact that with the proliferation of the internet and smart technologies, the mode of cyberattacks is also evolving. However, in order to prevent such attacks, they must first be detected; the dissemination and further development of cybersecurity datasets is therefore critical. In their work, the authors observed studies of datasets used in intrusion detection systems. Khraisat et al. ( 2019 ) also identified a need for new datasets in the context of cybersecurity. The researchers presented a taxonomy of current intrusion detection systems, a comprehensive review of notable recent work, and an overview of the datasets commonly used for assessment purposes. In their conclusion, the authors noted that new datasets are needed because most machine-learning techniques are trained and evaluated on the knowledge of old datasets. These datasets do not contain new and comprehensive information and are partly derived from datasets from 1999. The authors noted that the core of this issue is the availability of new public datasets as well as their quality. The availability of data, how it is used, created and shared was also investigated by Zheng et al. ( 2018 ). The researchers analysed 965 cybersecurity research papers published between 2012 and 2016. They created a taxonomy of the types of data that are created and shared and then analysed the data collected via datasets. The researchers concluded that while datasets are recognised as valuable for cybersecurity research, the proportion of publicly available datasets is limited.

The main contributions of this review and what differentiates it from previous studies can be summarised as follows. First, as far as we can tell, it is the first work to summarise all available datasets on cyber risk and cybersecurity in the context of a systematic review and present them to the scientific community and cyber insurance and cybersecurity stakeholders. Second, we investigated, analysed, and made available the datasets to support efficient and timely progress in cyber risk research. And third, we enable comparability of datasets so that the appropriate dataset can be selected depending on the research area.

Methodology

Process and eligibility criteria.

The structure of this systematic review is inspired by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework (Page et al. 2021 ), and the search was conducted from 3 to 10 May 2021. Due to the continuous development of cyber risks and their countermeasures, only articles published in the last 10 years were considered. In addition, only articles published in peer-reviewed journals written in English were included. As a final criterion, only articles that make use of one or more cybersecurity or cyber risk datasets met the inclusion criteria. Specifically, these studies presented new or existing datasets, used them for methods, or used them to verify new results, as well as analysed them in an economic context and pointed out their effects. The criterion was fulfilled if it was clearly stated in the abstract that one or more datasets were used. A detailed explanation of this selection criterion can be found in the ‘Study selection’ section.

Information sources

In order to cover a complete spectrum of literature, various databases were queried to collect relevant literature on the topic of cybersecurity and cyber risks. Due to the spread of related articles across multiple databases, the literature search was limited to the following four databases for simplicity: IEEE Xplore, Scopus, SpringerLink and Web of Science. This is similar to other literature reviews addressing cyber risks or cybersecurity, including Sardi et al. ( 2021 ), Franke and Brynielsson ( 2014 ), Lagerström (2019), Eling and Schnell ( 2016 ) and Eling ( 2020 ). In this paper, all databases used in the aforementioned works were considered. However, only two studies also used all the databases listed. The IEEE Xplore database contains electrical engineering, computer science, and electronics work from over 200 journals and three million conference papers (IEEE 2021 ). Scopus includes 23,400 peer-reviewed journals from more than 5000 international publishers in the areas of science, engineering, medicine, social sciences and humanities (Scopus 2021 ). SpringerLink contains 3742 journals and indexes over 10 million scientific documents (SpringerLink 2021 ). Finally, Web of Science indexes over 9200 journals in different scientific disciplines (Science 2021 ).

A search string was created and applied to all databases. To make the search efficient and reproducible, the following search string with Boolean operator was used in all databases: cybersecurity OR cyber risk AND dataset OR database. To ensure uniformity of the search across all databases, some adjustments had to be made for the respective search engines. In Scopus, for example, the Advanced Search was used, and the field code ‘Title-ABS-KEY’ was integrated into the search string. For IEEE Xplore, the search was carried out with the Search String in the Command Search and ‘All Metadata’. In the Web of Science database, the Advanced Search was used. The special feature of this search was that it had to be carried out in individual steps. The first search was carried out with the terms cybersecurity OR cyber risk with the field tag Topic (T.S. =) and the second search with dataset OR database. Subsequently, these searches were combined, which then delivered the searched articles for review. For SpringerLink, the search string was used in the Advanced Search under the category ‘Find the resources with all of the words’. After conducting this search string, 5219 studies could be found. According to the eligibility criteria (period, language and only scientific journals), 1581 studies were identified in the databases:

  • Scopus: 135
  • Springer Link: 548
  • Web of Science: 534

An overview of the process is given in Fig.  2 . Combined with the results from the four databases, 854 articles without duplicates were identified.

An external file that holds a picture, illustration, etc.
Object name is 41288_2022_266_Fig2_HTML.jpg

Literature search process and categorisation of the studies

Study selection

In the final step of the selection process, the articles were screened for relevance. Due to a large number of results, the abstracts were analysed in the first step of the process. The aim was to determine whether the article was relevant for the systematic review. An article fulfilled the criterion if it was recognisable in the abstract that it had made a contribution to datasets or databases with regard to cyber risks or cybersecurity. Specifically, the criterion was considered to be met if the abstract used datasets that address the causes or impacts of cyber risks, and measures in the area of cybersecurity. In this process, the number of articles was reduced to 288. The articles were then read in their entirety, and an expert panel of six people decided whether they should be used. This led to a final number of 255 articles. The years in which the articles were published and the exact number can be seen in Fig.  3 .

An external file that holds a picture, illustration, etc.
Object name is 41288_2022_266_Fig3_HTML.jpg

Distribution of studies

Data collection process and synthesis of the results

For the data collection process, various data were extracted from the studies, including the names of the respective creators, the name of the dataset or database and the corresponding reference. It was also determined where the data came from. In the context of accessibility, it was determined whether access is free, controlled, available for purchase or not available. It was also determined when the datasets were created and the time period referenced. The application type and domain characteristics of the datasets were identified.

This section analyses the results of the systematic literature review. The previously identified studies are divided into three categories: datasets on the causes of cyber risks, datasets on the effects of cyber risks and datasets on cybersecurity. The classification is based on the intended use of the studies. This system of classification makes it easier for stakeholders to find the appropriate datasets. The categories are evaluated individually. Although complete information is available for a large proportion of datasets, this is not true for all of them. Accordingly, the abbreviation N/A has been inserted in the respective characters to indicate that this information could not be determined by the time of submission. The term ‘use cases in the literature’ in the following and supplementary tables refers to the application areas in which the corresponding datasets were used in the literature. The areas listed there refer to the topic area on which the researchers conducted their research. Since some datasets were used interdisciplinarily, the listed use cases in the literature are correspondingly longer. Before discussing each category in the next sections, Fig.  4 provides an overview of the number of datasets found and their year of creation. Figure  5 then shows the relationship between studies and datasets in the period under consideration. Figure  6 shows the distribution of studies, their use of datasets and their creation date. The number of datasets used is higher than the number of studies because the studies often used several datasets (Table ​ (Table1). 1 ).

An external file that holds a picture, illustration, etc.
Object name is 41288_2022_266_Fig4_HTML.jpg

Distribution of dataset results

An external file that holds a picture, illustration, etc.
Object name is 41288_2022_266_Fig5_HTML.jpg

Correlation between the studies and the datasets

An external file that holds a picture, illustration, etc.
Object name is 41288_2022_266_Fig6_HTML.jpg

Distribution of studies and their use of datasets

Percentage contribution of datasets for each place of origin

Most of the datasets are generated in the U.S. (up to 58.2%). Canada and Australia rank next, with 11.3% and 5% of all the reviewed datasets, respectively.

Additionally, to create value for the datasets for the cyber insurance industry, an assessment of the applicability of each dataset has been provided for cyber insurers. This ‘Use Case Assessment’ includes the use of the data in the context of different analyses, calculation of cyber insurance premiums, and use of the information for the design of cyber insurance contracts or for additional customer services. To reasonably account for the transition of direct hyperlinks in the future, references were directed to the main websites for longevity (nearest resource point). In addition, the links to the main pages contain further information on the datasets and different versions related to the operating systems. The references were chosen in such a way that practitioners get the best overview of the respective datasets.

Case datasets

This section presents selected articles that use the datasets to analyse the causes of cyber risks. The datasets help identify emerging trends and allow pattern discovery in cyber risks. This information gives cybersecurity experts and cyber insurers the data to make better predictions and take appropriate action. For example, if certain vulnerabilities are not adequately protected, cyber insurers will demand a risk surcharge leading to an improvement in the risk-adjusted premium. Due to the capricious nature of cyber risks, existing data must be supplemented with new data sources (for example, new events, new methods or security vulnerabilities) to determine prevailing cyber exposure. The datasets of cyber risk causes could be combined with existing portfolio data from cyber insurers and integrated into existing pricing tools and factors to improve the valuation of cyber risks.

A portion of these datasets consists of several taxonomies and classifications of cyber risks. Aassal et al. ( 2020 ) propose a new taxonomy of phishing characteristics based on the interpretation and purpose of each characteristic. In comparison, Hindy et al. ( 2020 ) presented a taxonomy of network threats and the impact of current datasets on intrusion detection systems. A similar taxonomy was suggested by Kiwia et al. ( 2018 ). The authors presented a cyber kill chain-based taxonomy of banking Trojans features. The taxonomy built on a real-world dataset of 127 banking Trojans collected from December 2014 to January 2016 by a major U.K.-based financial organisation.

In the context of classification, Aamir et al. ( 2021 ) showed the benefits of machine learning for classifying port scans and DDoS attacks in a mixture of normal and attack traffic. Guo et al. ( 2020 ) presented a new method to improve malware classification based on entropy sequence features. The evaluation of this new method was conducted on different malware datasets.

To reconstruct attack scenarios and draw conclusions based on the evidence in the alert stream, Barzegar and Shajari ( 2018 ) use the DARPA2000 and MACCDC 2012 dataset for their research. Giudici and Raffinetti ( 2020 ) proposed a rank-based statistical model aimed at predicting the severity levels of cyber risk. The model used cyber risk data from the University of Milan. In contrast to the previous datasets, Skrjanc et al. ( 2018 ) used the older dataset KDD99 to monitor large-scale cyberattacks using a cauchy clustering method.

Amin et al. ( 2021 ) used a cyberattack dataset from the Canadian Institute for Cybersecurity to identify spatial clusters of countries with high rates of cyberattacks. In the context of cybercrime, Junger et al. ( 2020 ) examined crime scripts, key characteristics of the target company and the relationship between criminal effort and financial benefit. For their study, the authors analysed 300 cases of fraudulent activities against Dutch companies. With a similar focus on cybercrime, Mireles et al. ( 2019 ) proposed a metric framework to measure the effectiveness of the dynamic evolution of cyberattacks and defensive measures. To validate its usefulness, they used the DEFCON dataset.

Due to the rapidly changing nature of cyber risks, it is often impossible to obtain all information on them. Kim and Kim ( 2019 ) proposed an automated dataset generation system called CTIMiner that collects threat data from publicly available security reports and malware repositories. They released a dataset to the public containing about 640,000 records from 612 security reports published between January 2008 and 2019. A similar approach is proposed by Kim et al. ( 2020 ), using a named entity recognition system to extract core information from cyber threat reports automatically. They created a 498,000-tag dataset during their research (Ulven and Wangen 2021 ).

Within the framework of vulnerabilities and cybersecurity issues, Ulven and Wangen ( 2021 ) proposed an overview of mission-critical assets and everyday threat events, suggested a generic threat model, and summarised common cybersecurity vulnerabilities. With a focus on hospitality, Chen and Fiscus ( 2018 ) proposed several issues related to cybersecurity in this sector. They analysed 76 security incidents from the Privacy Rights Clearinghouse database. Supplementary Table 1 lists all findings that belong to the cyber causes dataset.

Impact datasets

This section outlines selected findings of the cyber impact dataset. For cyber insurers, these datasets can form an important basis for information, as they can be used to calculate cyber insurance premiums, evaluate specific cyber risks, formulate inclusions and exclusions in cyber wordings, and re-evaluate as well as supplement the data collected so far on cyber risks. For example, information on financial losses can help to better assess the loss potential of cyber risks. Furthermore, the datasets can provide insight into the frequency of occurrence of these cyber risks. The new datasets can be used to close any data gaps that were previously based on very approximate estimates or to find new results.

Eight studies addressed the costs of data breaches. For instance, Eling and Jung ( 2018 ) reviewed 3327 data breach events from 2005 to 2016 and identified an asymmetric dependence of monthly losses by breach type and industry. The authors used datasets from the Privacy Rights Clearinghouse for analysis. The Privacy Rights Clearinghouse datasets and the Breach level index database were also used by De Giovanni et al. ( 2020 ) to describe relationships between data breaches and bitcoin-related variables using the cointegration methodology. The data were obtained from the Department of Health and Human Services of healthcare facilities reporting data breaches and a national database of technical and organisational infrastructure information. Also in the context of data breaches, Algarni et al. ( 2021 ) developed a comprehensive, formal model that estimates the two components of security risks: breach cost and the likelihood of a data breach within 12 months. For their survey, the authors used two industrial reports from the Ponemon institute and VERIZON. To illustrate the scope of data breaches, Neto et al. ( 2021 ) identified 430 major data breach incidents among more than 10,000 incidents. The database created is available and covers the period 2018 to 2019.

With a direct focus on insurance, Biener et al. ( 2015 ) analysed 994 cyber loss cases from an operational risk database and investigated the insurability of cyber risks based on predefined criteria. For their study, they used data from the company SAS OpRisk Global Data. Similarly, Eling and Wirfs ( 2019 ) looked at a wide range of cyber risk events and actual cost data using the same database. They identified cyber losses and analysed them using methods from statistics and actuarial science. Using a similar reference, Farkas et al. ( 2021 ) proposed a method for analysing cyber claims based on regression trees to identify criteria for classifying and evaluating claims. Similar to Chen and Fiscus ( 2018 ), the dataset used was the Privacy Rights Clearinghouse database. Within the framework of reinsurance, Moro ( 2020 ) analysed cyber index-based information technology activity to see if index-parametric reinsurance coverage could suggest its cedant using data from a Symantec dataset.

Paté-Cornell et al. ( 2018 ) presented a general probabilistic risk analysis framework for cybersecurity in an organisation to be specified. The results are distributions of losses to cyberattacks, with and without considered countermeasures in support of risk management decisions based both on past data and anticipated incidents. The data used were from The Common Vulnerability and Exposures database and via confidential access to a database of cyberattacks on a large, U.S.-based organisation. A different conceptual framework for cyber risk classification and assessment was proposed by Sheehan et al. ( 2021 ). This framework showed the importance of proactive and reactive barriers in reducing companies’ exposure to cyber risk and quantifying the risk. Another approach to cyber risk assessment and mitigation was proposed by Mukhopadhyay et al. ( 2019 ). They estimated the probability of an attack using generalised linear models, predicted the security technology required to reduce the probability of cyberattacks, and used gamma and exponential distributions to best approximate the average loss data for each malicious attack. They also calculated the expected loss due to cyberattacks, calculated the net premium that would need to be charged by a cyber insurer, and suggested cyber insurance as a strategy to minimise losses. They used the CSI-FBI survey (1997–2010) to conduct their research.

In order to highlight the lack of data on cyber risks, Eling ( 2020 ) conducted a literature review in the areas of cyber risk and cyber insurance. Available information on the frequency, severity, and dependency structure of cyber risks was filtered out. In addition, open questions for future cyber risk research were set up. Another example of data collection on the impact of cyberattacks is provided by Sornette et al. ( 2013 ), who use a database of newspaper articles, press reports and other media to provide a predictive method to identify triggering events and potential accident scenarios and estimate their severity and frequency. A similar approach to data collection was used by Arcuri et al. ( 2020 ) to gather an original sample of global cyberattacks from newspaper reports sourced from the LexisNexis database. This collection is also used and applied to the fields of dynamic communication and cyber risk perception by Fang et al. ( 2021 ). To create a dataset of cyber incidents and disputes, Valeriano and Maness ( 2014 ) collected information on cyber interactions between rival states.

To assess trends and the scale of economic cybercrime, Levi ( 2017 ) examined datasets from different countries and their impact on crime policy. Pooser et al. ( 2018 ) investigated the trend in cyber risk identification from 2006 to 2015 and company characteristics related to cyber risk perception. The authors used a dataset of various reports from cyber insurers for their study. Walker-Roberts et al. ( 2020 ) investigated the spectrum of risk of a cybersecurity incident taking place in the cyber-physical-enabled world using the VERIS Community Database. The datasets of impacts identified are presented below. Due to overlap, some may also appear in the causes dataset (Supplementary Table 2).

Cybersecurity datasets

General intrusion detection.

General intrusion detection systems account for the largest share of countermeasure datasets. For companies or researchers focused on cybersecurity, the datasets can be used to test their own countermeasures or obtain information about potential vulnerabilities. For example, Al-Omari et al. ( 2021 ) proposed an intelligent intrusion detection model for predicting and detecting attacks in cyberspace, which was applied to dataset UNSW-NB 15. A similar approach was taken by Choras and Kozik ( 2015 ), who used machine learning to detect cyberattacks on web applications. To evaluate their method, they used the HTTP dataset CSIC 2010. For the identification of unknown attacks on web servers, Kamarudin et al. ( 2017 ) proposed an anomaly-based intrusion detection system using an ensemble classification approach. Ganeshan and Rodrigues ( 2020 ) showed an intrusion detection system approach, which clusters the database into several groups and detects the presence of intrusion in the clusters. In comparison, AlKadi et al. ( 2019 ) used a localisation-based model to discover abnormal patterns in network traffic. Hybrid models have been recommended by Bhattacharya et al. ( 2020 ) and Agrawal et al. ( 2019 ); the former is a machine-learning model based on principal component analysis for the classification of intrusion detection system datasets, while the latter is a hybrid ensemble intrusion detection system for anomaly detection using different datasets to detect patterns in network traffic that deviate from normal behaviour.

Agarwal et al. ( 2021 ) used three different machine learning algorithms in their research to find the most suitable for efficiently identifying patterns of suspicious network activity. The UNSW-NB15 dataset was used for this purpose. Kasongo and Sun ( 2020 ), Feed-Forward Deep Neural Network (FFDNN), Keshk et al. ( 2021 ), the privacy-preserving anomaly detection framework, and others also use the UNSW-NB 15 dataset as part of intrusion detection systems. The same dataset and others were used by Binbusayyis and Vaiyapuri ( 2019 ) to identify and compare key features for cyber intrusion detection. Atefinia and Ahmadi ( 2021 ) proposed a deep neural network model to reduce the false positive rate of an anomaly-based intrusion detection system. Fossaceca et al. ( 2015 ) focused in their research on the development of a framework that combined the outputs of multiple learners in order to improve the efficacy of network intrusion, and Gauthama Raman et al. ( 2020 ) presented a search algorithm based on Support Vector machine to improve the performance of the detection and false alarm rate to improve intrusion detection techniques. Ahmad and Alsemmeari ( 2020 ) targeted extreme learning machine techniques due to their good capabilities in classification problems and handling huge data. They used the NSL-KDD dataset as a benchmark.

With reference to prediction, Bakdash et al. ( 2018 ) used datasets from the U.S. Department of Defence to predict cyberattacks by malware. This dataset consists of weekly counts of cyber events over approximately seven years. Another prediction method was presented by Fan et al. ( 2018 ), which showed an improved integrated cybersecurity prediction method based on spatial-time analysis. Also, with reference to prediction, Ashtiani and Azgomi ( 2014 ) proposed a framework for the distributed simulation of cyberattacks based on high-level architecture. Kirubavathi and Anitha ( 2016 ) recommended an approach to detect botnets, irrespective of their structures, based on network traffic flow behaviour analysis and machine-learning techniques. Dwivedi et al. ( 2021 ) introduced a multi-parallel adaptive technique to utilise an adaption mechanism in the group of swarms for network intrusion detection. AlEroud and Karabatis ( 2018 ) presented an approach that used contextual information to automatically identify and query possible semantic links between different types of suspicious activities extracted from network flows.

Intrusion detection systems with a focus on IoT

In addition to general intrusion detection systems, a proportion of studies focused on IoT. Habib et al. ( 2020 ) presented an approach for converting traditional intrusion detection systems into smart intrusion detection systems for IoT networks. To enhance the process of diagnostic detection of possible vulnerabilities with an IoT system, Georgescu et al. ( 2019 ) introduced a method that uses a named entity recognition-based solution. With regard to IoT in the smart home sector, Heartfield et al. ( 2021 ) presented a detection system that is able to autonomously adjust the decision function of its underlying anomaly classification models to a smart home’s changing condition. Another intrusion detection system was suggested by Keserwani et al. ( 2021 ), which combined Grey Wolf Optimization and Particle Swam Optimization to identify various attacks for IoT networks. They used the KDD Cup 99, NSL-KDD and CICIDS-2017 to evaluate their model. Abu Al-Haija and Zein-Sabatto ( 2020 ) provide a comprehensive development of a new intelligent and autonomous deep-learning-based detection and classification system for cyberattacks in IoT communication networks that leverage the power of convolutional neural networks, abbreviated as IoT-IDCS-CNN (IoT-based Intrusion Detection and Classification System using Convolutional Neural Network). To evaluate the development, the authors used the NSL-KDD dataset. Biswas and Roy ( 2021 ) recommended a model that identifies malicious botnet traffic using novel deep-learning approaches like artificial neural networks gutted recurrent units and long- or short-term memory models. They tested their model with the Bot-IoT dataset.

With a more forensic background, Koroniotis et al. ( 2020 ) submitted a network forensic framework, which described the digital investigation phases for identifying and tracing attack behaviours in IoT networks. The suggested work was evaluated with the Bot-IoT and UINSW-NB15 datasets. With a focus on big data and IoT, Chhabra et al. ( 2020 ) presented a cyber forensic framework for big data analytics in an IoT environment using machine learning. Furthermore, the authors mentioned different publicly available datasets for machine-learning models.

A stronger focus on a mobile phones was exhibited by Alazab et al. ( 2020 ), which presented a classification model that combined permission requests and application programme interface calls. The model was tested with a malware dataset containing 27,891 Android apps. A similar approach was taken by Li et al. ( 2019a , b ), who proposed a reliable classifier for Android malware detection based on factorisation machine architecture and extraction of Android app features from manifest files and source code.

Literature reviews

In addition to the different methods and models for intrusion detection systems, various literature reviews on the methods and datasets were also found. Liu and Lang ( 2019 ) proposed a taxonomy of intrusion detection systems that uses data objects as the main dimension to classify and summarise machine learning and deep learning-based intrusion detection literature. They also presented four different benchmark datasets for machine-learning detection systems. Ahmed et al. ( 2016 ) presented an in-depth analysis of four major categories of anomaly detection techniques, which include classification, statistical, information theory and clustering. Hajj et al. ( 2021 ) gave a comprehensive overview of anomaly-based intrusion detection systems. Their article gives an overview of the requirements, methods, measurements and datasets that are used in an intrusion detection system.

Within the framework of machine learning, Chattopadhyay et al. ( 2018 ) conducted a comprehensive review and meta-analysis on the application of machine-learning techniques in intrusion detection systems. They also compared different machine learning techniques in different datasets and summarised the performance. Vidros et al. ( 2017 ) presented an overview of characteristics and methods in automatic detection of online recruitment fraud. They also published an available dataset of 17,880 annotated job ads, retrieved from the use of a real-life system. An empirical study of different unsupervised learning algorithms used in the detection of unknown attacks was presented by Meira et al. ( 2020 ).

New datasets

Kilincer et al. ( 2021 ) reviewed different intrusion detection system datasets in detail. They had a closer look at the UNS-NB15, ISCX-2012, NSL-KDD and CIDDS-001 datasets. Stojanovic et al. ( 2020 ) also provided a review on datasets and their creation for use in advanced persistent threat detection in the literature. Another review of datasets was provided by Sarker et al. ( 2020 ), who focused on cybersecurity data science as part of their research and provided an overview from a machine-learning perspective. Avila et al. ( 2021 ) conducted a systematic literature review on the use of security logs for data leak detection. They recommended a new classification of information leak, which uses the GDPR principles, identified the most widely publicly available dataset for threat detection, described the attack types in the datasets and the algorithms used for data leak detection. Tuncer et al. ( 2020 ) presented a bytecode-based detection method consisting of feature extraction using local neighbourhood binary patterns. They chose a byte-based malware dataset to investigate the performance of the proposed local neighbourhood binary pattern-based detection method. With a different focus, Mauro et al. ( 2020 ) gave an experimental overview of neural-based techniques relevant to intrusion detection. They assessed the value of neural networks using the Bot-IoT and UNSW-DB15 datasets.

Another category of results in the context of countermeasure datasets is those that were presented as new. Moreno et al. ( 2018 ) developed a database of 300 security-related accidents from European and American sources. The database contained cybersecurity-related events in the chemical and process industry. Damasevicius et al. ( 2020 ) proposed a new dataset (LITNET-2020) for network intrusion detection. The dataset is a new annotated network benchmark dataset obtained from the real-world academic network. It presents real-world examples of normal and under-attack network traffic. With a focus on IoT intrusion detection systems, Alsaedi et al. ( 2020 ) proposed a new benchmark IoT/IIot datasets for assessing intrusion detection system-enabled IoT systems. Also in the context of IoT, Vaccari et al. ( 2020 ) proposed a dataset focusing on message queue telemetry transport protocols, which can be used to train machine-learning models. To evaluate the performance of machine-learning classifiers, Mahfouz et al. ( 2020 ) created a dataset called Game Theory and Cybersecurity (GTCS). A dataset containing 22,000 malware and benign samples was constructed by Martin et al. ( 2019 ). The dataset can be used as a benchmark to test the algorithm for Android malware classification and clustering techniques. In addition, Laso et al. ( 2017 ) presented a dataset created to investigate how data and information quality estimates enable the detection of anomalies and malicious acts in cyber-physical systems. The dataset contained various cyberattacks and is publicly available.

In addition to the results described above, several other studies were found that fit into the category of countermeasures. Johnson et al. ( 2016 ) examined the time between vulnerability disclosures. Using another vulnerabilities database, Common Vulnerabilities and Exposures (CVE), Subroto and Apriyana ( 2019 ) presented an algorithm model that uses big data analysis of social media and statistical machine learning to predict cyber risks. A similar databank but with a different focus, Common Vulnerability Scoring System, was used by Chatterjee and Thekdi ( 2020 ) to present an iterative data-driven learning approach to vulnerability assessment and management for complex systems. Using the CICIDS2017 dataset to evaluate the performance, Malik et al. ( 2020 ) proposed a control plane-based orchestration for varied, sophisticated threats and attacks. The same dataset was used in another study by Lee et al. ( 2019 ), who developed an artificial security information event management system based on a combination of event profiling for data processing and different artificial network methods. To exploit the interdependence between multiple series, Fang et al. ( 2021 ) proposed a statistical framework. In order to validate the framework, the authors applied it to a dataset of enterprise-level security breaches from the Privacy Rights Clearinghouse and Identity Theft Center database. Another framework with a defensive aspect was recommended by Li et al. ( 2021 ) to increase the robustness of deep neural networks against adversarial malware evasion attacks. Sarabi et al. ( 2016 ) investigated whether and to what extent business details can help assess an organisation's risk of data breaches and the distribution of risk across different types of incidents to create policies for protection, detection and recovery from different forms of security incidents. They used data from the VERIS Community Database.

Datasets that have been classified into the cybersecurity category are detailed in Supplementary Table 3. Due to overlap, records from the previous tables may also be included.

This paper presented a systematic literature review of studies on cyber risk and cybersecurity that used datasets. Within this framework, 255 studies were fully reviewed and then classified into three different categories. Then, 79 datasets were consolidated from these studies. These datasets were subsequently analysed, and important information was selected through a process of filtering out. This information was recorded in a table and enhanced with further information as part of the literature analysis. This made it possible to create a comprehensive overview of the datasets. For example, each dataset contains a description of where the data came from and how the data has been used to date. This allows different datasets to be compared and the appropriate dataset for the use case to be selected. This research certainly has limitations, so our selection of datasets cannot necessarily be taken as a representation of all available datasets related to cyber risks and cybersecurity. For example, literature searches were conducted in four academic databases and only found datasets that were used in the literature. Many research projects also used old datasets that may no longer consider current developments. In addition, the data are often focused on only one observation and are limited in scope. For example, the datasets can only be applied to specific contexts and are also subject to further limitations (e.g. region, industry, operating system). In the context of the applicability of the datasets, it is unfortunately not possible to make a clear statement on the extent to which they can be integrated into academic or practical areas of application or how great this effort is. Finally, it remains to be pointed out that this is an overview of currently available datasets, which are subject to constant change.

Due to the lack of datasets on cyber risks in the academic literature, additional datasets on cyber risks were integrated as part of a further search. The search was conducted on the Google Dataset search portal. The search term used was ‘cyber risk datasets’. Over 100 results were found. However, due to the low significance and verifiability, only 20 selected datasets were included. These can be found in Table 2  in the “ Appendix ”.

Summary of Google datasets

The results of the literature review and datasets also showed that there continues to be a lack of available, open cyber datasets. This lack of data is reflected in cyber insurance, for example, as it is difficult to find a risk-based premium without a sufficient database (Nurse et al. 2020 ). The global cyber insurance market was estimated at USD 5.5 billion in 2020 (Dyson 2020 ). When compared to the USD 1 trillion global losses from cybercrime (Maleks Smith et al. 2020 ), it is clear that there exists a significant cyber risk awareness challenge for both the insurance industry and international commerce. Without comprehensive and qualitative data on cyber losses, it can be difficult to estimate potential losses from cyberattacks and price cyber insurance accordingly (GAO 2021 ). For instance, the average cyber insurance loss increased from USD 145,000 in 2019 to USD 359,000 in 2020 (FitchRatings 2021 ). Cyber insurance is an important risk management tool to mitigate the financial impact of cybercrime. This is particularly evident in the impact of different industries. In the Energy & Commodities financial markets, a ransomware attack on the Colonial Pipeline led to a substantial impact on the U.S. economy. As a result of the attack, about 45% of the U.S. East Coast was temporarily unable to obtain supplies of diesel, petrol and jet fuel. This caused the average price in the U.S. to rise 7 cents to USD 3.04 per gallon, the highest in seven years (Garber 2021 ). In addition, Colonial Pipeline confirmed that it paid a USD 4.4 million ransom to a hacker gang after the attack. Another ransomware attack occurred in the healthcare and government sector. The victim of this attack was the Irish Health Service Executive (HSE). A ransom payment of USD 20 million was demanded from the Irish government to restore services after the hack (Tidy 2021 ). In the car manufacturing sector, Miller and Valasek ( 2015 ) initiated a cyberattack that resulted in the recall of 1.4 million vehicles and cost manufacturers EUR 761 million. The risk that arises in the context of these events is the potential for the accumulation of cyber losses, which is why cyber insurers are not expanding their capacity. An example of this accumulation of cyber risks is the NotPetya malware attack, which originated in Russia, struck in Ukraine, and rapidly spread around the world, causing at least USD 10 billion in damage (GAO 2021 ). These events highlight the importance of proper cyber risk management.

This research provides cyber insurance stakeholders with an overview of cyber datasets. Cyber insurers can use the open datasets to improve their understanding and assessment of cyber risks. For example, the impact datasets can be used to better measure financial impacts and their frequencies. These data could be combined with existing portfolio data from cyber insurers and integrated with existing pricing tools and factors to better assess cyber risk valuation. Although most cyber insurers have sparse historical cyber policy and claims data, they remain too small at present for accurate prediction (Bessy-Roland et al. 2021 ). A combination of portfolio data and external datasets would support risk-adjusted pricing for cyber insurance, which would also benefit policyholders. In addition, cyber insurance stakeholders can use the datasets to identify patterns and make better predictions, which would benefit sustainable cyber insurance coverage. In terms of cyber risk cause datasets, cyber insurers can use the data to review their insurance products. For example, the data could provide information on which cyber risks have not been sufficiently considered in product design or where improvements are needed. A combination of cyber cause and cybersecurity datasets can help establish uniform definitions to provide greater transparency and clarity. Consistent terminology could lead to a more sustainable cyber market, where cyber insurers make informed decisions about the level of coverage and policyholders understand their coverage (The Geneva Association 2020).

In addition to the cyber insurance community, this research also supports cybersecurity stakeholders. The reviewed literature can be used to provide a contemporary, contextual and categorised summary of available datasets. This supports efficient and timely progress in cyber risk research and is beneficial given the dynamic nature of cyber risks. With the help of the described cybersecurity datasets and the identified information, a comparison of different datasets is possible. The datasets can be used to evaluate the effectiveness of countermeasures in simulated cyberattacks or to test intrusion detection systems.

In this paper, we conducted a systematic review of studies on cyber risk and cybersecurity databases. We found that most of the datasets are in the field of intrusion detection and machine learning and are used for technical cybersecurity aspects. The available datasets on cyber risks were relatively less represented. Due to the dynamic nature and lack of historical data, assessing and understanding cyber risk is a major challenge for cyber insurance stakeholders. To address this challenge, a greater density of cyber data is needed to support cyber insurers in risk management and researchers with cyber risk-related topics. With reference to ‘Open Science’ FAIR data (Jacobsen et al. 2020 ), mandatory reporting of cyber incidents could help improve cyber understanding, awareness and loss prevention among companies and insurers. Through greater availability of data, cyber risks can be better understood, enabling researchers to conduct more in-depth research into these risks. Companies could incorporate this new knowledge into their corporate culture to reduce cyber risks. For insurance companies, this would have the advantage that all insurers would have the same understanding of cyber risks, which would support sustainable risk-based pricing. In addition, common definitions of cyber risks could be derived from new data.

The cybersecurity databases summarised and categorised in this research could provide a different perspective on cyber risks that would enable the formulation of common definitions in cyber policies. The datasets can help companies addressing cybersecurity and cyber risk as part of risk management assess their internal cyber posture and cybersecurity measures. The paper can also help improve risk awareness and corporate behaviour, and provides the research community with a comprehensive overview of peer-reviewed datasets and other available datasets in the area of cyber risk and cybersecurity. This approach is intended to support the free availability of data for research. The complete tabulated review of the literature is included in the Supplementary Material.

This work provides directions for several paths of future work. First, there are currently few publicly available datasets for cyber risk and cybersecurity. The older datasets that are still widely used no longer reflect today's technical environment. Moreover, they can often only be used in one context, and the scope of the samples is very limited. It would be of great value if more datasets were publicly available that reflect current environmental conditions. This could help intrusion detection systems to consider current events and thus lead to a higher success rate. It could also compensate for the disadvantages of older datasets by collecting larger quantities of samples and making this contextualisation more widespread. Another area of research may be the integratability and adaptability of cybersecurity and cyber risk datasets. For example, it is often unclear to what extent datasets can be integrated or adapted to existing data. For cyber risks and cybersecurity, it would be helpful to know what requirements need to be met or what is needed to use the datasets appropriately. In addition, it would certainly be helpful to know whether datasets can be modified to be used for cyber risks or cybersecurity. Finally, the ability for stakeholders to identify machine-readable cybersecurity datasets would be useful because it would allow for even clearer delineations or comparisons between datasets. Due to the lack of publicly available datasets, concrete benchmarks often cannot be applied.

Below is the link to the electronic supplementary material.

Biographies

is a PhD student at the Kemmy Business School, University of Limerick, as part of the Emerging Risk Group (ERG). He is researching in joint cooperation with the Institute for Insurance Studies (ivwKöln), TH Köln, where he is working as a Research Assistant at the Cologne Research Centre for Reinsurance. His current research interests include cyber risks, cyber insurance and cybersecurity. Frank is a Fellow of the Chartered Insurance Institute (FCII) and a member of the German Association for Insurance Studies (DVfVW).

is a Lecturer in Risk and Finance at the Kemmy Business School at the University of Limerick. In his research, Dr Sheehan investigates novel risk metrication and machine learning methodologies in the context of insurance and finance, attentive to a changing private and public emerging risk environment. He is a researcher with significant insurance industry and academic experience. With a professional background in actuarial science, his research uses machine-learning techniques to estimate the changing risk profile produced by emerging technologies. He is a senior member of the Emerging Risk Group (ERG) at the University of Limerick, which has long-established expertise in insurance and risk management and has continued success within large research consortia including a number of SFI, FP7 and EU H2020 research projects. In particular, he contributed to the successful completion of three Horizon 2020 EU-funded projects, including PROTECT, Vision Inspired Driver Assistance Systems (VI-DAS) and Cloud Large Scale Video Analysis (Cloud-LSVA).

is a Professor at the Institute of Insurance at the Technical University of Cologne. His activities include teaching and research in insurance law and liability insurance. His research focuses include D&O, corporate liability, fidelity and cyber insurance. In addition, he heads the Master’s degree programme in insurance law and is the Academic Director of the Automotive Insurance Manager and Cyber Insurance Manager certificate programmes. He is also chairman of the examination board at the Institute of Insurance Studies.

Arash Negahdari Kia

is a postdoctoral Marie Cuire scholar and Research Fellow at the Kemmy Business School (KBS), University of Limerick (UL), a member of the Lero Software Research Center and Emerging Risk Group (ERG). He researches the cybersecurity risks of autonomous vehicles using machine-learning algorithms in a team supervised by Dr Finbarr Murphy at KBS, UL. For his PhD, he developed two graph-based, semi-supervised algorithms for multivariate time series for global stock market indices prediction. For his Master’s, he developed neural network models for Forex market prediction. Arash’s other research interests include text mining, graph mining and bioinformatics.

is a Professor in Risk and Insurance at the Kemmy Business School, University of Limerick. He worked on a number of insurance-related research projects, including four EU Commission-funded projects around emerging technologies and risk transfer. Prof. Mullins maintains strong links with the international insurance industry and works closely with Lloyd’s of London and XL Catlin on emerging risk. His work also encompasses the area of applied ethics as it pertains to new technologies. In the field of applied ethics, Dr Mullins works closely with the insurance industry and lectures on cultural and technological breakthroughs of high societal relevance. In that respect, Dr Martin Mullins has been appointed to a European expert group to advise EIOPA on the development of digital responsibility principles in insurance.

is Executive Dean Kemmy Business School. A computer engineering graduate, Finbarr worked for over 10 years in investment banking before returning to academia and completing his PhD in 2010. Finbarr has authored or co-authored over 70 refereed journal papers, edited books and book chapters. His research has been published in leading research journals in his discipline, such as Nature Nanotechnology, Small, Transportation Research A-F and the Review of Derivatives Research. A former Fulbright Scholar and Erasmus Mundus Exchange Scholar, Finbarr has delivered numerous guest lectures in America, mainland Europe, Israel, Russia, China and Vietnam. His research interests include quantitative finance and, more recently, emerging technological risk. Finbarr is currently engaged in several EU H2020 projects and with the Irish Science Foundation Ireland.

(FCII) has held the Chair of Reinsurance at the Institute of Insurance of TH Köln since 1998, focusing on the efficiency of reinsurance, industrial insurance and alternative risk transfer (ART). He studied mathematics and computer science with a focus on artificial intelligence and researched from 1988 to 1991 at the Fraunhofer Institute for Autonomous Intelligent Systems (AiS) in Schloß Birlinghoven. From 1991 to 2004, Prof. Materne worked for Gen Re (formerly Cologne Re) in various management positions in Germany and abroad, and from 2001 to 2003, he served as General Manager of Cologne Re of Dublin in Ireland. In 2008, Prof. Materne founded the Cologne Reinsurance Research Centre, of which he is the Director. Current issues in reinsurance and related fields are analysed and discussed with practitioners, with valuable contacts through the ‘Förderkreis Rückversicherung’ and the organisation of the annual Cologne Reinsurance Symposium. Prof. Materne holds various international supervisory boards, board of directors and advisory board mandates at insurance and reinsurance companies, captives, InsurTechs, EIOPA, as well as at insurance-scientific institutions. He also acts as an arbitrator and party representative in arbitration proceedings.

Open Access funding provided by the IReL Consortium.

Declarations

On behalf of all authors, the corresponding author states that there is no conflict of interest.

1 Average cost of a breach of more than 50 million records.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

  • Aamir M, Rizvi SSH, Hashmani MA, Zubair M, Ahmad J. Machine learning classification of port scanning and DDoS attacks: A comparative analysis. Mehran University Research Journal of Engineering and Technology. 2021; 40 (1):215–229. doi: 10.22581/muet1982.2101.19. [ CrossRef ] [ Google Scholar ]
  • Aamir M, Zaidi SMA. DDoS attack detection with feature engineering and machine learning: The framework and performance evaluation. International Journal of Information Security. 2019; 18 (6):761–785. doi: 10.1007/s10207-019-00434-1. [ CrossRef ] [ Google Scholar ]
  • Aassal A, El S, Baki A. Das, Verma RM. An in-depth benchmarking and evaluation of phishing detection research for security needs. IEEE Access. 2020; 8 :22170–22192. doi: 10.1109/ACCESS.2020.2969780. [ CrossRef ] [ Google Scholar ]
  • Abu Al-Haija Q, Zein-Sabatto S. An efficient deep-learning-based detection and classification system for cyber-attacks in IoT communication networks. Electronics. 2020; 9 (12):26. doi: 10.3390/electronics9122152. [ CrossRef ] [ Google Scholar ]
  • Adhikari U, Morris TH, Pan SY. Applying Hoeffding adaptive trees for real-time cyber-power event and intrusion classification. IEEE Transactions on Smart Grid. 2018; 9 (5):4049–4060. doi: 10.1109/tsg.2017.2647778. [ CrossRef ] [ Google Scholar ]
  • Agarwal A, Sharma P, Alshehri M, Mohamed AA, Alfarraj O. Classification model for accuracy and intrusion detection using machine learning approach. PeerJ Computer Science. 2021 doi: 10.7717/peerj-cs.437. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Agrafiotis Ioannis, Nurse Jason R.C., Goldsmith M, Creese S, Upton D. A taxonomy of cyber-harms: Defining the impacts of cyber-attacks and understanding how they propagate. Journal of Cybersecurity. 2018; 4 :tyy006. doi: 10.1093/cybsec/tyy006. [ CrossRef ] [ Google Scholar ]
  • Agrawal A, Mohammed S, Fiaidhi J. Ensemble technique for intruder detection in network traffic. International Journal of Security and Its Applications. 2019; 13 (3):1–8. doi: 10.33832/ijsia.2019.13.3.01. [ CrossRef ] [ Google Scholar ]
  • Ahmad, I., and R.A. Alsemmeari. 2020. Towards improving the intrusion detection through ELM (extreme learning machine). CMC Computers Materials & Continua 65 (2): 1097–1111. 10.32604/cmc.2020.011732.
  • Ahmed M, Mahmood AN, Hu JK. A survey of network anomaly detection techniques. Journal of Network and Computer Applications. 2016; 60 :19–31. doi: 10.1016/j.jnca.2015.11.016. [ CrossRef ] [ Google Scholar ]
  • Al-Jarrah OY, Alhussein O, Yoo PD, Muhaidat S, Taha K, Kim K. Data randomization and cluster-based partitioning for Botnet intrusion detection. IEEE Transactions on Cybernetics. 2016; 46 (8):1796–1806. doi: 10.1109/TCYB.2015.2490802. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Al-Mhiqani MN, Ahmad R, Abidin ZZ, Yassin W, Hassan A, Abdulkareem KH, Ali NS, Yunos Z. A review of insider threat detection: Classification, machine learning techniques, datasets, open challenges, and recommendations. Applied Sciences—Basel. 2020; 10 (15):41. doi: 10.3390/app10155208. [ CrossRef ] [ Google Scholar ]
  • Al-Omari M, Rawashdeh M, Qutaishat F, Alshira'H M, Ababneh N. An intelligent tree-based intrusion detection model for cyber security. Journal of Network and Systems Management. 2021; 29 (2):18. doi: 10.1007/s10922-021-09591-y. [ CrossRef ] [ Google Scholar ]
  • Alabdallah A, Awad M. Using weighted Support Vector Machine to address the imbalanced classes problem of Intrusion Detection System. KSII Transactions on Internet and Information Systems. 2018; 12 (10):5143–5158. doi: 10.3837/tiis.2018.10.027. [ CrossRef ] [ Google Scholar ]
  • Alazab M, Alazab M, Shalaginov A, Mesleh A, Awajan A. Intelligent mobile malware detection using permission requests and API calls. Future Generation Computer Systems—the International Journal of eScience. 2020; 107 :509–521. doi: 10.1016/j.future.2020.02.002. [ CrossRef ] [ Google Scholar ]
  • Albahar MA, Al-Falluji RA, Binsawad M. An empirical comparison on malicious activity detection using different neural network-based models. IEEE Access. 2020; 8 :61549–61564. doi: 10.1109/ACCESS.2020.2984157. [ CrossRef ] [ Google Scholar ]
  • AlEroud AF, Karabatis G. Queryable semantics to detect cyber-attacks: A flow-based detection approach. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2018; 48 (2):207–223. doi: 10.1109/TSMC.2016.2600405. [ CrossRef ] [ Google Scholar ]
  • Algarni AM, Thayananthan V, Malaiya YK. Quantitative assessment of cybersecurity risks for mitigating data breaches in business systems. Applied Sciences (switzerland) 2021 doi: 10.3390/app11083678. [ CrossRef ] [ Google Scholar ]
  • Alhowaide A, Alsmadi I, Tang J. Towards the design of real-time autonomous IoT NIDS. Cluster Computing—the Journal of Networks Software Tools and Applications. 2021 doi: 10.1007/s10586-021-03231-5. [ CrossRef ] [ Google Scholar ]
  • Ali S, Li Y. Learning multilevel auto-encoders for DDoS attack detection in smart grid network. IEEE Access. 2019; 7 :108647–108659. doi: 10.1109/ACCESS.2019.2933304. [ CrossRef ] [ Google Scholar ]
  • AlKadi O, Moustafa N, Turnbull B, Choo KKR. Mixture localization-based outliers models for securing data migration in cloud centers. IEEE Access. 2019; 7 :114607–114618. doi: 10.1109/ACCESS.2019.2935142. [ CrossRef ] [ Google Scholar ]
  • Allianz. 2021. Allianz Risk Barometer. https://www.agcs.allianz.com/content/dam/onemarketing/agcs/agcs/reports/Allianz-Risk-Barometer-2021.pdf . Accessed 15 May 2021.
  • Almiani Muder, AbuGhazleh Alia, Al-Rahayfeh Amer, Atiewi Saleh, Razaque Abdul. Deep recurrent neural network for IoT intrusion detection system. Simulation Modelling Practice and Theory. 2020; 101 :102031. doi: 10.1016/j.simpat.2019.102031. [ CrossRef ] [ Google Scholar ]
  • Alsaedi A, Moustafa N, Tari Z, Mahmood A, Anwar A. TON_IoT telemetry dataset: A new generation dataset of IoT and IIoT for data-driven intrusion detection systems. IEEE Access. 2020; 8 :165130–165150. doi: 10.1109/access.2020.3022862. [ CrossRef ] [ Google Scholar ]
  • Alsamiri J, Alsubhi K. Internet of Things cyber attacks detection using machine learning. International Journal of Advanced Computer Science and Applications. 2019; 10 (12):627–634. doi: 10.14569/IJACSA.2019.0101280. [ CrossRef ] [ Google Scholar ]
  • Alsharafat W. Applying artificial neural network and eXtended classifier system for network intrusion detection. International Arab Journal of Information Technology. 2013; 10 (3):230–238. [ Google Scholar ]
  • Amin RW, Sevil HE, Kocak S, Francia G, III, Hoover P. The spatial analysis of the malicious uniform resource locators (URLs): 2016 dataset case study. Information (switzerland) 2021; 12 (1):1–18. doi: 10.3390/info12010002. [ CrossRef ] [ Google Scholar ]
  • Arcuri MC, Gai LZ, Ielasi F, Ventisette E. Cyber attacks on hospitality sector: Stock market reaction. Journal of Hospitality and Tourism Technology. 2020; 11 (2):277–290. doi: 10.1108/jhtt-05-2019-0080. [ CrossRef ] [ Google Scholar ]
  • Arp Daniel, Spreitzenbarth Michael, Hubner Malte, Rieck Konrad, et al. Drebin: Effective and explainable detection of android malware in your pocket. NDSS Conference. 2014; 14 :23–26. [ Google Scholar ]
  • Ashtiani M, Azgomi MA. A distributed simulation framework for modeling cyber attacks and the evaluation of security measures. Simulation—Transactions of the Society for Modeling and Simulation International. 2014; 90 (9):1071–1102. doi: 10.1177/0037549714540221. [ CrossRef ] [ Google Scholar ]
  • Atefinia R, Ahmadi M. Network intrusion detection using multi-architectural modular deep neural network. Journal of Supercomputing. 2021; 77 (4):3571–3593. doi: 10.1007/s11227-020-03410-y. [ CrossRef ] [ Google Scholar ]
  • Avila R, Khoury R, Khoury R, Petrillo F. Use of security logs for data leak detection: A systematic literature review. Security and Communication Networks. 2021; 2021 :29. doi: 10.1155/2021/6615899. [ CrossRef ] [ Google Scholar ]
  • Azeez NA, Ayemobola TJ, Misra S, Maskeliunas R, Damasevicius R. Network Intrusion Detection with a Hashing Based Apriori Algorithm Using Hadoop MapReduce. Computers. 2019; 8 (4):15. doi: 10.3390/computers8040086. [ CrossRef ] [ Google Scholar ]
  • Bakdash JZ, Hutchinson S, Zaroukian EG, Marusich LR, Thirumuruganathan S, Sample C, Hoffman B, Das G. Malware in the future forecasting of analyst detection of cyber events. Journal of Cybersecurity. 2018 doi: 10.1093/cybsec/tyy007. [ CrossRef ] [ Google Scholar ]
  • Barletta VS, Caivano D, Nannavecchia A, Scalera M. Intrusion detection for in-vehicle communication networks: An unsupervised Kohonen SOM approach. Future Internet. 2020 doi: 10.3390/FI12070119. [ CrossRef ] [ Google Scholar ]
  • Barzegar M, Shajari M. Attack scenario reconstruction using intrusion semantics. Expert Systems with Applications. 2018; 108 :119–133. doi: 10.1016/j.eswa.2018.04.030. [ CrossRef ] [ Google Scholar ]
  • Bessy-Roland Yannick, Boumezoued Alexandre, Hillairet Caroline. Multivariate Hawkes process for cyber insurance. Annals of Actuarial Science. 2021; 15 (1):14–39. doi: 10.1017/S1748499520000093. [ CrossRef ] [ Google Scholar ]
  • Bhardwaj A, Mangat V, Vig R. Hyperband tuned deep neural network with well posed stacked sparse AutoEncoder for detection of DDoS attacks in cloud. IEEE Access. 2020; 8 :181916–181929. doi: 10.1109/ACCESS.2020.3028690. [ CrossRef ] [ Google Scholar ]
  • Bhati BS, Rai CS, Balamurugan B, Al-Turjman F. An intrusion detection scheme based on the ensemble of discriminant classifiers. Computers & Electrical Engineering. 2020; 86 :9. doi: 10.1016/j.compeleceng.2020.106742. [ CrossRef ] [ Google Scholar ]
  • Bhattacharya S, Krishnan SSR, Maddikunta PKR, Kaluri R, Singh S, Gadekallu TR, Alazab M, Tariq U. A novel PCA-firefly based XGBoost classification model for intrusion detection in networks using GPU. Electronics. 2020; 9 (2):16. doi: 10.3390/electronics9020219. [ CrossRef ] [ Google Scholar ]
  • Bibi I, Akhunzada A, Malik J, Iqbal J, Musaddiq A, Kim S. A dynamic DL-driven architecture to combat sophisticated android malware. IEEE Access. 2020; 8 :129600–129612. doi: 10.1109/ACCESS.2020.3009819. [ CrossRef ] [ Google Scholar ]
  • Biener C, Eling M, Wirfs JH. Insurability of cyber risk: An empirical analysis. Geneva Papers on Risk and Insurance: Issues and Practice. 2015; 40 (1):131–158. doi: 10.1057/gpp.2014.19. [ CrossRef ] [ Google Scholar ]
  • Binbusayyis A, Vaiyapuri T. Identifying and benchmarking key features for cyber intrusion detection: An ensemble approach. IEEE Access. 2019; 7 :106495–106513. doi: 10.1109/ACCESS.2019.2929487. [ CrossRef ] [ Google Scholar ]
  • Biswas R, Roy S. Botnet traffic identification using neural networks. Multimedia Tools and Applications. 2021 doi: 10.1007/s11042-021-10765-8. [ CrossRef ] [ Google Scholar ]
  • Bouyeddou B, Harrou F, Kadri B, Sun Y. Detecting network cyber-attacks using an integrated statistical approach. Cluster Computing—the Journal of Networks Software Tools and Applications. 2021; 24 (2):1435–1453. doi: 10.1007/s10586-020-03203-1. [ CrossRef ] [ Google Scholar ]
  • Bozkir AS, Aydos M. LogoSENSE: A companion HOG based logo detection scheme for phishing web page and E-mail brand recognition. Computers & Security. 2020; 95 :18. doi: 10.1016/j.cose.2020.101855. [ CrossRef ] [ Google Scholar ]
  • Brower, D., and M. McCormick. 2021. Colonial pipeline resumes operations following ransomware attack. Financial Times .
  • Cai H, Zhang F, Levi A. An unsupervised method for detecting shilling attacks in recommender systems by mining item relationship and identifying target items. The Computer Journal. 2019; 62 (4):579–597. doi: 10.1093/comjnl/bxy124. [ CrossRef ] [ Google Scholar ]
  • Cebula, J.J., M.E. Popeck, and L.R. Young. 2014. A Taxonomy of Operational Cyber Security Risks Version 2 .
  • Chadza T, Kyriakopoulos KG, Lambotharan S. Learning to learn sequential network attacks using hidden Markov models. IEEE Access. 2020; 8 :134480–134497. doi: 10.1109/ACCESS.2020.3011293. [ CrossRef ] [ Google Scholar ]
  • Chatterjee S, Thekdi S. An iterative learning and inference approach to managing dynamic cyber vulnerabilities of complex systems. Reliability Engineering and System Safety. 2020 doi: 10.1016/j.ress.2019.106664. [ CrossRef ] [ Google Scholar ]
  • Chattopadhyay M, Sen R, Gupta S. A comprehensive review and meta-analysis on applications of machine learning techniques in intrusion detection. Australasian Journal of Information Systems. 2018; 22 :27. doi: 10.3127/ajis.v22i0.1667. [ CrossRef ] [ Google Scholar ]
  • Chen HS, Fiscus J. The inhospitable vulnerability: A need for cybersecurity risk assessment in the hospitality industry. Journal of Hospitality and Tourism Technology. 2018; 9 (2):223–234. doi: 10.1108/JHTT-07-2017-0044. [ CrossRef ] [ Google Scholar ]
  • Chhabra GS, Singh VP, Singh M. Cyber forensics framework for big data analytics in IoT environment using machine learning. Multimedia Tools and Applications. 2020; 79 (23–24):15881–15900. doi: 10.1007/s11042-018-6338-1. [ CrossRef ] [ Google Scholar ]
  • Chiba Z, Abghour N, Moussaid K, Elomri A, Rida M. Intelligent approach to build a Deep Neural Network based IDS for cloud environment using combination of machine learning algorithms. Computers and Security. 2019; 86 :291–317. doi: 10.1016/j.cose.2019.06.013. [ CrossRef ] [ Google Scholar ]
  • Choras M, Kozik R. Machine learning techniques applied to detect cyber attacks on web applications. Logic Journal of the IGPL. 2015; 23 (1):45–56. doi: 10.1093/jigpal/jzu038. [ CrossRef ] [ Google Scholar ]
  • Chowdhury Sudipta, Khanzadeh Mojtaba, Akula Ravi, Zhang Fangyan, Zhang Song, Medal Hugh, Marufuzzaman Mohammad, Bian Linkan. Botnet detection using graph-based feature clustering. Journal of Big Data. 2017; 4 (1):14. doi: 10.1186/s40537-017-0074-7. [ CrossRef ] [ Google Scholar ]
  • Cost Of A Cyber Incident: Systematic Review And Cross-Validation, Cybersecurity & Infrastructure Agency , 1, https://www.cisa.gov/sites/default/files/publications/CISA-OCE_Cost_of_Cyber_Incidents_Study-FINAL_508.pdf (2020).
  • D'Hooge L, Wauters T, Volckaert B, De Turck F. Classification hardness for supervised learners on 20 years of intrusion detection data. IEEE Access. 2019; 7 :167455–167469. doi: 10.1109/access.2019.2953451. [ CrossRef ] [ Google Scholar ]
  • Damasevicius R, Venckauskas A, Grigaliunas S, Toldinas J, Morkevicius N, Aleliunas T, Smuikys P. LITNET-2020: An annotated real-world network flow dataset for network intrusion detection. Electronics. 2020; 9 (5):23. doi: 10.3390/electronics9050800. [ CrossRef ] [ Google Scholar ]
  • Giovanni De, Domenico Arturo Leccadito, Pirra Marco. On the determinants of data breaches: A cointegration analysis. Decisions in Economics and Finance. 2020 doi: 10.1007/s10203-020-00301-y. [ CrossRef ] [ Google Scholar ]
  • Deng Lianbing, Li Daming, Yao Xiang, Wang Haoxiang. Retracted Article: Mobile network intrusion detection for IoT system based on transfer learning algorithm. Cluster Computing. 2019; 22 (4):9889–9904. doi: 10.1007/s10586-018-1847-2. [ CrossRef ] [ Google Scholar ]
  • Donkal G, Verma GK. A multimodal fusion based framework to reinforce IDS for securing Big Data environment using Spark. Journal of Information Security and Applications. 2018; 43 :1–11. doi: 10.1016/j.jisa.2018.10.001. [ CrossRef ] [ Google Scholar ]
  • Dunn C, Moustafa N, Turnbull B. Robustness evaluations of sustainable machine learning models against data Poisoning attacks in the Internet of Things. Sustainability. 2020; 12 (16):17. doi: 10.3390/su12166434. [ CrossRef ] [ Google Scholar ]
  • Dwivedi S, Vardhan M, Tripathi S. Multi-parallel adaptive grasshopper optimization technique for detecting anonymous attacks in wireless networks. Wireless Personal Communications. 2021 doi: 10.1007/s11277-021-08368-5. [ CrossRef ] [ Google Scholar ]
  • Dyson, B. 2020. COVID-19 crisis could be ‘watershed’ for cyber insurance, says Swiss Re exec. https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/covid-19-crisis-could-be-watershed-for-cyber-insurance-says-swiss-re-exec-59197154 . Accessed 7 May 2020.
  • EIOPA. 2018. Understanding cyber insurance—a structured dialogue with insurance companies. https://www.eiopa.europa.eu/sites/default/files/publications/reports/eiopa_understanding_cyber_insurance.pdf . Accessed 28 May 2018
  • Elijah AV, Abdullah A, JhanJhi NZ, Supramaniam M, Abdullateef OB. Ensemble and deep-learning methods for two-class and multi-attack anomaly intrusion detection: An empirical study. International Journal of Advanced Computer Science and Applications. 2019; 10 (9):520–528. doi: 10.14569/IJACSA.2019.0100969. [ CrossRef ] [ Google Scholar ]
  • Eling M, Jung K. Copula approaches for modeling cross-sectional dependence of data breach losses. Insurance Mathematics & Economics. 2018; 82 :167–180. doi: 10.1016/j.insmatheco.2018.07.003. [ CrossRef ] [ Google Scholar ]
  • Eling M, Schnell W. What do we know about cyber risk and cyber risk insurance? Journal of Risk Finance. 2016; 17 (5):474–491. doi: 10.1108/jrf-09-2016-0122. [ CrossRef ] [ Google Scholar ]
  • Eling M, Wirfs J. What are the actual costs of cyber risk events? European Journal of Operational Research. 2019; 272 (3):1109–1119. doi: 10.1016/j.ejor.2018.07.021. [ CrossRef ] [ Google Scholar ]
  • Eling Martin. Cyber risk research in business and actuarial science. European Actuarial Journal. 2020; 10 (2):303–333. doi: 10.1007/s13385-020-00250-1. [ CrossRef ] [ Google Scholar ]
  • Elmasry W, Akbulut A, Zaim AH. Empirical study on multiclass classification-based network intrusion detection. Computational Intelligence. 2019; 35 (4):919–954. doi: 10.1111/coin.12220. [ CrossRef ] [ Google Scholar ]
  • Elsaid Shaimaa Ahmed, Albatati Nouf Saleh. An optimized collaborative intrusion detection system for wireless sensor networks. Soft Computing. 2020; 24 (16):12553–12567. doi: 10.1007/s00500-020-04695-0. [ CrossRef ] [ Google Scholar ]
  • Estepa R, Díaz-Verdejo JE, Estepa A, Madinabeitia G. How much training data is enough? A case study for HTTP anomaly-based intrusion detection. IEEE Access. 2020; 8 :44410–44425. doi: 10.1109/ACCESS.2020.2977591. [ CrossRef ] [ Google Scholar ]
  • European Council. 2021. Cybersecurity: how the EU tackles cyber threats. https://www.consilium.europa.eu/en/policies/cybersecurity/ . Accessed 10 May 2021
  • Falco Gregory, Eling Martin, Jablanski Danielle, Weber Matthias, Miller Virginia, Gordon Lawrence A, Wang Shaun Shuxun, Schmit Joan, Thomas Russell, Elvedi Mauro, Maillart Thomas, Donavan Emy, Dejung Simon, Durand Eric, Nutter Franklin, Scheffer Uzi, Arazi Gil, Ohana Gilbert, Lin Herbert. Cyber risk research impeded by disciplinary barriers. Science (american Association for the Advancement of Science) 2019; 366 (6469):1066–1069. doi: 10.1126/science.aaz4795. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fan ZJ, Tan ZP, Tan CX, Li X. An improved integrated prediction method of cyber security situation based on spatial-time analysis. Journal of Internet Technology. 2018; 19 (6):1789–1800. doi: 10.3966/160792642018111906015. [ CrossRef ] [ Google Scholar ]
  • Fang ZJ, Xu MC, Xu SH, Hu TZ. A framework for predicting data breach risk: Leveraging dependence to cope with sparsity. IEEE Transactions on Information Forensics and Security. 2021; 16 :2186–2201. doi: 10.1109/tifs.2021.3051804. [ CrossRef ] [ Google Scholar ]
  • Farkas S, Lopez O, Thomas M. Cyber claim analysis using Generalized Pareto regression trees with applications to insurance. Insurance: Mathematics and Economics. 2021; 98 :92–105. doi: 10.1016/j.insmatheco.2021.02.009. [ CrossRef ] [ Google Scholar ]
  • Farsi H, Fanian A, Taghiyarrenani Z. A novel online state-based anomaly detection system for process control networks. International Journal of Critical Infrastructure Protection. 2019; 27 :11. doi: 10.1016/j.ijcip.2019.100323. [ CrossRef ] [ Google Scholar ]
  • Ferrag MA, Maglaras L, Moschoyiannis S, Janicke H. Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study. Journal of Information Security and Applications. 2020; 50 :19. doi: 10.1016/j.jisa.2019.102419. [ CrossRef ] [ Google Scholar ]
  • Field, M. 2018. WannaCry cyber attack cost the NHS £92m as 19,000 appointments cancelled. https://www.telegraph.co.uk/technology/2018/10/11/wannacry-cyber-attack-cost-nhs-92m-19000-appointments-cancelled/ . Accessed 9 May 2018.
  • FitchRatings. 2021. U.S. Cyber Insurance Market Update (Spike in Claims Leads to Decline in 2020 Underwriting Performance). https://www.fitchratings.com/research/insurance/us-cyber-insurance-market-update-spike-in-claims-leads-to-decline-in-2020-underwriting-performance-26-05-2021 .
  • Fossaceca JM, Mazzuchi TA, Sarkani S. MARK-ELM: Application of a novel Multiple Kernel Learning framework for improving the robustness of network intrusion detection. Expert Systems with Applications. 2015; 42 (8):4062–4080. doi: 10.1016/j.eswa.2014.12.040. [ CrossRef ] [ Google Scholar ]
  • Franke Ulrik, Brynielsson Joel. Cyber situational awareness – A systematic review of the literature. Computers & Security. 2014; 46 :18–31. doi: 10.1016/j.cose.2014.06.008. [ CrossRef ] [ Google Scholar ]
  • Freeha Khan, Hwan Kim Jung, Lars Mathiassen, Robin Moore. Data breach management: An integrated risk model. Information & Management. 2021; 58 (1):103392. doi: 10.1016/j.im.2020.103392. [ CrossRef ] [ Google Scholar ]
  • Ganeshan R, Rodrigues Paul. Crow-AFL: Crow based adaptive fractional lion optimization approach for the intrusion detection. Wireless Personal Communications. 2020; 111 (4):2065–2089. doi: 10.1007/s11277-019-06972-0. [ CrossRef ] [ Google Scholar ]
  • GAO. 2021. CYBER INSURANCE—Insurers and policyholders face challenges in an evolving market. https://www.gao.gov/assets/gao-21-477.pdf . Accessed 16 May 2021.
  • Garber, J. 2021. Colonial Pipeline fiasco foreshadows impact of Biden energy policy. https://www.foxbusiness.com/markets/colonial-pipeline-fiasco-foreshadows-impact-of-biden-energy-policy . Accessed 4 May 2021.
  • Gauthama Raman MR, Somu Nivethitha, Jagarapu Sahruday, Manghnani Tina, Selvam Thirumaran, Krithivasan Kannan, Shankar Sriram VS. An efficient intrusion detection technique based on support vector machine and improved binary gravitational search algorithm. Artificial Intelligence Review. 2020; 53 (5):3255–3286. doi: 10.1007/s10462-019-09762-z. [ CrossRef ] [ Google Scholar ]
  • Gavel S, Raghuvanshi AS, Tiwari S. Distributed intrusion detection scheme using dual-axis dimensionality reduction for Internet of things (IoT) Journal of Supercomputing. 2021 doi: 10.1007/s11227-021-03697-5. [ CrossRef ] [ Google Scholar ]
  • GDPR.EU. 2021. FAQ. https://gdpr.eu/faq/ . Accessed 10 May 2021.
  • Georgescu TM, Iancu B, Zurini M. Named-entity-recognition-based automated system for diagnosing cybersecurity situations in IoT networks. Sensors (switzerland) 2019 doi: 10.3390/s19153380. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Giudici Paolo, Raffinetti Emanuela. Cyber risk ordering with rank-based statistical models. AStA Advances in Statistical Analysis. 2020 doi: 10.1007/s10182-020-00387-0. [ CrossRef ] [ Google Scholar ]
  • Goh, J., S. Adepu, K.N. Junejo, and A. Mathur. 2016. A dataset to support research in the design of secure water treatment systems. In CRITIS.
  • Gong XY, Lu JL, Zhou YF, Qiu H, He R. Model uncertainty based annotation error fixing for web attack detection. Journal of Signal Processing Systems for Signal Image and Video Technology. 2021; 93 (2–3):187–199. doi: 10.1007/s11265-019-01494-1. [ CrossRef ] [ Google Scholar ]
  • Goode Sigi, Hoehle Hartmut, Venkatesh Viswanath, Brown Susan A. USER compensation as a data breach recovery action: An investigation of the sony playstation network breach. MIS Quarterly. 2017; 41 (3):703–727. doi: 10.25300/MISQ/2017/41.3.03. [ CrossRef ] [ Google Scholar ]
  • Guo H, Huang S, Huang C, Pan Z, Zhang M, Shi F. File entropy signal analysis combined with wavelet decomposition for malware classification. IEEE Access. 2020; 8 :158961–158971. doi: 10.1109/ACCESS.2020.3020330. [ CrossRef ] [ Google Scholar ]
  • Habib Maria, Aljarah Ibrahim, Faris Hossam. A Modified multi-objective particle swarm optimizer-based Lévy flight: An approach toward intrusion detection in Internet of Things. Arabian Journal for Science and Engineering. 2020; 45 (8):6081–6108. doi: 10.1007/s13369-020-04476-9. [ CrossRef ] [ Google Scholar ]
  • Hajj S, El Sibai R, Abdo JB, Demerjian J, Makhoul A, Guyeux C. Anomaly-based intrusion detection systems: The requirements, methods, measurements, and datasets. Transactions on Emerging Telecommunications Technologies. 2021; 32 (4):36. doi: 10.1002/ett.4240. [ CrossRef ] [ Google Scholar ]
  • Heartfield R, Loukas G, Bezemskij A, Panaousis E. Self-configurable cyber-physical intrusion detection for smart homes using reinforcement learning. IEEE Transactions on Information Forensics and Security. 2021; 16 :1720–1735. doi: 10.1109/tifs.2020.3042049. [ CrossRef ] [ Google Scholar ]
  • Hemo, B., T. Gafni, K. Cohen, and Q. Zhao. 2020. Searching for anomalies over composite hypotheses. IEEE Transactions on Signal Processing 68: 1181–1196. 10.1109/TSP.2020.2971438
  • Hindy H, Brosset D, Bayne E, Seeam AK, Tachtatzis C, Atkinson R, Bellekens X. A taxonomy of network threats and the effect of current datasets on intrusion detection systems. IEEE Access. 2020; 8 :104650–104675. doi: 10.1109/ACCESS.2020.3000179. [ CrossRef ] [ Google Scholar ]
  • Hong W, Huang D, Chen C, Lee J. Towards accurate and efficient classification of power system contingencies and cyber-attacks using recurrent neural networks. IEEE Access. 2020; 8 :123297–123309. doi: 10.1109/ACCESS.2020.3007609. [ CrossRef ] [ Google Scholar ]
  • Husák Martin, Zádník M, Bartos V, Sokol P. Dataset of intrusion detection alerts from a sharing platform. Data in Brief. 2020; 33 :106530. doi: 10.1016/j.dib.2020.106530. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • IBM Security. 2020. Cost of a Data breach Report. https://www.capita.com/sites/g/files/nginej291/files/2020-08/Ponemon-Global-Cost-of-Data-Breach-Study-2020.pdf . Accessed 19 May 2021.
  • IEEE. 2021. IEEE Quick Facts. https://www.ieee.org/about/at-a-glance.html . Accessed 11 May 2021.
  • Firat Ilhan, Kilincer Ertam Fatih, Abdulkadir Sengur. Machine learning methods for cyber security intrusion detection: Datasets and comparative study. Computer Networks. 2021; 188 :107840. doi: 10.1016/j.comnet.2021.107840. [ CrossRef ] [ Google Scholar ]
  • Jaber AN, Ul Rehman S. FCM-SVM based intrusion detection system for cloud computing environment. Cluster Computing—the Journal of Networks Software Tools and Applications. 2020; 23 (4):3221–3231. doi: 10.1007/s10586-020-03082-6. [ CrossRef ] [ Google Scholar ]
  • Jacobs, J., S. Romanosky, B. Edwards, M. Roytman, and I. Adjerid. 2019. Exploit prediction scoring system (epss). arXiv:1908.04856
  • Jacobsen Annika, de Miranda Ricardo, Azevedo Nick Juty, Batista Dominique, Coles Simon, Cornet Ronald, Courtot Mélanie, Crosas Mercè, Dumontier Michel, Evelo Chris T, Goble Carole, Guizzardi Giancarlo, Hansen Karsten Kryger, Hasnain Ali, Hettne Kristina, Heringa Jaap, Hooft Rob W.W., Imming Melanie, Jeffery Keith G, Kaliyaperumal Rajaram, Kersloot Martijn G, Kirkpatrick Christine R, Kuhn Tobias, Labastida Ignasi, Magagna Barbara, McQuilton Peter, Meyers Natalie, Montesanti Annalisa, van Reisen Mirjam, Rocca-Serra Philippe, Pergl Robert, Sansone Susanna-Assunta, da Silva Luiz Olavo Bonino, Santos Juliane Schneider, Strawn George, Thompson Mark, Waagmeester Andra, Weigel Tobias, Wilkinson Mark D, Willighagen Egon L, Wittenburg Peter, Roos Marco, Mons Barend, Schultes Erik. FAIR principles: Interpretations and implementation considerations. Data Intelligence. 2020; 2 (1–2):10–29. doi: 10.1162/dint_r_00024. [ CrossRef ] [ Google Scholar ]
  • Jahromi AN, Hashemi S, Dehghantanha A, Parizi RM, Choo KKR. An enhanced stacked LSTM method with no random initialization for malware threat hunting in safety and time-critical systems. IEEE Transactions on Emerging Topics in Computational Intelligence. 2020; 4 (5):630–640. doi: 10.1109/TETCI.2019.2910243. [ CrossRef ] [ Google Scholar ]
  • Jang S, Li S, Sung Y. FastText-based local feature visualization algorithm for merged image-based malware classification framework for cyber security and cyber defense. Mathematics. 2020; 8 (3):13. doi: 10.3390/math8030460. [ CrossRef ] [ Google Scholar ]
  • Javeed D, Gao TH, Khan MT. SDN-enabled hybrid DL-driven framework for the detection of emerging cyber threats in IoT. Electronics. 2021; 10 (8):16. doi: 10.3390/electronics10080918. [ CrossRef ] [ Google Scholar ]
  • Johnson P, Gorton D, Lagerstrom R, Ekstedt M. Time between vulnerability disclosures: A measure of software product vulnerability. Computers & Security. 2016; 62 :278–295. doi: 10.1016/j.cose.2016.08.004. [ CrossRef ] [ Google Scholar ]
  • Johnson P, Lagerström R, Ekstedt M, Franke U. Can the common vulnerability scoring system be trusted? A Bayesian analysis. IEEE Transactions on Dependable and Secure Computing. 2018; 15 (6):1002–1015. doi: 10.1109/TDSC.2016.2644614. [ CrossRef ] [ Google Scholar ]
  • Junger Marianne, Wang Victoria, Schlömer Marleen. Fraud against businesses both online and offline: Crime scripts, business characteristics, efforts, and benefits. Crime Science. 2020; 9 (1):13. doi: 10.1186/s40163-020-00119-4. [ CrossRef ] [ Google Scholar ]
  • Kalutarage Harsha Kumara, Nguyen Hoang Nga, Shaikh Siraj Ahmed. Towards a threat assessment framework for apps collusion. Telecommunication Systems. 2017; 66 (3):417–430. doi: 10.1007/s11235-017-0296-1. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kamarudin MH, Maple C, Watson T, Safa NS. A LogitBoost-based algorithm for detecting known and unknown web attacks. IEEE Access. 2017; 5 :26190–26200. doi: 10.1109/ACCESS.2017.2766844. [ CrossRef ] [ Google Scholar ]
  • Kasongo SM, Sun YX. A deep learning method with wrapper based feature extraction for wireless intrusion detection system. Computers & Security. 2020; 92 :15. doi: 10.1016/j.cose.2020.101752. [ CrossRef ] [ Google Scholar ]
  • Keserwani Pankaj Kumar, Govil Mahesh Chandra, Pilli Emmanuel S, Govil Prajjval. A smart anomaly-based intrusion detection system for the Internet of Things (IoT) network using GWO–PSO–RF model. Journal of Reliable Intelligent Environments. 2021; 7 (1):3–21. doi: 10.1007/s40860-020-00126-x. [ CrossRef ] [ Google Scholar ]
  • Keshk M, Sitnikova E, Moustafa N, Hu J, Khalil I. An integrated framework for privacy-preserving based anomaly detection for cyber-physical systems. IEEE Transactions on Sustainable Computing. 2021; 6 (1):66–79. doi: 10.1109/TSUSC.2019.2906657. [ CrossRef ] [ Google Scholar ]
  • Khan IA, Pi DC, Bhatia AK, Khan N, Haider W, Wahab A. Generating realistic IoT-based IDS dataset centred on fuzzy qualitative modelling for cyber-physical systems. Electronics Letters. 2020; 56 (9):441–443. doi: 10.1049/el.2019.4158. [ CrossRef ] [ Google Scholar ]
  • Khraisat A, Gondal I, Vamplew P, Kamruzzaman J, Alazab A. Hybrid intrusion detection system based on the stacking ensemble of C5 decision tree classifier and one class support vector machine. Electronics. 2020; 9 (1):18. doi: 10.3390/electronics9010173. [ CrossRef ] [ Google Scholar ]
  • Khraisat Ansam, Gondal Iqbal, Vamplew Peter, Kamruzzaman Joarder. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity. 2019; 2 (1):20. doi: 10.1186/s42400-019-0038-7. [ CrossRef ] [ Google Scholar ]
  • Kilincer IF, Ertam F, Sengur A. Machine learning methods for cyber security intrusion detection: Datasets and comparative study. Computer Networks. 2021; 188 :16. doi: 10.1016/j.comnet.2021.107840. [ CrossRef ] [ Google Scholar ]
  • Kim D, Kim HK. Automated dataset generation system for collaborative research of cyber threat analysis. Security and Communication Networks. 2019; 2019 :10. doi: 10.1155/2019/6268476. [ CrossRef ] [ Google Scholar ]
  • Kim Gyeongmin, Lee Chanhee, Jo Jaechoon, Lim Heuiseok. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. International Journal of Machine Learning and Cybernetics. 2020; 11 (10):2341–2355. doi: 10.1007/s13042-020-01122-6. [ CrossRef ] [ Google Scholar ]
  • Kirubavathi G, Anitha R. Botnet detection via mining of traffic flow characteristics. Computers & Electrical Engineering. 2016; 50 :91–101. doi: 10.1016/j.compeleceng.2016.01.012. [ CrossRef ] [ Google Scholar ]
  • Kiwia D, Dehghantanha A, Choo KKR, Slaughter J. A cyber kill chain based taxonomy of banking Trojans for evolutionary computational intelligence. Journal of Computational Science. 2018; 27 :394–409. doi: 10.1016/j.jocs.2017.10.020. [ CrossRef ] [ Google Scholar ]
  • Koroniotis N, Moustafa N, Sitnikova E. A new network forensic framework based on deep learning for Internet of Things networks: A particle deep framework. Future Generation Computer Systems. 2020; 110 :91–106. doi: 10.1016/j.future.2020.03.042. [ CrossRef ] [ Google Scholar ]
  • Kruse Clemens Scott, Frederick Benjamin, Jacobson Taylor, Kyle Monticone D. Cybersecurity in healthcare: A systematic review of modern threats and trends. Technology and Health Care. 2017; 25 (1):1–10. doi: 10.3233/THC-161263. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kshetri N. The economics of cyber-insurance. IT Professional. 2018; 20 (6):9–14. doi: 10.1109/MITP.2018.2874210. [ CrossRef ] [ Google Scholar ]
  • Kumar R, Kumar P, Tripathi R, Gupta GP, Gadekallu TR, Srivastava G. SP2F: A secured privacy-preserving framework for smart agricultural Unmanned Aerial Vehicles. Computer Networks. 2021 doi: 10.1016/j.comnet.2021.107819. [ CrossRef ] [ Google Scholar ]
  • Kumar R, Tripathi R. DBTP2SF: A deep blockchain-based trustworthy privacy-preserving secured framework in industrial internet of things systems. Transactions on Emerging Telecommunications Technologies. 2021; 32 (4):27. doi: 10.1002/ett.4222. [ CrossRef ] [ Google Scholar ]
  • Laso PM, Brosset D, Puentes J. Dataset of anomalies and malicious acts in a cyber-physical subsystem. Data in Brief. 2017; 14 :186–191. doi: 10.1016/j.dib.2017.07.038. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lee J, Kim J, Kim I, Han K. Cyber threat detection based on artificial neural networks using event profiles. IEEE Access. 2019; 7 :165607–165626. doi: 10.1109/ACCESS.2019.2953095. [ CrossRef ] [ Google Scholar ]
  • Lee SJ, Yoo PD, Asyhari AT, Jhi Y, Chermak L, Yeun CY, Taha K. IMPACT: Impersonation attack detection via edge computing using deep Autoencoder and feature abstraction. IEEE Access. 2020; 8 :65520–65529. doi: 10.1109/ACCESS.2020.2985089. [ CrossRef ] [ Google Scholar ]
  • Leong Yin-Yee, Chen Yen-Chih. Cyber risk cost and management in IoT devices-linked health insurance. The Geneva Papers on Risk and Insurance—Issues and Practice. 2020; 45 (4):737–759. doi: 10.1057/s41288-020-00169-4. [ CrossRef ] [ Google Scholar ]
  • Levi, M. 2017. Assessing the trends, scale and nature of economic cybercrimes: overview and Issues: In Cybercrimes, cybercriminals and their policing, in crime, law and social change. Crime, Law and Social Change 67 (1): 3–20. 10.1007/s10611-016-9645-3.
  • Li C, Mills K, Niu D, Zhu R, Zhang H, Kinawi H. Android malware detection based on factorization machine. IEEE Access. 2019; 7 :184008–184019. doi: 10.1109/ACCESS.2019.2958927. [ CrossRef ] [ Google Scholar ]
  • Li DQ, Li QM. Adversarial deep ensemble: evasion attacks and defenses for malware detection. IEEE Transactions on Information Forensics and Security. 2020; 15 :3886–3900. doi: 10.1109/tifs.2020.3003571. [ CrossRef ] [ Google Scholar ]
  • Li DQ, Li QM, Ye YF, Xu SH. A framework for enhancing deep neural networks against adversarial malware. IEEE Transactions on Network Science and Engineering. 2021; 8 (1):736–750. doi: 10.1109/tnse.2021.3051354. [ CrossRef ] [ Google Scholar ]
  • Li RH, Zhang C, Feng C, Zhang X, Tang CJ. Locating vulnerability in binaries using deep neural networks. IEEE Access. 2019; 7 :134660–134676. doi: 10.1109/access.2019.2942043. [ CrossRef ] [ Google Scholar ]
  • Li X, Xu M, Vijayakumar P, Kumar N, Liu X. Detection of low-frequency and multi-stage attacks in industrial Internet of Things. IEEE Transactions on Vehicular Technology. 2020; 69 (8):8820–8831. doi: 10.1109/TVT.2020.2995133. [ CrossRef ] [ Google Scholar ]
  • Liu HY, Lang B. Machine learning and deep learning methods for intrusion detection systems: A survey. Applied Sciences—Basel. 2019; 9 (20):28. doi: 10.3390/app9204396. [ CrossRef ] [ Google Scholar ]
  • Lopez-Martin M, Carro B, Sanchez-Esguevillas A. Application of deep reinforcement learning to intrusion detection for supervised problems. Expert Systems with Applications. 2020 doi: 10.1016/j.eswa.2019.112963. [ CrossRef ] [ Google Scholar ]
  • Loukas G, Gan D, Vuong Tuan. A review of cyber threats and defence approaches in emergency management. Future Internet. 2013; 5 :205–236. doi: 10.3390/fi5020205. [ CrossRef ] [ Google Scholar ]
  • Luo CC, Su S, Sun YB, Tan QJ, Han M, Tian ZH. A convolution-based system for malicious URLs detection. CMC—Computers Materials Continua. 2020; 62 (1):399–411. doi: 10.32604/cmc.2020.06507. [ CrossRef ] [ Google Scholar ]
  • Mahbooba B, Timilsina M, Sahal R, Serrano M. Explainable artificial intelligence (XAI) to enhance trust management in intrusion detection systems using decision tree model. Complexity. 2021; 2021 :11. doi: 10.1155/2021/6634811. [ CrossRef ] [ Google Scholar ]
  • Mahdavifar S, Ghorbani AA. DeNNeS: Deep embedded neural network expert system for detecting cyber attacks. Neural Computing & Applications. 2020; 32 (18):14753–14780. doi: 10.1007/s00521-020-04830-w. [ CrossRef ] [ Google Scholar ]
  • Mahfouz A, Abuhussein A, Venugopal D, Shiva S. Ensemble classifiers for network intrusion detection using a novel network attack dataset. Future Internet. 2020; 12 (11):1–19. doi: 10.3390/fi12110180. [ CrossRef ] [ Google Scholar ]
  • Maleks Smith, Z., E. Lostri, and J.A. Lewis. 2020. The hidden costs of cybercrime. https://www.mcafee.com/enterprise/en-us/assets/reports/rp-hidden-costs-of-cybercrime.pdf . Accessed 16 May 2021.
  • Malik J, Akhunzada A, Bibi I, Imran M, Musaddiq A, Kim SW. Hybrid deep learning: An efficient reconnaissance and surveillance detection mechanism in SDN. IEEE Access. 2020; 8 :134695–134706. doi: 10.1109/ACCESS.2020.3009849. [ CrossRef ] [ Google Scholar ]
  • Manimurugan S. IoT-Fog-Cloud model for anomaly detection using improved Naive Bayes and principal component analysis. Journal of Ambient Intelligence and Humanized Computing. 2020 doi: 10.1007/s12652-020-02723-3. [ CrossRef ] [ Google Scholar ]
  • Martin A, Lara-Cabrera R, Camacho D. Android malware detection through hybrid features fusion and ensemble classifiers: The AndroPyTool framework and the OmniDroid dataset. Information Fusion. 2019; 52 :128–142. doi: 10.1016/j.inffus.2018.12.006. [ CrossRef ] [ Google Scholar ]
  • Mauro MD, Galatro G, Liotta A. Experimental review of neural-based approaches for network intrusion management. IEEE Transactions on Network and Service Management. 2020; 17 (4):2480–2495. doi: 10.1109/TNSM.2020.3024225. [ CrossRef ] [ Google Scholar ]
  • McLeod A, Dolezel D. Cyber-analytics: Modeling factors associated with healthcare data breaches. Decision Support Systems. 2018; 108 :57–68. doi: 10.1016/j.dss.2018.02.007. [ CrossRef ] [ Google Scholar ]
  • Meira J, Andrade R, Praca I, Carneiro J, Bolon-Canedo V, Alonso-Betanzos A, Marreiros G. Performance evaluation of unsupervised techniques in cyber-attack anomaly detection. Journal of Ambient Intelligence and Humanized Computing. 2020; 11 (11):4477–4489. doi: 10.1007/s12652-019-01417-9. [ CrossRef ] [ Google Scholar ]
  • Miao Y, Ma J, Liu X, Weng J, Li H, Li H. Lightweight fine-grained search over encrypted data in Fog computing. IEEE Transactions on Services Computing. 2019; 12 (5):772–785. doi: 10.1109/TSC.2018.2823309. [ CrossRef ] [ Google Scholar ]
  • Miller, C., and C. Valasek. 2015. Remote exploitation of an unaltered passenger vehicle. Black Hat USA 2015 (S 91).
  • Mireles JD, Ficke E, Cho JH, Hurley P, Xu SH. Metrics towards measuring cyber agility. IEEE Transactions on Information Forensics and Security. 2019; 14 (12):3217–3232. doi: 10.1109/tifs.2019.2912551. [ CrossRef ] [ Google Scholar ]
  • Mishra N, Pandya S. Internet of Things applications, security challenges, attacks, intrusion detection, and future visions: A systematic review. IEEE Access. 2021 doi: 10.1109/ACCESS.2021.3073408. [ CrossRef ] [ Google Scholar ]
  • Monshizadeh M, Khatri V, Atli BG, Kantola R, Yan Z. Performance evaluation of a combined anomaly detection platform. IEEE Access. 2019; 7 :100964–100978. doi: 10.1109/ACCESS.2019.2930832. [ CrossRef ] [ Google Scholar ]
  • Moreno VC, Reniers G, Salzano E, Cozzani V. Analysis of physical and cyber security-related events in the chemical and process industry. Process Safety and Environmental Protection. 2018; 116 :621–631. doi: 10.1016/j.psep.2018.03.026. [ CrossRef ] [ Google Scholar ]
  • Moro ED. Towards an economic cyber loss index for parametric cover based on IT security indicator: A preliminary analysis. Risks. 2020 doi: 10.3390/risks8020045. [ CrossRef ] [ Google Scholar ]
  • Moustafa N, Adi E, Turnbull B, Hu J. A new threat intelligence scheme for safeguarding industry 4.0 systems. IEEE Access. 2018; 6 :32910–32924. doi: 10.1109/ACCESS.2018.2844794. [ CrossRef ] [ Google Scholar ]
  • Moustakidis S, Karlsson P. A novel feature extraction methodology using Siamese convolutional neural networks for intrusion detection. Cybersecurity. 2020 doi: 10.1186/s42400-020-00056-4. [ CrossRef ] [ Google Scholar ]
  • Mukhopadhyay Arunabha, Chatterjee Samir, Bagchi Kallol K, Kirs Peteer J, Shukla Girja K. Cyber Risk Assessment and Mitigation (CRAM) framework using Logit and Probit models for cyber insurance. Information Systems Frontiers. 2019; 21 (5):997–1018. doi: 10.1007/s10796-017-9808-5. [ CrossRef ] [ Google Scholar ]
  • Murphey, H. 2021a. Biden signs executive order to strengthen US cyber security. https://www.ft.com/content/4d808359-b504-4014-85f6-68e7a2851bf1?accessToken=zwAAAXl0_ifgkc9NgINZtQRAFNOF9mjnooUb8Q.MEYCIQDw46SFWsMn1iyuz3kvgAmn6mxc0rIVfw10Lg1ovJSfJwIhAK2X2URzfSqHwIS7ddRCvSt2nGC2DcdoiDTG49-4TeEt&sharetype=gift?token=fbcd6323-1ecf-4fc3-b136-b5b0dd6a8756 . Accessed 7 May 2021.
  • Murphey, H. 2021b. Millions of connected devices have security flaws, study shows. https://www.ft.com/content/0bf92003-926d-4dee-87d7-b01f7c3e9621?accessToken=zwAAAXnA7f2Ikc8L-SADkm1N7tOH17AffD6WIQ.MEQCIDjBuROvhmYV0Mx3iB0cEV7m5oND1uaCICxJu0mzxM0PAiBam98q9zfHiTB6hKGr1gGl0Azt85yazdpX9K5sI8se3Q&sharetype=gift?token=2538218d-77d9-4dd3-9649-3cb556a34e51 . Accessed 6 May 2021.
  • Murugesan V, Shalinie M, Yang MH. Design and analysis of hybrid single packet IP traceback scheme. IET Networks. 2018; 7 (3):141–151. doi: 10.1049/iet-net.2017.0115. [ CrossRef ] [ Google Scholar ]
  • Mwitondi KS, Zargari SA. An iterative multiple sampling method for intrusion detection. Information Security Journal. 2018; 27 (4):230–239. doi: 10.1080/19393555.2018.1539790. [ CrossRef ] [ Google Scholar ]
  • Neto NN, Madnick S, De Paula AMG, Borges NM. Developing a global data breach database and the challenges encountered. ACM Journal of Data and Information Quality. 2021; 13 (1):33. doi: 10.1145/3439873. [ CrossRef ] [ Google Scholar ]
  • Nurse, J.R.C., L. Axon, A. Erola, I. Agrafiotis, M. Goldsmith, and S. Creese. 2020. The data that drives cyber insurance: A study into the underwriting and claims processes. In 2020 International conference on cyber situational awareness, data analytics and assessment (CyberSA), 15–19 June 2020.
  • Oliveira N, Praca I, Maia E, Sousa O. Intelligent cyber attack detection and classification for network-based intrusion detection systems. Applied Sciences—Basel. 2021; 11 (4):21. doi: 10.3390/app11041674. [ CrossRef ] [ Google Scholar ]
  • Page Matthew J, McKenzie Joanne E, Bossuyt Patrick M, Boutron Isabelle, Hoffmann Tammy C, Mulrow Cynthia D, Shamseer Larissa, Tetzlaff Jennifer M, Akl Elie A, Brennan Sue E, Chou Roger, Glanville Julie, Grimshaw Jeremy M, Hróbjartsson Asbjørn, Lalu Manoj M, Li Tianjing, Loder Elizabeth W, Mayo-Wilson Evan, McDonald Steve, McGuinness Luke A, Stewart Lesley A, Thomas James, Tricco Andrea C, Welch Vivian A, Whiting Penny, Moher David. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Systematic Reviews. 2021; 10 (1):89. doi: 10.1186/s13643-021-01626-4. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Pajouh HH, Javidan R, Khayami R, Dehghantanha A, Choo KR. A two-layer dimension reduction and two-tier classification model for anomaly-based intrusion detection in IoT backbone networks. IEEE Transactions on Emerging Topics in Computing. 2019; 7 (2):314–323. doi: 10.1109/TETC.2016.2633228. [ CrossRef ] [ Google Scholar ]
  • Parra GD, Rad P, Choo KKR, Beebe N. Detecting Internet of Things attacks using distributed deep learning. Journal of Network and Computer Applications. 2020; 163 :13. doi: 10.1016/j.jnca.2020.102662. [ CrossRef ] [ Google Scholar ]
  • Paté-Cornell ME, Kuypers M, Smith M, Keller P. Cyber risk management for critical infrastructure: A risk analysis model and three case studies. Risk Analysis. 2018; 38 (2):226–241. doi: 10.1111/risa.12844. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Pooser, D.M., M.J. Browne, and O. Arkhangelska. 2018. Growth in the perception of cyber risk: evidence from U.S. P&C Insurers. The Geneva Papers on Risk and Insurance—Issues and Practice 43 (2): 208–223. 10.1057/s41288-017-0077-9.
  • Pu, G., L. Wang, J. Shen, and F. Dong. 2021. A hybrid unsupervised clustering-based anomaly detection method. Tsinghua Science and Technology 26 (2): 146–153. 10.26599/TST.2019.9010051.
  • Qiu J, Luo W, Pan L, Tai Y, Zhang J, Xiang Y. Predicting the impact of android malicious samples via machine learning. IEEE Access. 2019; 7 :66304–66316. doi: 10.1109/ACCESS.2019.2914311. [ CrossRef ] [ Google Scholar ]
  • Qu X, Yang L, Guo K, Sun M, Ma L, Feng T, Ren S, Li K, Ma X. Direct batch growth hierarchical self-organizing mapping based on statistics for efficient network intrusion detection. IEEE Access. 2020; 8 :42251–42260. doi: 10.1109/ACCESS.2020.2976810. [ CrossRef ] [ Google Scholar ]
  • Shafiur Rahman, Md, Sajal Halder Md, Uddin Ashraf, Acharjee Uzzal Kumar. An efficient hybrid system for anomaly detection in social networks. Cybersecurity. 2021; 4 (1):10. doi: 10.1186/s42400-021-00074-w. [ CrossRef ] [ Google Scholar ]
  • Ramaiah M, Chandrasekaran V, Ravi V, Kumar N. An intrusion detection system using optimized deep neural network architecture. Transactions on Emerging Telecommunications Technologies. 2021; 32 (4):17. doi: 10.1002/ett.4221. [ CrossRef ] [ Google Scholar ]
  • Raman, M.R.G., K. Kannan, S.K. Pal, and V.S.S. Sriram. 2016. Rough set-hypergraph-based feature selection approach for intrusion detection systems. Defence Science Journal 66 (6): 612–617. 10.14429/dsj.66.10802.
  • Rathore, S., J.H. Park. 2018. Semi-supervised learning based distributed attack detection framework for IoT. Applied Soft Computing 72: 79–89. 10.1016/j.asoc.2018.05.049.
  • Romanosky Sasha, Ablon Lillian, Kuehn Andreas, Jones Therese. Content analysis of cyber insurance policies: How do carriers price cyber risk? Journal of Cybersecurity (oxford) 2019; 5 (1):tyz002. [ Google Scholar ]
  • Sarabi A, Naghizadeh P, Liu Y, Liu M. Risky business: Fine-grained data breach prediction using business profiles. Journal of Cybersecurity. 2016; 2 (1):15–28. doi: 10.1093/cybsec/tyw004. [ CrossRef ] [ Google Scholar ]
  • Sardi Alberto, Rizzi Alessandro, Sorano Enrico, Guerrieri Anna. Cyber risk in health facilities: A systematic literature review. Sustainability. 2021; 12 (17):7002. doi: 10.3390/su12177002. [ CrossRef ] [ Google Scholar ]
  • Sarker Iqbal H, Kayes ASM, Badsha Shahriar, Alqahtani Hamed, Watters Paul, Ng Alex. Cybersecurity data science: An overview from machine learning perspective. Journal of Big Data. 2020; 7 (1):41. doi: 10.1186/s40537-020-00318-5. [ CrossRef ] [ Google Scholar ]
  • Scopus. 2021. Factsheet. https://www.elsevier.com/__data/assets/pdf_file/0017/114533/Scopus_GlobalResearch_Factsheet2019_FINAL_WEB.pdf . Accessed 11 May 2021.
  • Sentuna A, Alsadoon A, Prasad PWC, Saadeh M, Alsadoon OH. A novel Enhanced Naïve Bayes Posterior Probability (ENBPP) using machine learning: Cyber threat analysis. Neural Processing Letters. 2021; 53 (1):177–209. doi: 10.1007/s11063-020-10381-x. [ CrossRef ] [ Google Scholar ]
  • Shaukat K, Luo SH, Varadharajan V, Hameed IA, Chen S, Liu DX, Li JM. Performance comparison and current challenges of using machine learning techniques in cybersecurity. Energies. 2020; 13 (10):27. doi: 10.3390/en13102509. [ CrossRef ] [ Google Scholar ]
  • Sheehan B, Murphy F, Mullins M, Ryan C. Connected and autonomous vehicles: A cyber-risk classification framework. Transportation Research Part a: Policy and Practice. 2019; 124 :523–536. doi: 10.1016/j.tra.2018.06.033. [ CrossRef ] [ Google Scholar ]
  • Sheehan Barry, Murphy Finbarr, Kia Arash N, Kiely Ronan. A quantitative bow-tie cyber risk classification and assessment framework. Journal of Risk Research. 2021; 24 (12):1619–1638. doi: 10.1080/13669877.2021.1900337. [ CrossRef ] [ Google Scholar ]
  • Shlomo A, Kalech M, Moskovitch R. Temporal pattern-based malicious activity detection in SCADA systems. Computers & Security. 2021; 102 :17. doi: 10.1016/j.cose.2020.102153. [ CrossRef ] [ Google Scholar ]
  • Singh KJ, De T. Efficient classification of DDoS attacks using an ensemble feature selection algorithm. Journal of Intelligent Systems. 2020; 29 (1):71–83. doi: 10.1515/jisys-2017-0472. [ CrossRef ] [ Google Scholar ]
  • Skrjanc I, Ozawa S, Ban T, Dovzan D. Large-scale cyber attacks monitoring using Evolving Cauchy Possibilistic Clustering. Applied Soft Computing. 2018; 62 :592–601. doi: 10.1016/j.asoc.2017.11.008. [ CrossRef ] [ Google Scholar ]
  • Smart, W. 2018. Lessons learned review of the WannaCry Ransomware Cyber Attack. https://www.england.nhs.uk/wp-content/uploads/2018/02/lessons-learned-review-wannacry-ransomware-cyber-attack-cio-review.pdf . Accessed 7 May 2021.
  • Sornette D, Maillart T, Kröger W. Exploring the limits of safety analysis in complex technological systems. International Journal of Disaster Risk Reduction. 2013; 6 :59–66. doi: 10.1016/j.ijdrr.2013.04.002. [ CrossRef ] [ Google Scholar ]
  • Sovacool Benjamin K. The costs of failure: A preliminary assessment of major energy accidents, 1907–2007. Energy Policy. 2008; 36 (5):1802–1820. doi: 10.1016/j.enpol.2008.01.040. [ CrossRef ] [ Google Scholar ]
  • SpringerLink. 2021. Journal Search. https://rd.springer.com/search?facet-content-type=%22Journal%22 . Accessed 11 May 2021.
  • Stojanovic B, Hofer-Schmitz K, Kleb U. APT datasets and attack modeling for automated detection methods: A review. Computers & Security. 2020; 92 :19. doi: 10.1016/j.cose.2020.101734. [ CrossRef ] [ Google Scholar ]
  • Subroto A, Apriyana A. Cyber risk prediction through social media big data analytics and statistical machine learning. Journal of Big Data. 2019 doi: 10.1186/s40537-019-0216-1. [ CrossRef ] [ Google Scholar ]
  • Tan Z, Jamdagni A, He X, Nanda P, Liu RP, Hu J. Detection of denial-of-service attacks based on computer vision techniques. IEEE Transactions on Computers. 2015; 64 (9):2519–2533. doi: 10.1109/TC.2014.2375218. [ CrossRef ] [ Google Scholar ]
  • Tidy, J. 2021. Irish cyber-attack: Hackers bail out Irish health service for free. https://www.bbc.com/news/world-europe-57197688 . Accessed 6 May 2021.
  • Tuncer T, Ertam F, Dogan S. Automated malware recognition method based on local neighborhood binary pattern. Multimedia Tools and Applications. 2020; 79 (37–38):27815–27832. doi: 10.1007/s11042-020-09376-6. [ CrossRef ] [ Google Scholar ]
  • Uhm Y, Pak W. Service-aware two-level partitioning for machine learning-based network intrusion detection with high performance and high scalability. IEEE Access. 2021; 9 :6608–6622. doi: 10.1109/ACCESS.2020.3048900. [ CrossRef ] [ Google Scholar ]
  • Ulven JB, Wangen G. A systematic review of cybersecurity risks in higher education. Future Internet. 2021; 13 (2):1–40. doi: 10.3390/fi13020039. [ CrossRef ] [ Google Scholar ]
  • Vaccari I, Chiola G, Aiello M, Mongelli M, Cambiaso E. MQTTset, a new dataset for machine learning techniques on MQTT. Sensors. 2020; 20 (22):17. doi: 10.3390/s20226578. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Valeriano B, Maness RC. The dynamics of cyber conflict between rival antagonists, 2001–11. Journal of Peace Research. 2014; 51 (3):347–360. doi: 10.1177/0022343313518940. [ CrossRef ] [ Google Scholar ]
  • Varghese JE, Muniyal B. An Efficient IDS framework for DDoS attacks in SDN environment. IEEE Access. 2021; 9 :69680–69699. doi: 10.1109/ACCESS.2021.3078065. [ CrossRef ] [ Google Scholar ]
  • Varsha M. V., Vinod P., Dhanya K. A. Identification of malicious android app using manifest and opcode features. Journal of Computer Virology and Hacking Techniques. 2017; 13 (2):125–138. doi: 10.1007/s11416-016-0277-z. [ CrossRef ] [ Google Scholar ]
  • Velliangiri S, Pandey HM. Fuzzy-Taylor-elephant herd optimization inspired Deep Belief Network for DDoS attack detection and comparison with state-of-the-arts algorithms. Future Generation Computer Systems—the International Journal of Escience. 2020; 110 :80–90. doi: 10.1016/j.future.2020.03.049. [ CrossRef ] [ Google Scholar ]
  • Verma A, Ranga V. Machine learning based intrusion detection systems for IoT applications. Wireless Personal Communications. 2020; 111 (4):2287–2310. doi: 10.1007/s11277-019-06986-8. [ CrossRef ] [ Google Scholar ]
  • Vidros S, Kolias C, Kambourakis G, Akoglu L. Automatic detection of online recruitment frauds: Characteristics, methods, and a public dataset. Future Internet. 2017; 9 (1):19. doi: 10.3390/fi9010006. [ CrossRef ] [ Google Scholar ]
  • Vinayakumar R, Alazab M, Soman KP, Poornachandran P, Al-Nemrat A, Venkatraman S. Deep learning approach for intelligent intrusion detection system. IEEE Access. 2019; 7 :41525–41550. doi: 10.1109/access.2019.2895334. [ CrossRef ] [ Google Scholar ]
  • Walker-Roberts S, Hammoudeh M, Aldabbas O, Aydin M, Dehghantanha A. Threats on the horizon: Understanding security threats in the era of cyber-physical systems. Journal of Supercomputing. 2020; 76 (4):2643–2664. doi: 10.1007/s11227-019-03028-9. [ CrossRef ] [ Google Scholar ]
  • Web of Science. 2021. Web of Science: Science Citation Index Expanded. https://clarivate.com/webofsciencegroup/solutions/webofscience-scie/ . Accessed 11 May 2021.
  • World Economic Forum. 2020. WEF Global Risk Report. http://www3.weforum.org/docs/WEF_Global_Risk_Report_2020.pdf . Accessed 13 May 2020.
  • Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018; 6 :35365–35381. doi: 10.1109/ACCESS.2018.2836950. [ CrossRef ] [ Google Scholar ]
  • Xu, C., J. Zhang, K. Chang, and C. Long. 2013. Uncovering collusive spammers in Chinese review websites. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management.
  • Yang J, Li T, Liang G, He W, Zhao Y. A Simple recurrent unit model based intrusion detection system with DCGAN. IEEE Access. 2019; 7 :83286–83296. doi: 10.1109/ACCESS.2019.2922692. [ CrossRef ] [ Google Scholar ]
  • Yuan BG, Wang JF, Liu D, Guo W, Wu P, Bao XH. Byte-level malware classification based on Markov images and deep learning. Computers & Security. 2020; 92 :12. doi: 10.1016/j.cose.2020.101740. [ CrossRef ] [ Google Scholar ]
  • Zhang S, Ou XM, Caragea D. Predicting cyber risks through national vulnerability database. Information Security Journal. 2015; 24 (4–6):194–206. doi: 10.1080/19393555.2015.1111961. [ CrossRef ] [ Google Scholar ]
  • Zhang Ying, Li Peisong, Wang Xinheng. Intrusion detection for IoT based on improved genetic algorithm and deep belief network. IEEE Access. 2019; 7 :31711–31722. doi: 10.1109/ACCESS.2019.2903723. [ CrossRef ] [ Google Scholar ]
  • Zheng, Muwei, Hannah Robbins, Zimo Chai, Prakash Thapa, and Tyler Moore. 2018. Cybersecurity research datasets: taxonomy and empirical analysis. In 11th {USENIX} workshop on cyber security experimentation and test ({CSET} 18).
  • Zhou X, Liang W, Shimizu S, Ma J, Jin Q. Siamese neural network based few-shot learning for anomaly detection in industrial cyber-physical systems. IEEE Transactions on Industrial Informatics. 2021; 17 (8):5790–5798. doi: 10.1109/TII.2020.3047675. [ CrossRef ] [ Google Scholar ]
  • Zhou YY, Cheng G, Jiang SQ, Dai M. Building an efficient intrusion detection system based on feature selection and ensemble classifier. Computer Networks. 2020; 174 :17. doi: 10.1016/j.comnet.2020.107247. [ CrossRef ] [ Google Scholar ]

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 05 May 2023

Does commerce promote theft? A quantitative study from Beijing, China

  • Yutian Jiang   ORCID: orcid.org/0000-0001-9839-4551 1 &
  • Na Zhang 1 , 2  

Humanities and Social Sciences Communications volume  10 , Article number:  203 ( 2023 ) Cite this article

804 Accesses

1 Altmetric

Metrics details

  • Criminology
  • Environmental studies

Commerce, as both an environmental and a social factor, is essential to the study of the causes of urban crimes. This paper aims to comprehensively propose research hypotheses based on these two types of commercial factors and optimise statistical tools with which to analyse commerce’s impact on the level of theft in Beijing. Combining criminal verdicts, census data, points of interest, and information on nighttime lighting, this paper first applies a hierarchical regression model to verify the effectiveness of using commercial environmental and social factors to explain theft statistics and then constructs a structural equation model to analyse the joint influence of multiple commercial factors on those statistics. This paper finds that Beijing’s commerce does not significantly promote theft, verifies the effectiveness of two types of commercial variables and the corresponding Western theories in explaining commerce’s impact on theft in Beijing, and provides empirical data for the study of the causes of theft in a non-Western context.

Similar content being viewed by others

data theft research paper

Socio-economic, built environment, and mobility conditions associated with crime: a study of multiple cities

Marco De Nadai, Yanyan Xu, … Bruno Lepri

data theft research paper

Assessment of street-level greenness and its association with housing prices in a metropolitan area

Sihyun An, Hanwool Jang, … Kwangwon Ahn

data theft research paper

What are the differences in urban citizens’ preferences for the colour of condominium building facades?

Kaida Chen, Hanliang Lin, … Shuying You

Introduction

In contemporary China, urbanisation has caused the population to rapidly concentrate in large cities (Lu, 2016 ), thus promoting the aggregation of social activities such as crime, which undoubtedly exacerbates the vulnerability of urban security. Park ( 1915 ), the founder of the Chicago School, advanced the classic topic that “Crime is the problem of the city.” Currently, there is far more crime in large cities than in small cities or rural areas, and this crime is more violent (De Nadai et al., 2020 ). In Chinese urban crimes, thefts are of particular concern. First, theft accounts for an extremely high proportion of all crimes. From 1978 to 2019, theft ranked first among all criminal cases filed by public security forces. In 2020, theft ranked second only after property fraud (affected by the depressed economic environment during COVID-19, this type of crime, represented by telecom fraud, accounted for a 10.58% surge Footnote 1 ). It can be seen that the theft problem is severe and has not been fundamentally addressed. At the same time, theft generally arises from a temporary intention and is thus more susceptible to the influence of the surrounding environment, so spatial prevention and control measures can be used to effectively suppress it (Shan, 2020 ). In summary, it is necessary to conduct spatial research on theft in Beijing to effectively govern the growing theft problem, reduce governance costs, maintain urban stability, improve residents’ happiness, and provide a reference to be used by other major cities in China to solve their own theft problems.

In studying the causes of urban crime such as theft, commerce Footnote 2 deserves special attention. First, commerce is an essential part of urban functions in the Athens Charter, and commercial or economic conditions are common and major causes of theft that need to be studied. Second, the dual economic structure is an unavoidable and vital phenomenon in China. Urban and rural products or economies are isolated and therefore evolve separately to a certain extent and have their own characteristics. Thus, developed commerce and a prosperous economy distinguish cities from rural areas, so studying commerce is particularly important for understanding urban crimes. Since commerce is both an environmental and a social factor, it should be comprehensively characterised by these two types of commercial factors, and its impact on thefts should be analysed using various theories.

Based on the existing research, this paper aims to effectively explain the impact of commerce on thefts through reasonable hypothesis construction and optimised statistical tests (Weisburd et al., 2014 ) to verify whether Western-derived theories such as environmental criminology and social disorganisation theory (SDT) apply in a non-Western settings (Piquero et al., 2019 ) and to provide this field with an eastern empirical study conducted in Beijing, China. First, this paper uses correlation analysis and a hierarchical regression model to verify the applicability and effectiveness of the two types of commercial factors in explaining the impact of commerce on thefts in Beijing. On this basis, this paper constructs a structural equation model that can reflect the primary spatial relationships and analyses the joint influence of multiple commercial factors on thefts with the help of the latent variable of commerce so that the relationship between commerce→theft is more intuitive.

The research significance of this paper is mainly reflected in the following two points: (1) By studying cases of theft, which have been judged to be criminal offences and account for a very high proportion of the total crime in Beijing, this research is meant to provide insight into the impact of commerce on theft in Beijing, providing a reference for the prevention and control of urban theft and providing a basis for protecting the happiness of urban residents (Cheng and Smyth, 2015 ). (2) Criminological research is more highly developed in the West. For example, environmental criminology research focuses on European and American cities by applying Western perspectives (Musah et al., 2020 ). Due to the extreme lack of statistics at the community level and the fact that most criminologists come from a legal background and lack sociology skills, Chinese SDT-based research still faces difficulties, such as innovating data sources and overcoming the challenges of combining interdisciplinary theories. This is why the total amount of such research is still small. Expanding the research scope to investigate new situations can provide more cases for research in these fields. By studying the theft problem in Beijing, a Chinese megacity, the feasibility of influencing factors impacting the conclusions of Western research in solving the same issues in China can be tested and whether these factors can be manipulated to guide the formulation and implementation of prevention and control measures in large cities in China can be explored.

The main innovative contributions of this paper are as follows: In addition to providing oriental empirical data, this paper selects two statistical models that complement each other and most closely match the dual characteristics of commerce. Ideal statistical models are applied to verify the accuracy of the research assumptions and variable selection, assure consistency with realities and statistical requirements, and facilitate the study of the combined impact of commerce on thefts due to commerce being both environmental and social in nature. The hierarchical regression model can be used to highlight the optimal choice of commercial environmental and social variables, to prove the validity and applicability of the two by placing these two types of variables at different levels, and to verify certain research hypotheses, their construction method and the partial results of the structural equation model. The structural equation model can not only depict realistic spatial relationships and meet the statistical requirements, but it can also analyse the comprehensive impact of the two types of commercial variables on theft through the application of the latent variable, making the relationship between commerce and theft intuitively visible. By selecting statistical models that are more compatible with the dual characteristics of commerce, this paper better analyses the comprehensive impact of commerce on theft in Beijing and provides ideas for future researchers.

Literature review

Influencing factors of theft.

Existing studies have multiple interpretations of the spatial distribution characteristics of theft, covering both the environmental and social factors in play at the place where theft occurs. In the Residential Theft Study, Wu et al. ( 2015 ) found that street network structure and socioeconomic factors were dual factors influencing residential theft and drew some important conclusions, such as high street permeability can dampen theft and commerce has a positive correlation with theft. Chen et al. ( 2017 ) studied the impact of sociodemographic factors on residential theft. They found that factors such as the proportion of the population who rent their houses and the proportion of residents who were originally from other provinces have a positive effect on the number of residential thefts and that factors such as the percentage of residents with a bachelor’s degree or higher are negatively correlated with the number of residential thefts when controlling for some factors in the bussing system. Xiao et al. ( 2018 ) found that the distance of crime in residential theft cases was closely related to the characteristics of the home community and the target community. They found that after searching for crime targets in and near their home community, perpetrators then decide whether to commit crimes at a longer distance based on factors such as the wealth of the target community. In a general theft study, based on social disorganisation theory and routine activity theory (RAT), Liu and Zhu ( 2016 ) argued that community social characteristics largely influence the incidence of theft. They found that communities with higher-than-average bar and department store density have a higher risk of theft. Yue et al. ( 2018 ) found that a range of factors related to socioeconomic and communal facilities impacted thefts and validated this effect with sociodemographic data and facility data. They found that higher local permeability ensures better street safety, while higher nonlocal permeability poses a threat to security. At the same time, Mao et al. ( 2018 ) found that Shanghai vehicle theft occurs in stable crime hotspots. They suggested that highly mobile and short-lived populations, higher population densities, and more traffic flows all positively impacted motor vehicle thefts. These findings also support routine activity theory. It can be seen from these observations that there are many factors that influence theft, and the focus of existing studies also varies, so researchers should firmly make the choices and trade-offs among research factors that are needed in this field. Among them, commercial or economic factors are the most common and major research factors. Scholars have used a variety of datasets to measure their impact on theft from different perspectives.

Commerce and thefts

Commerce could be seen as an environmental factor. Related research explains the relationship between commercial environmental factors and theft with the help of rational choice theory (RCT) and RAT. RCT is an essential foundational theory of environmental criminology that posits that potential perpetrators weigh the potential benefits and consequences of theft and then rationally judge whether to commit the theft. On this basis, RAT proposes that theft can occur when potential offenders, suitable criminal targets, and incompetent supervision converge at a particular crime location. In environmental criminology, a basic consensus is that drinking places and other undesirable enterprises can increase the crime rate regarding theft (Liu and Zhu, 2016 ). Yu and Maxfield ( 2014 ) studied the impact of ordinary enterprises on commercial and residential theft (but their research results only pertain to commercial theft and not residential theft). They found that ordinary enterprises also increase the risk of theft because these harmless or ordinary places can expose targets to the criminal population. Sohn ( 2016 ) studied the relationship between commercial land use and residential theft and found that the relationship between the two is relatively complex. For example, the impact of commercial land use on residential theft varies by the type of commercial facilities, and not all commercial uses increase crimes in the area. That is, an increase in specific commercial land use may either increase or decrease crime. Based on this finding, this paper suggests that when commercial facilities that encourage legitimate activities are incorporated into the community, they can offset the opportunity effect of attracting offenders by enhancing the positive effect of surveillance. According to the regression analysis results, Mao et al. ( 2018 ) found that the number of commercial premises in an area has a restraining effect on motor vehicle theft cases in that area. They first argued that this result is contrary to common sense since “areas with more commercial premises have more potential targets for crime” but after empirical investigation they finally attributed this phenomenon to the idea that areas with more commercial premises also having better and more orderly area management. The relationship between commercial environmental factors and theft is complex and requires extensive verification, especially in non-Western settings.

At the same time, commerce can also be seen as a social factor. As SDT puts it, social factors are important determinants of crimes. Crimes are caused by people’s natural reactions to society. SDT explains the impact of society on crimes, with a strong emphasis on commercial or economic factors. This theory holds that economic growth and commercial prosperity have profoundly changed the social life of urban residents, driving social problems such as economic inequality and population migration and giving rise to crimes such as theft (Chen et al., 2021 ). Accordingly, Wu et al. ( 2015 ), Chen et al. ( 2017 ), Xiao et al. ( 2018 ), Liu and Zhu ( 2016 ), Yue et al. ( 2018 ), and Mao et al. ( 2018 ) all introduced social factors such as economy, population, and employment to explain the theft phenomenon. Using measures such as population mobility, housing stability, ethnic consistency, and urbanisation to capture neighbourhood characteristics, the relevant research attempts to introduce the structural characteristics of neighbourhood society to analyse crime causes. Most extant environmental criminology studies focus on the environmental factors of the places where thefts occur and introduce social factors to improve the accuracy of spatial theft analysis, thus comprehensively analysing the theft problem, which is a problem that is closely related to society and the environment simultaneously (Tang et al., 2019 ). This is a valuable exploration, but due to the limited social data sources at the community level in China, innovative and careful use of unofficial data is vital.

To address this issue, nighttime light data is used in this study to reflect commerce prosperity at the township level. In traditional criminological research, scholars have verified that nighttime lights can prevent crimes, and most of their studies have linked nighttime lights with street lighting as a microscopic measure (Welsh et al., 2022 ). Some recent studies have increasingly used nighttime data, such as nighttime lights or nighttime social media, to represent social factors. For example, Zhang et al. ( 2021 ) found that using nighttime social media data to measure burglary victims worked well, and that these data worked best in the early hours of the morning. They demonstrated that the use of daytime active population data did not explain burglary effectively, while the use of nighttime data had strong interpretive power. Based on the hypothesis advanced in crime pattern theory that edges affect crime, Liu et al. ( 2020 ) and Zhou et al. ( 2019 ) studied the compound edge effect. They verified the existence of compound edges and their effects on crime and proved that using nighttime light data to measure compound edges effectively improved model fit. Nighttime light data are crucial for understanding compound edges, including physical and social edges, and have the potential to represent social factors. In economics research, nighttime light data are often used to represent regional economic situations. Economists have verified the validity of nighttime light data in representing macro- and meso-economic conditions, summarising the shortcomings of these data in reflecting the economic situations of low-density cities or rural areas and their corresponding microscopic lighting differences and advantages in high spatial accuracy, addressing the lack of GDP and being suitable for scales as fine as one square kilometre (Gibson, 2021 ; Gibson et al., 2020 ; Gibson et al., 2021 ). Therefore, this paper uses nighttime light data to represent the commerce prosperity of Beijing, a high-density developed city, at the meso township level (1.09 to 84.5 km 2 ) and uses fine data to measure the commercial social factors and capture the characteristics of areas where crime occurs more accurately through the use of innovative data sources (Snaphaan and Hardyns, 2021 ).

Other factors and thefts

In addition to commerce, this paper examines the impact of transportation, police agencies, and population factors on theft.

According to the Athens Charter, the four functions of a city are residence, recreation, work and transportation. Commerce can reflect the recreation and work activities of city dwellers. Therefore, this paper introduces transportation factors with which to comprehensively evaluate the urban environment. Among these transportation factors, road accessibility, bus station location and density, number of bus routes, and traffic flow are often studied. For example, Liu et al. ( 2017 ) found that adding a one-way bus line can significantly decrease burglary throughout a region and other developed regions but does not have a statistically significant impact on burglary in developing regions. Chen et al. ( 2017 ) verified that bus routes and bus stop density exert a significant positive effect on burglary based on rational choice theory, as bus routes improve the accessibility of the area so that criminals can come and go freely and increase the levels of criminal motivation. Nevertheless, at the same time, they also illustrate that some studies have found that inaccessible areas may be more vulnerable. The relationship between public transportation and theft in China needs more extensive empirical research. Commerce and transportation also interact with each other. Using correlation analysis, Porta et al. ( 2012 ) studied the relationship between street distribution and economic activities in Barcelona, Brazil. They found that the spatial distribution of secondary economic activities, that is, the distribution of commercial activities such as retail commerce, hotels, restaurants, and cafes, is highly correlated with the distribution of streets. However, other commercial activities that are unrelated to the public have a low correlation with the spatial distribution of streets. This study suggested that secondary economic activities are more dependent on passers-by and are therefore more limited by their locations, while major economic activities make people more willing to travel to them through the attractiveness of their functions. The impact of transportation on theft and its interrelationship with commerce is analysed in this paper.

Police agencies are also important research subjects. Liu and Zhu ( 2016 ) and Liu et al. ( 2019 ) explain that the nonsignificant or positive relationship between policing factors and theft is due to the fact that the enhancement of policing is a natural response to high crime rates. Shan ( 2020 ) found that the crime containment effect of government agencies and police agencies is relatively limited near large commercial places because shopping malls attract crime and the spatial layout of institutions is difficult to crime hotspots in a timely manner. Blesse and Diegmann ( 2022 ) found that closing police stations increased car thefts and burglary cases because this practice reduced police visibility and criminal deterrence. Piza et al. ( 2020 ) found that police stations increase the visibility of police, thus creating a deterrent effect within the precinct. They also find that the substation’s role could be strengthened with increased policing activities.

In terms of population, concepts such as hukou, resident population, floating population, etc., complicate the Chinese population issue. The hukou is a legal document in China that records household population information and is used to regulate population distribution and migration. Resident population describes those residents who have lived in a household for more than 6 months, including both the resident population with hukou and the resident floating population who have lived for more than 6 months in the area. The floating population refers to the resident population that can legally work for a long time but lack certain rights in public services, education, and medical care compared to those with hukou. Jing et al. ( 2020 ) found that when people do not have local hukou, they feel a greater sense of social disorder, less social integration, and a greater fear of crime. Chen et al. ( 2017 ) demonstrated that the proportion of the population that rents their houses and the population of other provinces promote residential theft and that the proportion of the population with an educational level of bachelor’s degree or higher inhibits residential theft. The SDT explains these findings. Residential mobility leads to weak local social connections, low community belonging, and weak oversight and guardianship capacity, which further weakens informal social control networks and ultimately causes crime. As Beijing’s population density ranks high in China and the hukou is designed to regulate population distribution, the proportion of the floating population in Beijing’s resident population is high. At the same time, Beijing is China’s higher education and cultural centre, and the proportion of higher educated people in the floating population is high. Accordingly, this paper argues the following: (1) The resident population of each township officially announced by Beijing Municipality suitably reflects the neighbourhood characteristics related to crime. That is, it is ideal for explaining thefts. (2) Beijing’s resident population (including both the local population with hukou and the resident floating population) feels a lower level of social disorder and a higher level of social integration, which may discourage thefts.

In summary, scholars have addressed the environmental and social factors affecting thefts in the existing research. Among them, commerce, transportation, police agencies, and population factors are vital. Due to the differences between the national conditions of China and those of the West, there are specific differences between Chinese and Western studies. For example, Western studies often examine population heterogeneity in terms of race and immigration. Nevertheless, Chinese studies see the resident population and the population with hukou as essential indicators. China’s transportation system is more complex (Liu et al., 2017 ), so some Western findings may not apply to China. Therefore, this paper evaluates the impact of commerce on theft in Beijing, verifies the effectiveness and applicability of commercial environmental and social factors in explaining the relationship between commerce and theft in China, and provides Chinese empirical data for the study of theft causes.

Research design

Research objective and research method.

Commerce is an important object in studying the causes of urban crimes. With the vigorous development of commerce in Beijing’s main urban area, exploring the relationship between commerce and theft is particularly important to understanding the theft phenomenon. This paper divides commerce into commercial environmental and commercial social factors, aiming to comprehensively evaluate the commercial situation in Beijing and thoroughly analyse the impact of commerce on thefts. After fundamental statistical analysis and geographical portrait research, this paper mainly applies hierarchical regression and structural equation models to achieve the above research objectives.

In the hierarchical regression model, other factors, commercial environmental factors and commercial social factors, can be placed at different levels, highlighting the comprehensive impact of multiple factors on theft at different levels, helping to identify the different roles of various factors in explaining the theft phenomenon, and facilitating this capacity of this study to judge the impact of multiple factors on the theft phenomenon. Accordingly, the effectiveness of these two variables in analysing the impact of commerce on thefts can be verified.

The structural equation model has two main advantages. First, compared with the multiple regression model, the structural equation model can be used to portray the complex relationships between various variables to better align the with reality. Therefore, considering the complex interrelationships between spatial factors, a structural equation model is necessary. Second, the structural equation model can introduce a potential variable, such as restaurants, to represent commercial factors without directly using data. Therefore, this model has lower requirements for data characteristics, and it compensates for some of the disadvantages of traditional methods such as path analysis and negative binomial regression. The structural equation model also has some disadvantages, however. First, the structural equation model, although suitable for predicting dependent variables and verifying the established theoretical model, is a more demanding theoretical model with more demanding research assumptions. This is why this paper explores its topic using a hierarchical regression model before applying structural equation models, aiming to ensure the correctness of certain assumptions and to test the necessity of introducing these two types of factors. Second, due to computational limitations, the structural equation model assumes that all relationships are linear (Najaf et al., 2018 ). For this paper, the latent variables that play an essential role in the structural equation models can be used to infer the overall nature of factors that are difficult to directly represent with one data point, such as commerce, based on multiple observation variables, to measure the relationships between this latent variable and other variables, and to make the relationship between commerce and theft that is the focus of this paper more intuitively visible (Skrondal and Rabe-Hesketh, 2007 ; Lee and Song, 2014 ).

In summary, given the high compatibility of hierarchical regression and structural equation models with the study’s purpose, this paper first applies the hierarchical regression model to analyse the applicability of commercial environmental and social factors in explaining theft numbers in Beijing, exploring the correctness of certain assumptions. After that, this paper takes the structural equation model as its main method and uses a latent variable to highlight the relationship between the effect of commerce on theft.

Research hypotheses

Based on the existing research results (Wortley et al., 2021 ) and the actual situation in China, this paper formulates the conceptual analysis model shown in Fig. 1 . This model later supports the structural equation model, which helps to solve the dilemma of traditional regression models, such as their limited explanatory power and equal treatment of individual variables. Drawing on the ideas of economic modelling, this paper hopes to build a simple rather than complex and comprehensive conceptual model that depicts all aspects of the actual situation and highlights the main points. This paper aims to study the four types of influencing factors of theft and first introduces the relationships between the four types of factors and theft. Since public transport facilities are almost entirely planned and built by the government in China and should be regarded as an exogenous variable that exerts an essential impact on other factors, this paper introduces public transport facilities→commerce and public transport facilities→public security institutions as two pathways that are in line with reality and have been experimentally verified. Therefore, based on meeting the statistical requirements, this paper ensures realistic matching and model conciseness.

figure 1

Conceptual analysis model. Commerce not only affects theft but also is affected by public transport, and public transport also affects theft and public security institutions. Theft is also influenced by public security institutions and permanent residents.

First, the number of public transportation facilities is an important indicator of the mobility of the population in a specific area, and it is also a common factor in the study of theft. According to the situation in China, areas with a greater number of public transportation facilities tend to receive more police forces to prevent potential crimes in these high-traffic areas. For example, the Beijing West Railway Station Substation, the Beijing West Railway Station Police Station, the Zhanbei Police Station, the Zhannan Police Station, and other police stations have been set up in proximity to the Beijing West Railway Station. The Beijing South Railway Station Police Station, Yangqiao Police Station, You’anmen Police Station, Xiluoyuan Police Station, and other police stations have also been set up in proximity to the Beijing South Railway Station. Given that public transportation facilities are mostly planned and constructed by the government and should be regarded as exogenous variables and that the distribution of public security agencies, including mobile police booths, police cars, and public security booths, is more flexible, the latter should be utilised as endogenous variables as they are affected by the former. In addition, limited by economic costs, the spatial layout of public security institutions, such as public security substations and police stations, cannot be easily changed, which limits their flexibility (Shan, 2020 ). This paper advances the claim that the distribution of police forces is difficult to flexibly change in time with the distribution of thefts, so a one-way relationship is drawn between the two. This paper follows the assumption that public transportation facilities affect thefts, public transportation facilities affect public security institutions, and public security institutions affect thefts in that area.

Second, the relationship between commerce and thefts can be explained by environmental criminology and SDT. The routine activity theory holds that crimes originate from daily life, and commercial premises undertake the two primary functions of urban areas, namely, recreation and work, which are closely related to people’s everyday lives. By providing food, clothing, entertainment, etc., commercial premises such as catering, shopping, and leisure are key nodes in the daily activities of both citizens and potential perpetrators and often see the highest incidence of theft because they limit the scope of people’s activities (Piquero et al., 2019 ). SDT values the commercial social factor of the nighttime light intensity and claims that the neighbourhood characteristic of commerce prosperity is closely related to crimes such as thefts. As far as the Beijing urban area is concerned, the different commerce prosperity of each township reflects its particular urban function. For example, the capital airport township, with the highest nighttime light intensity, is an important transportation hub busy around the clock. The Xiangshan township with the lowest intensity is a remote mountain and forest scenic area. Thus, their different levels of commercial prosperity mean that these townships’ urban functions and neighbourhood characteristics differ, leading to the different effects of informal social control on theft. As a result, commerce is often seen as an influencing factor on theft. Based on the research results and the current situation in China, this paper posits that a complex relationship exists between public transportation facilities and commerce (Shan, 2020 ; Porta et al., 2012 ). In China, public transportation facilities can be considered to be exogenous variables. Therefore, this paper sets the relationship between the two as a one-way relationship in which public transportation facilities affect commerce and emphasises that this relationship varies across the specific circumstances of each country.

Third, the population factor is essential to theft, and resident population density is suitable data for studying theft in Beijing (Jing et al., 2020 ; Chen et al., 2017 ). SDT attaches great importance to informal social control, which describes the informal and unofficial activity of regional inhabitants meant to fight crime through collective intervention. The higher that the density of the resident population in a township so, the less mobile the population, and the more likely the regional inhabitants are to pursue shared values and consciously unite to implement regulatory activities to maintain effective social control and combat theft. Therefore, this paper sets the resident population density, an important measure of informal social control, as the explanatory factor for theft.

Study area, study variables, and data sources

The urban area of Beijing, which includes Dongcheng District, Xicheng District, Chaoyang District, Haidian District, Fengtai District, and Shijingshan District, was divided into 134 townships in 2018 and is located in the south-central part of Beijing. Correspondingly, other areas are considered rural areas or suburbs. With the development of urbanisation, the difference between these six districts and their suburbs has become increasingly apparent. As the heart of Beijing, the urban area has a higher population density and greater theft numbers than other areas, becoming a virtual object of urban crime research. This paper aims to study the problem of theft in cities, so the urban area of Beijing is selected as the research object. Given that environmental criminology requires delicate spatial research, commercial social factors are challenging to measure at the micro level, and townships are the smallest administrative unit among the three levels of administrative divisions in China and the smallest announced unit for government plans and policies, such as police prevention and control policies, this paper explores the influence of the most fundamental commercial social factors and the commercial environmental factors that facilitate police deployment on theft numbers in Beijing at the township level, comprehensively analysing the relationship between commerce and theft.

Study variables and data sources

In Table 1 , the variables, data, and descriptive statistical indicators used in the study are summarised. The collinearity of all variables are checked by the variance inflation factor (VIF). The VIF values are mostly between 1.885 and 3.093; excepting the VIF value of the data regarding the densities of sports and leisure places at 6.445 and the VIF value of the data on the densities of catering places at 8.967, which indicates that there is no unacceptable significant correlation between these variables.

This paper examines those criminal cases of theft Footnote 3 for which case data are readily available and the social impact is relatively severe, refers to the classification of ordinary theft, burglary, and the theft of motor vehicles as presented in the Law Yearbook of China, eliminates specific types of thefts and focuses on the study of ordinary theft. The numbers of thefts are calculated from the public judgement documents on the judgement document network, which is the official Chinese platform used to publish and provide all effective court judgement documents.

The process of collecting and processing the judgement document materials is as follows: (1) Based on the judgement document network, this work retrieves the criminal judgements on theft that were published from January 1, 2018, to January 15, 2022 and manually screens the judgements based on the conditions that the actual place of the crime is Beijing and the actual time of the crime (rather than time of publication of the documents) is 2018, resulting in a total of 3088 relevant judgement documents being retained. (2) This paper manually screens 3088 judgement documents on the condition that the crime actually occurred in the urban area of Beijing. Among them, 1751 documents were related to thefts in the urban area, accounting for 56.70%, while 1337 documents were only related to thefts in the remaining ten districts, accounting for 43.30%. Under the premise of a significantly smaller footprint, the urban area accounted for 56.70% of the total thefts in that year, which is in line with the classic statement that “crime is a problem of the city.” (3) Each location of theft as extracted from the judgement documents in 134 townships are identified in this study. In fact, most of China’s theft judgements specify the criminal address at least to the level of the township jurisdiction. This official uniform action facilitates the extraction and precise location of criminal addresses. Among them, 243 documents contain insufficiently detailed addresses for positioning, and 1508 documents provide sufficiently detailed addresses for positioning, thus enabling the successful location of 86.12% of the criminal theft offences in the six districts. (4) Of the 1,508 theft cases available for research, this paper classifies theft cases into the categories of ordinary theft, burglary and motor vehicle theft using the classification of the China Law Yearbook since 2010, which are in line with Chinese national conditions. 976 ordinary thefts are identified in this paper, accounting for 64.72% of the total. This proportion is similar to the 67.91% of ordinary theft criminal cases reported by public security forces nationwide in 2018, indicating the representativeness of this study sample. Finally, the number of ordinary thefts in each township are tallied. Among the crime addresses, 17 addresses describe bus thefts in the form of “the theft occurred on the way from Station A to Station B.” In this regard, this paper sets Stations A and B with a weight of 0.5. When a case involves multiple thefts, this paper counts the thefts at each location once and avoids double-counting thefts at the same location.

In terms of the choice of research period, careful consideration was made. First, given the objective existence of crime time hotspots, this paper, as a spatial study, tends to analyse thefts on a yearly basis. Second, given that the disclosure process of judgement documents takes a long time, the materials of judgement documents in too recent years are not applicable. On January 12, 2023, this paper retrieved criminal judgements on thefts in Beijing and obtained a total of 3752 criminal judgements in 2018, 3765 criminal judgements in 2019, 2449 criminal judgements in 2020, 1390 criminal judgements in 2021, and 389 criminal judgements in 2022, which shows that the numbers of judgements in later years are significantly lower. It is highly likely that the judgements for later years have not yet been fully disclosed and are not suitable for this study. After careful consideration, this paper finally uses the 2018 judgement documents for study.

The nighttime light data obtained from global satellite observations from VIIRS_VNL V2 are one of the academic community’s most widely used geospatial data products and have been verified to perform better when representing economic activities (Gibson et al., 2021 ). Based on the 2018 data and ArcGIS software, this paper calculates the average nighttime light intensity of 134 townships in 2018.

The numbers of commercial premises, public security institutions, and public transportation facilities are tallied from the POI data provided by Amap. This paper uses POI data from 2018, which has been continuously updated until the end of 2018 and filters out discontinued or deactivated points of interest. Based on these data, the various POI data of 134 townships are extracted and counted. Specifically, among commercial premises, this paper first studies shopping places and catering places because these two types of places are closely related to people’s lives. Second, this paper also studies two types of commercial premises with high risks of theft, namely, finance and insurance establishments and sports and leisure establishments. Sports and leisure establishments include undesirable commercial premises such as bars and nightclubs, which are often of concern in environmental criminology. As it is difficult to distinguish between a sports or leisure category for certain commercial establishments, such as roller skating rinks, dance halls, and golf clubs, this paper combines these two types of places to avoid duplication of statistics. For the same reason, in public security institutions, public security substations, police stations, police workstations, etc., are combined into one category. Stations and metro stations are used to represent the public transportation facilities of townships. Stations include bus and coach stations, and metro stations include light rail and subway stations. By distinguishing different places of the same type, the structural equation model can be best leveraged and the ecological fallacy can be avoided.

Limited by data availability, the population data for each township come from the Seventh National Census of China in 2020, the bulletin published closest to the study time.

Based on Baidu Encyclopedia, the administrative division area data of each township from January 16, 2022 is collected, and these data are verified using the area data of some townships officially released in the China County Statistical Yearbook (Township Volume) in 2018 and the area data estimated based on ArcGIS software, which found that the encyclopaedia data were accurate. Due to the limitations represented by the official data being incomplete and the software estimation data containing some errors, the densities of various facilities, premises, and populations are finally calculated based on Baidu Encyclopedia data.

In summary, all the data in the official administrative division of townships is collated for this paper. In the following, the hierarchical regression and structural equation models regarding townships are analysed.

Regional characteristics

Thefts within the area.

A large number of thefts have occurred in the urban area of Beijing. In 134 townships, the occurrence of theft shows a state of cluster distribution, as shown in Fig. 2 . Among them, the theft numbers in certain townships are significantly higher, which may be related to the unique environmental and social conditions in these regions. As a result, the aggregation of thefts in the research area is verified, providing a basis for accurately portraying the cold spots and hotspots of crime and the empirical analysis in the following paper.

figure 2

Statistical chart of the distribution of theft numbers. The number of thefts varies between townships.

When thefts are spatially clustered and distributed, the incidence of theft forms both hotspots and cold spots. Based on ArcGIS software, this paper draws a map of theft hotspots at the township level and intuitively displays the differences in theft numbers in various administrative regions in Fig. 3 . In the urban area of Beijing, the regions with the highest theft numbers are concentrated in the central area and southwest, and crime hotspots are there. At the same time, based on ArcGIS software, this paper calculates the spatial autocorrelation index of theft numbers in the urban area and obtains the following values: Moran I index = 0.059218, Z -score = 2.759186, and P -value = 0.005795. There is a significant spatial positive autocorrelation in the urban theft numbers, and the theft shows a significant clustered distribution.

figure 3

Theft hotspot map at the township level. Theft numbers in Beijing have hotspots in terms of space.

Various study factors in the area

Among the 134 townships, this paper divides the 27 townships whose crime numbers are in the top 20% as hotspots and the 27 townships whose crime numbers are in the bottom 20% as cold spots and displays the means of the theft numbers and various factors in the hotspots, ordinary regions and cold spots in Table 2 . For ease of comparison, the ten line charts drawn from various means are shown in Fig. 4 .

figure 4

Means of theft number and various factors. Theft number and various factors have significant differences in hot, ordinary, and cold spots.

As seen in Fig. 4 , the nighttime light intensity, the densities of various commercial premises, the density of public security institutions, and the densities of various public transport facilities are significantly higher in hotspots than in cold spots. In contrast, the density of the resident population is relatively lower in hotspots than in cold spots. This may be explained as follows:

(1) As stated by the RAT, various commercial premises and public transport facilities serve as key nodes in the routine activities of both city dwellers and potential offenders and thus lead to high theft rates (see sections ‘Commerce and thefts’ and ‘Other factors and thefts’ for details).

(2) The degree of commercial prosperity as represented by the nighttime light intensity is the social factor and neighbourhood characteristic that SDT most focuses on. At the same time, according to RCT theory, a higher degree of commercial prosperity can encourage potential thieves to be attracted to a township because they might believe that it offers more profit opportunities.

(3) According to the SDT and specific situations of Beijing, the higher that the resident population density is, the stronger the informal social control, thus providing more crime regulations. This explains the higher density of the resident population in theft cold spots, meaning that the relationship between thefts and the resident population may be negative.

(4) As described in section ‘Other factors and thefts’, traditional criminology proposes, from the perspective of “deterrence,” that the presence of police agencies can suppress crimes. In Chinese environmental criminology studies, scholars have observed that the mean of public security institutions’ densities in hotspots is much higher than in ordinary areas and cold spots and explained this phenomenon through the following: “enhancing policing strength is a natural response to high crime rates” and “the ability of police agencies to suppress crimes is relatively limited near crime hotspots” (Liu and Zhu, 2016 ; Liu et al., 2019 ; Shan, 2020 ). This means that the relationship between thefts and police agencies may be positive.

Statistical analysis models are then established to test these theoretical speculations.

Research results based on the hierarchical regression model

Based on SPSS software, every data point and TD were analysed using two-factor correlation analysis and bilateral significance tests. Figure 5 displays the Pearson correlation coefficient, significance, and bivariate scatter plots. Multiple commercial premises factors and the nighttime light factor are significantly related to thefts. In other words, these factors may improve the ability of regression models to explain theft. Therefore, it is necessary to establish a hierarchical regression model to place these two types of factors at different levels and observe their roles and effects in explaining thefts.

figure 5

TD theft, NI nighttime light intensity, SL sports and leisure places, SP shopping places, FI finance and insurance places, CA catering places, PD resident population, PS public security institutions, BS station, MS metro station, Sig. significance.

A hierarchical regression model based on multiple linear regression is designed as the independent variable with each data point and TD is defined as the dependent variable (Table 3 ). This paper introduces the relevant data of the resident population, public security institutions, public transportation facilities, and commercial premises as contained in Models 1–4, each in turn, and finally introduces the most critical observed factor, nighttime light intensity, and establishes the final Model 0. Accordingly, this paper observes five types of factors at different levels, highlighting their different effects on the dependent variable.

In introducing variables at different levels, the relevant variables of public security institutions and public transportation facilities increased R 2 , improving the model’s explanatory ability. The introduction of the relevant variables regarding commercial premises also improves the explanatory ability of the model. That is, introducing commercial environmental factors enhances the model’s ability to explain the theft phenomenon. This demonstrates the effectiveness and applicability of using environmental criminology theories in explaining the impact of commerce on theft in Beijing. After introducing nighttime light intensity, the Δ R 2 becomes the largest of the four values. That is, introducing a commercial social factor significantly enhances the model’s ability to explain the theft phenomenon. This means that theories such as SDT, which view commerce as a social factor, can provide practical assistance in explaining the theft phenomenon, so introducing this factor is extremely valuable. Therefore, the model results validate certain theoretical assumptions and support the construction of the structural equation model.

In addition, the significance of the individual variable shows that the relationship between NI and TD is significant, which also verifies the vital value of NI data in explaining thefts. It is worth mentioning that the relationships between other individual variables and TD are nonsignificant, so Model 0, which is a model based on multiple linear regressions, is difficult to apply to the comprehensive analysis of thefts in Beijing. Therefore, a structural equation model is later introduced to reflect Beijing’s theft problem more thoroughly.

Research results based on the structural equation model

The fitting degree.

Based on the conceptual analysis model constructed above, this paper uses Amos software and various data to fit and evaluate the model. The fitting indices of the structural equation model are displayed in Table 4 . As seen from the table, each fitting index is within the acceptable range, and the fitting results of the model are good.

The measurement results of the model

This paper shows the model’s measurement results and standardised path coefficients in Fig. 6 .

figure 6

BS station, MS metro station; PT public transport facilities, NI nighttime light intensity, SL sports and leisure places, SP shopping places, FI finance and insurance places, CA catering places, CI commerce, PS public security institutions,TD theft, PD resident population.

Among them, the factor loads of the two observed variables for public transportation facilities are greater than or equal to 0.81, which indicates that they are strong indicators of public transportation facilities and that the factor selection in this paper is reasonable. The factor loads of all four commercial environmental variables are greater than or equal to 0.79, indicating that they are all strong commerce indicators. We can see that commercial environmental factors can effectively be used to characterise commerce. The commercial social variable of “NI” has a factor load of 0.67, which meets the statistical requirements and indicates that it is a helpful indicator for representing commerce. Combined with the hierarchical regression model results, it is argued in this paper that this commercial social factor can be used to effectively characterise commerce. The use of two commercial variables, then, can effectively describe the commerce phenomenon in Beijing.

Variables’ relationships

The path test results of the structural equation model are displayed in Table 5 .

As seen from the table:

(1) There is an nonsignificant positive relationship between PT and TD, a significant positive relationship between PT and PS, and an nonsignificant positive relationship between PS and TD. Public transportation facilities provide nonsignificant convenience for theft and thus attract more policing as a crowded hub. The positive relationship between policing and theft is consistent with previous findings from China (Liu and Zhu, 2016 ; Liu et al., 2019 ; Shan, 2020 ).

(2) There is an nonsignificant positive relationship between CI and TD and a significant positive relationship between PT and CI. Public transportation facilities drive commercial activities in the region, and commerce attracts thefts to a certain extent.

(3) There is a significant negative relationship between PD and TD at the 5% level. Beijing’s resident population effectively increases the level of informal social control to curb theft.

The above results (especially the positive and negative effects of influence) corroborate the conceptual analysis model and research hypotheses presented in section ‘Research hypotheses’. After verifying the correctness of the conceptual model, the impact of Beijing’s commerce on theft can be explored based on the relationship between the latent variable of CIs and the latent variable of TD as highlighted by the structural equation model. In addition, the following points need to be made: In Models 2–4, Model 0, and the structural equation model, PD and TD are negatively related, which indicates their relationship’s credibility. However, the relationship between PS and TD is significantly positive in Model 2, nonsignificantly positive in Models 3 and 4, and nonsignificantly negative in Model 0. Since the positive or negative nature of this relationship varies in relatively perfect models and is not significant, it is difficult to judge the positive or negative nature of this relationship based on the results of the hierarchical regression model. It is precisely because multiple univariates are not significant in the hierarchical regression model that a structural equation model is constructed, as it better aligns with actual situations and statistical requirements. Since the hierarchical regression model cannot be used to verify the relationship between PS and TD in the structural equation model, the relationship between the two cannot be judged based on the statistical results but rather relies on the existing results to verify that this positive relationship has a certain degree of credibility (Liu and Zhu, 2016 ; Liu et al., 2019 ; Shan, 2020 ).

First, commerce does not significantly facilitate theft in Beijing. After verifying the effectiveness of the commercial environmental and social variables in explaining theft numbers using the hierarchical regression model, a structural equation model is constructed that is more realistic, more statistically satisfactory and more comprehensively analyses commerce’s influence on theft with the help of the latent variable of commerce. The relationship between the two is nonsignificant, as seen from the statistical results regarding CI → TD. The higher factor loads of commercial environmental variables in the structural equation model and the nonsignificant relationships between them, as well as the dependent variable in Model 0, can arguably be used to verify this relationship. In other words, the commercial environmental variables can better reflect Beijing’s commerce than the commercial social variables, and their positive impact on theft is minimal, so the promotion effect of Beijing’s commerce on theft is not significant overall.

Second, two kinds of commercial variables and their corresponding theories, namely, environmental criminology and SDT, can be used to effectively explain commerce’s impact on theft in Beijing. In the hierarchical regression model based on multiple linear regression, the introduction of commercial environmental and social variables improves the explanatory ability of the model, which means that commercial environmental and social variables can both be used to explain theft numbers and identify theft hotspots. In the structural equation model, the factor loads of the five observed variables of the commercial environmental and social categories are greater than or equal to 0.67, which shows that they are ideal indicators for representing commerce. Therefore, this paper argues that these two types of variables and their corresponding theories can be used to effectively explain the relationship between commerce→theft in Beijing.

Conclusions

Commerce is the primary and essential dynamic in studying theft causes and has a complex relationship with theft. Hence, commerce’s impact on theft yet needs to be further verified, especially in non-Western contexts. Given that classical theories such as environmental criminology and SDT mostly suggest that commerce can facilitate theft, this paper uses hierarchical regression and structural equation models to analyse the relationship between commerce and theft in Beijing and verifies the applicability and validity of the two types of commercial variables and classical Western-origin theories in explaining Beijing’s phenomenon. This paper reaches the following conclusions:

First, commerce does not significantly promote theft in Beijing. Given the limited effectiveness of Model 0 in the hierarchical regression model in explaining the combined impact of commerce on thefts, this paper constructs a structural equation model that is more in line with actual situations and statistical requirements. In the latter, a nonsignificant positive impact of commerce on theft can be detected, which can be verified to some extent by the former. Second, both types of commercial variables and classical theories, such as environmental criminology and SDT, are valid and applicable in interpreting the relationship between commerce and theft in Beijing. According to the hierarchical regression model, introducing commercial environmental and social variables improves the explanatory ability of the model. According to the structural equation model, both kinds of observed variables have high factor loads, i.e., factor loads of greater than or equal to 0.67, effectively representing commerce. Therefore, these two types of variables and the corresponding classical theories are essential in explaining commerce’s impact on theft in Beijing.

This paper has three main limitations. First, in terms of data selection, due to the limitations of data availability and the tendency to conduct research based on whole years, this paper uses only one year of theft data, so it is difficult to determine whether fluctuations occur between years. Due to the time lag involved in the discovery of crimes and the lack of information in the judgement documents, this paper fails to extract accurate and sufficient data on the time of thefts, so the time factor cannot be sufficiently accounted for. Second, in terms of method selection, although the main structural equation model in this paper solves some problems with traditional research methods, it also has some limitations, such as its assumption that all relationships are linear (Najaf et al., 2018 ). Third, this paper examines ordinary theft in Beijing and does not verify the model’s applicability in two specific areas of theft, namely, burglary and motor vehicle theft.

Based on the above limitations, this paper proposes some potential directions for improvement for future research. First, the study period can be extended and the time factor considered. Second, the fixed relationship assumption problem of the structural equation model can be solved and the nonlinear function form can be introduced into the model design, further developing the model built in this paper. Third, the model can be verified or modified by ascertaining the applicability of the model constructed in this paper to the study of other types of theft.

Data availability

The data supporting this study’s findings are available on request from the corresponding author.

According to the Law Yearbook of China, property fraud increased from 29.49% in 2019 to 40.07% in 2020, an overall proportion statistic that does not distinguish between categories of property fraud. According to the analysis of crime trends in the “Blue Book of Chinese Crime Governance (2020),” this phenomenon is related to the grim situation of cybercrimes. In 2020, the theft of personal information was frequent, promoting targeted telecom fraud in emails, text messages, etc.

Commerce is an organised activity that provides goods and services required by consumers and is an important part of urban functions. Commerce can be seen both as an activity that enables the circulation of goods and services in specific places and as an urban cultural phenomenon in social development. Therefore, this paper uses two types of factors to comprehensively characterise commerce in Beijing, namely, social and commercial environmental factors.

China adopts a dualistic approach to addressing theft, dividing related cases into criminal and public security cases. Criminal theft cases are comprehensively defined in a quantitative and qualitative manner based on the involved amount and the case nature, account for 43.90% of all theft cases accepted by public security organs nationwide in 2020, seriously impact society and are of great research value.

Blesse S, Diegmann A (2022) The place-based effects of police stations on crime: Evidence from station closures. J Public Econ. https://doi.org/10.1016/J.JPUBECO.2022.104605

Chen JG, Liu L, Zhou SH, Xiao LZ, Song GW, Ren F (2017) Modeling spatial effect in residential burglary: a case study from ZG City, China. ISPRS Int J Geo-Inform5. https://doi.org/10.3390/ijgi6050138

Chen S, Gao CD, Jiang D, Hao MM, Ding FY, Ma T, Zhang SZ, Li SD (2021) The spatiotemporal pattern and driving factors of cyber fraud crime in China. ISPRS Int J Geo-Inform 12. https://doi.org/10.3390/IJGI10120802

Cheng ZM, Smyth R (2015) Crime victimisation, neighborhood safety and happiness in China. Econ Model 424–435. https://doi.org/10.1016/j.econmod.2015.08.027

De Nadai M, Xu YY, Letouzé E, Gonzalez MC, Lepri B (2020) Socio-economic, built environment, and mobility conditions associated with crime: a study of multiple cities. Sci Rep 1. https://doi.org/10.1038/s41598-020-70808-2

Gibson J (2021) Better night lights data, for longer. Oxford Bull Econ Stat 3. https://doi.org/10.1111/OBES.12417

Gibson J, Olivia S, Boe‐Gibson G (2020) Night lights in economics: sources and uses1. J Econ Surv 5. https://doi.org/10.1111/joes.12387

Gibson J, Olivia S, Boe-Gibson G, Li C (2021) Which night lights data should we use in economics, and where? J Dev Econ (prepublish). https://doi.org/10.1016/J.JDEVECO.2020.102602

Jing FR, Liu L, Zhou SH, Song GW (2020) Examining the relationship between Hukou Status, perceived neighborhood conditions, and fear of crime in Guangzhou, China. Sustainability 22. https://doi.org/10.3390/su12229614

Lee SY, Song XY (2014) Bayesian structural equation model. Wiley Interdisciplinary Reviews: Comput Stat 4. https://doi.org/10.1002/wics.1311

Liu HQ, Zhu XY (2016) Exploring the influence of neighborhood characteristics on burglary risks: a Bayesian random effects modeling approach. ISPRS Int J Geo-Inform 7. https://doi.org/10.3390/ijgi5070102

Liu HQ, Zhu XY, Zhang DY, Liu Z (2019) Investigating contextual effects on burglary risks: a contextual effects model built based on Bayesian spatial modeling strategy. ISPRS Int J Geo-Inform 11. https://doi.org/10.3390/ijgi8110488

Liu L, Jiang C, Zhou SH, Liu K, Du FY (2017) Impact of public bus system on spatial burglary patterns in a Chinese urban context. Appl Geogr. https://doi.org/10.1016/j.apgeog.2017.11.002

Liu L, Zhou HL, Lan MX, Wang ZL (2020) Linking Luojia 1-01 nightlight imagery to urban crime. Appl Geogr 125:102267. https://doi.org/10.1016/j.apgeog.2020.102267

Article   Google Scholar  

Lu M (2016) Great state needs bigger city. Shanghai People Publishing House, Shanghai

Mao YY, Dai SZ, Ding JJ, Zhu W, Wang C, Ye XY (2018) Space–time analysis of vehicle theft patterns in Shanghai, China. ISPRS Int J Geo-Inform 9. https://doi.org/10.3390/ijgi7090357

Musah A, Umar F, Yakubu KN, Ahmad M, Babagana A, Ahmed A, Thieme TA, Cheshire JA (2020) Assessing the impacts of various street-level characteristics on the burden of urban burglary in Kaduna, Nigeria. Appl Geogr C. https://doi.org/10.1016/j.apgeog.2019.102126

Najaf P, Thill JC, Zhang W, Fields MG (2018) City-level urban form and traffic safety: a structural equation modeling analysis of direct and indirect effects. J Trans Geogr 69:257–270. https://doi.org/10.1016/j.jtrangeo.2018.05.003

Park RE (1915) The city: suggestions for the investigation of human behavior in the city environment. Am J Sociol 5:577–612. https://doi.org/10.1086/212433

Piquero AR et al. (2019) The handbook of criminological theory. Law Press, Beijing

Google Scholar  

Piza EL, Wheeler AP, Connealy NT, Feng SQ (2020) Crime control effects of a police substation within a business improvement district: a quasi‐experimental synthetic control evaluation. Criminol Public Policy 19(2):653–684. https://doi.org/10.1111/1745-9133.12488

Porta S, Latora V, Wang FH, Rueda S, Strano E, Scellato S, Cardillo A, Belli E, Cardenas F, Cormenzana B, Latora L (2012) Street centrality and the location of economic activities in Barcelona. Urban Stud 49(7):1471–1488. https://doi.org/10.1177/0042098011422570

Shan Y (2020) Research on urban defense space based on crime hotspot mapping. Law Press, Beijing

Skrondal A, Rabe-Hesketh S (2007) Latent variable modelling: a survey. Scand J Stat 34(4):712–745. https://doi.org/10.1111/j.1467-9469.2007.00573.x

Article   MathSciNet   MATH   Google Scholar  

Snaphaan T, Hardyns W (2021) Environmental criminology in the big data era. Eur J Criminol 5. https://doi.org/10.1177/1477370819877753

Sohn DW (2016) Do all commercial land uses deteriorate neighborhood safety?: examining the relationship between commercial land-use mix and residential burglary. Habitat Int. https://doi.org/10.1016/j.habitatint.2016.03.007

Tang YC, Zhu XY, Guo W, Wu L, Fan YX (2019) Anisotropic Diffusion for Improved Crime Prediction in Urban China. ISPRS Int J Geo-Inform 5. https://doi.org/10.3390/ijgi8050234

Weisburd D, Groff ER, Yang SM (2014) Understanding and controlling hot spots of crime: the importance of formal and informal social controls. Prev Sci 15(1). https://doi.org/10.1007/s11121-012-0351-9

Welsh BC, Farrington DP, Douglas S (2022) The impact and policy relevance of street lighting for crime prevention: a systematic review based on a half‐century of evaluation research. Criminol Public Policy 3. https://doi.org/10.1111/1745-9133.12585

Wortley R, Townsley M et al. (2021) Environmental criminology and crime analysis 2nd ed. Tsinghua University Press, Beijing

Wu L, Liu XD, Ye XY, Leipnik M, Lee J, Zhu XY (2015) Permeability, space syntax, and the patterning of residential burglaries in urban China. Appl Geogr 261–265. https://doi.org/10.1016/j.apgeog.2014.12.001

Xiao LZ, Liu L, Song GW, Ruiter S, Zhou SH (2018) Journey-to-crime distances of residential burglars in China disentangled: origin and destination effects. ISPRS Int J Geo-Inform 8. https://doi.org/10.3390/ijgi7080325

Yu SSV, Maxfield MG (2014) Ordinary business: impacts on commercial and residential burglary. Br J Criminol 54(2):298–320. https://doi.org/10.1093/bjc/azt064

Yue H, Zhu XY, Ye XY, Hu T, Kudva S (2018) Modelling the effects of street permeability on burglary in Wuhan, China. Appl Geogr. https://doi.org/10.1016/j.apgeog.2018.06.005

Zhang ZF, Liu L, Cheng SS (2021) Measurement of Potential Victims of Burglary at the Mesoscale: Comparison of Census, Phone Users, and Social Media Data. ISPRS Int J Geo-Inform 5. https://doi.org/10.3390/IJGI10050280

Zhou HL, Liu L, Lan MX, Yang B, Wang ZL (2019) Assessing the impact of nightlight gradients on street robbery and burglary in Cincinnati of Ohio State, USA. Remot Sens 11(17):1958. https://doi.org/10.3390/rs11171958

Article   ADS   Google Scholar  

Download references

Acknowledgements

This research was supported by the Beijing Laboratory of National Economic Security Early-warning Engineering, Beijing Jiaotong University. The work was supported by the R&D Program of Beijing Municipal Education commission (Grant No. KJZD20191000401), the Program of the Co-Construction with Beijing Commission of Education of China (Grant No. B20H100020, B19H00010), and the Key Project of Beijing Social Science Foundation Research Base (Granted No. 19JDYJA001).

Author information

Authors and affiliations.

Department of Economics, School of Economics and Management, Beijing Jiaotong University, Beijing, China

Yutian Jiang & Na Zhang

Beijing Laboratory of National Economic Security Early-warning Engineering, Beijing Jiaotong University, Beijing, China

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed significantly to an article and approved the submitted version. Contributors to the concept or design of the article: YJ. Contributed to analysis and interpretation of data for the study: YJ. Drafting work or critically revising it for important intellectual content: YJ, NZ. Final approval of the version to be published: YJ. Agreement to be responsible for all aspects of work in ensuring that questions regarding the accuracy or completeness of any part of work are properly investigated and resolved: YJ, NZ.

Corresponding authors

Correspondence to Yutian Jiang or Na Zhang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Informed consent

Additional information.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Jiang, Y., Zhang, N. Does commerce promote theft? A quantitative study from Beijing, China. Humanit Soc Sci Commun 10 , 203 (2023). https://doi.org/10.1057/s41599-023-01706-x

Download citation

Received : 30 September 2022

Accepted : 18 April 2023

Published : 05 May 2023

DOI : https://doi.org/10.1057/s41599-023-01706-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

data theft research paper

  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, electricity theft detection in power consumption data based on adaptive tuning recurrent neural network.

www.frontiersin.org

  • 1 Metrology Center of Guangdong Power Grid Corporation, Guangzhou, China
  • 2 College of Electrical Engineering, Zhejiang University, Hangzhou, China
  • 3 Zhanjiang Power Supply Bureau of Guangdong Power Grid Co. Ltd., Zhanjiang, China
  • 4 China-EU Institute for Clean and Renewable Energy, Huazhong University of Science and Technology, Wuhan, China
  • 5 School of Data Science, Guangzhou Huashang College, Guangzhou, China

Electricity theft behavior has serious influence on the normal operation of power grid and the economic benefits of power enterprises. Intelligent anti-power-theft algorithm is required for monitoring the power consumption data to recognize electricity power theft. In this paper, an adaptive time-series recurrent neural network (TSRNN) architecture was built up to detect the abnormal users (i.e., the electricity theft users) in time-series data of the power consumption. In fusion with the synthetic minority oversampling technique (SMOTE) algorithm, a batch of virtual abnormal observations were generated as the implementation for training the TSRNN model. The power consumption record was characterized with the sharp data (ARP), the peak data (PEA), and the shoulder data (SHO). In the TSRNN architectural framework, a basic network unit was formed with three input nodes linked to one hidden neuron for extracting data features from the three characteristic variables. For time-series analysis, the TSRNN structure was re-formed by circulating the basic unit. Each hidden node was designed receiving data from both the current input neurons and the time-former neuron, thus to form a combination of network linking weights for adaptive tuning. The optimization of the TSRNN model is to automatically search for the most suitable values of these linking weights driven by the collected and simulated data. The TSRNN model was trained and optimized with a high discriminant accuracy of 95.1%, and evaluated to have 89.3% accuracy. Finally, the optimized TSRNN model was used to predict the 47 real abnormal samples, resulting in having only three samples false predicted. These experimental results indicated that the proposed adaptive TSRNN architecture combined with SMOTE is feasible to identify the abnormal electricity theft behavior. It is prospective to be applied to online monitoring of distributed analysis of large-scale electricity power consumption data.

Introduction

With the increasing scale of the power grid, the power consumption is becoming larger year by year. People are concerning on the economic operation of power network, saving of electric resources, reduction of grid line loss, and structural optimization on power consumption ( Dileep, 2020 ). However, the customer’s behavior of stealing electricity comes in non-stopping emergence. This infraction phenomenon has seriously affected the normal operation of power grid and the economic benefits of power enterprises ( Li et al., 2019 ; Zhang et al., 2020 ). The electricity theft rate in developing countries is as high as 30%, and the social power supply and consumption has also been greatly influenced. According to rough statistics, China’s power enterprises lose as much as 20 billion CNY every year due to power theft. Therefore, power enterprises must carry out efficient anti-electricity-theft work, in order to guarantee the reasonable power supply and rational use of electricity, thus to reduce economic losses as much as possible ( Aryanezhad, 2019 ).

The traditional detection methods of power theft mainly rely on the scheduled operations of technicians who work in power supply enterprises. The operation goes with reading the electricity meter and then recording, counting, and performing manual analysis and calculation. In the hardware aspect, there are multifaceted operations that can prevent energy theft, such as to install the specialized watt-hour metering box, to implement a kind of conductor that closes the low-voltage outlet to the metering device, to add anti-thief function to the watt-hour meter, and to improve the application rate of electrical acquisition system ( Jokar et al., 2016 ). However, most of these traditional anti-theft detection methods focus on the improvement of power devices. There is a lack of sufficient anti-power-stealing algorithms to analyze massive historical power consumption data, so it is difficult to find the power consumption characteristics of power-stealing users and detect the power-stealing behavior realized by advanced attack means ( Ahmad et al., 2015 ). Therefore, the development of power industry needs to strengthen the development of new artificial intelligence and information and automation technology. With the continuous improvement of dynamic monitoring and acquisition technology of power consumption data of power grid users, it is of great engineering significance to study the intelligent anti-power-theft algorithm based on the big data of the power consumption to identify the power theft behavior ( Ren et al., 2020 ; Zhang et al., 2021 ).

At present, the most popular scheme is to lay out the smart grid detection architecture and framework, then to collect the power consumption data, and upload them to the centralized data processing center through the terminal smart meter, and successively, the centralized data can be further analyzed by intelligent algorithms to detect electricity theft. The prevalent anti-power-stealing data mining algorithms include clustering, BP neural network, and local outlier detection algorithm ( Al-Dahidi et al., 2019 ; Li Y. et al., 2021 ). Many practical experiments have been studied in previous research works. A typical load curve is extracted from the power consumption data by applying the adaptive K-means clustering algorithm to realize load forecasting and load control ( Zhu et al., 2016 ). The situation of abnormal point detection method was proposed based on a fuzzy neural network to deal with various data, which provides a new idea for mining abnormal data from the power consumption records ( Mozaffar et al., 2018 ). The flying anomaly factor detection and analysis method was investigated to detect an electric energy meter flying anomaly ( Li et al., 2016 ). A novel detection method of power theft was constructed based on the one-class SVM algorithm. A calibration model was established by analyzing a large number of historical data. If the current data are inconsistent with the model, it is considered that there is a possibility of power theft ( Dou et al., 2018 ). Also, the RBF neural network was proposed to detect the electricity-stealing behavior, which used the data characteristics of voltage, current, and power factor to detect electricity theft, to make a positive detection on electricity stealing ( Cao et al., 2018 ).

Due to the wide layout of the power grid, the large-scale deployment of smart meters should consume a lot of resources. In order to save the energy consumption of distributed terminal nodes, and reduce the non-essential data transmission, it is necessary to study modern data mining technology, in integration with machine learning algorithms ( Wang et al., 2020 ; Li Z. et al., 2021 ). The application of indirect data anomaly detection as well as some preprocessing and analyzing technologies is much necessary to achieve the online detection of power theft. However, data-driven power theft detection is a special type of anomaly detection, which has a serious class imbalance problem ( Avila et al., 2018 ). Actually, the number of normal power consumption users is much larger than the number of abnormal users. The inherent imbalance of data will affect the performance of traditional machine learning methods. Until now, only a few studies have considered the category imbalance in power theft detection ( Zhang et al., 2019 ). The solutions of these works are mainly performed with undersampling and oversampling methods in the aspects of data analytical algorithm. They were keen on simultaneously implementing the random oversampling and undersampling techniques, to select the best detection effect by testing different sampling ratios. Otherwise, they focus on increasing the misclassification cost of abnormal users to improve the detection rate of electricity theft, by setting penalty parameters for support vector machine misclassification of normal and abnormal users ( Hu et al., 2019 ).

Generally, the electricity theft monitoring data are a kind of time-series data. The difficulty of data analysis lies in how to find the abnormal data from the constantly updated dynamic data flow, so as to accurately predict the theft users. The fact that the data are extremely imbalance is the first-of-all analytical difficulty. Many experiments have proved that oversampling is a solution to the category imbalance problem. In essence, the random oversampling method increases the weight in the sample set by randomly copying a few samples. It does not increase classification accuracy but is easy to cause over-fitting ( He and Garcia, 2019 ). Synthetic minority oversampling technique (SMOTE) is an unbalanced data recall method that is improved from the linear interpolation calculation methodology. It uses the local prior distribution information of samples to improve the accuracy of minority samples, to solve the data imbalance problem ( Zhu et al., 2017 ). Furthermore, the recurrent neural network (RNN) is an effective intelligent machine learning method that is especially effective for monitoring and analyzing time-series dynamic data flow. The RNN is derived from the conventional fully connected neural network (FCNN) model. Its core operation is to compute the result of each neuron not only from its input data (similar to the FCNN) but also from the historical variables from its former calculations (different from the FCNN). The RNN model is widely used in addressing the tasks of sequential data processing ( Liu et al., 2020 ). The running of the RNN structure is to produce a neuron output by combined fusing of the current status data with the previous status data of the system. The RNN is able to automatically learn the time correlation of the input data without specifying any lag observations ( Cossu et al., 2021 ). It is well known that the traditional time-series analytical methods (such as auto-correlation) need to identify the seasonality and stability from the time-series data. The effectiveness of identification may vary according to the network structure and the calculation speed, and it needs to be adjusted for each simulation ( Chen et al., 2018 ; Farjaminezhad et al., 2021 ). The characteristic of the RNN is to create a closed-loop calculation in the hidden layer, which forms a circulating adaptive model to capture the internal hidden historical state features in the way of iterative update, and thus to complete the process of error level accumulation in the training stage. In effect, the RNN model is enforced to adapt the error accumulation and improve the model robustness ( Ståhl et al., 2019 ).

This paper is aimed at designing a data-driven adaptive parameter optimization time-series RNN (TSRNN) architecture, for intelligent machine learning to solve the problem of abnormal monitoring of power consumption. The TSRNN architecture with an adaptive training strategy is constructed by monitoring, collecting, and analyzing the observed data of a stage. Then, the non-linear features of the observed data can be extracted by developing a hyperparametric optimization mode of RNN, in fusion with a SMOTE solvation of data imbalance. On this algorithmic basis, the power-stealing users with abnormal characteristics are identified in a large number of power user samples. In structural detail, grid search is designed for the parameter selection of the RNN linking weights, and also, a fault-tolerance iteration mechanism is adopted for parameter optimization in the closed-loop training stage, to control the error accumulation in model prediction, so as to enhance the model robustness. In this way, the proposed intelligent TSRNN architecture with data-driven adaptive parameter optimization is validated through data training and prediction. The optimized model is effective for accurate extraction of the data features of power-stealing behavior. The establishment of the intelligent TSRNN model is expected to overcome the costly, laborious, and time-consuming problems of the traditional methods for monitoring electricity theft. It is feasible to speed up to locate the abnormal watt-hour meter terminals and accurately identify the power-stealing users. The proposed method helps promote the development of artificial intelligence and information analysis technology in the field of power grid operation and maintenance.

Methodologies

In this section, we discuss the basic structure of the TSRNN architecture and the algorithmic progress of SMOTE balancing. The energy theft detection model is established and further optimized by fusion of TSRNN and SMOTE. And the discriminant indicators are introduced based on the confusion matrix for the quasi-qualitative recognition of the abnormal user data.

The Principle of SMOTE

The SMOTE algorithm is an oversampling method based on synthetic sampling proposed by Chawla ( Chawla et al., 2002 ). In geometric sense, the SMOTE method firstly observes the minority samples and connects them and a batch of their surrounding samples. Then, it produces new samples by random insertion on the connecting lines. The connection and insertion operation can reduce the imbalance of sample space and simultaneously prevent the over-fitting phenomenon by suppressing too large repetition of the original minority samples ( Fernández et al., 2018 ; Chen et al., 2021 ). The schematic diagram for generating new samples by the SMOTE algorithm is shown in Figure 1 . Specifically, the SMOTE sample-generating procedures are described in the following steps:

Step 1: Let { x i | i = 1,2 … } be the minority samples and set the sampling number r according to the number ratio of the majority samples over the minority samples

Step 2: Search k samples in the neighborhood of the minority samples, where k > r

Step 3: Randomly select r samples from the k neighborhood sample, to form the neighborhood sample set { y 1 , y 2 … y r }

Step 4: To generate a set of new samples by random linear interpolation computation, the new samples are denoted as { p j | j = 1,2 … } , where

with rand ( 0,1 ) representing a random number in the interval of [ 0,1 ] . Then, { p j } is regarded as the algorithmic implementation of the minority samples.

Step 5: The newly generated samples { p j } are regarded as the algorithmic implementation of the minority samples, added to the original sample set to form a brand new training sample set together with the majority samples.

www.frontiersin.org

FIGURE 1 . Schematic diagram of the SMOTE algorithm.

The SMOTE algorithm makes artificial synthesis of minority samples by random interpolation. Compared with the traditional methods of random replication, SMOTE reduces redundant information of newly generated minority samples and effectively avoids the phenomenon of over-fitting in the subsequent data mining processes. In algorithm, SMOTE shows its uncertainty in part of selecting the nearest neighborhood of the original minority samples, namely, the number of neighbor samples (i.e., the number of k ) has a great influence on the model performance. When SMOTE is embedded in fusion with the TSRNN architecture, the number of neighbor samples would be designed as one of the tunable parameters for the network model optimization.

Time-Series RNN Model

The data-driven time-series analysis problem is theoretically described as a general ordinary differential model ( Li and Yang, 2021 ), formulated as

where z ∈ R d is the current state of the system and x ∈ R d represents the instant input data. In common sense, the model function f is unknown, but it can be estimated by simulation on the discrete observation of the current state z and the instant input x . On these lines, the fully connected neural network (FCNN) is suitable to resolve the data-driven analytical models.

An FCNN module is traditionally applied as a black box to directly transform the input data to the hidden layer and then to get the output. The generated data acquired at each neuron node are described as z t + 1 = g ( z t , x t ) , where the activation function g ( ⋅ ) is usually a kind of simple linear transformation, while the operation inside the FCNN has no physical interpretations. The black-box model may not be able to capture the detailed data transition in the time series. The TSRNN is proposed to solve this issue.

The TSRNN architecture is built up with circulation computation of the hidden layer. To unfold the circulation ring, the TSRNN structure is introduced as shown in Figure 2 . As is shown in Figure 2 , the TSRNN architecture is supposed to be constructed along a time variance axis. At the starting of time, the power consumption user data are input into the network and delivered to the first hidden layer ( H 1 ) while t = 1 . The data are transformed and calculated to extract the first level of neural features and then delivered to the next hidden layer when t varies. At each time step, the result of each neuron computation depends not only on the current input but also on the computation results. In this way, the TSRNN captures the intercorrelation between the time longitudinal parameters and the section parameters. As such, there are two network linking weight effects: one describes the direct effect from network layer delivery and the other shows the indirect data influence from the time-series circulation of the hidden layers. Any change in the direct weights or in the indirect weights will cause a change in the output at any instant moment of time ( Alkinani et al., 2021 ).

www.frontiersin.org

FIGURE 2 . Structural design of the time-series RNN (TSRNN) architecture.

Figure 2 also presents a simple TSRNN cell structure at the instant moment of time t = t . To be specific, a TSRNN cell is actually a single layer of hidden neurons. This hidden layer is denoted as H ( t ) , and there are many hidden neurons for functional calculation, i.e., H ( t ) = { h i ( t ) | i = 1,2 … m } . Suppose the current input data are X ( t ) = { x i ( t ) | i = 1,2 … n } from the power consumption user data, regarded as the direct input. The time-lag input data are acquired from the network calculation in the hidden layer H ( t − 1 ) at the time moment of t − 1 , taken as the indirect input. Then, H ( t ) works as a t -time hidden layer to extract data features from the direct inputs as well as the indirect inputs. The output of H ( t ) is influenced by both X ( t ) and H ( t − 1 ) . It can be formulated as

where the function f ( ⋅ ) simply represents the sigmoid function which would strictly limit the transformed features in the standard variable range of [ − 1,1 ] . The parameters W and U represent the linking weights for data connection and for the time variance connection, respectively.

Successively, data H ( t ) , namely, the set of feature data included in { h i ( t ) } , are further delivered to a softmax unit for discriminant calculation. Thus, the neural network output at the time-series moment of t is mathematically demonstrated as

where V represents the linking weights involving the data transform from H ( t ) to O ( t ) and the function K ( ⋅ ) operates the k -means clustering by Mahalanobis distance

The Mahalanobis distance between any two of the n samples is calculated according to Eq. 5 and then to obtain the distance matrix KM at the instant time moment of t , namely,

where mah i j ≜ mah ( O i , O j ) .

Finally, the Mahalanobis-based k-means clustering results of the TSRNN-extracted feature data are used for further calculation of the discriminant indicators, thus to help identify the abnormal users from all of the electric power consumption data.

Discriminant Indicators

The power consumption data are originally imbalanced because the normal electricity users are much larger than the electricity thieves. It is expensive to identify the abnormal users. In our algorithmic designs, SMOTE is functional to alleviate the data imbalance, and the adaptive TSRNN model extracts the feature of power consumption data for improving the model discrimination accuracy with the k-means Mahalanobis measure. The model should be evaluated with quantitative indicators. The confusion matrix is a basic tool to evaluate the model performance (see Table 1 ). Then, the indicators of each model are verified based on the matrix table.

www.frontiersin.org

TABLE 1 . Confusion matrix for evaluation of the discrimination/classification models.

By definition of the confusion matrix, the normal power consumption users are distinguished as the negative records, while the abnormal users are taken as positive. Thus, the table markers are interpreted with the following information:

- TP indicates that the abnormal user (positive) is accurately predicted as abnormal (positive),

- TN indicates that the normal user (negative) is accurately predicted as normal (negative),

- FP indicates that the actual normal user (negative) is predicted false as abnormal (positive),

- FN indicates that the actual abnormal user (positive) is predicted false as normal (negative).

Multiple indicators are further calculated according to the confusion matrix, such as the classification accuracy ( ACC ), true positive rate ( TPR ), and false alarm rate ( FAR ). The calculations are presented as follows:

These indicators are used to evaluate the model performance of the adaptive parametric-scaling TSRNN architecture. It is learnt from Eqs. 7 – 9 that the higher the TP and TN are, the better the model performance is.

For fault-tolerant analysis, the model prediction results can be monitored at every moment of the dynamic changing time series. By data export, there are a series of prediction results acquired for the model classification of normal and abnormal users. Then, the frequency of identification of abnormal is counted for each user over the whole time-series axis, thus to provide an extra confirmation of the model predictions.

Analysis of Power Consumption Data

A total of 929 electricity/power consumption users were monitored continuously from January 1, 2017, to March 31, 2019, with the minimum time changing unit of 1 day; thus, we recorded 820 instant moments in the long time series spanning 25 months. Their electricity use data were collected in different partitions of time periods of hours according to the total usage amount. In detail, the electricity used during the hours of 00:00–08:00 is named the off-peak data (denoted as OPE for short), during 08:00–12:00 as the peak data (PEA), during 18:00–22:00 as the sharp data (ARP), and during the rest hours as the shoulder data (SHO).

If the electricity users are taken as the analytical samples, the power consumption characteristics of the 929 samples are demonstrated by the recorded data of OPE, PEA ARP, and SHO. There are 820 digital records for each user by time variance. As the maximum record is over thirty thousand and the minimum record is zero, the dataset should be normalized before analysis, applying the min–max normalization method ( Jin et al., 2015 ). Then, we statistically derived the sample distribution using the average electricity consumption of the 820 time nodes (see Figure 3 ). As is seen from Figure 3 , the users do not use electricity all along time; for example, some electricity consumption appears high in the ARP time but low or even zero in SHO, and some goes high in PEA but zero in ARP or OPE. To be specific, it is seen from the sub-figure of OPE (the blue plot) that only one user out of the 929 keeps using electricity during the OPE time period. Thus, it is recognized with statistical principles that the OPE data property hardly provides data information for discriminating the abnormal users. Then, the OPE data do not participate in the following modeling processes of SMOTE balancing and TSRNN training.

www.frontiersin.org

FIGURE 3 . Statistical descriptive plots of the power consumption data in different time period partitions.

Data Balancing by SMOTE

Practically, we have the priori target classification index for the 929 available power consumption user samples. There are originally 882 normal samples and only 47 abnormal samples. The normal samples are the majority, and the abnormal ones are the minority. The imbalance ratio of the normal over the abnormal goes to a great extent of around 19:1. The scattering distribution of the 929 samples is a plot in the 3D axis based on the three basic variables of ARP, PEA, and SHO (see Figure 4A ). To ease the heavy imbalance status, the SMOTE algorithm is applied to increase the proportion of the minority samples by linear interpolations. According to the principle of the SMOTE simulation as introduced in The Principle of SMOTE , a batch of virtual samples are generated by interpolations on the original 47 abnormal samples.

www.frontiersin.org

FIGURE 4 . Distribution of the power consumption user samples (panel (A) is for the original 929 samples, and panel (B) is for the SMOTE-balanced output of the 1,151 samples).

Theoretically, one virtual sample is generated from the linking edge of every two samples. The 47 available samples are able to generate 1,081 (i.e., C 47 2 ) new samples in all, from which we randomly chose 222 samples as a supplement to data balance. By SMOTE simulation, we finally have total of 1,151 samples for modeling analysis, of which 269 are abnormal samples, while 882 are normal data from the original. The scattering distribution is shown in Figure 4B . In this case, we have the sample balance ratio at about 3:1 for the normal samples over the abnormal samples.

Hereafter, the 1,151 SMOTE-balancing samples were used to train the TSRNN model (defined in Time-Series RNN Model ), as to build up an intelligent network architecture with adaptive grid optimization of parameters, for accurate recognition of the abnormal power users who are stealing electricity.

Discriminations Based on TSRNN Training and Testing

An applicable discrimination model for detecting electricity theft was trained using the TSRNN architecture based on the power consumption data of the 1,151 SMOTE-balanced samples. The recorded ARP, PEA, and SHO variables are taken as the network input. The data have a time-series record of 820 days.

The data samples were divided into two sets for model training and testing: 918 samples (∼80%) for training and 233 (∼20%) for testing. The training data were used to conduct the data-driven machine learning optimization of the TSRNN model. The model was constructed with three input neurons and one hidden neuron to produce the output results. There, we have three input-to-hidden linking weights ( w 1 ,   w 2 , and w 3 ) and one hidden-to-output linking weight ( v ) to adjust. There is also a linking weight ( u ) to help accept another data input from the former time moment of the circle iteration. With machine learning operations, these linking weights were adaptively identified as their most suitable values during the model training process, and then the testing data were used to examine the model discrimination effectiveness by using the data-driven decisive parameters.

In progress, the 918 training samples were introduced to the input layer at every moment of time and then delivered to compute the hidden variables. Notably, the RNN architecture is characterized with the circle of reproducing the hidden layer. The hidden variables at t moment are affected by both the t -moment input and the hidden variables at the t − 1 moment, where t = 1,2 … 820 . Thus, a series of phased discriminant results were obtained from the output layers at every time moment. Specifically, we chose to make a segmentation to the full time series from January 1, 2017, to March 31, 2019. There, we set five time markers (see Table 2 ), to observe five phased modeling outputs for examining the progress of model optimization.

www.frontiersin.org

TABLE 2 . Markers of the five special time nodes for investigation of the TSRNN model performance.

Based on the 918 training samples, the TSRNN model was trained with parameters’ iteration by circle improvement of the hidden neurons. We calculated the model discriminant indicators at each phase stoppage moment of t 1 , t 2 , t 3 , t 4 , and t 5 and drew the ROC curves (see Figure 5 ). The ROC figures show that the TSRNN model was continuously improved with the promotion of time series. Eventually, the optimal model was observed at t = t 5 = 820 .

www.frontiersin.org

FIGURE 5 . ROC curves for the evaluation of the TSRNN training effects at the five selected time markers based on the 918 training samples.

To study the machine learning progress on parameter optimization, we further investigate the running procedures of the adaptive tuning of the TSRNN linking weights. If the linking weights are denoted as a combination of ( w 1 , w 2 , w 3 , v , u ) , we initialized this combination as ( 100 ,   100 ,   100 ,   100 ,   1 ) for model optimization by network iteration of time-series circulation. When time varies, the more and more power consumption data were input to the network, and thus, the linking weights were adjusted for the improving TSRNN model. The changing values of each linking weight were recorded with a time interval of every 20 moments, and thus, we obtained the variation trends of the five linking weights for model optimization (see Figure 6 ). It is seen from Figure 6 that the network weights of w i and v were presented as an overall downward trend with cyclical recovery fluctuations, ending with optimal values close to zero. And the parameter u (i.e., the weight of the iteration of time series) shows a trend of first falling and then rising. In the end, the optimal value of ( w 1 , w 2 , w 3 , v ,   u ) was recognized as ( 2.763 ,   0.767 ,   0.821 ,   3.254 ,   0.564 ) after 820 iterations by time series, noting that u = 0.564 was for the circle iterative optimization from t = 819 to t = 820 . These observed optimal values of parameters indicated that the optimal TSRNN model was trained to have a linear formula expression with simple weight coefficients, while the circle iteration of time series pays a certain contribution to the network model.

www.frontiersin.org

FIGURE 6 . Training of linking weights in the TSRNN structure.

The predictive performance of the TSRNN discriminant model with adaptive tuning of the network weights was further evaluated by the 233 test samples, which were assumed to be “unknown” because they were not involved in the training process. We have the knowledge that there were 53 abnormal samples and 180 normal samples in the test sample set. The optimal TSRNN model is evaluated with a relative high prediction accuracy upon the quantitative metrics of the model indicators. The predictive ACC, TPR, and FAR were 89.3, 92.5, and 11.7%, respectively. The corresponding confusion matrix is shown in Table 3 .

www.frontiersin.org

TABLE 3 . Confusion matrix of the discriminating results predicted by the optimal TSRNN model for the 233 test samples.

Aiming to find out the electricity theft from the real power consumption users, the optimal model output its discriminant results for each sample (shown in Figure 7 ). The virtual use data which were produced by SMOTE balancing were not targeted for prediction. Thus, it is necessary to distinguish the real abnormal data from the virtual abnormal data. Practically, we used solid stars to mark the 10 real abnormal samples in the figure, and only two of them were predicted to be false. The results indicated that the adaptive TSRNN architecture is functional to predict the abnormal cases in the daily records of the power consumption data.

www.frontiersin.org

FIGURE 7 . Discrimination for each test sample by the optimal TSRNN model.

Furthermore, the well-trained TSRNN architecture was utilized to monitor the time-series data from January 1, 2017, to March 31, 2019, to recognize the power consumption users who probably have electricity theft behavior. The identification of the real abnormal users is listed in Table 4 . It is learnt from Table 4 that the optimal TSRNN model successfully identified 44 of the total of 47 abnormal users. The results show that the proposed adaptive TSRNN architecture combined with SMOTE sample balancing technique is able to accurately find the abnormal samples based on the analysis of the time-series–recorded power consumption data, thus to recognize the electricity theft behaviors.

www.frontiersin.org

TABLE 4 . Discrimination results for the 47 real abnormal data of the electricity theft users.

In this paper, an adaptive TSRNN architecture was built up to detect the electricity theft based on time-series data of the power consumption. The recorded data were monitored continuously from January 1, 2017, to March 31, 2019 (820 days in total). By monitoring the ARP, PEA, and SHO data, the users who are suspicious of stealing electricity were denoted as abnormal samples, while the other common users were denoted as normal. There, we had collected the data of 882 normal samples and 47 abnormal samples. As the abnormal users appear as the minority in all of the recorded data, the SMOTE algorithm was used to ease the data imbalance by generating 222 virtual abnormal samples, to make the ratio of the normal over the abnormal at about 3:1.

The TSRNN model was established based on the total of 1,151 user samples over the 820 time-series moments. A basic network was formed with three input nodes for receiving the data in the three variables of ARP, PEA, and SHO, and with one hidden neuron for extracting data features. Then, the network output was computed as a k-means classified result to discriminate the sample as an abnormal one or a normal one. The k-means classifier calculation was on the basis of Mahalanobis distance. As for the successive analysis of the non-stopping input time-series data, the TSRNN structure was re-formed by circulating this kind of basic network. Then, each hidden node was influenced by the input data at the current time moment and the data delivery from the time-former hidden node, and thus, the output results can be optimized by adaptively tuning the network parameters in the combination of linking weights ( w 1 , w 2 , w 3 , v , u ) . In our empirical experiment, the most optimal values of the combination of linking weights were observed as ( 2.763 ,   0.767 ,   0.821 ,   3.254 ,   0.564 ) after 820 iterations by time series. There, we obtained the discriminant model with a high prediction accuracy of ACC = 95.1%. The optimal TSRNN model was evaluated to be much effective by the 233 test samples, with the testing ACC = 89.3, TPR = 92.5, and FAR = 11.7%. Therefore, the adaptive TSRNN model was finally used to predict the 47 real abnormal samples, and the discriminating results are quite appreciating, with only three samples predicted to be false. The prediction accuracy was as high as 93.6%.

The experimental results indicated that the proposed adaptive TSRNN architecture in fusion with the SMOTE balancing technique is feasible to extract data features for monitoring the abnormal electricity theft behavior. The methodology framework is prospectively promoted to be used for online monitoring on big data analysis for a large scale of electricity power consumption.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, and further inquiries can be directed to the corresponding author.

Author Contributions

YL conceptualized the idea and supervised the work. GL and SH performed the methodology. GL and HW visualized the results. HW was involved in formal analysis. SH and ZN investigated the data. ZN validated the data.GL and HF wrote the original draft. HF curated the data and ran the software. XF and SH reviewed and edited the paper. XF obtained the resources.

This research was funded by the project supported by the China Southern Power Grid Corporation (Grant No. GDKJXM20185800).

Conflict of Interest

The author HW was employed by Zhanjiang Power Supply Bureau of Guangdong Power Grid Co., Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Ahmad, T., Hasan, D. Q. U., and Zada, S. (2015). Non-Technical Loss Detection, Prevention and Suppression Issues for AMI in Smart Grid. Ijser 6 (3), 217–228. doi:10.14299/ijser.2015.03.001

CrossRef Full Text | Google Scholar

Al-Dahidi, S., Ayadi, O., Adeeb, J., and Louzazni, M. (2019). Assessment of Artificial Neural Networks Learning Algorithms and Training Datasets for Solar Photovoltaic Power Production Prediction. Front. Energ. Res. 7, 1–18. doi:10.3389/fenrg.2019.00130

Alkinani, H. H., Al-Hameedi, A. T. T., and Dunn-Norman, S. (2021). Data-driven Recurrent Neural Network Model to Predict the Rate of Penetration. Upstream Oil Gas Techn. 7, 100047. doi:10.1016/j.upstre.2021.100047

Aryanezhad, M. (2019). A Novel Approach to Detection and Prevention of Electricity Pilferage over Power Distribution Network. Int. J. Electr. Power Energ. Syst. 111, 191–200. doi:10.1016/j.ijepes.2019.04.005

Avila, N. F., Figueroa, G., and Chu, C.-C. (2018). NTL Detection in Electric Distribution Systems Using the Maximal Overlap Discrete Wavelet-Packet Transform and Random Undersampling Boosting. IEEE Trans. Power Syst. 33, 7171–7180. doi:10.1109/tpwrs.2018.2853162

Cao, M., Zou, J., Wei, L., Zhao, X., Zhang, L., and Li, P. (2018). Detection of Power Theft Behavior of Distribution Network Based on RBF Neural Network. J. Yunnan Univ. Nat. Sci. Ed. 40 (5), 872–878. doi:10.7540/j.ynu.20170426

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. jair 16, 321–357. doi:10.1613/jair.953

Chen, H., Liu, X., Jia, Z., Liu, Z., Shi, K., and Cai, K. (2018). A Combination Strategy of Random forest and Back Propagation Network for Variable Selection in Spectral Calibration. Chemometrics Intell. Lab. Syst. 182, 101–108. doi:10.1016/j.chemolab.2018.09.002

Chen, W., Chen, H., Feng, Q., Mo, L., and Hong, S. (2021). A Hybrid Optimization Method for Sample Partitioning in Near-Infrared Analysis. Spectrochimica Acta A: Mol. Biomol. Spectrosc. 248, 119182. doi:10.1016/j.saa.2020.119182

Cossu, A., Carta, A., Lomonaco, V., and Bacciu, D. (2021). Continual Learning for Recurrent Neural Networks: An Empirical Evaluation. Neural Networks 143, 607–627. doi:10.1016/j.neunet.2021.07.021

PubMed Abstract | CrossRef Full Text | Google Scholar

Dileep, G. (2020). A Survey on Smart Grid Technologies and Applications. Renew. Energ. 146, 2589–2625. doi:10.1016/j.renene.2019.08.092

Dou, J., Liu, X., Lu, J., Wu, D., and Wang, X. (2018). Research on Electricity Anti-stealing Method Based on Power Consumption Information Acquisition and Big Data. Elec. Meas. Instrum. 55 (21), 43–49.

Google Scholar

Farjaminezhad, R., Safari, S., and Moghadam, A. M. E. (2021). Recurrent Neural Networks Models for Analyzing Single and Multiple Transient Faults in Combinational Circuits. Microelectronics J. 112, 104993. doi:10.1016/j.mejo.2021.104993

Fernández, A., García, S., Herrera, F., and Chawla, N. V. (2018). SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. jair 61, 863–905. doi:10.1613/jair.1.11192

He, H., and Garcia, E. A. (2019). Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284. doi:10.1109/tkde.2008.239

Hu, T., Guo, Q., and Sun, H. (2019). Nontechnical Loss Detection Based on Stacked Uncorrelating Autoencoder and Support Vector Machine. Autom. Elec. Power Syst. 43 (1), 119–127. doi:10.7500/AEPS20180630013

Jin, J., Li, M., and Jin, L. (2015). Data Normalization to Accelerate Training for Linear Neural Net to Predict Tropical Cyclone Tracks. Math. Probl. Eng. 2015. doi:10.1155/2015/931629

Jokar, P., Arianpoo, N., and Leung, V. C. M. (2016). Electricity Theft Detection in AMI Using Customers' Consumption Patterns. IEEE Trans. Smart Grid 7, 216–226. doi:10.1109/tsg.2015.2425222

Li, S., Han, Y., Yao, X., Yingchen, S., Wang, J., and Zhao, Q. (2019). Electricity Theft Detection in Power Grids with Deep Learning and Random Forests. J. Electr. Comput. Eng. 2019, 4136874. doi:10.1155/2019/4136874

Li, S., and Yang, Y. (2021). A Recurrent Neural Network Framework with an Adaptive Training Strategy for Long-Time Predictive Modeling of Nonlinear Dynamical Systems. J. Sound Vibration 506, 116167. doi:10.1016/j.jsv.2021.116167

Li, Y., Hao, G., Liu, Y., Yu, Y., Ni, Z., and Zhao, Y. (2021a). Many-objective Distribution Network Reconfiguration via Deep Reinforcement Learning Assisted Optimization Algorithm. IEEE Trans. Power Deliv. , 1. doi:10.1109/tpwrd.2021.3107534

Li, Y., Song, W., Peng, F., Ding, N., and Wang, F. (2016). The Intelligent Analysis on the Trend Anomaly of the Electric Energy Meter Based on LOF Algorithm. Elec. Meas. Instrum. 53 (18), 69–73.

Li, Z., Li, Y., Liu, Y., Wang, P., Lu, R., and Gooi, H. B. (2021b). Deep Learning Based Densely Connected Network for Load Forecasting. IEEE Trans. Power Syst. 36, 2829–2840. doi:10.1109/tpwrs.2020.3048359

Liu, L., Finch, A., Utiyama, M., and Sumita, E. (2020). Agreement on Target-Bidirectional Recurrent Neural Networks for Sequence-To-Sequence Learning. jair 67, 581–606. doi:10.1613/jair.1.12008

Mozaffar, M., Paul, A., Al-Bahrani, R., Wolff, S., Choudhary, A., Agrawal, A., et al. (2018). Data-driven Prediction of the High-Dimensional thermal History in Directed Energy Deposition Processes via Recurrent Neural Networks. Manufacturing Lett. 18, 35–39. doi:10.1016/j.mfglet.2018.10.002

Ren, H., Hou, Z. J., Vyakaranam, B., Wang, H., and Etingov, P. (2020). Power System Event Classification and Localization Using a Convolutional Neural Network. Front. Energ. Res. 8, 1–11. doi:10.3389/fenrg.2020.607826

Ståhl, N., Mathiason, G., Falkman, G., and Karlsson, A. (2019). Using Recurrent Neural Networks with Attention for Detecting Problematic Slab Shapes in Steel Rolling. Appl. Math. Model. 70, 365–377. doi:10.1016/j.apm.2019.01.027

Wang, H., Cai, R., Zhou, B., Aziz, S., Qin, B., Voropai, N., et al. (2020). Solar Irradiance Forecasting Based on Direct Explainable Neural Network. Energ. Convers. Manag. 226, 113487. doi:10.1016/j.enconman.2020.113487

Zhang, C., Xiao, X., and Zheng, Z. (2019). Electricity Theft Detection for Customers in Power Utility Based on Real-Valued Deep Belief Network. Power Syst. Techn. 43 (3), 1083–1091.

Zhang, K., Zhou, B., Or, S. W., Li, C., Chung, C. Y., and Voropai, N. I. (2021). Optimal Coordinated Control of Multi-Renewable-To-Hydrogen Production System for Hydrogen Fueling Stations. IEEE Trans. Ind. Applicat. , 1. doi:10.1109/TIA.2021.3093841

Zhang, Y., Ai, Q., Wang, H., Li, Z., and Zhou, X. (2020). Energy Theft Detection in an Edge Data center Using Threshold-Based Abnormality Detector. Int. J. Electr. Power Energ. Syst. 121, 106162. doi:10.1016/j.ijepes.2020.106162

Zhu, L., Lu, C., Dong, Z. Y., and Hong, C. (2017). Imbalance Learning Machine-Based Power System Short-Term Voltage Stability Assessment. IEEE Trans. Ind. Inf. 13, 2533–2543. doi:10.1109/tii.2017.2696534

Zhu, W., Wang, Y., Luo, M., Lin, G., Cheng, J., and Kang, C. (2016). Distributed Clustering Algorithm for Awareness of Electricity Consumption Characteristics of Massive Consumers. Autom. Elec. Power Syst. 40 (12), 21–27. doi:10.7500/AEPS20160316007

Keywords: electricity theft, TSRNN, adaptive parameter tuning, intelligent learning, SMOTE, power consumption data

Citation: Lin G, Feng H, Feng X, Wen H, Li Y, Hong S and Ni Z (2021) Electricity Theft Detection in Power Consumption Data Based on Adaptive Tuning Recurrent Neural Network. Front. Energy Res. 9:773805. doi: 10.3389/fenrg.2021.773805

Received: 10 September 2021; Accepted: 04 October 2021; Published: 10 November 2021.

Reviewed by:

Copyright © 2021 Lin, Feng, Feng, Wen, Li, Hong and Ni. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Shaoyong Hong, [email protected]

This article is part of the Research Topic

Advanced Technologies for Planning and Operation of Prosumer Energy Systems

COMMENTS

  1. Digging Deeper into Data Breaches: An Exploratory Data Analysis of

    Data breaches represent a permanent threat to all types of organizations. Although the types of breaches are different, the impacts are always the same. This paper focuses on analyzing over 9000 data breaches made public since 2005 that led to the loss of 11,5 billion individual records which have a significant financial and technical impact.

  2. The Effects of Privacy and Data Breaches on Consumers' Online Self

    Identity theft was once again the most prevalent data breach type. It accounted for approximately 83% of the accounts breached in H1 2018, a massive growth of 757% over the previous year. ... Five major streams of research inform our work in this paper: (1) technology adoption model (TAM), (2) consumer privacy paradox, (3) service failure, (4 ...

  3. Phishing Attacks: A Recent Comprehensive Study and a New Anatomy

    Data theft is an unauthorized accessing and stealing of confidential information for a business or individuals. ... Research on social media-based phishing, Voice Phishing, and SMS Phishing is sparse and these emerging threats are predicted to be significantly increased over the next years. 3. Laws and legislations that apply for phishing are ...

  4. (PDF) Data Breaches and Identity Theft: Costs and Responses Rita O

    The paper is organized as follows: section 2 reviews the literature while section 3. gives an overview of data breaches and identity theft. Section 4 discusses the economic costs of identity th ...

  5. A Systematic Analysis of the Capital One Data Breach: Critical Lessons

    (1) The attacker used anonymizing services (such as TOR and VPN service provider IPredator end nodes) to access Capital One's cloud network between March 12 and July 19, 2019. 3 This was confirmed, ex post facto, by Capital One after reviewing their network data logs [].The intrusions, querying of subsequent backend resources, and exfiltration of data, all remained undetected by the intrusion ...

  6. (PDF) Enterprise data breach: causes, challenges, prevention, and

    WIREs Data Mining Knowl Discov 2017, 7:e1211. doi: 10.1002/widm.1211 This article is categorized under: Application Areas > Business and Industry Fundamental Concepts of Data and Knowledge > Key ...

  7. PDF A Case Study of the Capital One Data Breach

    According to our research, the number of data records breached increased from 4.3 billion in 2018 to over 11.5 billion in 2019. There are a number of frameworks, standards and best practices in the industry to support organizations ... For the purpose of this paper, we selected U.S. bank Capital One as the object of study due to the severity

  8. Cyber risk and cybersecurity: a systematic review of data availability

    Cybercrime is estimated to have cost the global economy just under USD 1 trillion in 2020, indicating an increase of more than 50% since 2018. With the average cyber insurance claim rising from USD 145,000 in 2019 to USD 359,000 in 2020, there is a growing necessity for better cyber information sources, standardised databases, mandatory reporting and public awareness. This research analyses ...

  9. PDF Preventing Identity Theft: Perspectives on Technological Solutions from

    services provided by the Identity Theft Resource Center (Green et al., 2020). In phase 2, they collected qualitative data through interviews and focus groups with experts from public and private sector organizations engaged in preventing or remediating identity theft, and they analyzed these data (Green et al., 2020).

  10. The Financial and Psychological Impact of Identity Theft Among Older

    Society's growing reliance on technology to transfer private information has created more opportunities for identity thieves to access and misuse personal data. Research on identity theft specifically among adults aged 65 and older is virtually nonexistent, yet research focusing on victims of all ages indicates a positive association between ...

  11. Cyber-enabled Competitive Data Theft: A Framework for ...

    With this paper, Friedman, Mack-Crane, and Hammond present what they believe is the first economic framework and model to understand the long-run impact of competitive data theft on an economy by ...

  12. Healthcare Data Breaches: Insights and Implications

    The total number of healthcare records that were exposed, stolen, or illegally disclosed in the year 2019 was 41.2 million in 505 healthcare data breaches [ 8 ]. According to an IBM report, the average cost of a data breach in 2019 was $3.92 million, while a healthcare industry breach typically costs $6.45 million [ 9 ].

  13. Identity Theft: A Review of Critical Issues by Mark Hwang :: SSRN

    Abstract. Identity theft is a serious crime growing rapidly due to the ever-tighter integration of technology into people's lives. The psychological and financial loss to individual victims is devastating, and its costs to society at large staggering. In order to better understand the problem and to combat the crime more effectively, a ...

  14. Cybercrime and Intellectual Property Theft: An Analysis of Modern

    1 Introduction. Intellectual property (IP) theft is one of many cybercrimes committed daily. Anyone can conceivably commit these crimes—sometimes unwittingly as there are many different types of property. Computer software and data are intellectual property and are thus covered by the United States copyright law.

  15. (PDF) SOCIAL MEDIA AND CYBER SECURITY: PROTECTING ...

    bullying, data theft, identity theft, and phishing scams. This paper investigates the numerous effects of social media on cyber security and looks at defence mechanisms that could be used to

  16. Identity Theft

    The research found that identity theft generally involves three stages: acquisition of the identity information, the thief's use of the information for personal gain to the detriment of the victim of identity theft, and discovery of the identity theft. Evidence indicates that the longer it takes to discover the theft, the greater the loss ...

  17. Cyber risk and cybersecurity: a systematic review of data availability

    The paper can also help improve risk awareness and corporate behaviour, and provides the research community with a comprehensive overview of peer-reviewed datasets and other available datasets in the area of cyber risk and cybersecurity. This approach is intended to support the free availability of data for research.

  18. Full article: The Changing Face of Financial Crime: New Technologies

    Their research is based on interviews with approximately 60 information technology security professionals, "hackers," and academic researchers. The next technology-oriented article, by Claire S. Lee spotlights victimization in one of the most densely populated nations in the world with a rapidly growing financial crime problem: China.

  19. Research paper A comprehensive review study of cyber-attacks and cyber

    Encryption is a reversible method of encrypting data that requires a key to decrypt. Encryption can be used in conjunction with encryption, which provides another level of confidentiality (Sun et al., 2018). Encryption is the implementation and study of data encryption and decryption thus that it can only be decrypted by specific individuals.

  20. Does commerce promote theft? A quantitative study from Beijing, China

    The research significance of this paper is mainly reflected in the following two points: (1) By studying cases of theft, which have been judged to be criminal offences and account for a very high ...

  21. (PDF) Cybercrime -Identity Theft

    consequences of identity theft have vast implications for both privacy and security. 2. As identified by idSafe, (2019), identity theft is typically an invisible crime, made. possible by the ...

  22. Frontiers

    Electricity theft behavior has serious influence on the normal operation of power grid and the economic benefits of power enterprises. Intelligent anti-power-theft algorithm is required for monitoring the power consumption data to recognize electricity power theft. In this paper, an adaptive time-series recurrent neural network (TSRNN) architecture was built up to detect the abnormal users (i ...