Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Can you add attackToExcel.get_stix_data_from( "/path/to/export/folder") to make loading data much faster? Or some other more efficient cache file format? #73

Open
jt0dd opened this issue Apr 28, 2022 · 1 comment

Comments

@jt0dd
Copy link

jt0dd commented Apr 28, 2022

Is your feature request related to a problem?

The example from the usage page we've been using takes an extremely long time to load.

Describe the solution you'd like

Just make it a little more clear (in the basic usage example) how we can not only export, but cache and import the att&ck matrix data rather than slowly loading it.

Describe alternatives you've considered

There doesn't seem to be one since the documentation only mentions an export feature, not import.

Additional context

import mitreattack.attackToExcel.attackToExcel as attackToExcel
import mitreattack.attackToExcel.stixToDf as stixToDf

# download and parse ATT&CK STIX data

# SUGGESTED ADDITION / PSEUDO CODE:
attackToExcel.export("enterprise-attack", "v8.1", "/path/to/export/folder")
# instead of:
# attackdata = attackToExcel.get_stix_data("enterprise-attack")
# allow:
attackdata = attackToExcel.get_stix_data_from( "/path/to/export/folder")
# END ADDITION

# get Pandas DataFrames for techniques, associated relationships, and citations
techniques_data = stixToDf.techniquesToDf(attackdata, "enterprise-attack") 

# show T1102 and sub-techniques of T1102
techniques_df = techniques_data["techniques"]
print(techniques_df[techniques_df["ID"].str.contains("T1102")]["name"])

And I don't really know if exporting as excel is the most efficient way to cache the data, probably not, but it seems to be the format supported. My only goal is to get the data into a DataFrame as efficiently as possible instead of having to go take a 5 minute coffee break to wait every time I restart my Jupyter kernel.

We're going to be solving this by adding some code to use Apache's Parquet to store the DataFrame efficiently, but that is not something that would make sense as a PR in a library designed for converting to Excel. That said, people shouldn't need to invent a caching solution for this, in my opinion. It would make sense to support it by default when the library takes 3-5 minutes to load into a DataFrame.

Like I said, I don't know if it really fits into the library since it's named to be an excel conversion tool, but I'm thinking something like:

attackToExcel.export_parquet("enterprise-attack", "v8.1", "/path/to/export/file")
attackdata = attackToExcel.import_parquet("/path/to/export/file")
techniques_data = stixToDf.techniquesToDf(attackdata, "enterprise-attack")
@jt0dd
Copy link
Author

jt0dd commented May 2, 2022

Oh I should've suggested, I guess, Python's pickling feature rather than Parquet, which is more optimal for very large and diverse data structures. I only have a year of Python experience, so I'd forgotten it was the right option here. Nonetheless, I still think caching the data in a file should be a built-in option in the library rather than the user needing to do it manually. I could understand if the maintainers of the project feel differently; it's not that hard to cache with pandas.DataFrame.to_pickle. Just my suggestion / opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant