Fileorganize

General purpose functions for working with corpora.

Ronald Sprouse

Install

pip install fileorganize

Functions

fileorganize.dir_to_df(dirname, fnpat=None, dirpat=None, addcols=[], sentinel='', to_datetime=True, dotfiles=False, dotdirs=False, sort_by=['relpath', 'fname'], **kwargs)[source]

dir_to_df: Recursively generate the filenames in a directory tree using os.walk and store as rows of a DataFrame.

‘Hidden’ files and directories (those with names that start with ‘.’) are ignored by default. dir_to_df will also not descend into a directory tree that contains a sentinel file.

Additional parameters can be used to filter which filepaths to include in the output, and also to add additional file metadata.

Parameters:

dirname (str) – Top-level directory name for filename search.
fnpat (str, re, default=None) –
Regular expression pattern that defines the filenames to return. The only filenames in the result set will be those that return a match for re.search(fnpat, filename).

Note

If you use named captures in fnpat, new columns corresponding to the capture groups will be added to the output dataframe as dtype ‘Categorical’.

If you need to use a flag with your pattern, you can use a precompiled regex for the value of fnpat. For example, you can do case-insensitive matching with re.compile(pattern, re.IGNORECASE).
dirpat (str, re, default=None) – Same as fnpat, only applied against the relative path in dirname. Relative paths that do not match dirpat will be skipped.

addcols (str, list of str, default=[]) –

One or more additional columns to include in the output. Possible names and values provided are:

Columns that can be added
Name	value
’dirname’	the user-provided top-level directory
’barename’	the filename without path or extension
’ext’	the filename extension
’mtime’	the last modification time of the file
’bytes’	the size of the file in bytes

The ‘mtime’ column is cast to Pandas Timestamps automatically unless to_datetime is False. Resolution of the time-based stats is dependent on your platform; see the os.stat documentation.

sentinel (str, default='') – Name of the sentinel file, which marks a directory tree to be ignored. No filenames from the directory containing the sentinel file will be included in the output, nor will any filenames from any of its subdirectories. If the value of sentinel is ‘’ or None, the sentinel file check will not be performed.
to_datetime (boolean, default=True) – If True, ‘mtime’ stats will be converted from Unix epoch to datetime. If False, the values will not be converted.
dotfiles (boolean, default=False) – If True, include filenames beginning with . in the output. Otherwise, omit these names.
dotdirs (boolean, default=False) – If True, descend into directories with names that begin with .. If False, do not descend into these directories.
sort_by (list of str, default=['relpath', 'fname']) – Sort output dataframe rows by the columns named in the list. Specify an empty list [] if no sorting is desired.
kwargs (various) – Remaining kwargs are passed to os.walk. If not used, then os.walk will be called with default kwargs. Note that using os.walk(topdown=False) is not compatible with dotdirs=False.

Returns:

fnamedf – Dataframe of filename rows.

Return type:

DataFrame

Example

In this example we list the .wav files in the Dimex corpus of Mexican Spanish. A typical file name is s09101.wav, which indicates that the file is of speaker s091 reading sentence 01. To capture this information the fnpat variable has two named captures. The first (?P<subj>s\d\d\d) says to parse the filename and find a sequence of ‘s’ followed by three digits, keeping that sequence in the variable subj. The second named capture (?P<sentence>\d+) says to keep the remaining one or more digits in the variable sentence. Note the final two columns in the dataframe.

path_to_corpus = Path('./dimex')

corpus_list = dir_to_df(path_to_corpus,
           fnpat = r'(?P<subj>s\d\d\d)(?P<sentence>\d+)\.wav',
           addcols = ["dirname", "barename"])
corpus_list.head()

The first few lines of the dataframe corpus_list, which was created by the above code.

fileorganize.today_YYYYMMDD()[source]: Return today’s date in YYYYMMDD format.

fileorganize.timestamp_now()[source]

Create a timestamp for an acquisition, using current local time.

Returns:: timestamp, utcoffset – A tuple of strings representing the datetime in YYYY-MM-DDTHHMMSS format and the timezone offset from UTC, e.g. ‘-0700’.
Return type:: tuple(str, str)

fileorganize.cp_backup(fname, bkdir=None, hidden=True)[source]

Make a backup copy of the file fname and return the name of the copied file.

By default the copy will have the same name as fname with ‘.’ prepended and a suffix of the form ‘.N’, where N is an integer. Multiple calls to this function result in increasing values of N, starting with ‘1’.

Parameters:

fname (str) – The name of the file to be copied.
bkdir (str (default: None)) – By default, the backup file will be written in the same directory as the source file. If bkdir is provided, the backup file will be written to that path instead. A FileNotFoundError will be thrown if the backup directory does not exist.
hidden (bool (default: True)) – If True, prepend ‘.’ to the backup filename, resulting in a ‘hidden’ file. If not True, do not prepend anything.

Returns:

dst – The name of the copied backup file.

Return type:

str